Skip to content

fix: chunk CSV/JSON knowledge file content, not the dict repr#6174

Open
ly-wang19 wants to merge 1 commit into
crewAIInc:mainfrom
ly-wang19:fix/knowledge-csv-json-chunk-content
Open

fix: chunk CSV/JSON knowledge file content, not the dict repr#6174
ly-wang19 wants to merge 1 commit into
crewAIInc:mainfrom
ly-wang19:fix/knowledge-csv-json-chunk-content

Conversation

@ly-wang19

@ly-wang19 ly-wang19 commented Jun 16, 2026

Copy link
Copy Markdown

Found while reviewing the knowledge-source siblings for consistency.

Problem

CSVKnowledgeSource.add() / aadd() and JSONKnowledgeSource.add() / aadd() build their chunks from:

content_str = str(self.content) if isinstance(self.content, dict) else self.content
new_chunks = self._chunk_text(content_str)

But content is declared in BaseFileKnowledgeSource as dict[Path, str] and is always populated (in model_post_initload_content()), so the isinstance(..., dict) branch is always taken. The chunks therefore contain the str() of the dict, e.g.

{PosixPath('/abs/path/data.csv'): 'Name City\nBrandon New York\n'}

i.e. the embedded/indexed knowledge holds the Python dict repr, the absolute file path, and escaped newlines — instead of the file content. This degrades retrieval quality and leaks local filesystem paths into the vector store.

TextFileKnowledgeSource and ExcelKnowledgeSource already do the right thing — they iterate self.content.values(). CSV and JSON are the inconsistent ones.

Fix

Make CSVKnowledgeSource and JSONKnowledgeSource (both add and aadd) iterate self.content.values() and chunk each file's content, matching the existing text/excel sources.

Tests

Added test_csv_knowledge_source_chunks_hold_content_not_dict_repr and test_json_knowledge_source_chunks_hold_content_not_dict_repr, which call add() and assert the chunks contain the data and not PosixPath(...) / dict-repr noise. Both fail before the change (the failure surfaces the {PosixPath(...): ...} chunk) and pass after. Existing CSV/JSON/text/string knowledge tests still pass; ruff check is clean.

Summary by CodeRabbit

  • Bug Fixes
    • Fixed CSV and JSON knowledge source chunking to process individual data entries separately instead of converting entire structures into single strings. Generated knowledge chunks now contain actual content values rather than internal Python representations.

CSVKnowledgeSource.add()/aadd() and JSONKnowledgeSource.add()/aadd() built
their chunks from `str(self.content)`. Since `content` is a
`dict[Path, str]` (set in BaseFileKnowledgeSource.model_post_init), this
embedded the dict's repr — e.g.
`{PosixPath('/abs/path/data.csv'): 'col1 col2\\n...'}` — into the knowledge
base, leaking the absolute file path and Python repr/escape noise instead of
the file's actual text.

TextFileKnowledgeSource and ExcelKnowledgeSource already iterate
`self.content.values()`; make CSV and JSON do the same so the chunks hold the
real content. Add regression tests asserting the chunks contain the data and
not `PosixPath(...)`/dict-repr noise.
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

CSVKnowledgeSource and JSONKnowledgeSource are updated so that add() and aadd() iterate over self.content.values() and chunk each entry individually, replacing prior logic that serialized the entire content dict to one string. Two regression tests verify that resulting chunks contain actual field values rather than Python dict or path representations.

Changes

Per-entry chunking fix for CSV and JSON knowledge sources

Layer / File(s) Summary
Per-entry chunking in CSV and JSON add/aadd methods
lib/crewai/src/crewai/knowledge/source/csv_knowledge_source.py, lib/crewai/src/crewai/knowledge/source/json_knowledge_source.py
Both add() and aadd() in each source class are changed to loop over self.content.values(), chunk each value, and append results, replacing the previous single str(self.content) conversion.
Regression tests for chunk content correctness
lib/crewai/tests/knowledge/test_knowledge.py
Adds MagicMock to the mock import and inserts two tests that write real CSV/JSON temp files, call add(), and assert chunks hold field content strings while excluding PosixPath and { substrings.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 No more dicts in the chunk parade,
Each CSV row gets its own crusade.
JSON values, one by one they flow,
No PosixPath leaks putting on a show!
The rabbit checked the chunks—hooray,
Clean content chunks are here to stay! 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the main change: fixing CSV/JSON knowledge source to chunk file content instead of dictionary representation.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
lib/crewai/tests/knowledge/test_knowledge.py (1)

482-519: ⚡ Quick win

Add async regression coverage for aadd() as well.

These regressions validate add() well, but this PR also changes aadd() in both CSV and JSON sources. Mirroring these tests for async would lock the same no-dict-repr guarantee on the async path too.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/crewai/tests/knowledge/test_knowledge.py` around lines 482 - 519, Add
async regression coverage by creating two new test functions that mirror the
existing synchronous tests for CSVKnowledgeSource and JSONKnowledgeSource.
Create test_csv_knowledge_source_chunks_hold_content_not_dict_repr_async and
test_json_knowledge_source_chunks_hold_content_not_dict_repr_async as async test
functions (using async def and await) that validate the same regression
behavior: that chunks hold actual file content rather than dict representations
with PosixPath leakage. For each async test, follow the same setup pattern as
the synchronous counterpart but call await source.aadd() instead of
source.add(), and verify the same assertions about content presence and absence
of PosixPath and curly braces in the joined chunks.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@lib/crewai/tests/knowledge/test_knowledge.py`:
- Around line 482-519: Add async regression coverage by creating two new test
functions that mirror the existing synchronous tests for CSVKnowledgeSource and
JSONKnowledgeSource. Create
test_csv_knowledge_source_chunks_hold_content_not_dict_repr_async and
test_json_knowledge_source_chunks_hold_content_not_dict_repr_async as async test
functions (using async def and await) that validate the same regression
behavior: that chunks hold actual file content rather than dict representations
with PosixPath leakage. For each async test, follow the same setup pattern as
the synchronous counterpart but call await source.aadd() instead of
source.add(), and verify the same assertions about content presence and absence
of PosixPath and curly braces in the joined chunks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 6d2f896a-5f35-4c96-a0a3-d4e0a37ce50d

📥 Commits

Reviewing files that changed from the base of the PR and between e9d568d and fde0478.

📒 Files selected for processing (3)
  • lib/crewai/src/crewai/knowledge/source/csv_knowledge_source.py
  • lib/crewai/src/crewai/knowledge/source/json_knowledge_source.py
  • lib/crewai/tests/knowledge/test_knowledge.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant