fix: chunk CSV/JSON knowledge file content, not the dict repr#6174
fix: chunk CSV/JSON knowledge file content, not the dict repr#6174ly-wang19 wants to merge 1 commit into
Conversation
CSVKnowledgeSource.add()/aadd() and JSONKnowledgeSource.add()/aadd() built
their chunks from `str(self.content)`. Since `content` is a
`dict[Path, str]` (set in BaseFileKnowledgeSource.model_post_init), this
embedded the dict's repr — e.g.
`{PosixPath('/abs/path/data.csv'): 'col1 col2\\n...'}` — into the knowledge
base, leaking the absolute file path and Python repr/escape noise instead of
the file's actual text.
TextFileKnowledgeSource and ExcelKnowledgeSource already iterate
`self.content.values()`; make CSV and JSON do the same so the chunks hold the
real content. Add regression tests asserting the chunks contain the data and
not `PosixPath(...)`/dict-repr noise.
📝 WalkthroughWalkthrough
ChangesPer-entry chunking fix for CSV and JSON knowledge sources
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
lib/crewai/tests/knowledge/test_knowledge.py (1)
482-519: ⚡ Quick winAdd async regression coverage for
aadd()as well.These regressions validate
add()well, but this PR also changesaadd()in both CSV and JSON sources. Mirroring these tests for async would lock the same no-dict-repr guarantee on the async path too.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/crewai/tests/knowledge/test_knowledge.py` around lines 482 - 519, Add async regression coverage by creating two new test functions that mirror the existing synchronous tests for CSVKnowledgeSource and JSONKnowledgeSource. Create test_csv_knowledge_source_chunks_hold_content_not_dict_repr_async and test_json_knowledge_source_chunks_hold_content_not_dict_repr_async as async test functions (using async def and await) that validate the same regression behavior: that chunks hold actual file content rather than dict representations with PosixPath leakage. For each async test, follow the same setup pattern as the synchronous counterpart but call await source.aadd() instead of source.add(), and verify the same assertions about content presence and absence of PosixPath and curly braces in the joined chunks.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@lib/crewai/tests/knowledge/test_knowledge.py`:
- Around line 482-519: Add async regression coverage by creating two new test
functions that mirror the existing synchronous tests for CSVKnowledgeSource and
JSONKnowledgeSource. Create
test_csv_knowledge_source_chunks_hold_content_not_dict_repr_async and
test_json_knowledge_source_chunks_hold_content_not_dict_repr_async as async test
functions (using async def and await) that validate the same regression
behavior: that chunks hold actual file content rather than dict representations
with PosixPath leakage. For each async test, follow the same setup pattern as
the synchronous counterpart but call await source.aadd() instead of
source.add(), and verify the same assertions about content presence and absence
of PosixPath and curly braces in the joined chunks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 6d2f896a-5f35-4c96-a0a3-d4e0a37ce50d
📒 Files selected for processing (3)
lib/crewai/src/crewai/knowledge/source/csv_knowledge_source.pylib/crewai/src/crewai/knowledge/source/json_knowledge_source.pylib/crewai/tests/knowledge/test_knowledge.py
Found while reviewing the knowledge-source siblings for consistency.
Problem
CSVKnowledgeSource.add()/aadd()andJSONKnowledgeSource.add()/aadd()build their chunks from:But
contentis declared inBaseFileKnowledgeSourceasdict[Path, str]and is always populated (inmodel_post_init→load_content()), so theisinstance(..., dict)branch is always taken. The chunks therefore contain thestr()of the dict, e.g.i.e. the embedded/indexed knowledge holds the Python dict repr, the absolute file path, and escaped newlines — instead of the file content. This degrades retrieval quality and leaks local filesystem paths into the vector store.
TextFileKnowledgeSourceandExcelKnowledgeSourcealready do the right thing — they iterateself.content.values(). CSV and JSON are the inconsistent ones.Fix
Make
CSVKnowledgeSourceandJSONKnowledgeSource(bothaddandaadd) iterateself.content.values()and chunk each file's content, matching the existing text/excel sources.Tests
Added
test_csv_knowledge_source_chunks_hold_content_not_dict_reprandtest_json_knowledge_source_chunks_hold_content_not_dict_repr, which calladd()and assert the chunks contain the data and notPosixPath(...)/ dict-repr noise. Both fail before the change (the failure surfaces the{PosixPath(...): ...}chunk) and pass after. Existing CSV/JSON/text/string knowledge tests still pass;ruff checkis clean.Summary by CodeRabbit