fix: chunk CSV/JSON knowledge file content, not the dict repr by ly-wang19 · Pull Request #6174 · crewAIInc/crewAI

ly-wang19 · 2026-06-16T05:07:36Z

Found while reviewing the knowledge-source siblings for consistency.

Problem

CSVKnowledgeSource.add() / aadd() and JSONKnowledgeSource.add() / aadd() build their chunks from:

content_str = str(self.content) if isinstance(self.content, dict) else self.content
new_chunks = self._chunk_text(content_str)

But content is declared in BaseFileKnowledgeSource as dict[Path, str] and is always populated (in model_post_init → load_content()), so the isinstance(..., dict) branch is always taken. The chunks therefore contain the str() of the dict, e.g.

{PosixPath('/abs/path/data.csv'): 'Name City\nBrandon New York\n'}

i.e. the embedded/indexed knowledge holds the Python dict repr, the absolute file path, and escaped newlines — instead of the file content. This degrades retrieval quality and leaks local filesystem paths into the vector store.

TextFileKnowledgeSource and ExcelKnowledgeSource already do the right thing — they iterate self.content.values(). CSV and JSON are the inconsistent ones.

Fix

Make CSVKnowledgeSource and JSONKnowledgeSource (both add and aadd) iterate self.content.values() and chunk each file's content, matching the existing text/excel sources.

Tests

Added test_csv_knowledge_source_chunks_hold_content_not_dict_repr and test_json_knowledge_source_chunks_hold_content_not_dict_repr, which call add() and assert the chunks contain the data and not PosixPath(...) / dict-repr noise. Both fail before the change (the failure surfaces the {PosixPath(...): ...} chunk) and pass after. Existing CSV/JSON/text/string knowledge tests still pass; ruff check is clean.

Summary by CodeRabbit

Bug Fixes
- Fixed CSV and JSON knowledge source chunking to process individual data entries separately instead of converting entire structures into single strings. Generated knowledge chunks now contain actual content values rather than internal Python representations.

CSVKnowledgeSource.add()/aadd() and JSONKnowledgeSource.add()/aadd() built their chunks from `str(self.content)`. Since `content` is a `dict[Path, str]` (set in BaseFileKnowledgeSource.model_post_init), this embedded the dict's repr — e.g. `{PosixPath('/abs/path/data.csv'): 'col1 col2\\n...'}` — into the knowledge base, leaking the absolute file path and Python repr/escape noise instead of the file's actual text. TextFileKnowledgeSource and ExcelKnowledgeSource already iterate `self.content.values()`; make CSV and JSON do the same so the chunks hold the real content. Add regression tests asserting the chunks contain the data and not `PosixPath(...)`/dict-repr noise.

coderabbitai · 2026-06-16T05:08:04Z

📝 Walkthrough

Walkthrough

CSVKnowledgeSource and JSONKnowledgeSource are updated so that add() and aadd() iterate over self.content.values() and chunk each entry individually, replacing prior logic that serialized the entire content dict to one string. Two regression tests verify that resulting chunks contain actual field values rather than Python dict or path representations.

Changes

Per-entry chunking fix for CSV and JSON knowledge sources

Layer / File(s)	Summary
Per-entry chunking in CSV and JSON add/aadd methods `lib/crewai/src/crewai/knowledge/source/csv_knowledge_source.py`, `lib/crewai/src/crewai/knowledge/source/json_knowledge_source.py`	Both `add()` and `aadd()` in each source class are changed to loop over `self.content.values()`, chunk each value, and append results, replacing the previous single `str(self.content)` conversion.
Regression tests for chunk content correctness `lib/crewai/tests/knowledge/test_knowledge.py`	Adds `MagicMock` to the mock import and inserts two tests that write real CSV/JSON temp files, call `add()`, and assert chunks hold field content strings while excluding `PosixPath` and `{` substrings.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 No more dicts in the chunk parade,
Each CSV row gets its own crusade.
JSON values, one by one they flow,
No PosixPath leaks putting on a show!
The rabbit checked the chunks—hooray,
Clean content chunks are here to stay! 🎉

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and clearly summarizes the main change: fixing CSV/JSON knowledge source to chunk file content instead of dictionary representation.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

lib/crewai/tests/knowledge/test_knowledge.py (1)

482-519: ⚡ Quick win

Add async regression coverage for aadd() as well.

These regressions validate add() well, but this PR also changes aadd() in both CSV and JSON sources. Mirroring these tests for async would lock the same no-dict-repr guarantee on the async path too.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/crewai/tests/knowledge/test_knowledge.py` around lines 482 - 519, Add
async regression coverage by creating two new test functions that mirror the
existing synchronous tests for CSVKnowledgeSource and JSONKnowledgeSource.
Create test_csv_knowledge_source_chunks_hold_content_not_dict_repr_async and
test_json_knowledge_source_chunks_hold_content_not_dict_repr_async as async test
functions (using async def and await) that validate the same regression
behavior: that chunks hold actual file content rather than dict representations
with PosixPath leakage. For each async test, follow the same setup pattern as
the synchronous counterpart but call await source.aadd() instead of
source.add(), and verify the same assertions about content presence and absence
of PosixPath and curly braces in the joined chunks.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@lib/crewai/tests/knowledge/test_knowledge.py`:
- Around line 482-519: Add async regression coverage by creating two new test
functions that mirror the existing synchronous tests for CSVKnowledgeSource and
JSONKnowledgeSource. Create
test_csv_knowledge_source_chunks_hold_content_not_dict_repr_async and
test_json_knowledge_source_chunks_hold_content_not_dict_repr_async as async test
functions (using async def and await) that validate the same regression
behavior: that chunks hold actual file content rather than dict representations
with PosixPath leakage. For each async test, follow the same setup pattern as
the synchronous counterpart but call await source.aadd() instead of
source.add(), and verify the same assertions about content presence and absence
of PosixPath and curly braces in the joined chunks.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 6d2f896a-5f35-4c96-a0a3-d4e0a37ce50d

📥 Commits

Reviewing files that changed from the base of the PR and between e9d568d and fde0478.

📒 Files selected for processing (3)

lib/crewai/src/crewai/knowledge/source/csv_knowledge_source.py
lib/crewai/src/crewai/knowledge/source/json_knowledge_source.py
lib/crewai/tests/knowledge/test_knowledge.py

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

coderabbitai Bot approved these changes Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: chunk CSV/JSON knowledge file content, not the dict repr#6174

fix: chunk CSV/JSON knowledge file content, not the dict repr#6174
ly-wang19 wants to merge 1 commit into
crewAIInc:mainfrom
ly-wang19:fix/knowledge-csv-json-chunk-content

ly-wang19 commented Jun 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ly-wang19 commented Jun 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ly-wang19 commented Jun 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading