feat(knowledge): add token, sentence, recursive, and regex chunkers#4102
feat(knowledge): add token, sentence, recursive, and regex chunkers#4102waleedlatif1 merged 20 commits intostagingfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview Introduces new chunker implementations ( Enhances safety/validation around strategy inputs (e.g., overlap < chunk size, regex pattern required and length-limited, basic catastrophic-backtracking checks) and updates/extends chunker test coverage accordingly. Reviewed by Cursor Bugbot for commit 97a0bd4. Configure here. |
Greptile SummaryThis PR adds four new chunking strategies (token, sentence, recursive, regex) by extracting shared utilities into Confidence Score: 5/5Safe to merge; all remaining findings are P2 edge-case suggestions with no data-loss or correctness risk on typical inputs. The two previously blocking issues (token chunker overlap, misleading label) are resolved. The two remaining findings require uncommon user inputs (regex capture groups, trailing commas in separators) and neither corrupts data for the default or normal usage paths. apps/sim/lib/chunkers/regex-chunker.ts (capture group handling), apps/sim/app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx (separator filtering) Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
UI["CreateBaseModal\n(strategy selector)"] -->|"POST /api/knowledge"| API["route.ts\nZod validation\n(strategy + strategyOptions)"]
API --> KB_SVC["knowledge/service.ts\nstoreChunkingConfig in DB"]
KB_SVC --> JOB["dispatchDocumentProcessingJob"]
JOB --> PROC["processDocumentAsync\nreads strategy from DB"]
PROC --> DP["document-processor.ts\napplyStrategy()"]
DP -->|"auto"| AUTO{"auto-detect\ncontent type"}
AUTO -->|json/yaml| JY["JsonYamlChunker"]
AUTO -->|csv/xlsx| SD["StructuredDataChunker"]
AUTO -->|"default"| TC["TextChunker"]
DP -->|"token"| TK["TokenChunker"]
DP -->|"sentence"| SC["SentenceChunker"]
DP -->|"recursive"| RC["RecursiveChunker\n(plain/markdown/code recipe)"]
DP -->|"regex"| RX["RegexChunker\n(user pattern + ReDoS guard)"]
TK & SC & RC & RX & TC & JY & SD --> UTILS["utils.ts\nestimateTokens · cleanText\nsplitAtWordBoundaries · buildChunks · addOverlap"]
Reviews (5): Last reviewed commit: "fix(chunkers): restore separator-as-join..." | Re-trigger Greptile |
...sim/app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx
Outdated
Show resolved
Hide resolved
- Refactor all existing chunkers (Text, JsonYaml, StructuredData, Docs) to use shared utils - Fix inconsistent token estimation (JsonYaml used tiktoken, StructuredData used /3 ratio) - Fix DocsChunker operator precedence bug and hard-coded 300-token limit - Fix JsonYamlChunker isStructuredData false positive on plain strings - Add MAX_DEPTH recursion guard to JsonYamlChunker - Replace @/components/ui/select with emcn DropdownMenu in strategy selector
- Expand RecursiveChunker recipes: markdown adds horizontal rules, code fences, blockquotes; code adds const/let/var/if/for/while/switch/return - RecursiveChunker fallback uses splitAtWordBoundaries instead of char slicing - RegexChunker ReDoS test uses adversarial strings (repeated chars, spaces) - SentenceChunker abbreviation list adds St/Rev/Gen/No/Fig/Vol/months and single-capital-letter lookbehind - Add overlap < maxSize validation in Zod schema and UI form - Add pattern max length (500) validation in Zod schema - Fix StructuredDataChunker footer grammar
- DocsChunker: extract headers from cleaned content (not raw markdown) to fix position mismatch between header positions and chunk positions - DocsChunker: strip export statements and JSX expressions in cleanContent - DocsChunker: fix table merge dedup using equality instead of includes - JsonYamlChunker: preserve path breadcrumbs when nested value fits in one chunk, matching LangChain RecursiveJsonSplitter behavior - StructuredDataChunker: detect 2-column CSV (lowered threshold from >2 to >=1) and use 20% relative tolerance instead of absolute +/-2 - TokenChunker: use sliding window overlap (matching LangChain/Chonkie) where chunks stay within chunkSize instead of exceeding it - utils: splitAtWordBoundaries accepts optional stepChars for sliding window overlap; addOverlap uses newline join instead of space
- Fix SentenceChunker regex: lookbehinds now include the period to correctly handle abbreviations (Mr., Dr., etc.), initials (J.K.), and decimals - Fix RegexChunker ReDoS: reset lastIndex between adversarial test iterations, add poisoned-suffix test strings - Fix DocsChunker: skip code blocks during table boundary detection to prevent false positives from pipe characters - Fix JsonYamlChunker: oversized primitive leaf values now fall back to text chunking instead of emitting a single chunk - Fix TokenChunker: pass 0 to buildChunks for overlap metadata since sliding window handles overlap inherently - Add defensive guard in splitAtWordBoundaries to prevent infinite loops if step is 0 - Add tests for utils, TokenChunker, SentenceChunker, RecursiveChunker, RegexChunker (236 total tests, 0 failures) - Fix existing test expectations for updated footer format and isStructuredData behavior
Strip 445 lines of redundant TSDoc, math calculation comments, implementation rationale notes, and assertion-restating comments across all chunker source and test files.
- Fix regex fallback path: use sliding window for overlap instead of passing chunkOverlap to buildChunks without prepended overlap text - Fix misleading strategy label: "Text (hierarchical splitting)" → "Text (word boundary splitting)"
Use addOverlap + buildChunks(chunks, overlap) in the regex fallback path to match the main path and all other chunkers (TextChunker, RecursiveChunker). The sliding window approach was inconsistent.
|
@greptile |
|
@cursor review |
When splitAtWordBoundaries snaps end back to a word boundary, advance pos from end (not pos + step) in non-overlapping mode. The step-based advancement is preserved for the sliding window case (TokenChunker).
|
@cursor review |
When no complete sentence fits within the overlap budget, fall back to character-level word-boundary overlap from the previous group's text. This ensures buildChunks metadata is always correct.
|
@greptile |
|
@cursor review |
- Fix regex fallback log: "character splitting" → "word-boundary splitting" - Add Jun and Jul to sentence chunker abbreviation list
avgCount >= 1 was too permissive — prose with consistent comma usage would be misclassified as CSV. Restore original > 2 threshold while keeping the improved proportional tolerance.
|
@greptile |
|
@cursor review |
|
@greptile |
|
@cursor review |
Separator was unconditionally prepended to parts after the first, leaving leading punctuation on chunks after a boundary reset.
|
@greptile |
|
@cursor review |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 97a0bd4. Configure here.
Parses JSON Lines files by splitting on newlines and converting to a JSON array, which then flows through the existing JsonYamlChunker. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Summary
Type of Change
Testing
Tested manually. All 53 existing chunker tests pass.
Checklist