fix: prevent infinite loop in OverlappingWindowChunking when overlap >= window_size#2012
Open
Osamaali313 wants to merge 1 commit into
Open
Conversation
…>= window_size OverlappingWindowChunking.chunk() advanced the window with start = end - self.overlap. When overlap >= window_size this leaves start unchanged (overlap == window_size) or moves it backwards (overlap > window_size), so the while loop never terminates and the call hangs while the chunk list grows without bound. Advance by a guaranteed-positive stride of max(1, window_size - overlap). For valid configurations (overlap < window_size) the stride equals window_size - overlap, so the produced chunks are identical to before; degenerate configurations now terminate instead of hanging. Adds unit tests for OverlappingWindowChunking covering normal overlap, zero overlap, short input, and the overlap >= window_size regression.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a regression fix to prevent infinite loops in OverlappingWindowChunking when overlap >= window_size, and introduces unit tests covering basic behavior and those edge cases.
Changes:
- Ensure chunking loop always advances by introducing a positive
stride. - Add unit tests for overlap/no-overlap/short text and regression cases (
overlap >= window_size). - Verify chunk coverage reaches the final word in edge cases.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tests/general/test_chunking_strategy.py | Adds unit/regression tests validating termination and coverage for overlapping window chunking. |
| crawl4ai/chunking_strategy.py | Fixes potential infinite loop by forcing forward progress via a computed stride. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+244
to
+247
| # The stride must be positive so ``start`` always advances. Otherwise an | ||
| # overlap >= window_size leaves start unchanged (or moving backwards), | ||
| # turning the crawl into an infinite loop that never terminates. | ||
| stride = max(1, self.window_size - self.overlap) |
| break | ||
|
|
||
| start = end - self.overlap | ||
| start += stride |
Comment on lines
+244
to
+247
| # The stride must be positive so ``start`` always advances. Otherwise an | ||
| # overlap >= window_size leaves start unchanged (or moving backwards), | ||
| # turning the crawl into an infinite loop that never terminates. | ||
| stride = max(1, self.window_size - self.overlap) |
|
|
||
| def test_overlapping_window_no_overlap(): | ||
| chunks = OverlappingWindowChunking(window_size=100, overlap=0).chunk(_words(250)) | ||
| assert len(chunks) == 3 # 0-100, 100-200, 200-250 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OverlappingWindowChunking.chunk()hangs in an infinite loop wheneveroverlap >= window_size.It advanced the window with
start = end - self.overlap. Whenoverlap == window_size,startis unchanged on every iteration; whenoverlap > window_size,startmoves backwards. Either way thewhile start < len(words)loop never terminates and the chunk list grows without bound until the process runs out of memory or is killed.The fix advances with a guaranteed-positive stride of
max(1, window_size - overlap). For valid configurations (overlap < window_size) the stride equalswindow_size - overlap, sostart += strideis exactly equivalent to the oldstart = end - overlapand the produced chunks are unchanged. Degenerate configurations now terminate instead of hanging.List of files changed and why
crawl4ai/chunking_strategy.py— guard the slide step inOverlappingWindowChunking.chunk()sostartalways advances.tests/general/test_chunking_strategy.py— new unit tests forOverlappingWindowChunking: normal overlap, zero overlap, short input, and theoverlap >= window_sizeregression that previously hung.How Has This Been Tested?
I verified that for valid configurations (window=100/overlap=20, window=100/overlap=0, window=1000/overlap=100) the new stride-based loop produces byte-for-byte identical output to the previous implementation, and that the degenerate
overlap >= window_sizeconfigurations now terminate and still reach the final word instead of looping forever. The added unit tests encode these cases.Note: I validated the chunking logic in isolation and did not run the full crawl4ai dependency stack locally, so the new tests are intended to be confirmed by CI.
Checklist: