Skip to content

fix: prevent infinite loop in OverlappingWindowChunking when overlap >= window_size#2012

Open
Osamaali313 wants to merge 1 commit into
unclecode:developfrom
Osamaali313:fix/overlapping-window-infinite-loop
Open

fix: prevent infinite loop in OverlappingWindowChunking when overlap >= window_size#2012
Osamaali313 wants to merge 1 commit into
unclecode:developfrom
Osamaali313:fix/overlapping-window-infinite-loop

Conversation

@Osamaali313

Copy link
Copy Markdown

Summary

OverlappingWindowChunking.chunk() hangs in an infinite loop whenever overlap >= window_size.

It advanced the window with start = end - self.overlap. When overlap == window_size, start is unchanged on every iteration; when overlap > window_size, start moves backwards. Either way the while start < len(words) loop never terminates and the chunk list grows without bound until the process runs out of memory or is killed.

The fix advances with a guaranteed-positive stride of max(1, window_size - overlap). For valid configurations (overlap < window_size) the stride equals window_size - overlap, so start += stride is exactly equivalent to the old start = end - overlap and the produced chunks are unchanged. Degenerate configurations now terminate instead of hanging.

List of files changed and why

  • crawl4ai/chunking_strategy.py — guard the slide step in OverlappingWindowChunking.chunk() so start always advances.
  • tests/general/test_chunking_strategy.py — new unit tests for OverlappingWindowChunking: normal overlap, zero overlap, short input, and the overlap >= window_size regression that previously hung.

How Has This Been Tested?

I verified that for valid configurations (window=100/overlap=20, window=100/overlap=0, window=1000/overlap=100) the new stride-based loop produces byte-for-byte identical output to the previous implementation, and that the degenerate overlap >= window_size configurations now terminate and still reach the final word instead of looping forever. The added unit tests encode these cases.

Note: I validated the chunking logic in isolation and did not run the full crawl4ai dependency stack locally, so the new tests are intended to be confirmed by CI.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

…>= window_size

OverlappingWindowChunking.chunk() advanced the window with
start = end - self.overlap. When overlap >= window_size this leaves start
unchanged (overlap == window_size) or moves it backwards (overlap >
window_size), so the while loop never terminates and the call hangs while
the chunk list grows without bound.

Advance by a guaranteed-positive stride of max(1, window_size - overlap).
For valid configurations (overlap < window_size) the stride equals
window_size - overlap, so the produced chunks are identical to before;
degenerate configurations now terminate instead of hanging.

Adds unit tests for OverlappingWindowChunking covering normal overlap,
zero overlap, short input, and the overlap >= window_size regression.
Copilot AI review requested due to automatic review settings June 13, 2026 20:03

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a regression fix to prevent infinite loops in OverlappingWindowChunking when overlap >= window_size, and introduces unit tests covering basic behavior and those edge cases.

Changes:

  • Ensure chunking loop always advances by introducing a positive stride.
  • Add unit tests for overlap/no-overlap/short text and regression cases (overlap >= window_size).
  • Verify chunk coverage reaches the final word in edge cases.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
tests/general/test_chunking_strategy.py Adds unit/regression tests validating termination and coverage for overlapping window chunking.
crawl4ai/chunking_strategy.py Fixes potential infinite loop by forcing forward progress via a computed stride.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +244 to +247
# The stride must be positive so ``start`` always advances. Otherwise an
# overlap >= window_size leaves start unchanged (or moving backwards),
# turning the crawl into an infinite loop that never terminates.
stride = max(1, self.window_size - self.overlap)
break

start = end - self.overlap
start += stride
Comment on lines +244 to +247
# The stride must be positive so ``start`` always advances. Otherwise an
# overlap >= window_size leaves start unchanged (or moving backwards),
# turning the crawl into an infinite loop that never terminates.
stride = max(1, self.window_size - self.overlap)

def test_overlapping_window_no_overlap():
chunks = OverlappingWindowChunking(window_size=100, overlap=0).chunk(_words(250))
assert len(chunks) == 3 # 0-100, 100-200, 200-250
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants