Skip to content

fix: clone all batches before processing commits to avoid broken range [CM-1200]#4138

Merged
mbani01 merged 5 commits into
mainfrom
fix/simplify-batched-commit-range
May 21, 2026
Merged

fix: clone all batches before processing commits to avoid broken range [CM-1200]#4138
mbani01 merged 5 commits into
mainfrom
fix/simplify-batched-commit-range

Conversation

@mbani01
Copy link
Copy Markdown
Contributor

@mbani01 mbani01 commented May 20, 2026

This pull request refactors and simplifies the batched cloning and commit processing logic for repositories. The main changes remove unnecessary tracking of shallow clone boundaries, streamline batch management, and improve error handling and metrics reporting. The changes also clarify the conditions for completing a clone and standardize the commit processing flow.

Clone batch management and logic simplification:

  • Removed edge_commit and prev_batch_edge_commit fields from the CloneBatchInfo model and all related logic, eliminating the need to track shallow clone boundaries for batch processing. [1] [2] [3]
  • Refactored _check_if_final_batch to more clearly determine when a batched clone is complete, including handling of timeouts and force-push scenarios, and raising ReOnboardingRequiredError as needed.
  • Updated the batch generator to yield only meaningful batches and log clone completion, further simplifying the batch loop.

Commit processing and metrics:

  • Simplified the commit processing method (process_batch_commits), removing complex metrics context resetting and error tracking, and always updating the last processed commit at the end. [1] [2] [3] [4] [5]
  • Refactored _execute_git_log to use a consistent commit range (last_processed_commit..HEAD) for batched clones, removing the need for shallow boundary logic. [1] [2]

Configuration and error handling:

  • Added support for stuck repository timeouts and re-onboarding errors, with new settings imports and error handling paths. [1] [2] [3]

These changes make the clone and commit processing code easier to maintain and less error-prone by removing unnecessary complexity and clarifying the core logic.


Note

Medium Risk
Changes the core incremental clone/commit pipeline and alters when commits are processed and how last_processed_commit is advanced; mistakes could cause missed commits or unnecessary re-onboarding.

Overview
Batched cloning now deepens to completion before commit processing. CloneBatchInfo drops shallow-boundary fields (edge_commit, prev_batch_edge_commit), and CloneService reworks final-batch detection to verify the full last_processed_commit..HEAD range is available, raising ReOnboardingRequiredError on force-push or configured “stuck” timeouts.

Commit extraction is simplified and runs only once per repo update. CommitService consolidates into process_batch_commits, runs git log either for the full ref (full clone) or last_processed_commit..HEAD (batched), records per-run execution metrics, and moves update_last_processed_commit into the commit service; RepositoryWorker now calls commit processing only for the final batch.

Tests/fixtures are updated to reflect the new flow and activity payloads (notably adding username fields and updating expected outputs).

Reviewed by Cursor Bugbot for commit 214b637. Bugbot is set up for automated code reviews on this repo. Configure here.

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Copilot AI review requested due to automatic review settings May 20, 2026 17:45
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the git integration flow so that, for batched (shallow + deepen) clones, the worker completes all deepening first and only then runs commit processing, avoiding issues with unstable/invalid commit ranges while the shallow boundary moves.

Changes:

  • Move commit processing in RepositoryWorker to run only when batch_info.is_final_batch is reached.
  • Simplify commit range selection in CommitService to use last_processed_commit..HEAD for batched clones and remove edge-commit range optimization logic.
  • Change CloneService.clone_batches_generator() to stop yielding per-deepen batch and instead yield again only after deepening completes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
services/apps/git_integration/src/crowdgit/worker/repository_worker.py Defers commit processing until the final clone batch.
services/apps/git_integration/src/crowdgit/services/commit/commit_service.py Removes edge-based range logic; uses last_processed..HEAD for batched clones; simplifies commit skipping.
services/apps/git_integration/src/crowdgit/services/clone/clone_service.py Stops yielding intermediate batches; yields after deepening completes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Comment thread services/apps/git_integration/src/crowdgit/services/clone/clone_service.py Outdated
Comment on lines +163 to +168
await update_last_processed_commit(
repo_id=repository.id,
commit_hash=batch_info.latest_commit_in_repo,
branch=await get_default_branch(batch_info.repo_path),
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest_commit_in_repo is set by CloneService during the initial minimal clone, so this is out of scope and currently working as expected.

Comment thread services/apps/git_integration/src/test/test_activity_extraction.py Outdated
Comment thread services/apps/git_integration/src/test/test_activity_extraction.py Outdated
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread services/apps/git_integration/src/crowdgit/services/clone/clone_service.py Outdated
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

@mbani01 mbani01 marked this pull request as ready for review May 21, 2026 12:41
@mbani01 mbani01 changed the title fix: clone all batches before processing commits to avoid broken range fix: clone all batches before processing commits to avoid broken range [CM-1200] May 21, 2026
@mbani01 mbani01 requested a review from themarolt May 21, 2026 13:05
@mbani01 mbani01 merged commit cd5292c into main May 21, 2026
15 checks passed
@mbani01 mbani01 deleted the fix/simplify-batched-commit-range branch May 21, 2026 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants