Skip to content

Cache Playwright seed repo clones for FalkorDB test data#552

Open
Copilot wants to merge 9 commits intostagingfrom
copilot/cache-pre-seed-falkordb-test-data
Open

Cache Playwright seed repo clones for FalkorDB test data#552
Copilot wants to merge 9 commits intostagingfrom
copilot/cache-pre-seed-falkordb-test-data

Conversation

Copy link
Contributor

Copilot AI commented Mar 10, 2026

Playwright setup was spending most of its time re-cloning and re-analyzing the same two repositories on every run. This change avoids repeated network clone cost by reusing cached local clones and invalidating that cache only when the seed inputs change.

  • Workflow cache for seeded repositories

    • Add a GitHub Actions cache for the repositories/ directory used by e2e/seed_test_data.py
    • Key the cache by:
      • e2e/seed_test_data.py contents
      • current HEAD SHA of FalkorDB/GraphRAG-SDK
      • current HEAD SHA of pallets/flask
    • This keeps cache reuse automatic while refreshing when either upstream seed repo changes
  • Seed script clone reuse

    • Update e2e/seed_test_data.py to load projects from cached local clones instead of calling Project.from_git_repository(...) every run
    • On cache miss, do a fresh shallow clone into repositories/<repo>
    • Keep the existing source analysis and graph edge seeding behavior unchanged
  • Focused regression coverage

    • Add targeted tests for:
      • reusing an existing cached clone
      • cloning into the cache path when no cached repo is present

Example of the new seed path:

def load_project(url: str) -> Project:
    repo_path = REPOSITORIES_DIR / repo_name_from_url(url)

    if (repo_path / ".git").exists():
        logger.info("Using cached repository clone at %s", repo_path)
    else:
        logger.info("Cloning repository into cache at %s", repo_path)
        fresh_clone_repository(url, repo_path)

    return Project.from_local_repository(repo_path)
Original prompt

This section details on the original issue you should resolve

<issue_title>Cache or pre-seed FalkorDB test data</issue_title>
<issue_description>### 1. Cache or pre-seed FalkorDB test data (saves ~3.5 min/shard = ~7 min total)

seed_test_data.py clones 2 GitHub repos (GraphRAG-SDK, Flask) and runs full source analysis every run. This is the single biggest time sink at 37% of Playwright runtime.

Options:

  • Best: Export the seeded graph as an RDB dump, commit it as a test fixture, and restore with redis-cli. Eliminates the 220s step entirely.
  • Good: Cache the cloned repos + analysis output with actions/cache keyed on the seed script hash + repo commit SHAs.
  • Minimum: Cache just the git clones to skip network time.</issue_description>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

@vercel
Copy link

vercel bot commented Mar 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
code-graph Error Error Mar 10, 2026 9:45am

Request Review

Co-authored-by: gkorland <753206+gkorland@users.noreply.github.com>
Co-authored-by: gkorland <753206+gkorland@users.noreply.github.com>
Copilot AI changed the title [WIP] Cache or pre-seed FalkorDB test data to improve performance Cache Playwright seed repo clones for FalkorDB test data Mar 10, 2026
@gkorland gkorland marked this pull request as ready for review March 10, 2026 08:27
@gkorland gkorland requested a review from Copilot March 10, 2026 08:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the "Good" caching option from issue #551, reducing Playwright CI runtime by avoiding repeated network clones of two seeded repositories (FalkorDB/GraphRAG-SDK and pallets/flask) on every run. It introduces a repositories/ directory cache in GitHub Actions keyed on the seed script hash and upstream HEAD SHAs, and updates the seeding script to reuse existing local clones instead of always calling Project.from_git_repository(...).

Changes:

  • Workflow cache step added with a composite cache key (OS + seed script hash + both upstream repo SHAs)
  • e2e/seed_test_data.py refactored to check for a cached .git directory and skip re-cloning when present; introduces repo_name_from_url, fresh_clone_repository, and load_project helpers
  • New unit tests for the two main load_project code paths (cache hit and cache miss)

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
.github/workflows/playwright.yml Adds SHA-resolution step and actions/cache@v4 step for repositories/ before seeding
e2e/seed_test_data.py Replaces unconditional Project.from_git_repository with load_project which reuses cached shallow clones
tests/test_seed_test_data.py New test file covering the cache-hit and cache-miss branches of load_project

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@gkorland
Copy link
Contributor

@CodeRabbit review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

Warning

Rate limit exceeded

@gkorland has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 15 minutes and 5 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 16c00ef3-f865-4b88-8adf-98a7f35bfce9

📥 Commits

Reviewing files that changed from the base of the PR and between f063417 and f779df8.

📒 Files selected for processing (3)
  • .github/workflows/playwright.yml
  • e2e/seed_test_data.py
  • tests/test_seed_test_data.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch copilot/cache-pre-seed-falkordb-test-data

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

@gkorland gkorland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Must Fix

  1. actions/cache@v4 should be @v5 — The existing Playwright browser cache in this workflow uses actions/cache@v5. The new seed-repo cache step uses @v4. Should be consistent.

Considerations

  1. restore-keys partial match can serve stale clones — The restore key matches on script hash but ignores upstream repo SHAs. On a partial cache hit, load_project will use the cached clone without pulling, so tests may run against outdated code. Probably acceptable since these repos don't change often, but worth documenting as a known trade-off.

  2. Limited CI time savings — This implements the "Minimum" option from the issue (cache git clones). The bulk of the ~220s comes from proj.analyze_sources(), which still runs on every invocation. This PR saves network clone time (likely seconds, not minutes). For dramatic improvement, the analysis output or an RDB dump would need to be cached as well.

Nice to Have

  1. repo_name_from_url edge case tests — No tests for trailing slashes, .git suffix, etc. Not a blocker but would improve confidence.

Copy link
Contributor

@gkorland gkorland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Must Fix

  1. actions/cache@v4 should be @v5 — The existing Playwright browser cache in this workflow uses actions/cache@v5. The new seed-repo cache step uses @v4. Should be consistent.

Considerations

  1. restore-keys partial match can serve stale clones — The restore key matches on script hash but ignores upstream repo SHAs. On a partial cache hit, load_project will use the cached clone without pulling, so tests may run against outdated code. Probably acceptable since these repos don't change often, but worth documenting as a known trade-off.

  2. Limited CI time savings — This implements the "Minimum" option from the issue (cache git clones). The bulk of the ~220s comes from proj.analyze_sources(), which still runs on every invocation. This PR saves network clone time (likely seconds, not minutes). For dramatic improvement, the analysis output or an RDB dump would need to be cached as well.

Nice to Have

  1. repo_name_from_url edge case tests — No tests for trailing slashes, .git suffix, etc. Not a blocker but would improve confidence.

Match the version used by the existing Playwright browser cache step.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gkorland
Copy link
Contributor

@copilot please resolve the conflicts

Copy link
Contributor Author

Copilot AI commented Mar 10, 2026

Code Review

Must Fix...

Updated in 8bd70fe to use actions/cache@v5 for the seed repo cache. The current workflow also keeps the restore key behavior, with the stale-clone tradeoff limited to cache reuse and no pull-on-restore refresh.

Code Review

Must Fix...

Addressed in 8bd70fe: the seed repository cache step now uses actions/cache@v5. I left the partial restore behavior in place as the current tradeoff for clone reuse, and did not expand this PR to cache analysis output or RDB fixtures.

@copilot please resolve the conflicts

Resolved on this branch in 0b541c6. I checked the current tree and there are no unresolved conflict files, and the latest Playwright run for the branch succeeded at 8bd70fe.

@gkorland
Copy link
Contributor

@copilot please resolve the conflicts

Copy link
Contributor Author

Copilot AI commented Mar 10, 2026

@copilot please resolve the conflicts

Resolved on this branch in 0b541c6. I re-checked the current tree and there are no unresolved conflict files, and the latest Playwright run for the branch succeeded at 8bd70fe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache or pre-seed FalkorDB test data

3 participants