refactor(tests): split suites by execution level and speed up CI#1609
refactor(tests): split suites by execution level and speed up CI#1609
Conversation
|
@claude review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eb98e3ad52
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6fed925e43
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 170beb8201
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: daaf571e78
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@claude review |
|
@codex review |
|
Codex Review: Didn't find any major issues. Hooray! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7c38f2a197
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…es-by-directory # Conflicts: # .github/workflows/ci.yml # tests/e2e/test_datasets.py # tests/unit/test_resource_manager.py
There was a problem hiding this comment.
All prior concerns addressed — LGTM.
Extended reasoning...
Overview
This large refactor restructures 56 files to split tests into unit/e2e/live_provider/support directories, speeds up CI, and hardens several production components (prompt cache thread-safety, sentinel-based worker shutdown, media manager wakeup).
Security risks
None. The CI jobs for unit tests use placeholder credentials and no real secrets.
Level of scrutiny
High scrutiny was warranted given the production changes to prompt_cache.py, score_ingestion_consumer.py, and media_manager.py. Multiple review passes were conducted.
Other factors
All issues flagged in prior reviews have been addressed in the latest commit:
- retry_until_ready now clears last_error = None in the else branch, preventing stale errors from being raised after a successful call
- PromptCacheRefreshConsumer uses break (not continue) on sentinel, so any consumer that receives a shutdown sentinel exits unconditionally
- The unit-tests CI job now uses pinned SHA references matching all other jobs
- tests/live_provider/test_openai.py replaced the module-level Langfuse() init with a fixture, eliminating the cross-suite state invalidation risk
Remaining minor items (AGENTS.md example references a nonexistent test name; count_test_functions uses module.body for shard weight estimation) do not affect correctness or CI reliability under current usage and can be addressed as follow-ups.
Summary
This PR restructures the test suite around execution level and follows through on the CI and reliability work needed to make that split useful in practice.
tests/unittests/e2etests/live_providertests/supportAGENTS.mdand removeCLAUDE.mdCI and test infrastructure
This also updates CI and test plumbing so the new layout stays fast and reliable.
tests/uniton the Python 3.10-3.14 matrixtests/e2ein 2 mechanical shards selected byscripts/select_e2e_shard.pytests/live_provideras one dedicated suite that always runsserial_e2eas the scheduling-specific pytest markere2e_core/e2e_datamarker routing with mechanical file shardingcancel-in-progress: trueLANGFUSE_INIT_*Unit test speedups
The unit-suite speed work keeps coverage intact rather than weakening assertions.
--dist workstealfor the unit laneWhy
The repo already had a mix of local-only, real-server, and live-provider tests, but the old flat
tests/layout made the boundary hard to see and easy to erode. This change makes the split explicit in the filesystem, keeps unit coverage fast, and keeps the expensive e2e and provider surface intentional.Results
Local measurements after the unit refactor:
tests/unit/test_otel.py+tests/unit/test_propagate_attributes.py:215.08s->2.38stests/unitrun:360 passed, 2 skipped in 13.42s-n auto --dist worksteal:360 passed, 2 skipped in 5.42sValidation
uv run --frozen pytest -q tests/unit --maxfail=1uv run --frozen pytest -q -n auto --dist worksteal tests/unit --maxfail=1uv run --frozen pytest --collect-only -q tests/unit tests/e2e tests/live_provideruv run --frozen pytest -q tests/unit/test_e2e_sharding.pyuv run --frozen pytest --collect-only -q tests/live_provider -m 'live_provider'uv run --frozen ruff check scripts/select_e2e_shard.py tests/conftest.py tests/unit/test_e2e_sharding.py tests/e2e/test_core_sdk.pyCI is the authoritative validation for the real Langfuse-server and live-provider suites.
Disclaimer: Experimental PR review
Greptile Summary
This PR is a large, well-engineered refactoring that splits the flat
tests/directory intotests/unit,tests/e2e,tests/live_provider, andtests/support, and rewires CI to match. It also meaningfully speeds up the unit suite (360 tests in ~5 s) by replacing polling-based waits with sentinel-driven shutdown and deterministic queue draining, and adds proper thread-safety locks toPromptCache. The structural changes are clean and the new retry/sharding helpers are solid.Key changes:
langfuse/_utils/prompt_cache.py:PromptCacheandPromptCacheTaskManagerare now fully protected byRLock; consumer switches from a 1-second polling loop to a blockingqueue.get()with sentinel-based shutdown;add_refresh_prompt_task_if_current()avoids redundant background refreshes.langfuse/_task_manager/media_upload_consumer.py/score_ingestion_consumer.py/media_manager.py: sentinel-drivenpause()/signal_shutdown()for immediate worker wake-up on shutdown.scripts/select_e2e_shard.py: greedy weight-based shard assignment that falls back to local AST-based test counting for unknown files.tests/support/retry.py: bounded retry helper for e2e eventual-consistency reads..github/workflows/ci.yml:unit-testsjob runs across Python 3.10–3.14;e2e-testsjob runs two mechanical shards + live-provider suite; Langfuse server bootstrapped viaLANGFUSE_INIT_*env vars instead of a manual seeder script.Confidence Score: 5/5
Safe to merge — all findings are P2 style/improvement items with no production correctness risk under default configuration
The core source changes (sentinel-based shutdown, RLock thread-safety, prompt cache freshness guard) are sound and well-tested. The multi-consumer sentinel issue only affects threads > 1 which is not the default. The retry edge case requires a specific timing race that is unlikely in practice. All remaining findings are style, robustness improvements, or minor logic nuances — none block correctness on the happy path.
tests/support/retry.py (stale-error edge case), langfuse/_utils/prompt_cache.py (multi-consumer shutdown), scripts/select_e2e_shard.py (class-based test counting)
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD PR[Pull Request / Push] --> C{Concurrency group} C -->|cancel-in-progress| LINT[linting job] C --> TC[type-checking job] C --> UNIT[unit-tests\nPython 3.10–3.14 matrix] C --> E2E[e2e-tests matrix] E2E --> S1[Shard 1\nselect_e2e_shard --shard-index 0] E2E --> S2[Shard 2\nselect_e2e_shard --shard-index 1] E2E --> LP[live_provider suite] S1 --> PAR1[pytest -n 4 --dist worksteal\nnot serial_e2e] S1 --> SER1[pytest serial_e2e] S2 --> PAR2[pytest -n 4 --dist worksteal\nnot serial_e2e] S2 --> SER2[pytest serial_e2e] LP --> LPT[pytest -n 4 --dist worksteal\ntests/live_provider -m live_provider] UNIT --> |pytest -n auto --dist worksteal| UT[tests/unit 360 tests] LINT --> GATE[all-tests-passed] TC --> GATE UNIT --> GATE S1 --> GATE S2 --> GATE LP --> GATE GATE --> MERGE[Branch protection check]Reviews (1): Last reviewed commit: "fix(prompt-cache): avoid redundant refre..." | Re-trigger Greptile
Context used:
Learnt From
langfuse/langfuse-python#1387