refactor(tests): split suites by execution level and speed up CI by hassiebp · Pull Request #1609 · langfuse/langfuse-python

hassiebp · 2026-04-04T13:23:27Z

Summary

This PR restructures the test suite around execution level and follows through on the CI and reliability work needed to make that split useful in practice.

move deterministic local tests into tests/unit
move Langfuse-server-backed tests into tests/e2e
keep live provider coverage in one dedicated always-on suite under tests/live_provider
keep shared helpers in tests/support
split mixed prompt and media coverage so unit, e2e, and live-provider cases live in the right buckets
keep OpenAI and LangChain unit coverage local by asserting against the in-memory exporter instead of the Langfuse server roundtrip
add a repo-level AGENTS.md and remove CLAUDE.md

CI and test infrastructure

This also updates CI and test plumbing so the new layout stays fast and reliable.

run tests/unit on the Python 3.10-3.14 matrix
run tests/e2e in 2 mechanical shards selected by scripts/select_e2e_shard.py
keep tests/live_provider as one dedicated suite that always runs
keep only serial_e2e as the scheduling-specific pytest marker
replace hidden e2e_core / e2e_data marker routing with mechanical file sharding
keep cancel-in-progress: true
bootstrap Langfuse in CI via LANGFUSE_INIT_*
remove the Docker image cache path after it proved slower than fresh pulls
replace fixed e2e sleeps with bounded retry helpers where read-after-write consistency mattered

Unit test speedups

The unit-suite speed work keeps coverage intact rather than weakening assertions.

replace prompt-cache polling with deterministic queue draining
restore the stricter OTEL timing and prompt atexit stress assertions
wake score-ingestion and media-upload workers immediately on shutdown via sentinels instead of waiting on queue timeouts
add regression coverage for the worker shutdown behavior
use --dist worksteal for the unit lane

Why

The repo already had a mix of local-only, real-server, and live-provider tests, but the old flat tests/ layout made the boundary hard to see and easy to erode. This change makes the split explicit in the filesystem, keeps unit coverage fast, and keeps the expensive e2e and provider surface intentional.

Results

Local measurements after the unit refactor:

tests/unit/test_otel.py + tests/unit/test_propagate_attributes.py: 215.08s -> 2.38s
full tests/unit run: 360 passed, 2 skipped in 13.42s
CI-style unit run with -n auto --dist worksteal: 360 passed, 2 skipped in 5.42s

Validation

uv run --frozen pytest -q tests/unit --maxfail=1
uv run --frozen pytest -q -n auto --dist worksteal tests/unit --maxfail=1
uv run --frozen pytest --collect-only -q tests/unit tests/e2e tests/live_provider
uv run --frozen pytest -q tests/unit/test_e2e_sharding.py
uv run --frozen pytest --collect-only -q tests/live_provider -m 'live_provider'
uv run --frozen ruff check scripts/select_e2e_shard.py tests/conftest.py tests/unit/test_e2e_sharding.py tests/e2e/test_core_sdk.py

CI is the authoritative validation for the real Langfuse-server and live-provider suites.

Disclaimer: Experimental PR review

Greptile Summary

This PR is a large, well-engineered refactoring that splits the flat tests/ directory into tests/unit, tests/e2e, tests/live_provider, and tests/support, and rewires CI to match. It also meaningfully speeds up the unit suite (360 tests in ~5 s) by replacing polling-based waits with sentinel-driven shutdown and deterministic queue draining, and adds proper thread-safety locks to PromptCache. The structural changes are clean and the new retry/sharding helpers are solid.

Key changes:

langfuse/_utils/prompt_cache.py: PromptCache and PromptCacheTaskManager are now fully protected by RLock; consumer switches from a 1-second polling loop to a blocking queue.get() with sentinel-based shutdown; add_refresh_prompt_task_if_current() avoids redundant background refreshes.
langfuse/_task_manager/media_upload_consumer.py / score_ingestion_consumer.py / media_manager.py: sentinel-driven pause() / signal_shutdown() for immediate worker wake-up on shutdown.
scripts/select_e2e_shard.py: greedy weight-based shard assignment that falls back to local AST-based test counting for unknown files.
tests/support/retry.py: bounded retry helper for e2e eventual-consistency reads.
.github/workflows/ci.yml: unit-tests job runs across Python 3.10–3.14; e2e-tests job runs two mechanical shards + live-provider suite; Langfuse server bootstrapped via LANGFUSE_INIT_* env vars instead of a manual seeder script.

Confidence Score: 5/5

Safe to merge — all findings are P2 style/improvement items with no production correctness risk under default configuration

The core source changes (sentinel-based shutdown, RLock thread-safety, prompt cache freshness guard) are sound and well-tested. The multi-consumer sentinel issue only affects threads > 1 which is not the default. The retry edge case requires a specific timing race that is unlikely in practice. All remaining findings are style, robustness improvements, or minor logic nuances — none block correctness on the happy path.

tests/support/retry.py (stale-error edge case), langfuse/_utils/prompt_cache.py (multi-consumer shutdown), scripts/select_e2e_shard.py (class-based test counting)

Important Files Changed

Filename	Overview
langfuse/_utils/prompt_cache.py	Adds RLock thread-safety to PromptCache and PromptCacheTaskManager; switches consumer to blocking queue.get() with sentinel-based shutdown; adds wait_for_idle() and add_refresh_prompt_task_if_current() — multi-consumer sentinel deadlock risk when threads > 1
langfuse/_task_manager/media_manager.py	Adds sentinel-based signal_shutdown() to wake MediaManager consumer immediately instead of waiting on queue timeout
langfuse/_task_manager/score_ingestion_consumer.py	Adds sentinel-based shutdown to ScoreIngestionConsumer.pause() so the consumer breaks its loop immediately
langfuse/_task_manager/media_upload_consumer.py	Calls signal_shutdown() on pause so the media manager wakes up immediately
langfuse/_client/client.py	Switches to add_refresh_prompt_task_if_current() to skip redundant background refreshes when cache is already fresh
tests/support/retry.py	New bounded retry helper for e2e eventual-consistency reads; stale error can be raised after a not-ready result when the deadline expires
tests/support/utils.py	Transparent _RetryingApiProxy retries list/get reads until data is available for e2e tests
tests/support/api_wrapper.py	HTTP-level Langfuse API wrapper with built-in retry on 404/NotFoundError payloads for e2e helpers
scripts/select_e2e_shard.py	Greedy weight-based shard assignment for e2e files; count_test_functions only counts top-level functions and misses class-based test methods
tests/conftest.py	Central fixture file providing InMemorySpanExporter, langfuse_memory_client, and marker routing; contains imports inside mock_init closure
.github/workflows/ci.yml	Splits CI into unit-tests (Python 3.10–3.14 matrix) and e2e-tests (sharded + live_provider); unit-tests job uses unpinned actions/checkout@v3 and setup-uv@v7
tests/unit/test_e2e_sharding.py	Unit tests verifying shard assignment covers all e2e files exactly once and weight estimation works
tests/unit/test_e2e_support.py	Unit tests covering retry helper behavior: NotFound retries, filtered-list retries, disabling retry, and predicate-based waiting
tests/unit/test_prompt.py	Unit tests for prompt cache using mocked server calls; contains import inside fixture function
tests/unit/test_prompt_atexit.py	Subprocess-based stress tests verifying prompt cache shuts down cleanly at process exit with 10 workers
tests/unit/test_otel.py	Comprehensive OTel span tests using InMemorySpanExporter; many imports placed inside test methods rather than at module level
AGENTS.md	New repo-level agent instructions replacing CLAUDE.md; no code changes
CLAUDE.md	Removed in favour of AGENTS.md; no code impact

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    PR[Pull Request / Push] --> C{Concurrency group}
    C -->|cancel-in-progress| LINT[linting job]
    C --> TC[type-checking job]
    C --> UNIT[unit-tests\nPython 3.10–3.14 matrix]
    C --> E2E[e2e-tests matrix]

    E2E --> S1[Shard 1\nselect_e2e_shard --shard-index 0]
    E2E --> S2[Shard 2\nselect_e2e_shard --shard-index 1]
    E2E --> LP[live_provider suite]

    S1 --> PAR1[pytest -n 4 --dist worksteal\nnot serial_e2e]
    S1 --> SER1[pytest serial_e2e]
    S2 --> PAR2[pytest -n 4 --dist worksteal\nnot serial_e2e]
    S2 --> SER2[pytest serial_e2e]
    LP --> LPT[pytest -n 4 --dist worksteal\ntests/live_provider -m live_provider]

    UNIT --> |pytest -n auto --dist worksteal| UT[tests/unit 360 tests]

    LINT --> GATE[all-tests-passed]
    TC --> GATE
    UNIT --> GATE
    S1 --> GATE
    S2 --> GATE
    LP --> GATE
    GATE --> MERGE[Branch protection check]

_{Reviews (1): Last reviewed commit: "fix(prompt-cache): avoid redundant refre..." | Re-trigger Greptile}

Greptile also left 3 inline comments on this PR.

Context used:

Rule used - Move imports to the top of the module instead of p... (source)

Learnt From
langfuse/langfuse-python#1387

…es-by-directory

github-actions · 2026-04-04T21:18:45Z

@claude review

scripts/select_e2e_shard.py

tests/support/retry.py

langfuse/_utils/prompt_cache.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb98e3ad52

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tests/conftest.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6fed925e43

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

langfuse/_utils/prompt_cache.py

tests/support/retry.py

.github/workflows/ci.yml

AGENTS.md

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 170beb8201

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

langfuse/_task_manager/media_manager.py

langfuse/_utils/prompt_cache.py

scripts/select_e2e_shard.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: daaf571e78

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

scripts/select_e2e_shard.py

hassiebp · 2026-04-07T12:21:38Z

@claude review

hassiebp · 2026-04-07T12:21:43Z

@codex review

chatgpt-codex-connector · 2026-04-07T12:31:24Z

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c38f2a197

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tests/conftest.py

tests/e2e/test_datasets.py

…es-by-directory # Conflicts: # .github/workflows/ci.yml # tests/e2e/test_datasets.py # tests/unit/test_resource_manager.py

claude

All prior concerns addressed — LGTM.

Extended reasoning...

Overview

This large refactor restructures 56 files to split tests into unit/e2e/live_provider/support directories, speeds up CI, and hardens several production components (prompt cache thread-safety, sentinel-based worker shutdown, media manager wakeup).

Security risks

None. The CI jobs for unit tests use placeholder credentials and no real secrets.

Level of scrutiny

High scrutiny was warranted given the production changes to prompt_cache.py, score_ingestion_consumer.py, and media_manager.py. Multiple review passes were conducted.

Other factors

All issues flagged in prior reviews have been addressed in the latest commit:

retry_until_ready now clears last_error = None in the else branch, preventing stale errors from being raised after a successful call
PromptCacheRefreshConsumer uses break (not continue) on sentinel, so any consumer that receives a shutdown sentinel exits unconditionally
The unit-tests CI job now uses pinned SHA references matching all other jobs
tests/live_provider/test_openai.py replaced the module-level Langfuse() init with a fixture, eliminating the cross-suite state invalidation risk

Remaining minor items (AGENTS.md example references a nonexistent test name; count_test_functions uses module.body for shard weight estimation) do not affect correctness or CI reliability under current usage and can be addressed as follow-ups.

hassiebp added 21 commits April 4, 2026 15:23

split test suites by execution level

26d3966

speed up unit test suite

38b5c9a

speed up unit shutdown without weakening assertions

48fcd29

Merge remote-tracking branch 'origin/main' into codex/split-test-suit…

82bfb88

…es-by-directory

fix post-merge propagate attributes test

a20e811

stabilize e2e readbacks in ci

4cd23c3

stabilize remaining e2e checks

3c3b264

reduce e2e ci load

78395e7

speed up langfuse server startup in ci

b5b0bc6

use langfuse init bootstrap in ci

c9ed554

split serial e2e tests from parallel ci lane

46785f9

serialize flaky trace e2e test

12c5847

split e2e ci into core and data shards

9679ca9

make e2e data shard the catch-all

174884c

cache langfuse docker images in ci

a98487e

stabilize live-provider langchain assertions

6fd4dde

replace marker-based e2e sharding

30ffc8f

remove docker image cache from e2e ci

27ae9d3

add shared agent instructions

67e0682

sync agent guidance with monorepo standards

1719aee

fix(tests): wait for generation visibility in e2e

871fc31

hassiebp changed the title ~~[codex] split test suites into execution-level directories~~ refactor(tests): split suites by execution level and speed up CI Apr 4, 2026

fix(prompt-cache): avoid redundant refresh races

eb98e3a

hassiebp marked this pull request as ready for review April 4, 2026 21:18

greptile-apps bot reviewed Apr 4, 2026

View reviewed changes

scripts/select_e2e_shard.py Show resolved Hide resolved

tests/support/retry.py Show resolved Hide resolved

langfuse/_utils/prompt_cache.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Apr 4, 2026

View reviewed changes

tests/conftest.py Show resolved Hide resolved

ci: migrate GitHub Actions to Blacksmith runners

6fed925

chatgpt-codex-connector bot reviewed Apr 4, 2026

View reviewed changes

langfuse/_utils/prompt_cache.py Outdated Show resolved Hide resolved

tests/support/retry.py Show resolved Hide resolved

claude bot reviewed Apr 4, 2026

View reviewed changes

tests/support/retry.py Show resolved Hide resolved

.github/workflows/ci.yml Show resolved Hide resolved

AGENTS.md Show resolved Hide resolved

revert(ci): switch back from Blacksmith runners

170beb8

chatgpt-codex-connector bot reviewed Apr 4, 2026

View reviewed changes

langfuse/_task_manager/media_manager.py Show resolved Hide resolved

claude bot reviewed Apr 4, 2026

View reviewed changes

langfuse/_utils/prompt_cache.py Show resolved Hide resolved

claude bot reviewed Apr 4, 2026

View reviewed changes

langfuse/_utils/prompt_cache.py Outdated Show resolved Hide resolved

scripts/select_e2e_shard.py Show resolved Hide resolved

fix(tests): address high-priority review feedback

daaf571

chatgpt-codex-connector bot reviewed Apr 7, 2026

View reviewed changes

scripts/select_e2e_shard.py Show resolved Hide resolved

fix(tests): stabilize remaining CI failures

fbac98e

fix(tests): use direct dataset-item reads in e2e

0b65df2

hassiebp added 2 commits April 7, 2026 14:47

fix(tests): align dataset e2e with archived item behavior

43e3059

fix(tests): wait for async generator observations

7c38f2a

chatgpt-codex-connector bot reviewed Apr 7, 2026

View reviewed changes

tests/conftest.py Show resolved Hide resolved

claude bot reviewed Apr 7, 2026

View reviewed changes

tests/conftest.py Show resolved Hide resolved

claude bot reviewed Apr 7, 2026

View reviewed changes

tests/e2e/test_datasets.py Outdated Show resolved Hide resolved

hassiebp added 4 commits April 10, 2026 15:04

Merge remote-tracking branch 'origin/main' into codex/split-test-suit…

76feeef

…es-by-directory # Conflicts: # .github/workflows/ci.yml # tests/e2e/test_datasets.py # tests/unit/test_resource_manager.py

fix(tests): isolate live provider client state

5a63f6b

Merge branch 'main' into codex/split-test-suites-by-directory

f305068

fix(tests): align dataset archival assertion with server

edf219e

hassiebp merged commit dea5fdf into main Apr 10, 2026
15 checks passed

hassiebp deleted the codex/split-test-suites-by-directory branch April 10, 2026 14:02

claude bot reviewed Apr 10, 2026

View reviewed changes

Conversation

hassiebp commented Apr 4, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

CI and test infrastructure

Unit test speedups

Why

Results

Validation

Disclaimer: Experimental PR review

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

github-actions bot commented Apr 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

hassiebp commented Apr 7, 2026

Uh oh!

hassiebp commented Apr 7, 2026

Uh oh!

chatgpt-codex-connector bot commented Apr 7, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hassiebp commented Apr 4, 2026 •

edited by greptile-apps bot

Loading