fireflyframework · ancongui · Jun 7, 2026 · Jun 7, 2026 · Jun 7, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,32 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
 
 ---
 
+## v26.06.73 (2026-06-07)
+
+### Changed (benchmarks — rigor overhaul)
+
+A critical audit of `benchmarks/run.py` to measure what actually matters and keep ratios stable:
+
+- **Baseline de-inflated.** The old "bare Starlette ~436µs/req" was ~99% `TestClient`/httpx
+  round-trip artifact. Request benchmarks now drive the ASGI app **directly** (one event loop,
+  no httpx) — real bare-ASGI handling is **~5µs/req**. PyFly's filter-chain overhead is now
+  reported as an **absolute ~+44µs/req** over that real base (not a % of an inflated number);
+  the TestClient cost is called out and excluded.
+- **DI coverage expanded** to transient with 1/3/5/10 dependencies + nested depth-1/3/5 graphs —
+  the data shows resolution is **linear** in both width (~+0.4µs/cached dependency) and depth
+  (~+2.7µs/node, the transient-construction rate), with no superlinear or hidden per-bean cost.
+- **Request overhead reconciles.** The +~44µs is the sum of all **seven** default filters
+  (including the default-on `MetricsFilter` + `HttpExchangeRecorderFilter`, not just the explicit
+  web filters); the README per-filter decomposition sums to it, and the zero-filter chain
+  machinery is confirmed free (+0.2µs vs bare ASGI).
+- **Dependency baselines labelled.** `pydantic v2 model_dump_json` is tagged `[dep]` (measures
+  Pydantic, not PyFly); bare Starlette is tagged `[base]`.
+- **Naming fixed.** `TransactionIdFilter` is documented as an MDC-style `X-Transaction-Id`
+  correlation id — explicitly **not** declarative `@Transactional` transaction management.
+- **Measurement reliability.** Warmup + 9 timed runs with the GC disabled during measurement;
+  reports median µs/op, best, p99, and run-to-run spread (`±%`, observed ~1–3%). The access log
+  (sink-dependent I/O) is excluded from the CPU number and documented separately.
+
 ## v26.06.72 (2026-06-07)
 
 ### Added (observability — metrics recording port)

diff --git a/README.md b/README.md
@@ -11,7 +11,7 @@
   <a href="https://github.com/fireflyframework"><img src="https://img.shields.io/badge/Firefly_Framework-official-ff6600?logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCI+PHBhdGggZmlsbD0id2hpdGUiIGQ9Ik0xMiAyQzYuNDggMiAyIDYuNDggMiAxMnM0LjQ4IDEwIDEwIDEwIDEwLTQuNDggMTAtMTBTMTcuNTIgMiAxMiAyeiIvPjwvc3ZnPg==" alt="Firefly Framework"></a>
   <a href="https://www.python.org/"><img src="https://img.shields.io/badge/python-3.12%2B-blue?logo=python&logoColor=white" alt="Python 3.12+"></a>
   <a href="LICENSE"><img src="https://img.shields.io/badge/license-Apache%202.0-green" alt="License: Apache 2.0"></a>
-  <a href="#"><img src="https://img.shields.io/badge/version-26.06.72-brightgreen" alt="Version: 26.06.72"></a>
+  <a href="#"><img src="https://img.shields.io/badge/version-26.06.73-brightgreen" alt="Version: 26.06.73"></a>
   <a href="#"><img src="https://img.shields.io/badge/type--checked-mypy%20strict-blue?logo=python&logoColor=white" alt="Type Checked: mypy strict"></a>
   <a href="#"><img src="https://img.shields.io/badge/code%20style-ruff-purple?logo=ruff&logoColor=white" alt="Code Style: Ruff"></a>
   <a href="#"><img src="https://img.shields.io/badge/async-first-brightgreen" alt="Async First"></a>

diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -1,49 +1,88 @@
 # PyFly benchmarks
 
 Dependency-free micro-benchmarks for representative hot paths. **Not part of CI** — run
-manually to spot regressions and quantify framework overhead.
+manually to spot regressions and quantify the overhead PyFly *itself* adds.
 
 ```bash
 uv run python benchmarks/run.py
 ```
 
-Covers:
-- **DI container** — singleton (cached), transient (construct), transient + 1 dependency.
-- **Serialization** — Pydantic `model_dump_json`.
-- **Request overhead** — a GET through the full PyFly filter chain vs bare Starlette.
-
-Sample run (Apple Silicon, Python 3.12; absolute numbers are machine-dependent — watch the
-*ratios* over time):
-
-| benchmark                                   |  ops/sec | µs/op |
-|---------------------------------------------|---------:|------:|
-| container.resolve (singleton, cached)       |    ~5.1M |  0.19 |
-| container.resolve (transient, construct)    |    ~420K |  2.40 |
-| container.resolve (transient + 1 dep)       |    ~342K |  2.92 |
-| pydantic model_dump_json                    |    ~1.5M |  0.68 |
-| pyfly GET vs bare Starlette                 |      —   | ~+27% |
-
-Notes:
-- Singleton resolution is effectively free (cached).
-- **Transient-with-deps was ~15.55µs (~64K ops/s) before v26.06.63** — dominated by
-  `typing.get_type_hints` + `inspect.signature` running on *every* resolve. v26.06.63 caches
-  the constructor injection plan at registration (parsed once) and computes `get_origin` once
-  per parameter, bringing it to ~2.92µs (~342K ops/s) — a **~5.3x** speedup. The residual cost
-  is the actual per-resolve dependency resolution + object construction, which is irreducible
-  for a transient bean.
-- The PyFly filter chain adds ~22% over a bare Starlette route — the cost of the enterprise
-  middleware. The chain *machinery itself is free* (the filter-chain middleware with **zero**
-  filters measures within noise of bare Starlette); the overhead is entirely the per-filter
-  features. Decomposition (µs/req added vs bare ~436µs):
-
-  | filter | +µs/req | notes |
-  |--------|--------:|-------|
-  | RequestContextFilter   | ~11 | request id + contextvar (needed for REQUEST scope / method security) |
-  | CorrelationFilter      | ~17 | correlation id |
-  | TracingFilter          | ~13 | opens a server span (only when OpenTelemetry is installed) |
-  | TransactionIdFilter    | ~10 | transaction id |
-  | RequestLoggingFilter   | ~38 | **biggest** — structured access log per request |
-  | SecurityHeadersFilter  | ~10 | OWASP headers (precomputed + bulk-appended since v26.06.64) |
-
-  The access log is the single biggest cost, so it is opt-out: set
-  `pyfly.web.request-logging.enabled=false` to drop ~38µs/req (overhead ~22% → ~13%).
+## Methodology (so the ratios are trustworthy)
+
+- **Measures the framework, not the test harness.** Request benchmarks drive the ASGI app
+  **directly** (build a scope, drive `receive`/`send` on one shared event loop). They do **not**
+  use `TestClient`: a TestClient round-trip is ~500 µs/req dominated by httpx + the test
+  transport + a per-request portal, so measuring through it computes PyFly's overhead against a
+  ~100× inflated base. Real Starlette ASGI handling is ~5 µs/req.
+- **Doesn't attribute dependencies to PyFly.** Rows tagged `[dep]` measure a third-party library
+  (e.g. Pydantic v2), not PyFly code; `[base]` is a reference baseline.
+- **Quantifies noise.** Each benchmark warms up, then runs 9 timed passes with the GC disabled
+  during measurement; we report median µs/op, best, p99, and the run-to-run spread (`±%`).
+
+Absolute numbers are machine-dependent — watch the **ratios** and the **±%** over time. Sample
+run (Apple Silicon, Python 3.12):
+
+| benchmark                                       | median µs/op | ops/sec |  ±%  |
+|-------------------------------------------------|-------------:|--------:|-----:|
+| container.resolve singleton (cached)            |        ~0.19 |   ~5.3M |  ~2% |
+| container.resolve transient (no deps)           |        ~2.37 |   ~423K |  ~1% |
+| container.resolve transient + 1 dep             |        ~2.95 |   ~339K |  ~3% |
+| container.resolve transient + 3 deps            |        ~3.79 |   ~264K |  ~1% |
+| container.resolve transient + 5 deps            |        ~4.59 |   ~218K |  ~1% |
+| container.resolve transient + 10 deps           |        ~6.52 |   ~153K |  ~1% |
+| container.resolve nested depth-1 (transient)    |        ~5.34 |   ~187K |  ~1% |
+| container.resolve nested depth-3 (transient)    |       ~11.12 |    ~90K |  ~6% |
+| container.resolve nested depth-5 (transient)    |       ~17.24 |    ~58K |  ~2% |
+| pydantic v2 model_dump_json `[dep]`             |        ~0.67 |   ~1.5M |  ~1% |
+| bare Starlette ASGI (real baseline) `[base]`    |        ~4.80 |   ~208K |  ~2% |
+| pyfly filter chain (access log off)             |       ~47.9  |    ~21K |  ~1% |
+
+## DI container — how it scales
+
+Singleton resolution is effectively free (cached, ~0.19 µs). Both **width** and **depth** scale
+**linearly**, but at *different per-node rates* because they do different work:
+
+- **Width** (transient target, cached-singleton dependencies): **~+0.4 µs per dependency**
+  (1→2.95, 3→3.79, 5→4.59, 10→6.52 µs). Each extra dependency is just a cached-singleton lookup
+  via `_resolve_param`.
+- **Depth** (a fully-transient chain): **~+2.7 µs per node** (depth-1→5.34, depth-3→11.12,
+  depth-5→17.24 µs; ≈2.7–2.9 µs/node). Each level is itself a full transient *resolve + build*, so
+  the per-node cost is the transient-construction rate — not the cheaper cached-lookup rate that
+  width pays.
+
+Either way the growth is linear with no superlinear or hidden per-bean cost; the construction
+cost a transient pays for itself and each transient ancestor is irreducible.
+
+> Historical note: transient-with-deps was ~15.6 µs (~64K ops/s) before v26.06.63, dominated by
+> `typing.get_type_hints` + `inspect.signature` running on *every* resolve. v26.06.63 caches the
+> constructor injection plan at registration and computes `get_origin` once per parameter — a
+> ~5.3× speedup (15.6 → 2.95 µs at 1 dep), now reflected above.
+
+## Request overhead — what PyFly's filter chain costs
+
+Against the **real** bare-ASGI baseline (~5 µs), the full default chain (access log off) costs
+~49 µs/req → **+~44 µs/req of framework CPU overhead**, ~20K req/s through the chain.
+
+The chain *machinery* is genuinely free — the filter-chain middleware with **zero** filters
+measures ~5.1 µs vs ~4.9 µs bare ASGI (+0.2 µs, within noise). The +44 µs is entirely the
+**seven default filters** (with the access log off), and they reconcile to it. The default chain
+includes two filters beyond the explicit web filters: `MetricsFilter` (metrics are on by default)
+and `HttpExchangeRecorderFilter` (the actuator's HTTP-exchange recorder). Per-filter decomposition
+(direct ASGI, in chain order; access log excluded — it is I/O, see below):
+
+| filter | +µs/req | notes |
+|--------|--------:|-------|
+| RequestContextFilter        | ~6.1 | request id + contextvar (needed for REQUEST scope / method security) |
+| CorrelationFilter           | ~9.1 | correlation id propagation |
+| TracingFilter               | ~6.4 | opens a server span (only when OpenTelemetry is installed) |
+| TransactionIdFilter         | ~3.8 | propagates/generates an **`X-Transaction-Id`** correlation id (MDC-style request id for log/trace correlation — **not** declarative `@Transactional` transaction management) |
+| SecurityHeadersFilter       | ~0.7 | OWASP headers (precomputed + bulk-appended since v26.06.64) |
+| MetricsFilter               | ~9.4 | per-request metrics (default-on; disable via `pyfly.observability.metrics.enabled=false`) |
+| HttpExchangeRecorderFilter  | ~8.8 | actuator HTTP-exchange recorder (default-on; tied to the actuator) |
+| **cumulative**              | **~44.3** | ≈ the +44 µs measured above |
+
+The **access log** (`RequestLoggingFilter`) is deliberately excluded from the CPU number: its
+cost is structured-log *render + I/O*, which is sink-dependent, not a CPU hot path. It is opt-out
+via `pyfly.web.request-logging.enabled=false`. With a typical sink it adds on the order of tens of
+µs/req — measure it in your own environment against your real log sink rather than trusting a
+microbenchmark number.