Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,32 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).

---

## v26.06.73 (2026-06-07)

### Changed (benchmarks — rigor overhaul)

A critical audit of `benchmarks/run.py` to measure what actually matters and keep ratios stable:

- **Baseline de-inflated.** The old "bare Starlette ~436µs/req" was ~99% `TestClient`/httpx
round-trip artifact. Request benchmarks now drive the ASGI app **directly** (one event loop,
no httpx) — real bare-ASGI handling is **~5µs/req**. PyFly's filter-chain overhead is now
reported as an **absolute ~+44µs/req** over that real base (not a % of an inflated number);
the TestClient cost is called out and excluded.
- **DI coverage expanded** to transient with 1/3/5/10 dependencies + nested depth-1/3/5 graphs —
the data shows resolution is **linear** in both width (~+0.4µs/cached dependency) and depth
(~+2.7µs/node, the transient-construction rate), with no superlinear or hidden per-bean cost.
- **Request overhead reconciles.** The +~44µs is the sum of all **seven** default filters
(including the default-on `MetricsFilter` + `HttpExchangeRecorderFilter`, not just the explicit
web filters); the README per-filter decomposition sums to it, and the zero-filter chain
machinery is confirmed free (+0.2µs vs bare ASGI).
- **Dependency baselines labelled.** `pydantic v2 model_dump_json` is tagged `[dep]` (measures
Pydantic, not PyFly); bare Starlette is tagged `[base]`.
- **Naming fixed.** `TransactionIdFilter` is documented as an MDC-style `X-Transaction-Id`
correlation id — explicitly **not** declarative `@Transactional` transaction management.
- **Measurement reliability.** Warmup + 9 timed runs with the GC disabled during measurement;
reports median µs/op, best, p99, and run-to-run spread (`±%`, observed ~1–3%). The access log
(sink-dependent I/O) is excluded from the CPU number and documented separately.

## v26.06.72 (2026-06-07)

### Added (observability — metrics recording port)
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
<a href="https://github.com/fireflyframework"><img src="https://img.shields.io/badge/Firefly_Framework-official-ff6600?logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCI+PHBhdGggZmlsbD0id2hpdGUiIGQ9Ik0xMiAyQzYuNDggMiAyIDYuNDggMiAxMnM0LjQ4IDEwIDEwIDEwIDEwLTQuNDggMTAtMTBTMTcuNTIgMiAxMiAyeiIvPjwvc3ZnPg==" alt="Firefly Framework"></a>
<a href="https://www.python.org/"><img src="https://img.shields.io/badge/python-3.12%2B-blue?logo=python&logoColor=white" alt="Python 3.12+"></a>
<a href="LICENSE"><img src="https://img.shields.io/badge/license-Apache%202.0-green" alt="License: Apache 2.0"></a>
<a href="#"><img src="https://img.shields.io/badge/version-26.06.72-brightgreen" alt="Version: 26.06.72"></a>
<a href="#"><img src="https://img.shields.io/badge/version-26.06.73-brightgreen" alt="Version: 26.06.73"></a>
<a href="#"><img src="https://img.shields.io/badge/type--checked-mypy%20strict-blue?logo=python&logoColor=white" alt="Type Checked: mypy strict"></a>
<a href="#"><img src="https://img.shields.io/badge/code%20style-ruff-purple?logo=ruff&logoColor=white" alt="Code Style: Ruff"></a>
<a href="#"><img src="https://img.shields.io/badge/async-first-brightgreen" alt="Async First"></a>
Expand Down
121 changes: 80 additions & 41 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -1,49 +1,88 @@
# PyFly benchmarks

Dependency-free micro-benchmarks for representative hot paths. **Not part of CI** — run
manually to spot regressions and quantify framework overhead.
manually to spot regressions and quantify the overhead PyFly *itself* adds.

```bash
uv run python benchmarks/run.py
```

Covers:
- **DI container** — singleton (cached), transient (construct), transient + 1 dependency.
- **Serialization** — Pydantic `model_dump_json`.
- **Request overhead** — a GET through the full PyFly filter chain vs bare Starlette.

Sample run (Apple Silicon, Python 3.12; absolute numbers are machine-dependent — watch the
*ratios* over time):

| benchmark | ops/sec | µs/op |
|---------------------------------------------|---------:|------:|
| container.resolve (singleton, cached) | ~5.1M | 0.19 |
| container.resolve (transient, construct) | ~420K | 2.40 |
| container.resolve (transient + 1 dep) | ~342K | 2.92 |
| pydantic model_dump_json | ~1.5M | 0.68 |
| pyfly GET vs bare Starlette | — | ~+27% |

Notes:
- Singleton resolution is effectively free (cached).
- **Transient-with-deps was ~15.55µs (~64K ops/s) before v26.06.63** — dominated by
`typing.get_type_hints` + `inspect.signature` running on *every* resolve. v26.06.63 caches
the constructor injection plan at registration (parsed once) and computes `get_origin` once
per parameter, bringing it to ~2.92µs (~342K ops/s) — a **~5.3x** speedup. The residual cost
is the actual per-resolve dependency resolution + object construction, which is irreducible
for a transient bean.
- The PyFly filter chain adds ~22% over a bare Starlette route — the cost of the enterprise
middleware. The chain *machinery itself is free* (the filter-chain middleware with **zero**
filters measures within noise of bare Starlette); the overhead is entirely the per-filter
features. Decomposition (µs/req added vs bare ~436µs):

| filter | +µs/req | notes |
|--------|--------:|-------|
| RequestContextFilter | ~11 | request id + contextvar (needed for REQUEST scope / method security) |
| CorrelationFilter | ~17 | correlation id |
| TracingFilter | ~13 | opens a server span (only when OpenTelemetry is installed) |
| TransactionIdFilter | ~10 | transaction id |
| RequestLoggingFilter | ~38 | **biggest** — structured access log per request |
| SecurityHeadersFilter | ~10 | OWASP headers (precomputed + bulk-appended since v26.06.64) |

The access log is the single biggest cost, so it is opt-out: set
`pyfly.web.request-logging.enabled=false` to drop ~38µs/req (overhead ~22% → ~13%).
## Methodology (so the ratios are trustworthy)

- **Measures the framework, not the test harness.** Request benchmarks drive the ASGI app
**directly** (build a scope, drive `receive`/`send` on one shared event loop). They do **not**
use `TestClient`: a TestClient round-trip is ~500 µs/req dominated by httpx + the test
transport + a per-request portal, so measuring through it computes PyFly's overhead against a
~100× inflated base. Real Starlette ASGI handling is ~5 µs/req.
- **Doesn't attribute dependencies to PyFly.** Rows tagged `[dep]` measure a third-party library
(e.g. Pydantic v2), not PyFly code; `[base]` is a reference baseline.
- **Quantifies noise.** Each benchmark warms up, then runs 9 timed passes with the GC disabled
during measurement; we report median µs/op, best, p99, and the run-to-run spread (`±%`).

Absolute numbers are machine-dependent — watch the **ratios** and the **±%** over time. Sample
run (Apple Silicon, Python 3.12):

| benchmark | median µs/op | ops/sec | ±% |
|-------------------------------------------------|-------------:|--------:|-----:|
| container.resolve singleton (cached) | ~0.19 | ~5.3M | ~2% |
| container.resolve transient (no deps) | ~2.37 | ~423K | ~1% |
| container.resolve transient + 1 dep | ~2.95 | ~339K | ~3% |
| container.resolve transient + 3 deps | ~3.79 | ~264K | ~1% |
| container.resolve transient + 5 deps | ~4.59 | ~218K | ~1% |
| container.resolve transient + 10 deps | ~6.52 | ~153K | ~1% |
| container.resolve nested depth-1 (transient) | ~5.34 | ~187K | ~1% |
| container.resolve nested depth-3 (transient) | ~11.12 | ~90K | ~6% |
| container.resolve nested depth-5 (transient) | ~17.24 | ~58K | ~2% |
| pydantic v2 model_dump_json `[dep]` | ~0.67 | ~1.5M | ~1% |
| bare Starlette ASGI (real baseline) `[base]` | ~4.80 | ~208K | ~2% |
| pyfly filter chain (access log off) | ~47.9 | ~21K | ~1% |

## DI container — how it scales

Singleton resolution is effectively free (cached, ~0.19 µs). Both **width** and **depth** scale
**linearly**, but at *different per-node rates* because they do different work:

- **Width** (transient target, cached-singleton dependencies): **~+0.4 µs per dependency**
(1→2.95, 3→3.79, 5→4.59, 10→6.52 µs). Each extra dependency is just a cached-singleton lookup
via `_resolve_param`.
- **Depth** (a fully-transient chain): **~+2.7 µs per node** (depth-1→5.34, depth-3→11.12,
depth-5→17.24 µs; ≈2.7–2.9 µs/node). Each level is itself a full transient *resolve + build*, so
the per-node cost is the transient-construction rate — not the cheaper cached-lookup rate that
width pays.

Either way the growth is linear with no superlinear or hidden per-bean cost; the construction
cost a transient pays for itself and each transient ancestor is irreducible.

> Historical note: transient-with-deps was ~15.6 µs (~64K ops/s) before v26.06.63, dominated by
> `typing.get_type_hints` + `inspect.signature` running on *every* resolve. v26.06.63 caches the
> constructor injection plan at registration and computes `get_origin` once per parameter — a
> ~5.3× speedup (15.6 → 2.95 µs at 1 dep), now reflected above.

## Request overhead — what PyFly's filter chain costs

Against the **real** bare-ASGI baseline (~5 µs), the full default chain (access log off) costs
~49 µs/req → **+~44 µs/req of framework CPU overhead**, ~20K req/s through the chain.

The chain *machinery* is genuinely free — the filter-chain middleware with **zero** filters
measures ~5.1 µs vs ~4.9 µs bare ASGI (+0.2 µs, within noise). The +44 µs is entirely the
**seven default filters** (with the access log off), and they reconcile to it. The default chain
includes two filters beyond the explicit web filters: `MetricsFilter` (metrics are on by default)
and `HttpExchangeRecorderFilter` (the actuator's HTTP-exchange recorder). Per-filter decomposition
(direct ASGI, in chain order; access log excluded — it is I/O, see below):

| filter | +µs/req | notes |
|--------|--------:|-------|
| RequestContextFilter | ~6.1 | request id + contextvar (needed for REQUEST scope / method security) |
| CorrelationFilter | ~9.1 | correlation id propagation |
| TracingFilter | ~6.4 | opens a server span (only when OpenTelemetry is installed) |
| TransactionIdFilter | ~3.8 | propagates/generates an **`X-Transaction-Id`** correlation id (MDC-style request id for log/trace correlation — **not** declarative `@Transactional` transaction management) |
| SecurityHeadersFilter | ~0.7 | OWASP headers (precomputed + bulk-appended since v26.06.64) |
| MetricsFilter | ~9.4 | per-request metrics (default-on; disable via `pyfly.observability.metrics.enabled=false`) |
| HttpExchangeRecorderFilter | ~8.8 | actuator HTTP-exchange recorder (default-on; tied to the actuator) |
| **cumulative** | **~44.3** | ≈ the +44 µs measured above |

The **access log** (`RequestLoggingFilter`) is deliberately excluded from the CPU number: its
cost is structured-log *render + I/O*, which is sink-dependent, not a CPU hot path. It is opt-out
via `pyfly.web.request-logging.enabled=false`. With a typical sink it adds on the order of tens of
µs/req — measure it in your own environment against your real log sink rather than trusting a
microbenchmark number.
Loading