perf: top-level CpuKernel dispatch at FrameDecoder/FrameCompressor entry + FSE Entry layout

## Context

Architectural performance gap analysis on PR #245 surfaced that
our hot paths gate CPU-feature kernel selection at the wrong
place AND at the wrong time:

- **Compile-time gating** (`cfg!(target_feature = ...)`) means
  consumer builds with the default rustc x86_64 target (SSE2 only)
  silently use the scalar fallback even on hardware that has
  BMI2 / AVX2 / VBMI2 etc. Donor zstd-sys does runtime detection
  and always picks up the runner's full ISA.

- **Per-subsystem dispatch** scatters the choice across HUF, FSE,
  SIMD-copy independently. Each subsystem detects features on its
  own (sometimes per-call, sometimes per-frame). We pay the
  dispatch cost N times instead of once, and we can't reuse the
  monomorphised pipeline across subsystems.

- **Per-sequence table lookups** for FSE base values (`lookup_ll_code` /
  `lookup_ml_code`) instead of donor's pre-computed `ZSTD_seqSymbol`
  entries cost 2 extra L1d touches per sequence.

The architectural fix is **single top-level dispatch at the zstd
entry**: `FrameDecoder::{decode_to_slice, decode_all, decode_blocks}`
and `FrameCompressor::compress` detect the kernel ONCE (cached in
a process-wide OnceLock) and propagate the choice through the
ENTIRE pipeline as a generic parameter. Inner code (HUF, FSE,
SIMD-copy, bit-reader, match-copy) instantiates against the
chosen kernel; monomorphisation specialises ONLY the bodies that
actually differ, common code stays single-impl.

## Architecture

```rust
// Single kernel ZST, dispatched once at zstd entry.
pub(crate) trait CpuKernel: Copy {
    // Bodies that DIFFER per kernel — these get monomorphised:
    fn mask_lower_bits(value: u64, n: u8) -> u64;     // FSE bit reader
    fn huf_burst(...);                                 // HUF 4-stream
    fn copy_chunk(src: *const u8, dst: *mut u8);      // SIMD copy
    // ... etc

    // Common code (FCS check, block loop, sequence-execute structure,
    // offset history, repeat semantics) is NOT on the trait. Those
    // stay single-impl, generic ONLY over the kernel ZST so the
    // compiler routes through K::method calls.
}

pub(crate) struct ScalarKernel;
pub(crate) struct Sse2Kernel;
pub(crate) struct Avx2Kernel;     // includes BMI2 (x86-64-v3 baseline)
pub(crate) struct Avx512Kernel;
// + Aarch64Neon, Aarch64Sve

impl CpuKernel for ScalarKernel { /* portable */ }
#[cfg(target_arch = "x86_64")]
impl CpuKernel for Avx2Kernel { /* #[target_feature(enable = "bmi2,avx2")] bodies */ }
// ...

// Entry-level dispatch (OnceLock-cached once per process):
fn detect_cpu_kernel() -> CpuKernelTag {
    static CACHED: OnceLock<CpuKernelTag> = OnceLock::new();
    *CACHED.get_or_init(|| { /* is_x86_feature_detected! chain */ })
}

// FrameDecoder entry:
pub fn decode_to_slice(&mut self, input, output) -> Result<usize> {
    match detect_cpu_kernel() {
        CpuKernelTag::Avx2 => self.decode_to_slice_impl::<Avx2Kernel>(input, output),
        CpuKernelTag::Sse2 => self.decode_to_slice_impl::<Sse2Kernel>(input, output),
        CpuKernelTag::Scalar => self.decode_to_slice_impl::<ScalarKernel>(input, output),
        // ...
    }
}

fn decode_to_slice_impl<K: CpuKernel>(&mut self, input, output) -> Result<usize> {
    // ALL inner code (block loop, HUF burst, FSE state update,
    // sequence executor, match copy) is generic over K and routes
    // through K::method at the leaf hot-path call sites. Common
    // structure stays single-impl.
}

// Same shape for FrameCompressor::compress.
```

The HUF 4-stream burst already follows this shape internally
(via the `HufKernel` trait + `detect_huffman_decode_kernel()`
OnceLock). This issue lifts the dispatch to the FrameDecoder /
FrameCompressor entry so a SINGLE choice covers HUF + FSE +
SIMD-copy + bit-reader + match-copy for the duration of the
decode / encode call.

## Why top-level, not per-subsystem

- **Cost amortisation**: one detect + match per zstd call instead
  of per-subsystem. On small frames this matters; the runtime
  detect's atomic load + branch chain isn't free.

- **Monomorphisation sharing**: the same `K` propagates through
  HUF, FSE, sequence-execute, match-copy. The compiler emits one
  specialised pipeline per kernel; subsystems don't each carry
  their own dispatcher and their own scalar fallback.

- **Consistency**: avoids the asymmetry where HUF runtime-dispatches
  but FSE compile-time-gates. Either both pick up native ISA or
  neither does — anything else is a footgun for benchmarks (we
  hit it on PR #245 where CI bench had to set
  `target-cpu=x86-64-v3` to make FSE inline `_bzhi_u64`).

- **Specialise only the differing bodies**: `CpuKernel` trait
  methods cover the leaf hot-path ops where SIMD/BMI2 actually
  changes the codegen. FCS checks, block-loop structure, offset
  history etc. stay single-impl with `K` as a phantom on the
  outer function only. Monomorphisation cost is bounded.

## Part 1 — FSE Entry layout (independent of kernel dispatch)

Current `Entry` (zstd/src/fse/fse_decoder.rs:578):

```rust
pub struct Entry {
    pub new_state: u16,    // 2 bytes
    pub symbol: u8,        // 1 byte
    pub num_bits: u8,      // 1 byte
}                          // total: 4 bytes per state
```

Per-sequence FSE state update touches the 4-byte entry, then:

```rust
let (ll_value, ll_num_bits) = lookup_ll_code(ll_code); // touch LL_BASE/BITS arrays
let (ml_value, ml_num_bits) = lookup_ml_code(ml_code); // touch ML_BASE/BITS arrays
```

That is 2 extra small-array touches per sequence. On i9 with hot
L1 they cost ~4-6 cycles each; ~20K sequences per MiB on L-1
fast corpora = ~80-120 us / MiB extra. Estimated saving: 2-4%
throughput.

Donor `ZSTD_seqSymbol`:

```c
typedef struct {
    U16  nextState;
    BYTE nbAdditionalBits;
    BYTE nbBits;
    U32  baseValue;
} ZSTD_seqSymbol;   // 12 bytes, 16 with alignment
```

Proposed expansion:

```rust
pub struct Entry {
    pub new_state: u16,
    pub symbol: u8,
    pub num_bits: u8,
    pub base_value: u32,           // <-- new
    pub num_additional_bits: u8,   // <-- new
    // 3 bytes padding to 12B; or pack to 16B for cache alignment
}
```

Build path: in the LL / ML / OF table builders (`FSEScratch`
sub-tables), populate `base_value` and `num_additional_bits` from
the existing LL_TABLE / ML_TABLE / OF constants based on
`Entry.symbol`. `decode_one_sequence_inline` then reads these
directly from the entry, dropping both `lookup_*` calls.

Table size: LL/ML/OF combined ~20 KiB at 16B-aligned 12B entries
(vs current 12 KiB). Still fits in L1d (32 KiB i9) but tighter
margin alongside HUF table + burst working set.

## Part 2 — Top-level `CpuKernel` dispatch (decoder)

Implement the architecture sketched above on the decoder side:

- New `CpuKernel` trait + per-ISA impls (Scalar / Sse2 / Avx2 /
  Avx512, plus aarch64-neon / aarch64-sve).
- `detect_cpu_kernel()` with OnceLock cache.
- `decode_to_slice` / `decode_all` / `decode_blocks` ALL dispatch
  at entry via match over the cached tag, calling generic-over-K
  implementations.
- Inner code (HUF burst, FSE state update, sequence executor,
  match-copy) routes through `K::method` calls. The existing
  per-subsystem `HufKernel` trait merges into / is replaced by
  the unified `CpuKernel` to remove the second-level dispatch.
- Inner functions that DON'T touch SIMD/BMI2 ops (block loop,
  FCS check, offset history update, structure code) keep their
  single-impl shape with K as phantom on the outer function only.

After this lands, the CI bench-build can drop its
`CARGO_TARGET_X86_64_UNKNOWN_LINUX_*_RUSTFLAGS=target-cpu=x86-64-v3`
override — the default rustc build will runtime-dispatch into
the BMI2/AVX2 path on any x86_64 hardware that supports it.

## Part 3 — Top-level `CpuKernel` dispatch (encoder)

Symmetric refactor on the FrameCompressor side. Single detect at
`FrameCompressor::compress` entry; propagate K through the encode
pipeline (HUF table build + encode, FSE table build + encode,
match finder if any of its kernels differ per ISA, sequence
emission). Same monomorphise-only-where-bodies-differ rule.

## Acceptance criteria

### Part 1 (FSE Entry layout) — ✅ DONE (PR #252)
- [x] `Entry` expanded with `base_value` (u32) and
      `num_additional_bits` (u8); `#[repr(C)]` layout doc updated.
- [x] LL/ML/OF `FSEScratch` builders pre-compute new fields.
- [x] `decode_one_sequence_inline` drops `lookup_ll_code` /
      `lookup_ml_code` and reads from the new entry fields.
- [x] L-1 fast c_stream bench shows >= 2% throughput improvement
      on i9.
- [x] No regression on other levels / scenarios.
- [x] 588+ nextest pass.

### Part 2 (decoder top-level dispatch) — ✅ DONE (PR #254 + PR #263)
- [x] `CpuKernel` trait + per-ISA impls live alongside or replace
      the existing `HufKernel`. Implemented as
      `Scalar` / `Bmi2` / `Avx2` / `Vbmi2` (x86_64) +
      `Neon` / `Sve` (aarch64) in `zstd/src/cpu_kernel.rs`.
- [x] `detect_cpu_kernel()` with `OnceLock` cache; one detect +
      match per call.
- [x] Decoder hot paths dispatch at entry; inner code generic over
      `K`. Literals section (`literals_section_decoder.rs:143`)
      and sequence section (`sequence_section_decoder.rs:80`)
      both `match detect_cpu_kernel()` and route into per-K
      `#[target_feature]`-trampolined impls.
- [x] CI bench-build can drop the
      `CARGO_TARGET_X86_64_UNKNOWN_LINUX_*_RUSTFLAGS=target-cpu=x86-64-v3`
      override; default rustc build hits the BMI2 path on capable
      x86_64 hardware.
- [x] L-1 c_stream bench delta on i9 (post-PR-#263 unification):
      `pure_rust` 2.034 ms → 1.555 ms = **−23.5% vs origin/main**.
      `high-entropy-1m` reaches **1.03× donor (parity)**;
      `low-entropy-1m` **−39.7%**. Combined target (≤ 2.0× donor)
      achieved on the structured corpora; the small-block
      `small-4k-log-lines` worst case still sits at ~2.5×.
- [x] No regression on Scalar path / aarch64 / wasm32 targets.

### Part 3 (encoder top-level dispatch) — ❌ TODO
- [ ] Same dispatch shape on `FrameCompressor::compress` entry.
- [ ] Encode benches show ≥ 3% throughput improvement on x86_64
      hardware with BMI2/AVX2.
- [ ] Symmetric with the decoder side: same `CpuKernel` trait
      reused for shared ops where applicable.
- [ ] No regression on the scalar path.

### Part 5 (Cargo feature flags for kernel selection) — ❌ TODO

PR #263 introduced `SUPPORTS_INLINE_SEQUENCE_EXEC: bool =
cfg!(target_arch = "x86_64")` as a compile-time arch gate on the
inline sequence executor. That gate gives "use the SSE2 inline
path on every x86_64 build" — which is the right default for
universal binaries but doesn't let downstreams trim the binary
for embedded / constrained targets where the inline-exec body
and the per-kernel target_feature trampolines are dead weight.

Replace the arch-gated booleans with Cargo features, default
all-on (universal binary), opt-out for embedded:

```toml
[features]
default = [
  "kernel_scalar",
  "kernel_sse2",   # x86_64 baseline, no-op on non-x86_64 builds
  "kernel_bmi2",   # x86_64
  "kernel_avx2",   # x86_64
  "kernel_vbmi2",  # x86_64 (= AVX-512 BMI2 family)
  "kernel_neon",   # aarch64
  "kernel_sve",    # aarch64
]
kernel_scalar = []
kernel_sse2 = []
kernel_bmi2 = ["kernel_sse2"]
kernel_avx2 = ["kernel_bmi2"]
kernel_vbmi2 = ["kernel_avx2"]
kernel_neon = []
kernel_sve = ["kernel_neon"]
```

Semantic rules:

- **Default build** (no flags) → every kernel compiled in;
  `detect_cpu_kernel()` picks the best at runtime. Universal
  binary, same shape as today.
- **Embedded** (e.g. `--no-default-features --features kernel_scalar`)
  → only the scalar kernel compiled; `detect_cpu_kernel()`
  collapses to a constant `ScalarKernel` tag; the SIMD
  trampolines + the `exec_sequence_inline` body are
  dead-eliminated by `#[cfg(feature = "...")]`. Significant
  binary-size win.
- **Specialised build** (e.g. `--no-default-features --features
  kernel_avx2`) → AVX2 path only, no Scalar fallback for the
  hot dispatch sites. Fails to compile on non-x86_64 targets
  via the existing arch-cfg, so the feature surfaces a clear
  error rather than producing a broken binary.
- **Feature implications** (`kernel_bmi2 = ["kernel_sse2"]` etc.)
  enforce the dependency graph: AVX2 requires BMI2; VBMI2
  requires AVX2; SVE requires NEON.

Acceptance:

- [ ] Cargo features `kernel_*` defined per the table above with
      correct implication chain.
- [ ] Per-kernel `impl CpuKernel for ...` and the dispatch arms
      inside the entry-level `match detect_cpu_kernel()` are
      gated `#[cfg(feature = "kernel_<name>")]`.
- [ ] `detect_cpu_kernel()` returns only tags whose backing
      feature is enabled (falls through to the lowest enabled
      kernel — Scalar by default).
- [ ] `exec_sequence_inline` body (currently
      `#[cfg(target_arch = "x86_64")]`) becomes
      `#[cfg(all(target_arch = "x86_64", any(feature =
      "kernel_sse2", feature = "kernel_bmi2", ...)))]` so the
      whole module is dead-elimable on embedded builds.
- [ ] CI build matrix gains an `embedded_minimal` row:
      `--no-default-features --features kernel_scalar,std,hash`,
      plus binary-size delta vs default in the bench dashboard.
- [ ] Documentation: README + lib.rs preamble explain the
      feature axis with an example for each shape (universal,
      embedded, specialised).

## Combined target

Bring L-1 c_stream decompress from ~31% of FFI to **≥ 50% of FFI**
on i9 (≤ 2.0x gap). Part 1: −2-4%. Part 2: −5-12% (depends on
how much of the bench was previously running scalar FSE / SIMD-
copy fallbacks). The CI bench's `target-cpu=x86-64-v3` workaround
gives a current snapshot of what runtime dispatch can deliver;
this issue makes that available to ALL consumers without build
flags.

## Out of scope

- Re-organising the FSE encoder side TABLE FORMAT (Part 1 fixes
  the decoder Entry; encoder has its own table that already
  pre-computes what it needs for its own hot path).
- Per-call runtime dispatch inside hot inner loops (per the
  cost analysis: per-call dispatch is strictly WORSE than
  function-level monomorphisation via the `K` trait).

Part of the #178 perf roadmap follow-up.


---

## Part 4 — Memory copy path donor parity (post-flamegraph reframe, 2026-05-25)

Flamegraph + perf annotate on i9 of two worst-case decode fixtures (`decompress/level_-1_fast/low-entropy-1m/rust_stream/matrix/pure_rust_direct` and `decompress/level_-3_fast/decodecorpus-z000033/c_stream/matrix/pure_rust_direct`) found that the post-Part-1+Part-2 hot path is NOT BitReader / FSE inner loops anymore — it's the **memory-copy primitives**:

- **low-entropy-1m L-1**: 95.6% of total time in `__memmove_avx_unaligned_erms` (libc memmove). Source: `simd_copy::copy_bytes_overshooting`'s `copy_from_nonoverlapping` fallback fires for every match-copy > 32 bytes whose destination has no overshoot slack. Donor `ZSTD_wildcopy` never calls libc on the hot path.
- **z000033 L-3 c_stream**: 82% memmove + **67% page-fault stack** (`do_anonymous_page` 55%, `handle_mm_fault` 60%). The page-fault overhead is a bench-harness artifact (fresh output Vec per iteration, kernel lazy-maps pages). Decoder-side cost matches the memmove story.

Four independent sub-issues split this work so each can be benched + landed separately:

- **#256** — Replace libc memmove fallback in `simd_copy.rs:170` with inline SIMD wildcopy + overlapping-tail trick. Primary win.
- **#257** — Bench harness pre-touch: allocate + zero output Vec once outside `b.iter`. Removes the page-fault artifact from dashboard delta on large-FCS fixtures.
- **#258** — SIMD broadcast in `extend_and_fill` to replace `slice::fill` libc memset for fills <= 256 bytes (RLE-heavy fixtures).
- **#259** — `decode_to_slice_trusted` requires `output.len() >= FCS + WILDCOPY_OVERLENGTH (= 32)`. API-breaking; enables chunked SIMD kernels on tail copies (synergy with #256).

### Acceptance criteria (Part 4) — ✅ DONE

- [x] **#256** evaluated; libc memmove fallback investigated.
      Inline 16-byte SSE2 loop regressed both target fixtures
      (Intel ERMS microcode beats hand-rolled SIMD on the
      medium-range copies that flamegraph attributed to libc).
      Closed; work moved to `decode_to_slice_trusted` slack
      contract path which delivered the win.
- [x] **#257** landed; bench-harness `pretouch_pages` allocates
      + faults the target Vec once outside `b.iter`. Removed the
      page-fault artifact from dashboard delta on large-FCS
      fixtures.
- [x] **#258** evaluated; SIMD broadcast in `extend_and_fill`
      not hot on FP-flamegraph (0.49% on low-entropy-1m).
      Closed without implementation — the win is consumed by
      Intel ERMS at this size.
- [x] **#259** landed; `WILDCOPY_OVERLENGTH = 32` slack contract
      enables AVX2 chunked kernel reach on tail copies.
- [x] **#262** evaluated; donor `overlapCopy8` 8-byte stride
      port regressed +17/+24% on i9 target fixtures (our
      existing chunked path emits 16-byte SSE2 stores already).
      Closed; the donor portable 8-byte stride is suboptimal on
      x86.
- [x] **#264** evaluated; HUF burst donor port — already
      ported. Closed.
- [x] **#266** evaluated; FSE state-interleave donor port
      regressed +1.40% on z000033 (LLVM already optimal at
      3× separate `#[inline(always)]` calls). Closed.

Post-Part-4 + PR #263 unification: `pure_rust` on
`high-entropy-1m` reaches **1.03× donor parity**;
`low-entropy-1m` **−39.7% vs origin/main**.

### Combined target update

Pre-Part-4 status (after Part 1 + Part 2 + #255): x86_64 `pure_rust_direct` on primary fixture = 1.628 ms vs donor 641 µs = **2.54× donor**. Part 4 closes the memmove/page-fault gap; new target = **≤ 1.5× donor** on primary fixture, with the goal of ≤ 1.0× kept as the ultimate finish line via subsequent Tier 9 BitReader donor parity work (now staged on the refactor done in #255).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: top-level CpuKernel dispatch at FrameDecoder/FrameCompressor entry + FSE Entry layout #247

Context

Architecture

Why top-level, not per-subsystem

Part 1 — FSE Entry layout (independent of kernel dispatch)

Part 2 — Top-level `CpuKernel` dispatch (decoder)

Part 3 — Top-level `CpuKernel` dispatch (encoder)

Acceptance criteria

Part 1 (FSE Entry layout) — ✅ DONE (PR #252)

Part 2 (decoder top-level dispatch) — ✅ DONE (PR #254 + PR #263)

Part 3 (encoder top-level dispatch) — ❌ TODO

Part 5 (Cargo feature flags for kernel selection) — ❌ TODO

Combined target

Out of scope

Part 4 — Memory copy path donor parity (post-flamegraph reframe, 2026-05-25)

Acceptance criteria (Part 4) — ✅ DONE

Combined target update

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf: top-level CpuKernel dispatch at FrameDecoder/FrameCompressor entry + FSE Entry layout #247

Description

Context

Architecture

Why top-level, not per-subsystem

Part 1 — FSE Entry layout (independent of kernel dispatch)

Part 2 — Top-level CpuKernel dispatch (decoder)

Part 3 — Top-level CpuKernel dispatch (encoder)

Acceptance criteria

Part 1 (FSE Entry layout) — ✅ DONE (PR #252)

Part 2 (decoder top-level dispatch) — ✅ DONE (PR #254 + PR #263)

Part 3 (encoder top-level dispatch) — ❌ TODO

Part 5 (Cargo feature flags for kernel selection) — ❌ TODO

Combined target

Out of scope

Part 4 — Memory copy path donor parity (post-flamegraph reframe, 2026-05-25)

Acceptance criteria (Part 4) — ✅ DONE

Combined target update

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Part 2 — Top-level `CpuKernel` dispatch (decoder)

Part 3 — Top-level `CpuKernel` dispatch (encoder)