Skip to content

perf: top-level CpuKernel dispatch at FrameDecoder/FrameCompressor entry + FSE Entry layout #247

@polaz

Description

@polaz

Context

Architectural performance gap analysis on PR #245 surfaced that
our hot paths gate CPU-feature kernel selection at the wrong
place AND at the wrong time:

  • Compile-time gating (cfg!(target_feature = ...)) means
    consumer builds with the default rustc x86_64 target (SSE2 only)
    silently use the scalar fallback even on hardware that has
    BMI2 / AVX2 / VBMI2 etc. Donor zstd-sys does runtime detection
    and always picks up the runner's full ISA.

  • Per-subsystem dispatch scatters the choice across HUF, FSE,
    SIMD-copy independently. Each subsystem detects features on its
    own (sometimes per-call, sometimes per-frame). We pay the
    dispatch cost N times instead of once, and we can't reuse the
    monomorphised pipeline across subsystems.

  • Per-sequence table lookups for FSE base values (lookup_ll_code /
    lookup_ml_code) instead of donor's pre-computed ZSTD_seqSymbol
    entries cost 2 extra L1d touches per sequence.

The architectural fix is single top-level dispatch at the zstd
entry
: FrameDecoder::{decode_to_slice, decode_all, decode_blocks}
and FrameCompressor::compress detect the kernel ONCE (cached in
a process-wide OnceLock) and propagate the choice through the
ENTIRE pipeline as a generic parameter. Inner code (HUF, FSE,
SIMD-copy, bit-reader, match-copy) instantiates against the
chosen kernel; monomorphisation specialises ONLY the bodies that
actually differ, common code stays single-impl.

Architecture

// Single kernel ZST, dispatched once at zstd entry.
pub(crate) trait CpuKernel: Copy {
    // Bodies that DIFFER per kernel — these get monomorphised:
    fn mask_lower_bits(value: u64, n: u8) -> u64;     // FSE bit reader
    fn huf_burst(...);                                 // HUF 4-stream
    fn copy_chunk(src: *const u8, dst: *mut u8);      // SIMD copy
    // ... etc

    // Common code (FCS check, block loop, sequence-execute structure,
    // offset history, repeat semantics) is NOT on the trait. Those
    // stay single-impl, generic ONLY over the kernel ZST so the
    // compiler routes through K::method calls.
}

pub(crate) struct ScalarKernel;
pub(crate) struct Sse2Kernel;
pub(crate) struct Avx2Kernel;     // includes BMI2 (x86-64-v3 baseline)
pub(crate) struct Avx512Kernel;
// + Aarch64Neon, Aarch64Sve

impl CpuKernel for ScalarKernel { /* portable */ }
#[cfg(target_arch = "x86_64")]
impl CpuKernel for Avx2Kernel { /* #[target_feature(enable = "bmi2,avx2")] bodies */ }
// ...

// Entry-level dispatch (OnceLock-cached once per process):
fn detect_cpu_kernel() -> CpuKernelTag {
    static CACHED: OnceLock<CpuKernelTag> = OnceLock::new();
    *CACHED.get_or_init(|| { /* is_x86_feature_detected! chain */ })
}

// FrameDecoder entry:
pub fn decode_to_slice(&mut self, input, output) -> Result<usize> {
    match detect_cpu_kernel() {
        CpuKernelTag::Avx2 => self.decode_to_slice_impl::<Avx2Kernel>(input, output),
        CpuKernelTag::Sse2 => self.decode_to_slice_impl::<Sse2Kernel>(input, output),
        CpuKernelTag::Scalar => self.decode_to_slice_impl::<ScalarKernel>(input, output),
        // ...
    }
}

fn decode_to_slice_impl<K: CpuKernel>(&mut self, input, output) -> Result<usize> {
    // ALL inner code (block loop, HUF burst, FSE state update,
    // sequence executor, match copy) is generic over K and routes
    // through K::method at the leaf hot-path call sites. Common
    // structure stays single-impl.
}

// Same shape for FrameCompressor::compress.

The HUF 4-stream burst already follows this shape internally
(via the HufKernel trait + detect_huffman_decode_kernel()
OnceLock). This issue lifts the dispatch to the FrameDecoder /
FrameCompressor entry so a SINGLE choice covers HUF + FSE +
SIMD-copy + bit-reader + match-copy for the duration of the
decode / encode call.

Why top-level, not per-subsystem

  • Cost amortisation: one detect + match per zstd call instead
    of per-subsystem. On small frames this matters; the runtime
    detect's atomic load + branch chain isn't free.

  • Monomorphisation sharing: the same K propagates through
    HUF, FSE, sequence-execute, match-copy. The compiler emits one
    specialised pipeline per kernel; subsystems don't each carry
    their own dispatcher and their own scalar fallback.

  • Consistency: avoids the asymmetry where HUF runtime-dispatches
    but FSE compile-time-gates. Either both pick up native ISA or
    neither does — anything else is a footgun for benchmarks (we
    hit it on PR perf(decode): direct-write decode_to_slice path (#244) #245 where CI bench had to set
    target-cpu=x86-64-v3 to make FSE inline _bzhi_u64).

  • Specialise only the differing bodies: CpuKernel trait
    methods cover the leaf hot-path ops where SIMD/BMI2 actually
    changes the codegen. FCS checks, block-loop structure, offset
    history etc. stay single-impl with K as a phantom on the
    outer function only. Monomorphisation cost is bounded.

Part 1 — FSE Entry layout (independent of kernel dispatch)

Current Entry (zstd/src/fse/fse_decoder.rs:578):

pub struct Entry {
    pub new_state: u16,    // 2 bytes
    pub symbol: u8,        // 1 byte
    pub num_bits: u8,      // 1 byte
}                          // total: 4 bytes per state

Per-sequence FSE state update touches the 4-byte entry, then:

let (ll_value, ll_num_bits) = lookup_ll_code(ll_code); // touch LL_BASE/BITS arrays
let (ml_value, ml_num_bits) = lookup_ml_code(ml_code); // touch ML_BASE/BITS arrays

That is 2 extra small-array touches per sequence. On i9 with hot
L1 they cost ~4-6 cycles each; ~20K sequences per MiB on L-1
fast corpora = ~80-120 us / MiB extra. Estimated saving: 2-4%
throughput.

Donor ZSTD_seqSymbol:

typedef struct {
    U16  nextState;
    BYTE nbAdditionalBits;
    BYTE nbBits;
    U32  baseValue;
} ZSTD_seqSymbol;   // 12 bytes, 16 with alignment

Proposed expansion:

pub struct Entry {
    pub new_state: u16,
    pub symbol: u8,
    pub num_bits: u8,
    pub base_value: u32,           // <-- new
    pub num_additional_bits: u8,   // <-- new
    // 3 bytes padding to 12B; or pack to 16B for cache alignment
}

Build path: in the LL / ML / OF table builders (FSEScratch
sub-tables), populate base_value and num_additional_bits from
the existing LL_TABLE / ML_TABLE / OF constants based on
Entry.symbol. decode_one_sequence_inline then reads these
directly from the entry, dropping both lookup_* calls.

Table size: LL/ML/OF combined ~20 KiB at 16B-aligned 12B entries
(vs current 12 KiB). Still fits in L1d (32 KiB i9) but tighter
margin alongside HUF table + burst working set.

Part 2 — Top-level CpuKernel dispatch (decoder)

Implement the architecture sketched above on the decoder side:

  • New CpuKernel trait + per-ISA impls (Scalar / Sse2 / Avx2 /
    Avx512, plus aarch64-neon / aarch64-sve).
  • detect_cpu_kernel() with OnceLock cache.
  • decode_to_slice / decode_all / decode_blocks ALL dispatch
    at entry via match over the cached tag, calling generic-over-K
    implementations.
  • Inner code (HUF burst, FSE state update, sequence executor,
    match-copy) routes through K::method calls. The existing
    per-subsystem HufKernel trait merges into / is replaced by
    the unified CpuKernel to remove the second-level dispatch.
  • Inner functions that DON'T touch SIMD/BMI2 ops (block loop,
    FCS check, offset history update, structure code) keep their
    single-impl shape with K as phantom on the outer function only.

After this lands, the CI bench-build can drop its
CARGO_TARGET_X86_64_UNKNOWN_LINUX_*_RUSTFLAGS=target-cpu=x86-64-v3
override — the default rustc build will runtime-dispatch into
the BMI2/AVX2 path on any x86_64 hardware that supports it.

Part 3 — Top-level CpuKernel dispatch (encoder)

Symmetric refactor on the FrameCompressor side. Single detect at
FrameCompressor::compress entry; propagate K through the encode
pipeline (HUF table build + encode, FSE table build + encode,
match finder if any of its kernels differ per ISA, sequence
emission). Same monomorphise-only-where-bodies-differ rule.

Acceptance criteria

Part 1 (FSE Entry layout) — ✅ DONE (PR #252)

  • Entry expanded with base_value (u32) and
    num_additional_bits (u8); #[repr(C)] layout doc updated.
  • LL/ML/OF FSEScratch builders pre-compute new fields.
  • decode_one_sequence_inline drops lookup_ll_code /
    lookup_ml_code and reads from the new entry fields.
  • L-1 fast c_stream bench shows >= 2% throughput improvement
    on i9.
  • No regression on other levels / scenarios.
  • 588+ nextest pass.

Part 2 (decoder top-level dispatch) — ✅ DONE (PR #254 + PR #263)

  • CpuKernel trait + per-ISA impls live alongside or replace
    the existing HufKernel. Implemented as
    Scalar / Bmi2 / Avx2 / Vbmi2 (x86_64) +
    Neon / Sve (aarch64) in zstd/src/cpu_kernel.rs.
  • detect_cpu_kernel() with OnceLock cache; one detect +
    match per call.
  • Decoder hot paths dispatch at entry; inner code generic over
    K. Literals section (literals_section_decoder.rs:143)
    and sequence section (sequence_section_decoder.rs:80)
    both match detect_cpu_kernel() and route into per-K
    #[target_feature]-trampolined impls.
  • CI bench-build can drop the
    CARGO_TARGET_X86_64_UNKNOWN_LINUX_*_RUSTFLAGS=target-cpu=x86-64-v3
    override; default rustc build hits the BMI2 path on capable
    x86_64 hardware.
  • L-1 c_stream bench delta on i9 (post-PR-perf(decode): inline sequence executor for direct path + auto-route decode_all (z000033 −24%, high-entropy-1m parity) #263 unification):
    pure_rust 2.034 ms → 1.555 ms = −23.5% vs origin/main.
    high-entropy-1m reaches 1.03× donor (parity);
    low-entropy-1m −39.7%. Combined target (≤ 2.0× donor)
    achieved on the structured corpora; the small-block
    small-4k-log-lines worst case still sits at ~2.5×.
  • No regression on Scalar path / aarch64 / wasm32 targets.

Part 3 (encoder top-level dispatch) — ❌ TODO

  • Same dispatch shape on FrameCompressor::compress entry.
  • Encode benches show ≥ 3% throughput improvement on x86_64
    hardware with BMI2/AVX2.
  • Symmetric with the decoder side: same CpuKernel trait
    reused for shared ops where applicable.
  • No regression on the scalar path.

Part 5 (Cargo feature flags for kernel selection) — ❌ TODO

PR #263 introduced SUPPORTS_INLINE_SEQUENCE_EXEC: bool = cfg!(target_arch = "x86_64") as a compile-time arch gate on the
inline sequence executor. That gate gives "use the SSE2 inline
path on every x86_64 build" — which is the right default for
universal binaries but doesn't let downstreams trim the binary
for embedded / constrained targets where the inline-exec body
and the per-kernel target_feature trampolines are dead weight.

Replace the arch-gated booleans with Cargo features, default
all-on (universal binary), opt-out for embedded:

[features]
default = [
  "kernel_scalar",
  "kernel_sse2",   # x86_64 baseline, no-op on non-x86_64 builds
  "kernel_bmi2",   # x86_64
  "kernel_avx2",   # x86_64
  "kernel_vbmi2",  # x86_64 (= AVX-512 BMI2 family)
  "kernel_neon",   # aarch64
  "kernel_sve",    # aarch64
]
kernel_scalar = []
kernel_sse2 = []
kernel_bmi2 = ["kernel_sse2"]
kernel_avx2 = ["kernel_bmi2"]
kernel_vbmi2 = ["kernel_avx2"]
kernel_neon = []
kernel_sve = ["kernel_neon"]

Semantic rules:

  • Default build (no flags) → every kernel compiled in;
    detect_cpu_kernel() picks the best at runtime. Universal
    binary, same shape as today.
  • Embedded (e.g. --no-default-features --features kernel_scalar)
    → only the scalar kernel compiled; detect_cpu_kernel()
    collapses to a constant ScalarKernel tag; the SIMD
    trampolines + the exec_sequence_inline body are
    dead-eliminated by #[cfg(feature = "...")]. Significant
    binary-size win.
  • Specialised build (e.g. --no-default-features --features kernel_avx2) → AVX2 path only, no Scalar fallback for the
    hot dispatch sites. Fails to compile on non-x86_64 targets
    via the existing arch-cfg, so the feature surfaces a clear
    error rather than producing a broken binary.
  • Feature implications (kernel_bmi2 = ["kernel_sse2"] etc.)
    enforce the dependency graph: AVX2 requires BMI2; VBMI2
    requires AVX2; SVE requires NEON.

Acceptance:

  • Cargo features kernel_* defined per the table above with
    correct implication chain.
  • Per-kernel impl CpuKernel for ... and the dispatch arms
    inside the entry-level match detect_cpu_kernel() are
    gated #[cfg(feature = "kernel_<name>")].
  • detect_cpu_kernel() returns only tags whose backing
    feature is enabled (falls through to the lowest enabled
    kernel — Scalar by default).
  • exec_sequence_inline body (currently
    #[cfg(target_arch = "x86_64")]) becomes
    #[cfg(all(target_arch = "x86_64", any(feature = "kernel_sse2", feature = "kernel_bmi2", ...)))] so the
    whole module is dead-elimable on embedded builds.
  • CI build matrix gains an embedded_minimal row:
    --no-default-features --features kernel_scalar,std,hash,
    plus binary-size delta vs default in the bench dashboard.
  • Documentation: README + lib.rs preamble explain the
    feature axis with an example for each shape (universal,
    embedded, specialised).

Combined target

Bring L-1 c_stream decompress from ~31% of FFI to ≥ 50% of FFI
on i9 (≤ 2.0x gap). Part 1: −2-4%. Part 2: −5-12% (depends on
how much of the bench was previously running scalar FSE / SIMD-
copy fallbacks). The CI bench's target-cpu=x86-64-v3 workaround
gives a current snapshot of what runtime dispatch can deliver;
this issue makes that available to ALL consumers without build
flags.

Out of scope

  • Re-organising the FSE encoder side TABLE FORMAT (Part 1 fixes
    the decoder Entry; encoder has its own table that already
    pre-computes what it needs for its own hot path).
  • Per-call runtime dispatch inside hot inner loops (per the
    cost analysis: per-call dispatch is strictly WORSE than
    function-level monomorphisation via the K trait).

Part of the #178 perf roadmap follow-up.


Part 4 — Memory copy path donor parity (post-flamegraph reframe, 2026-05-25)

Flamegraph + perf annotate on i9 of two worst-case decode fixtures (decompress/level_-1_fast/low-entropy-1m/rust_stream/matrix/pure_rust_direct and decompress/level_-3_fast/decodecorpus-z000033/c_stream/matrix/pure_rust_direct) found that the post-Part-1+Part-2 hot path is NOT BitReader / FSE inner loops anymore — it's the memory-copy primitives:

  • low-entropy-1m L-1: 95.6% of total time in __memmove_avx_unaligned_erms (libc memmove). Source: simd_copy::copy_bytes_overshooting's copy_from_nonoverlapping fallback fires for every match-copy > 32 bytes whose destination has no overshoot slack. Donor ZSTD_wildcopy never calls libc on the hot path.
  • z000033 L-3 c_stream: 82% memmove + 67% page-fault stack (do_anonymous_page 55%, handle_mm_fault 60%). The page-fault overhead is a bench-harness artifact (fresh output Vec per iteration, kernel lazy-maps pages). Decoder-side cost matches the memmove story.

Four independent sub-issues split this work so each can be benched + landed separately:

Acceptance criteria (Part 4) — ✅ DONE

Post-Part-4 + PR #263 unification: pure_rust on
high-entropy-1m reaches 1.03× donor parity;
low-entropy-1m −39.7% vs origin/main.

Combined target update

Pre-Part-4 status (after Part 1 + Part 2 + #255): x86_64 pure_rust_direct on primary fixture = 1.628 ms vs donor 641 µs = 2.54× donor. Part 4 closes the memmove/page-fault gap; new target = ≤ 1.5× donor on primary fixture, with the goal of ≤ 1.0× kept as the ultimate finish line via subsequent Tier 9 BitReader donor parity work (now staged on the refactor done in #255).

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2-mediumMedium priority — important improvementdocumentationImprovements or additions to documentationenhancementNew feature or requestperformancePerformance optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions