You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Architectural performance gap analysis on PR #245 surfaced that
our hot paths gate CPU-feature kernel selection at the wrong
place AND at the wrong time:
Compile-time gating (cfg!(target_feature = ...)) means
consumer builds with the default rustc x86_64 target (SSE2 only)
silently use the scalar fallback even on hardware that has
BMI2 / AVX2 / VBMI2 etc. Donor zstd-sys does runtime detection
and always picks up the runner's full ISA.
Per-subsystem dispatch scatters the choice across HUF, FSE,
SIMD-copy independently. Each subsystem detects features on its
own (sometimes per-call, sometimes per-frame). We pay the
dispatch cost N times instead of once, and we can't reuse the
monomorphised pipeline across subsystems.
Per-sequence table lookups for FSE base values (lookup_ll_code / lookup_ml_code) instead of donor's pre-computed ZSTD_seqSymbol
entries cost 2 extra L1d touches per sequence.
The architectural fix is single top-level dispatch at the zstd
entry: FrameDecoder::{decode_to_slice, decode_all, decode_blocks}
and FrameCompressor::compress detect the kernel ONCE (cached in
a process-wide OnceLock) and propagate the choice through the
ENTIRE pipeline as a generic parameter. Inner code (HUF, FSE,
SIMD-copy, bit-reader, match-copy) instantiates against the
chosen kernel; monomorphisation specialises ONLY the bodies that
actually differ, common code stays single-impl.
Architecture
// Single kernel ZST, dispatched once at zstd entry.pub(crate)traitCpuKernel:Copy{// Bodies that DIFFER per kernel — these get monomorphised:fnmask_lower_bits(value:u64,n:u8) -> u64;// FSE bit readerfnhuf_burst(...);// HUF 4-streamfncopy_chunk(src:*constu8,dst:*mutu8);// SIMD copy// ... etc// Common code (FCS check, block loop, sequence-execute structure,// offset history, repeat semantics) is NOT on the trait. Those// stay single-impl, generic ONLY over the kernel ZST so the// compiler routes through K::method calls.}pub(crate)structScalarKernel;pub(crate)structSse2Kernel;pub(crate)structAvx2Kernel;// includes BMI2 (x86-64-v3 baseline)pub(crate)structAvx512Kernel;// + Aarch64Neon, Aarch64SveimplCpuKernelforScalarKernel{/* portable */}#[cfg(target_arch = "x86_64")]implCpuKernelforAvx2Kernel{/* #[target_feature(enable = "bmi2,avx2")] bodies */}// ...// Entry-level dispatch (OnceLock-cached once per process):fndetect_cpu_kernel() -> CpuKernelTag{staticCACHED:OnceLock<CpuKernelTag> = OnceLock::new();*CACHED.get_or_init(|| {/* is_x86_feature_detected! chain */})}// FrameDecoder entry:pubfndecode_to_slice(&mutself,input,output) -> Result<usize>{matchdetect_cpu_kernel(){CpuKernelTag::Avx2 => self.decode_to_slice_impl::<Avx2Kernel>(input, output),CpuKernelTag::Sse2 => self.decode_to_slice_impl::<Sse2Kernel>(input, output),CpuKernelTag::Scalar => self.decode_to_slice_impl::<ScalarKernel>(input, output),// ...}}fndecode_to_slice_impl<K:CpuKernel>(&mutself,input,output) -> Result<usize>{// ALL inner code (block loop, HUF burst, FSE state update,// sequence executor, match copy) is generic over K and routes// through K::method at the leaf hot-path call sites. Common// structure stays single-impl.}// Same shape for FrameCompressor::compress.
The HUF 4-stream burst already follows this shape internally
(via the HufKernel trait + detect_huffman_decode_kernel()
OnceLock). This issue lifts the dispatch to the FrameDecoder /
FrameCompressor entry so a SINGLE choice covers HUF + FSE +
SIMD-copy + bit-reader + match-copy for the duration of the
decode / encode call.
Why top-level, not per-subsystem
Cost amortisation: one detect + match per zstd call instead
of per-subsystem. On small frames this matters; the runtime
detect's atomic load + branch chain isn't free.
Monomorphisation sharing: the same K propagates through
HUF, FSE, sequence-execute, match-copy. The compiler emits one
specialised pipeline per kernel; subsystems don't each carry
their own dispatcher and their own scalar fallback.
Consistency: avoids the asymmetry where HUF runtime-dispatches
but FSE compile-time-gates. Either both pick up native ISA or
neither does — anything else is a footgun for benchmarks (we
hit it on PR perf(decode): direct-write decode_to_slice path (#244) #245 where CI bench had to set target-cpu=x86-64-v3 to make FSE inline _bzhi_u64).
Specialise only the differing bodies: CpuKernel trait
methods cover the leaf hot-path ops where SIMD/BMI2 actually
changes the codegen. FCS checks, block-loop structure, offset
history etc. stay single-impl with K as a phantom on the
outer function only. Monomorphisation cost is bounded.
Part 1 — FSE Entry layout (independent of kernel dispatch)
Current Entry (zstd/src/fse/fse_decoder.rs:578):
pubstructEntry{pubnew_state:u16,// 2 bytespubsymbol:u8,// 1 bytepubnum_bits:u8,// 1 byte}// total: 4 bytes per state
Per-sequence FSE state update touches the 4-byte entry, then:
That is 2 extra small-array touches per sequence. On i9 with hot
L1 they cost ~4-6 cycles each; ~20K sequences per MiB on L-1
fast corpora = ~80-120 us / MiB extra. Estimated saving: 2-4%
throughput.
pubstructEntry{pubnew_state:u16,pubsymbol:u8,pubnum_bits:u8,pubbase_value:u32,// <-- newpubnum_additional_bits:u8,// <-- new// 3 bytes padding to 12B; or pack to 16B for cache alignment}
Build path: in the LL / ML / OF table builders (FSEScratch
sub-tables), populate base_value and num_additional_bits from
the existing LL_TABLE / ML_TABLE / OF constants based on Entry.symbol. decode_one_sequence_inline then reads these
directly from the entry, dropping both lookup_* calls.
Table size: LL/ML/OF combined ~20 KiB at 16B-aligned 12B entries
(vs current 12 KiB). Still fits in L1d (32 KiB i9) but tighter
margin alongside HUF table + burst working set.
Part 2 — Top-level CpuKernel dispatch (decoder)
Implement the architecture sketched above on the decoder side:
New CpuKernel trait + per-ISA impls (Scalar / Sse2 / Avx2 /
Avx512, plus aarch64-neon / aarch64-sve).
detect_cpu_kernel() with OnceLock cache.
decode_to_slice / decode_all / decode_blocks ALL dispatch
at entry via match over the cached tag, calling generic-over-K
implementations.
Inner code (HUF burst, FSE state update, sequence executor,
match-copy) routes through K::method calls. The existing
per-subsystem HufKernel trait merges into / is replaced by
the unified CpuKernel to remove the second-level dispatch.
Inner functions that DON'T touch SIMD/BMI2 ops (block loop,
FCS check, offset history update, structure code) keep their
single-impl shape with K as phantom on the outer function only.
After this lands, the CI bench-build can drop its CARGO_TARGET_X86_64_UNKNOWN_LINUX_*_RUSTFLAGS=target-cpu=x86-64-v3
override — the default rustc build will runtime-dispatch into
the BMI2/AVX2 path on any x86_64 hardware that supports it.
Part 3 — Top-level CpuKernel dispatch (encoder)
Symmetric refactor on the FrameCompressor side. Single detect at FrameCompressor::compress entry; propagate K through the encode
pipeline (HUF table build + encode, FSE table build + encode,
match finder if any of its kernels differ per ISA, sequence
emission). Same monomorphise-only-where-bodies-differ rule.
CpuKernel trait + per-ISA impls live alongside or replace
the existing HufKernel. Implemented as Scalar / Bmi2 / Avx2 / Vbmi2 (x86_64) + Neon / Sve (aarch64) in zstd/src/cpu_kernel.rs.
detect_cpu_kernel() with OnceLock cache; one detect +
match per call.
Decoder hot paths dispatch at entry; inner code generic over K. Literals section (literals_section_decoder.rs:143)
and sequence section (sequence_section_decoder.rs:80)
both match detect_cpu_kernel() and route into per-K #[target_feature]-trampolined impls.
CI bench-build can drop the CARGO_TARGET_X86_64_UNKNOWN_LINUX_*_RUSTFLAGS=target-cpu=x86-64-v3
override; default rustc build hits the BMI2 path on capable
x86_64 hardware.
No regression on Scalar path / aarch64 / wasm32 targets.
Part 3 (encoder top-level dispatch) — ❌ TODO
Same dispatch shape on FrameCompressor::compress entry.
Encode benches show ≥ 3% throughput improvement on x86_64
hardware with BMI2/AVX2.
Symmetric with the decoder side: same CpuKernel trait
reused for shared ops where applicable.
No regression on the scalar path.
Part 5 (Cargo feature flags for kernel selection) — ❌ TODO
PR #263 introduced SUPPORTS_INLINE_SEQUENCE_EXEC: bool = cfg!(target_arch = "x86_64") as a compile-time arch gate on the
inline sequence executor. That gate gives "use the SSE2 inline
path on every x86_64 build" — which is the right default for
universal binaries but doesn't let downstreams trim the binary
for embedded / constrained targets where the inline-exec body
and the per-kernel target_feature trampolines are dead weight.
Replace the arch-gated booleans with Cargo features, default
all-on (universal binary), opt-out for embedded:
Default build (no flags) → every kernel compiled in; detect_cpu_kernel() picks the best at runtime. Universal
binary, same shape as today.
Embedded (e.g. --no-default-features --features kernel_scalar)
→ only the scalar kernel compiled; detect_cpu_kernel()
collapses to a constant ScalarKernel tag; the SIMD
trampolines + the exec_sequence_inline body are
dead-eliminated by #[cfg(feature = "...")]. Significant
binary-size win.
Specialised build (e.g. --no-default-features --features kernel_avx2) → AVX2 path only, no Scalar fallback for the
hot dispatch sites. Fails to compile on non-x86_64 targets
via the existing arch-cfg, so the feature surfaces a clear
error rather than producing a broken binary.
Feature implications (kernel_bmi2 = ["kernel_sse2"] etc.)
enforce the dependency graph: AVX2 requires BMI2; VBMI2
requires AVX2; SVE requires NEON.
Acceptance:
Cargo features kernel_* defined per the table above with
correct implication chain.
Per-kernel impl CpuKernel for ... and the dispatch arms
inside the entry-level match detect_cpu_kernel() are
gated #[cfg(feature = "kernel_<name>")].
detect_cpu_kernel() returns only tags whose backing
feature is enabled (falls through to the lowest enabled
kernel — Scalar by default).
exec_sequence_inline body (currently #[cfg(target_arch = "x86_64")]) becomes #[cfg(all(target_arch = "x86_64", any(feature = "kernel_sse2", feature = "kernel_bmi2", ...)))] so the
whole module is dead-elimable on embedded builds.
CI build matrix gains an embedded_minimal row: --no-default-features --features kernel_scalar,std,hash,
plus binary-size delta vs default in the bench dashboard.
Documentation: README + lib.rs preamble explain the
feature axis with an example for each shape (universal,
embedded, specialised).
Combined target
Bring L-1 c_stream decompress from ~31% of FFI to ≥ 50% of FFI
on i9 (≤ 2.0x gap). Part 1: −2-4%. Part 2: −5-12% (depends on
how much of the bench was previously running scalar FSE / SIMD-
copy fallbacks). The CI bench's target-cpu=x86-64-v3 workaround
gives a current snapshot of what runtime dispatch can deliver;
this issue makes that available to ALL consumers without build
flags.
Out of scope
Re-organising the FSE encoder side TABLE FORMAT (Part 1 fixes
the decoder Entry; encoder has its own table that already
pre-computes what it needs for its own hot path).
Per-call runtime dispatch inside hot inner loops (per the
cost analysis: per-call dispatch is strictly WORSE than
function-level monomorphisation via the K trait).
Flamegraph + perf annotate on i9 of two worst-case decode fixtures (decompress/level_-1_fast/low-entropy-1m/rust_stream/matrix/pure_rust_direct and decompress/level_-3_fast/decodecorpus-z000033/c_stream/matrix/pure_rust_direct) found that the post-Part-1+Part-2 hot path is NOT BitReader / FSE inner loops anymore — it's the memory-copy primitives:
low-entropy-1m L-1: 95.6% of total time in __memmove_avx_unaligned_erms (libc memmove). Source: simd_copy::copy_bytes_overshooting's copy_from_nonoverlapping fallback fires for every match-copy > 32 bytes whose destination has no overshoot slack. Donor ZSTD_wildcopy never calls libc on the hot path.
z000033 L-3 c_stream: 82% memmove + 67% page-fault stack (do_anonymous_page 55%, handle_mm_fault 60%). The page-fault overhead is a bench-harness artifact (fresh output Vec per iteration, kernel lazy-maps pages). Decoder-side cost matches the memmove story.
Four independent sub-issues split this work so each can be benched + landed separately:
Post-Part-4 + PR #263 unification: pure_rust on high-entropy-1m reaches 1.03× donor parity; low-entropy-1m−39.7% vs origin/main.
Combined target update
Pre-Part-4 status (after Part 1 + Part 2 + #255): x86_64 pure_rust_direct on primary fixture = 1.628 ms vs donor 641 µs = 2.54× donor. Part 4 closes the memmove/page-fault gap; new target = ≤ 1.5× donor on primary fixture, with the goal of ≤ 1.0× kept as the ultimate finish line via subsequent Tier 9 BitReader donor parity work (now staged on the refactor done in #255).
Context
Architectural performance gap analysis on PR #245 surfaced that
our hot paths gate CPU-feature kernel selection at the wrong
place AND at the wrong time:
Compile-time gating (
cfg!(target_feature = ...)) meansconsumer builds with the default rustc x86_64 target (SSE2 only)
silently use the scalar fallback even on hardware that has
BMI2 / AVX2 / VBMI2 etc. Donor zstd-sys does runtime detection
and always picks up the runner's full ISA.
Per-subsystem dispatch scatters the choice across HUF, FSE,
SIMD-copy independently. Each subsystem detects features on its
own (sometimes per-call, sometimes per-frame). We pay the
dispatch cost N times instead of once, and we can't reuse the
monomorphised pipeline across subsystems.
Per-sequence table lookups for FSE base values (
lookup_ll_code/lookup_ml_code) instead of donor's pre-computedZSTD_seqSymbolentries cost 2 extra L1d touches per sequence.
The architectural fix is single top-level dispatch at the zstd
entry:
FrameDecoder::{decode_to_slice, decode_all, decode_blocks}and
FrameCompressor::compressdetect the kernel ONCE (cached ina process-wide OnceLock) and propagate the choice through the
ENTIRE pipeline as a generic parameter. Inner code (HUF, FSE,
SIMD-copy, bit-reader, match-copy) instantiates against the
chosen kernel; monomorphisation specialises ONLY the bodies that
actually differ, common code stays single-impl.
Architecture
The HUF 4-stream burst already follows this shape internally
(via the
HufKerneltrait +detect_huffman_decode_kernel()OnceLock). This issue lifts the dispatch to the FrameDecoder /
FrameCompressor entry so a SINGLE choice covers HUF + FSE +
SIMD-copy + bit-reader + match-copy for the duration of the
decode / encode call.
Why top-level, not per-subsystem
Cost amortisation: one detect + match per zstd call instead
of per-subsystem. On small frames this matters; the runtime
detect's atomic load + branch chain isn't free.
Monomorphisation sharing: the same
Kpropagates throughHUF, FSE, sequence-execute, match-copy. The compiler emits one
specialised pipeline per kernel; subsystems don't each carry
their own dispatcher and their own scalar fallback.
Consistency: avoids the asymmetry where HUF runtime-dispatches
but FSE compile-time-gates. Either both pick up native ISA or
neither does — anything else is a footgun for benchmarks (we
hit it on PR perf(decode): direct-write decode_to_slice path (#244) #245 where CI bench had to set
target-cpu=x86-64-v3to make FSE inline_bzhi_u64).Specialise only the differing bodies:
CpuKerneltraitmethods cover the leaf hot-path ops where SIMD/BMI2 actually
changes the codegen. FCS checks, block-loop structure, offset
history etc. stay single-impl with
Kas a phantom on theouter function only. Monomorphisation cost is bounded.
Part 1 — FSE Entry layout (independent of kernel dispatch)
Current
Entry(zstd/src/fse/fse_decoder.rs:578):Per-sequence FSE state update touches the 4-byte entry, then:
That is 2 extra small-array touches per sequence. On i9 with hot
L1 they cost ~4-6 cycles each; ~20K sequences per MiB on L-1
fast corpora = ~80-120 us / MiB extra. Estimated saving: 2-4%
throughput.
Donor
ZSTD_seqSymbol:Proposed expansion:
Build path: in the LL / ML / OF table builders (
FSEScratchsub-tables), populate
base_valueandnum_additional_bitsfromthe existing LL_TABLE / ML_TABLE / OF constants based on
Entry.symbol.decode_one_sequence_inlinethen reads thesedirectly from the entry, dropping both
lookup_*calls.Table size: LL/ML/OF combined ~20 KiB at 16B-aligned 12B entries
(vs current 12 KiB). Still fits in L1d (32 KiB i9) but tighter
margin alongside HUF table + burst working set.
Part 2 — Top-level
CpuKerneldispatch (decoder)Implement the architecture sketched above on the decoder side:
CpuKerneltrait + per-ISA impls (Scalar / Sse2 / Avx2 /Avx512, plus aarch64-neon / aarch64-sve).
detect_cpu_kernel()with OnceLock cache.decode_to_slice/decode_all/decode_blocksALL dispatchat entry via match over the cached tag, calling generic-over-K
implementations.
match-copy) routes through
K::methodcalls. The existingper-subsystem
HufKerneltrait merges into / is replaced bythe unified
CpuKernelto remove the second-level dispatch.FCS check, offset history update, structure code) keep their
single-impl shape with K as phantom on the outer function only.
After this lands, the CI bench-build can drop its
CARGO_TARGET_X86_64_UNKNOWN_LINUX_*_RUSTFLAGS=target-cpu=x86-64-v3override — the default rustc build will runtime-dispatch into
the BMI2/AVX2 path on any x86_64 hardware that supports it.
Part 3 — Top-level
CpuKerneldispatch (encoder)Symmetric refactor on the FrameCompressor side. Single detect at
FrameCompressor::compressentry; propagate K through the encodepipeline (HUF table build + encode, FSE table build + encode,
match finder if any of its kernels differ per ISA, sequence
emission). Same monomorphise-only-where-bodies-differ rule.
Acceptance criteria
Part 1 (FSE Entry layout) — ✅ DONE (PR #252)
Entryexpanded withbase_value(u32) andnum_additional_bits(u8);#[repr(C)]layout doc updated.FSEScratchbuilders pre-compute new fields.decode_one_sequence_inlinedropslookup_ll_code/lookup_ml_codeand reads from the new entry fields.on i9.
Part 2 (decoder top-level dispatch) — ✅ DONE (PR #254 + PR #263)
CpuKerneltrait + per-ISA impls live alongside or replacethe existing
HufKernel. Implemented asScalar/Bmi2/Avx2/Vbmi2(x86_64) +Neon/Sve(aarch64) inzstd/src/cpu_kernel.rs.detect_cpu_kernel()withOnceLockcache; one detect +match per call.
K. Literals section (literals_section_decoder.rs:143)and sequence section (
sequence_section_decoder.rs:80)both
match detect_cpu_kernel()and route into per-K#[target_feature]-trampolined impls.CARGO_TARGET_X86_64_UNKNOWN_LINUX_*_RUSTFLAGS=target-cpu=x86-64-v3override; default rustc build hits the BMI2 path on capable
x86_64 hardware.
pure_rust2.034 ms → 1.555 ms = −23.5% vs origin/main.high-entropy-1mreaches 1.03× donor (parity);low-entropy-1m−39.7%. Combined target (≤ 2.0× donor)achieved on the structured corpora; the small-block
small-4k-log-linesworst case still sits at ~2.5×.Part 3 (encoder top-level dispatch) — ❌ TODO
FrameCompressor::compressentry.hardware with BMI2/AVX2.
CpuKerneltraitreused for shared ops where applicable.
Part 5 (Cargo feature flags for kernel selection) — ❌ TODO
PR #263 introduced
SUPPORTS_INLINE_SEQUENCE_EXEC: bool = cfg!(target_arch = "x86_64")as a compile-time arch gate on theinline sequence executor. That gate gives "use the SSE2 inline
path on every x86_64 build" — which is the right default for
universal binaries but doesn't let downstreams trim the binary
for embedded / constrained targets where the inline-exec body
and the per-kernel target_feature trampolines are dead weight.
Replace the arch-gated booleans with Cargo features, default
all-on (universal binary), opt-out for embedded:
Semantic rules:
detect_cpu_kernel()picks the best at runtime. Universalbinary, same shape as today.
--no-default-features --features kernel_scalar)→ only the scalar kernel compiled;
detect_cpu_kernel()collapses to a constant
ScalarKerneltag; the SIMDtrampolines + the
exec_sequence_inlinebody aredead-eliminated by
#[cfg(feature = "...")]. Significantbinary-size win.
--no-default-features --features kernel_avx2) → AVX2 path only, no Scalar fallback for thehot dispatch sites. Fails to compile on non-x86_64 targets
via the existing arch-cfg, so the feature surfaces a clear
error rather than producing a broken binary.
kernel_bmi2 = ["kernel_sse2"]etc.)enforce the dependency graph: AVX2 requires BMI2; VBMI2
requires AVX2; SVE requires NEON.
Acceptance:
kernel_*defined per the table above withcorrect implication chain.
impl CpuKernel for ...and the dispatch armsinside the entry-level
match detect_cpu_kernel()aregated
#[cfg(feature = "kernel_<name>")].detect_cpu_kernel()returns only tags whose backingfeature is enabled (falls through to the lowest enabled
kernel — Scalar by default).
exec_sequence_inlinebody (currently#[cfg(target_arch = "x86_64")]) becomes#[cfg(all(target_arch = "x86_64", any(feature = "kernel_sse2", feature = "kernel_bmi2", ...)))]so thewhole module is dead-elimable on embedded builds.
embedded_minimalrow:--no-default-features --features kernel_scalar,std,hash,plus binary-size delta vs default in the bench dashboard.
feature axis with an example for each shape (universal,
embedded, specialised).
Combined target
Bring L-1 c_stream decompress from ~31% of FFI to ≥ 50% of FFI
on i9 (≤ 2.0x gap). Part 1: −2-4%. Part 2: −5-12% (depends on
how much of the bench was previously running scalar FSE / SIMD-
copy fallbacks). The CI bench's
target-cpu=x86-64-v3workaroundgives a current snapshot of what runtime dispatch can deliver;
this issue makes that available to ALL consumers without build
flags.
Out of scope
the decoder Entry; encoder has its own table that already
pre-computes what it needs for its own hot path).
cost analysis: per-call dispatch is strictly WORSE than
function-level monomorphisation via the
Ktrait).Part of the #178 perf roadmap follow-up.
Part 4 — Memory copy path donor parity (post-flamegraph reframe, 2026-05-25)
Flamegraph + perf annotate on i9 of two worst-case decode fixtures (
decompress/level_-1_fast/low-entropy-1m/rust_stream/matrix/pure_rust_directanddecompress/level_-3_fast/decodecorpus-z000033/c_stream/matrix/pure_rust_direct) found that the post-Part-1+Part-2 hot path is NOT BitReader / FSE inner loops anymore — it's the memory-copy primitives:__memmove_avx_unaligned_erms(libc memmove). Source:simd_copy::copy_bytes_overshooting'scopy_from_nonoverlappingfallback fires for every match-copy > 32 bytes whose destination has no overshoot slack. DonorZSTD_wildcopynever calls libc on the hot path.do_anonymous_page55%,handle_mm_fault60%). The page-fault overhead is a bench-harness artifact (fresh output Vec per iteration, kernel lazy-maps pages). Decoder-side cost matches the memmove story.Four independent sub-issues split this work so each can be benched + landed separately:
simd_copy.rs:170with inline SIMD wildcopy + overlapping-tail trick. Primary win.b.iter. Removes the page-fault artifact from dashboard delta on large-FCS fixtures.extend_and_fillto replaceslice::filllibc memset for fills <= 256 bytes (RLE-heavy fixtures).decode_to_slice_trustedrequiresoutput.len() >= FCS + WILDCOPY_OVERLENGTH (= 32). API-breaking; enables chunked SIMD kernels on tail copies (synergy with perf(decode): replace libc memmove fallback in simd_copy with inline SIMD wildcopy (exact-length tail) #256).Acceptance criteria (Part 4) — ✅ DONE
Inline 16-byte SSE2 loop regressed both target fixtures
(Intel ERMS microcode beats hand-rolled SIMD on the
medium-range copies that flamegraph attributed to libc).
Closed; work moved to
decode_to_slice_trustedslackcontract path which delivered the win.
pretouch_pagesallocates+ faults the target Vec once outside
b.iter. Removed thepage-fault artifact from dashboard delta on large-FCS
fixtures.
extend_and_fillnot hot on FP-flamegraph (0.49% on low-entropy-1m).
Closed without implementation — the win is consumed by
Intel ERMS at this size.
WILDCOPY_OVERLENGTH = 32slack contractenables AVX2 chunked kernel reach on tail copies.
overlapCopy88-byte strideport regressed +17/+24% on i9 target fixtures (our
existing chunked path emits 16-byte SSE2 stores already).
Closed; the donor portable 8-byte stride is suboptimal on
x86.
ported. Closed.
regressed +1.40% on z000033 (LLVM already optimal at
3× separate
#[inline(always)]calls). Closed.Post-Part-4 + PR #263 unification:
pure_rustonhigh-entropy-1mreaches 1.03× donor parity;low-entropy-1m−39.7% vs origin/main.Combined target update
Pre-Part-4 status (after Part 1 + Part 2 + #255): x86_64
pure_rust_directon primary fixture = 1.628 ms vs donor 641 µs = 2.54× donor. Part 4 closes the memmove/page-fault gap; new target = ≤ 1.5× donor on primary fixture, with the goal of ≤ 1.0× kept as the ultimate finish line via subsequent Tier 9 BitReader donor parity work (now staged on the refactor done in #255).