Add GPU arrival time readback for timing-aware VCD output by robtaylor · Pull Request #49 · ChipFlow/Loom

robtaylor · 2026-02-27T12:21:05Z

Summary

Adds --timing-vcd flag (requires --sdf) that produces timing-accurate VCD output where signal transitions are offset from clock edges by their computed arrival times
GPU kernels (Metal/CUDA) write shared_writeout_arrival to global memory via a new arrival state section alongside values and xmask
Host-side extracts arrival data and writes sub-cycle-accurate VCD with proper timescale conversion

Details

The GPU kernel already computes per-gate arrival times for setup/hold violation checking, but discards them after each partition. This PR adds an opt-in sideband (arrival_state_offset in SimParams) that writes arrival times to global memory, then a new write_output_vcd_timed() function offsets each signal transition from its clock edge by the arrival time in picoseconds.

State buffer layout when enabled: [values (rio) | xmask (rio, if xprop) | arrivals (rio)]

Files changed:

csrc/kernel_v1.metal, csrc/kernel_v1_impl.cuh — arrival write + SimParams update
src/flatten.rs — timing_arrivals_enabled, arrival_state_offset, updated effective_state_size()
src/sim/vcd_io.rs — expand_states_for_arrivals(), split_arrival_states(), write_output_vcd_timed()
src/bin/loom.rs — CLI flag, SimParams wiring, timed VCD dispatch

Test plan

cargo test — 97 tests pass (3 new timing arrival tests)
cargo build -r --features metal --bin loom — Metal shader compiles
Run on inv_chain with --sdf and --timing-vcd, compare against CVC reference output
Verify default behavior unchanged without --timing-vcd

Add --timing-vcd flag that produces timing-accurate VCD output where signal transitions are offset from clock edges by their computed arrival times. The GPU kernel already computes per-gate arrival times for setup/hold checking; this feature writes them to global memory so the host can produce sub-cycle-accurate output. Changes: - GPU kernels (Metal/CUDA): write shared_writeout_arrival to global memory at arrival_state_offset when enabled - FlattenedScriptV1: add timing_arrivals_enabled, arrival_state_offset fields; update effective_state_size() for 3-section layout - vcd_io: add expand_states_for_arrivals(), split_arrival_states(), write_output_vcd_timed() with ps-to-timescale conversion - loom CLI: wire --timing-vcd flag, SimParams.arrival_state_offset, and timed VCD writer dispatch Co-developed-by: Claude Code v2.1.44 (claude-opus-4-6)

Add detailed section to Known Issues explaining why Loom only supports edge-triggered DFFs, why CVC's test suite can't be reused as reference tests (NAND-latch flip-flops), and what would be needed to add latch support (new DriverType, two-phase evaluation, GPU kernel changes). Co-developed-by: Claude Code v2.1.44 (claude-opus-4-6)

- Change IdCode::from(0) to IdCode(0) for vcd_ng tuple struct API - Make write_output_vcd_timed generic over W: Write for testability - Remove writer.flush() calls (vcd_ng::Writer has no flush method) - Add 8 comprehensive tests for expand/split/write timing arrivals Co-developed-by: Claude Code v2.1.44 (claude-opus-4-6)

The Metal kernel uses a double-buffered read pattern where t4_5 holds the current stage's data while the next stage's data is pre-loaded. The gate_delay extraction was incorrectly placed AFTER the t4_5 overwrite, causing it to read the next stage's padding slot instead of the current one. For single-stage designs (like inv_chain), this read garbage/zeros. Fix: extract gate_delay from t4_5.c4 before overwriting t4_5. Also fix arrival tracking to add gate_delay even for pass-through positions (orb == 0xFFFFFFFF) across all hierarchy levels, since pass-throughs can represent physical cells (e.g., inverter chains) with accumulated delays. Also fix load_timing_from_sdf to iterate all cell origins per AIG pin instead of only the first, enabling correct delay accumulation for inverter chains collapsed to a single AIG wire. Verified: inv_chain test produces correct 1323ps arrival delay matching the analytical SDF sum (CLK→Q=350ps + 16 inverters=973ps). Co-developed-by: Claude Code v2.1.62 (claude-opus-4-6)

Suppress unused variable warnings (staged, num_srams, num_ios, num_dup, part_end) and remove dead assignments (offset before break, script_pi before break) that were cluttering build output. Co-developed-by: Claude Code v2.1.62 (claude-opus-4-6)

- tb_cvc.v: CVC testbench with SDF annotation for inv_chain timing validation (expected total delay: 1323ps) - inv_chain_stimulus.vcd: Input stimulus for timing VCD tests - compare_vcd.py: VCD comparison script for Loom vs CVC output - watchlist.json: Signal watchlist for timing_sim_cpu tracing - CI workflow: CVC reference simulation job for automated validation Co-developed-by: Claude Code v2.1.62 (claude-opus-4-6)

Dockerfile builds CVC (open-src-cvc) from source on linux/amd64 with gcc/binutils for its native code compilation. run_cvc.sh builds the image, runs the inv_chain testbench with SDF back-annotation, and compares against Loom's timing output. Results: CVC reports 1235ps total delay vs Loom's 1323ps — an 88ps (7.1%) conservative overestimate. This is expected: Loom uses max(rise, fall) per cell since the GPU kernel processes 32 packed signals and cannot track per-signal transition direction. CVC tracks actual rise/fall transitions through the inverter chain. The 88ps decomposes as: 8 inverter stages × 10ps IOPATH rise/fall asymmetry = 80ps 8 interconnect wires × 1ps rise/fall asymmetry = 8ps Usage: bash tests/timing_test/cvc/run_cvc.sh Co-developed-by: Claude Code v2.1.62 (claude-opus-4-6)

Add detailed section to timing-simulation.md covering the three independent sources of timing overestimation: 1. max(rise, fall) per cell — GPU can't track transition direction across 32 packed signals (80ps / 6.5% for inv_chain) 2. max wire delay across multi-input pins — single wire delay per cell regardless of which input is critical (8ps for inv_chain) 3. max arrival across 32 packed signals per thread — mitigated by timing-aware bit packing (0ps for inv_chain, larger in practice) Documents CVC reference validation: Loom 1323ps vs CVC 1235ps (88ps / 7.1% conservative overestimate) for the inv_chain design. Updates implementation phases to reflect completed GPU arrival tracking and timing-aware VCD output. Co-developed-by: Claude Code v2.1.62 (claude-opus-4-6)

40 outputs at 5 logic depths (3, 5, 9, 13, 17) exercise Source 3 overestimation in timing-aware bit packing. CVC reference shows distinct arrival times per group (513ps to 1286ps), confirming the conservative timing model. Includes hand-crafted SDF, stimulus VCD, CVC testbench, and Docker runner script. Co-developed-by: Claude Code v2.1.44 (claude-opus-4-6)

The previous fallback logic used `find | sort -r | head -1` which grabbed a pre-PnR SDF (step 08) alphabetically instead of the post-PnR SDF from STAPostPNR (step 51) that includes interconnect delays. Now explicitly searches for stapostpnr nom_tt SDF first. Co-developed-by: Claude Code v2.1.44 (claude-opus-4-6)

Adds a new --stimulus-vcd <path> CLI option to `loom cosim` that writes all primary input signals (clock, reset, flash MISO, constants) to a VCD file. This enables CVC reference simulation by replaying the exact same stimulus that the GPU cosim applied. When enabled, forces single-tick mode (batch=1) to read back GPU state after each cycle. Each tick produces two VCD timestamps (falling + rising edge) for correct clock waveform reconstruction. Change-based encoding minimizes file size by only writing transitions. Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)

Recognize `dlymetal` prefix as a delay cell (same A→X interface as `dlygate`) and `diode` prefix as a non-functional cell (like fill/tap/ decap) so post-PnR netlists containing these cells parse correctly. Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)

Sorts level-1 endpoint placement by logic level so signals with similar arrival times land in the same 32-slot groups. This tightens the conservative timing estimate by reducing intra-group level spread. Adds diagnostic logging of per-group timing spread statistics. Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)

Update eda-infra-rs submodule to include support for parsing `assign y = ~(x)` in structural Verilog. This is needed for SKY130 post-PnR netlists that use bitwise NOT in assign statements. The parser adds a Not(Box<Wirexpr>) variant and the netlistdb builder synthesizes INV cells, which the existing AIG builder handles natively. Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)

When the gemparts file is omitted, partitions are generated inline using the same mt-kahypar loop as `loom map`. This adds ~20s but removes the need for a separate mapping step during development. Refactored generate_partitions() and run_par() into setup.rs to share between map, sim, and cosim code paths. Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)

Stimulus VCD writer: use proper VCD reference + index format (e.g., "gpio_in [38]" instead of "gpio_in[38]") so the VCD roundtrips correctly through the vcd-ng parser for sim playback. SKY130: add INV cell pin mapping to sky130.rs for post-PnR netlists that contain inverter cells. Config: update MCU SoC sim config with timing section for SDF. Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)

Infrastructure for comparing Loom GPU simulation against CVC (open-source event-driven simulator) with SDF back-annotation on the MCU SoC SKY130 post-PnR netlist. Workflow: 1. loom cosim --stimulus-vcd captures primary inputs 2. convert_stimulus.py converts VCD to Verilog assignments 3. gen_cell_models.py generates SKY130 behavioral + specify models 4. strip_sdf_checks.py preprocesses SDF for CVC compatibility 5. CVC runs with SDF timing via Docker (run_cvc.sh) 6. compare_outputs.py compares gpio_out waveforms Key fixes for CVC compatibility: - Wire _delayed signals directly to inputs in behavioral models - Initialize DFF UDPs to 0 (matching Loom's initialization) - Strip TIMINGCHECK/INTERCONNECT from SDF - Remove empty DELAY blocks and escaped-$ CELL entries - Add specify blocks to sized wrapper modules - Behavioral CF_SRAM_1024x32 model with per-bit specify paths Result: 100% match on CPU-driven GPIO outputs (bits 6-43), sub-cycle SPI flash differences expected due to sampling. Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)

The separate `loom map` step added complexity without meaningful benefit since partitioning is fast (~20s). Sim and cosim now always generate partitions at startup, simplifying the workflow from two steps to one. Changes: - Remove Map subcommand, MapArgs, and cmd_map from loom.rs - Remove gemparts field from DesignArgs and SimArgs/CosimArgs - Make generate_partitions() and run_par() private to setup.rs - Update CI to remove loom map steps and gemparts args - Delete checked-in .gemparts files (no longer needed) - Update all documentation to reflect single-step workflow Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)

robtaylor and others added 19 commits February 28, 2026 16:27

Update MCU SoC test data from librelane rebuild

4a8c291

robtaylor force-pushed the timing-vcd-readback branch from 25a9dfd to b070eb9 Compare February 28, 2026 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU arrival time readback for timing-aware VCD output#49

Add GPU arrival time readback for timing-aware VCD output#49
robtaylor wants to merge 19 commits intomainfrom
timing-vcd-readback

robtaylor commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

robtaylor commented Feb 27, 2026

Summary

Details

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant