Skip to content

Add GPU arrival time readback for timing-aware VCD output#49

Open
robtaylor wants to merge 19 commits intomainfrom
timing-vcd-readback
Open

Add GPU arrival time readback for timing-aware VCD output#49
robtaylor wants to merge 19 commits intomainfrom
timing-vcd-readback

Conversation

@robtaylor
Copy link
Contributor

Summary

  • Adds --timing-vcd flag (requires --sdf) that produces timing-accurate VCD output where signal transitions are offset from clock edges by their computed arrival times
  • GPU kernels (Metal/CUDA) write shared_writeout_arrival to global memory via a new arrival state section alongside values and xmask
  • Host-side extracts arrival data and writes sub-cycle-accurate VCD with proper timescale conversion

Details

The GPU kernel already computes per-gate arrival times for setup/hold violation checking, but discards them after each partition. This PR adds an opt-in sideband (arrival_state_offset in SimParams) that writes arrival times to global memory, then a new write_output_vcd_timed() function offsets each signal transition from its clock edge by the arrival time in picoseconds.

State buffer layout when enabled: [values (rio) | xmask (rio, if xprop) | arrivals (rio)]

Files changed:

  • csrc/kernel_v1.metal, csrc/kernel_v1_impl.cuh — arrival write + SimParams update
  • src/flatten.rstiming_arrivals_enabled, arrival_state_offset, updated effective_state_size()
  • src/sim/vcd_io.rsexpand_states_for_arrivals(), split_arrival_states(), write_output_vcd_timed()
  • src/bin/loom.rs — CLI flag, SimParams wiring, timed VCD dispatch

Test plan

  • cargo test — 97 tests pass (3 new timing arrival tests)
  • cargo build -r --features metal --bin loom — Metal shader compiles
  • Run on inv_chain with --sdf and --timing-vcd, compare against CVC reference output
  • Verify default behavior unchanged without --timing-vcd

robtaylor and others added 19 commits February 28, 2026 16:27
Add --timing-vcd flag that produces timing-accurate VCD output where
signal transitions are offset from clock edges by their computed
arrival times. The GPU kernel already computes per-gate arrival times
for setup/hold checking; this feature writes them to global memory
so the host can produce sub-cycle-accurate output.

Changes:
- GPU kernels (Metal/CUDA): write shared_writeout_arrival to global
  memory at arrival_state_offset when enabled
- FlattenedScriptV1: add timing_arrivals_enabled, arrival_state_offset
  fields; update effective_state_size() for 3-section layout
- vcd_io: add expand_states_for_arrivals(), split_arrival_states(),
  write_output_vcd_timed() with ps-to-timescale conversion
- loom CLI: wire --timing-vcd flag, SimParams.arrival_state_offset,
  and timed VCD writer dispatch

Co-developed-by: Claude Code v2.1.44 (claude-opus-4-6)
Add detailed section to Known Issues explaining why Loom only supports
edge-triggered DFFs, why CVC's test suite can't be reused as reference
tests (NAND-latch flip-flops), and what would be needed to add latch
support (new DriverType, two-phase evaluation, GPU kernel changes).

Co-developed-by: Claude Code v2.1.44 (claude-opus-4-6)
- Change IdCode::from(0) to IdCode(0) for vcd_ng tuple struct API
- Make write_output_vcd_timed generic over W: Write for testability
- Remove writer.flush() calls (vcd_ng::Writer has no flush method)
- Add 8 comprehensive tests for expand/split/write timing arrivals

Co-developed-by: Claude Code v2.1.44 (claude-opus-4-6)
The Metal kernel uses a double-buffered read pattern where t4_5 holds
the current stage's data while the next stage's data is pre-loaded. The
gate_delay extraction was incorrectly placed AFTER the t4_5 overwrite,
causing it to read the next stage's padding slot instead of the current
one. For single-stage designs (like inv_chain), this read garbage/zeros.

Fix: extract gate_delay from t4_5.c4 before overwriting t4_5.

Also fix arrival tracking to add gate_delay even for pass-through
positions (orb == 0xFFFFFFFF) across all hierarchy levels, since
pass-throughs can represent physical cells (e.g., inverter chains)
with accumulated delays.

Also fix load_timing_from_sdf to iterate all cell origins per AIG pin
instead of only the first, enabling correct delay accumulation for
inverter chains collapsed to a single AIG wire.

Verified: inv_chain test produces correct 1323ps arrival delay matching
the analytical SDF sum (CLK→Q=350ps + 16 inverters=973ps).

Co-developed-by: Claude Code v2.1.62 (claude-opus-4-6)
Suppress unused variable warnings (staged, num_srams, num_ios, num_dup,
part_end) and remove dead assignments (offset before break, script_pi
before break) that were cluttering build output.

Co-developed-by: Claude Code v2.1.62 (claude-opus-4-6)
- tb_cvc.v: CVC testbench with SDF annotation for inv_chain timing
  validation (expected total delay: 1323ps)
- inv_chain_stimulus.vcd: Input stimulus for timing VCD tests
- compare_vcd.py: VCD comparison script for Loom vs CVC output
- watchlist.json: Signal watchlist for timing_sim_cpu tracing
- CI workflow: CVC reference simulation job for automated validation

Co-developed-by: Claude Code v2.1.62 (claude-opus-4-6)
Dockerfile builds CVC (open-src-cvc) from source on linux/amd64 with
gcc/binutils for its native code compilation. run_cvc.sh builds the
image, runs the inv_chain testbench with SDF back-annotation, and
compares against Loom's timing output.

Results: CVC reports 1235ps total delay vs Loom's 1323ps — an 88ps
(7.1%) conservative overestimate. This is expected: Loom uses
max(rise, fall) per cell since the GPU kernel processes 32 packed
signals and cannot track per-signal transition direction. CVC tracks
actual rise/fall transitions through the inverter chain.

The 88ps decomposes as:
  8 inverter stages × 10ps IOPATH rise/fall asymmetry = 80ps
  8 interconnect wires × 1ps rise/fall asymmetry = 8ps

Usage: bash tests/timing_test/cvc/run_cvc.sh

Co-developed-by: Claude Code v2.1.62 (claude-opus-4-6)
Add detailed section to timing-simulation.md covering the three
independent sources of timing overestimation:

1. max(rise, fall) per cell — GPU can't track transition direction
   across 32 packed signals (80ps / 6.5% for inv_chain)
2. max wire delay across multi-input pins — single wire delay per
   cell regardless of which input is critical (8ps for inv_chain)
3. max arrival across 32 packed signals per thread — mitigated by
   timing-aware bit packing (0ps for inv_chain, larger in practice)

Documents CVC reference validation: Loom 1323ps vs CVC 1235ps (88ps
/ 7.1% conservative overestimate) for the inv_chain design.

Updates implementation phases to reflect completed GPU arrival
tracking and timing-aware VCD output.

Co-developed-by: Claude Code v2.1.62 (claude-opus-4-6)
40 outputs at 5 logic depths (3, 5, 9, 13, 17) exercise Source 3
overestimation in timing-aware bit packing. CVC reference shows
distinct arrival times per group (513ps to 1286ps), confirming the
conservative timing model. Includes hand-crafted SDF, stimulus VCD,
CVC testbench, and Docker runner script.

Co-developed-by: Claude Code v2.1.44 (claude-opus-4-6)
The previous fallback logic used `find | sort -r | head -1` which
grabbed a pre-PnR SDF (step 08) alphabetically instead of the
post-PnR SDF from STAPostPNR (step 51) that includes interconnect
delays. Now explicitly searches for stapostpnr nom_tt SDF first.

Co-developed-by: Claude Code v2.1.44 (claude-opus-4-6)
Adds a new --stimulus-vcd <path> CLI option to `loom cosim` that writes
all primary input signals (clock, reset, flash MISO, constants) to a VCD
file. This enables CVC reference simulation by replaying the exact same
stimulus that the GPU cosim applied.

When enabled, forces single-tick mode (batch=1) to read back GPU state
after each cycle. Each tick produces two VCD timestamps (falling + rising
edge) for correct clock waveform reconstruction. Change-based encoding
minimizes file size by only writing transitions.

Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
Recognize `dlymetal` prefix as a delay cell (same A→X interface as
`dlygate`) and `diode` prefix as a non-functional cell (like fill/tap/
decap) so post-PnR netlists containing these cells parse correctly.

Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
Sorts level-1 endpoint placement by logic level so signals with similar
arrival times land in the same 32-slot groups. This tightens the
conservative timing estimate by reducing intra-group level spread.
Adds diagnostic logging of per-group timing spread statistics.

Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
Update eda-infra-rs submodule to include support for parsing
`assign y = ~(x)` in structural Verilog. This is needed for
SKY130 post-PnR netlists that use bitwise NOT in assign statements.

The parser adds a Not(Box<Wirexpr>) variant and the netlistdb builder
synthesizes INV cells, which the existing AIG builder handles natively.

Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
When the gemparts file is omitted, partitions are generated inline
using the same mt-kahypar loop as `loom map`. This adds ~20s but
removes the need for a separate mapping step during development.

Refactored generate_partitions() and run_par() into setup.rs to
share between map, sim, and cosim code paths.

Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
Stimulus VCD writer: use proper VCD reference + index format
(e.g., "gpio_in [38]" instead of "gpio_in[38]") so the VCD
roundtrips correctly through the vcd-ng parser for sim playback.

SKY130: add INV cell pin mapping to sky130.rs for post-PnR
netlists that contain inverter cells.

Config: update MCU SoC sim config with timing section for SDF.

Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
Infrastructure for comparing Loom GPU simulation against CVC
(open-source event-driven simulator) with SDF back-annotation
on the MCU SoC SKY130 post-PnR netlist.

Workflow:
1. loom cosim --stimulus-vcd captures primary inputs
2. convert_stimulus.py converts VCD to Verilog assignments
3. gen_cell_models.py generates SKY130 behavioral + specify models
4. strip_sdf_checks.py preprocesses SDF for CVC compatibility
5. CVC runs with SDF timing via Docker (run_cvc.sh)
6. compare_outputs.py compares gpio_out waveforms

Key fixes for CVC compatibility:
- Wire _delayed signals directly to inputs in behavioral models
- Initialize DFF UDPs to 0 (matching Loom's initialization)
- Strip TIMINGCHECK/INTERCONNECT from SDF
- Remove empty DELAY blocks and escaped-$ CELL entries
- Add specify blocks to sized wrapper modules
- Behavioral CF_SRAM_1024x32 model with per-bit specify paths

Result: 100% match on CPU-driven GPIO outputs (bits 6-43),
sub-cycle SPI flash differences expected due to sampling.

Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
The separate `loom map` step added complexity without meaningful benefit
since partitioning is fast (~20s). Sim and cosim now always generate
partitions at startup, simplifying the workflow from two steps to one.

Changes:
- Remove Map subcommand, MapArgs, and cmd_map from loom.rs
- Remove gemparts field from DesignArgs and SimArgs/CosimArgs
- Make generate_partitions() and run_par() private to setup.rs
- Update CI to remove loom map steps and gemparts args
- Delete checked-in .gemparts files (no longer needed)
- Update all documentation to reflect single-step workflow

Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant