Add GPU arrival time readback for timing-aware VCD output#49
Open
Add GPU arrival time readback for timing-aware VCD output#49
Conversation
Add --timing-vcd flag that produces timing-accurate VCD output where signal transitions are offset from clock edges by their computed arrival times. The GPU kernel already computes per-gate arrival times for setup/hold checking; this feature writes them to global memory so the host can produce sub-cycle-accurate output. Changes: - GPU kernels (Metal/CUDA): write shared_writeout_arrival to global memory at arrival_state_offset when enabled - FlattenedScriptV1: add timing_arrivals_enabled, arrival_state_offset fields; update effective_state_size() for 3-section layout - vcd_io: add expand_states_for_arrivals(), split_arrival_states(), write_output_vcd_timed() with ps-to-timescale conversion - loom CLI: wire --timing-vcd flag, SimParams.arrival_state_offset, and timed VCD writer dispatch Co-developed-by: Claude Code v2.1.44 (claude-opus-4-6)
Add detailed section to Known Issues explaining why Loom only supports edge-triggered DFFs, why CVC's test suite can't be reused as reference tests (NAND-latch flip-flops), and what would be needed to add latch support (new DriverType, two-phase evaluation, GPU kernel changes). Co-developed-by: Claude Code v2.1.44 (claude-opus-4-6)
- Change IdCode::from(0) to IdCode(0) for vcd_ng tuple struct API - Make write_output_vcd_timed generic over W: Write for testability - Remove writer.flush() calls (vcd_ng::Writer has no flush method) - Add 8 comprehensive tests for expand/split/write timing arrivals Co-developed-by: Claude Code v2.1.44 (claude-opus-4-6)
The Metal kernel uses a double-buffered read pattern where t4_5 holds the current stage's data while the next stage's data is pre-loaded. The gate_delay extraction was incorrectly placed AFTER the t4_5 overwrite, causing it to read the next stage's padding slot instead of the current one. For single-stage designs (like inv_chain), this read garbage/zeros. Fix: extract gate_delay from t4_5.c4 before overwriting t4_5. Also fix arrival tracking to add gate_delay even for pass-through positions (orb == 0xFFFFFFFF) across all hierarchy levels, since pass-throughs can represent physical cells (e.g., inverter chains) with accumulated delays. Also fix load_timing_from_sdf to iterate all cell origins per AIG pin instead of only the first, enabling correct delay accumulation for inverter chains collapsed to a single AIG wire. Verified: inv_chain test produces correct 1323ps arrival delay matching the analytical SDF sum (CLK→Q=350ps + 16 inverters=973ps). Co-developed-by: Claude Code v2.1.62 (claude-opus-4-6)
Suppress unused variable warnings (staged, num_srams, num_ios, num_dup, part_end) and remove dead assignments (offset before break, script_pi before break) that were cluttering build output. Co-developed-by: Claude Code v2.1.62 (claude-opus-4-6)
- tb_cvc.v: CVC testbench with SDF annotation for inv_chain timing validation (expected total delay: 1323ps) - inv_chain_stimulus.vcd: Input stimulus for timing VCD tests - compare_vcd.py: VCD comparison script for Loom vs CVC output - watchlist.json: Signal watchlist for timing_sim_cpu tracing - CI workflow: CVC reference simulation job for automated validation Co-developed-by: Claude Code v2.1.62 (claude-opus-4-6)
Dockerfile builds CVC (open-src-cvc) from source on linux/amd64 with gcc/binutils for its native code compilation. run_cvc.sh builds the image, runs the inv_chain testbench with SDF back-annotation, and compares against Loom's timing output. Results: CVC reports 1235ps total delay vs Loom's 1323ps — an 88ps (7.1%) conservative overestimate. This is expected: Loom uses max(rise, fall) per cell since the GPU kernel processes 32 packed signals and cannot track per-signal transition direction. CVC tracks actual rise/fall transitions through the inverter chain. The 88ps decomposes as: 8 inverter stages × 10ps IOPATH rise/fall asymmetry = 80ps 8 interconnect wires × 1ps rise/fall asymmetry = 8ps Usage: bash tests/timing_test/cvc/run_cvc.sh Co-developed-by: Claude Code v2.1.62 (claude-opus-4-6)
Add detailed section to timing-simulation.md covering the three independent sources of timing overestimation: 1. max(rise, fall) per cell — GPU can't track transition direction across 32 packed signals (80ps / 6.5% for inv_chain) 2. max wire delay across multi-input pins — single wire delay per cell regardless of which input is critical (8ps for inv_chain) 3. max arrival across 32 packed signals per thread — mitigated by timing-aware bit packing (0ps for inv_chain, larger in practice) Documents CVC reference validation: Loom 1323ps vs CVC 1235ps (88ps / 7.1% conservative overestimate) for the inv_chain design. Updates implementation phases to reflect completed GPU arrival tracking and timing-aware VCD output. Co-developed-by: Claude Code v2.1.62 (claude-opus-4-6)
40 outputs at 5 logic depths (3, 5, 9, 13, 17) exercise Source 3 overestimation in timing-aware bit packing. CVC reference shows distinct arrival times per group (513ps to 1286ps), confirming the conservative timing model. Includes hand-crafted SDF, stimulus VCD, CVC testbench, and Docker runner script. Co-developed-by: Claude Code v2.1.44 (claude-opus-4-6)
The previous fallback logic used `find | sort -r | head -1` which grabbed a pre-PnR SDF (step 08) alphabetically instead of the post-PnR SDF from STAPostPNR (step 51) that includes interconnect delays. Now explicitly searches for stapostpnr nom_tt SDF first. Co-developed-by: Claude Code v2.1.44 (claude-opus-4-6)
Adds a new --stimulus-vcd <path> CLI option to `loom cosim` that writes all primary input signals (clock, reset, flash MISO, constants) to a VCD file. This enables CVC reference simulation by replaying the exact same stimulus that the GPU cosim applied. When enabled, forces single-tick mode (batch=1) to read back GPU state after each cycle. Each tick produces two VCD timestamps (falling + rising edge) for correct clock waveform reconstruction. Change-based encoding minimizes file size by only writing transitions. Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
Recognize `dlymetal` prefix as a delay cell (same A→X interface as `dlygate`) and `diode` prefix as a non-functional cell (like fill/tap/ decap) so post-PnR netlists containing these cells parse correctly. Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
Sorts level-1 endpoint placement by logic level so signals with similar arrival times land in the same 32-slot groups. This tightens the conservative timing estimate by reducing intra-group level spread. Adds diagnostic logging of per-group timing spread statistics. Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
Update eda-infra-rs submodule to include support for parsing `assign y = ~(x)` in structural Verilog. This is needed for SKY130 post-PnR netlists that use bitwise NOT in assign statements. The parser adds a Not(Box<Wirexpr>) variant and the netlistdb builder synthesizes INV cells, which the existing AIG builder handles natively. Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
When the gemparts file is omitted, partitions are generated inline using the same mt-kahypar loop as `loom map`. This adds ~20s but removes the need for a separate mapping step during development. Refactored generate_partitions() and run_par() into setup.rs to share between map, sim, and cosim code paths. Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
Stimulus VCD writer: use proper VCD reference + index format (e.g., "gpio_in [38]" instead of "gpio_in[38]") so the VCD roundtrips correctly through the vcd-ng parser for sim playback. SKY130: add INV cell pin mapping to sky130.rs for post-PnR netlists that contain inverter cells. Config: update MCU SoC sim config with timing section for SDF. Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
Infrastructure for comparing Loom GPU simulation against CVC (open-source event-driven simulator) with SDF back-annotation on the MCU SoC SKY130 post-PnR netlist. Workflow: 1. loom cosim --stimulus-vcd captures primary inputs 2. convert_stimulus.py converts VCD to Verilog assignments 3. gen_cell_models.py generates SKY130 behavioral + specify models 4. strip_sdf_checks.py preprocesses SDF for CVC compatibility 5. CVC runs with SDF timing via Docker (run_cvc.sh) 6. compare_outputs.py compares gpio_out waveforms Key fixes for CVC compatibility: - Wire _delayed signals directly to inputs in behavioral models - Initialize DFF UDPs to 0 (matching Loom's initialization) - Strip TIMINGCHECK/INTERCONNECT from SDF - Remove empty DELAY blocks and escaped-$ CELL entries - Add specify blocks to sized wrapper modules - Behavioral CF_SRAM_1024x32 model with per-bit specify paths Result: 100% match on CPU-driven GPIO outputs (bits 6-43), sub-cycle SPI flash differences expected due to sampling. Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
The separate `loom map` step added complexity without meaningful benefit since partitioning is fast (~20s). Sim and cosim now always generate partitions at startup, simplifying the workflow from two steps to one. Changes: - Remove Map subcommand, MapArgs, and cmd_map from loom.rs - Remove gemparts field from DesignArgs and SimArgs/CosimArgs - Make generate_partitions() and run_par() private to setup.rs - Update CI to remove loom map steps and gemparts args - Delete checked-in .gemparts files (no longer needed) - Update all documentation to reflect single-step workflow Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
25a9dfd to
b070eb9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--timing-vcdflag (requires--sdf) that produces timing-accurate VCD output where signal transitions are offset from clock edges by their computed arrival timesshared_writeout_arrivalto global memory via a new arrival state section alongside values and xmaskDetails
The GPU kernel already computes per-gate arrival times for setup/hold violation checking, but discards them after each partition. This PR adds an opt-in sideband (
arrival_state_offsetinSimParams) that writes arrival times to global memory, then a newwrite_output_vcd_timed()function offsets each signal transition from its clock edge by the arrival time in picoseconds.State buffer layout when enabled:
[values (rio) | xmask (rio, if xprop) | arrivals (rio)]Files changed:
csrc/kernel_v1.metal,csrc/kernel_v1_impl.cuh— arrival write + SimParams updatesrc/flatten.rs—timing_arrivals_enabled,arrival_state_offset, updatedeffective_state_size()src/sim/vcd_io.rs—expand_states_for_arrivals(),split_arrival_states(),write_output_vcd_timed()src/bin/loom.rs— CLI flag, SimParams wiring, timed VCD dispatchTest plan
cargo test— 97 tests pass (3 new timing arrival tests)cargo build -r --features metal --bin loom— Metal shader compiles--sdfand--timing-vcd, compare against CVC reference output--timing-vcd