docker build -t verilogeval:v2 .
docker run -it --name your-perfect-dokcer-name -v $(pwd):/workspace/verilogeval verilogeval:v2
This repository provides an end-to-end evaluation framework that spans Verilog generation, Verilog debugging, reference model generation, and utility tooling. Conceptually, it is organized into four main components:
- Verilog Gen: datasets for generating Verilog designs with different structures and difficulty levels (self‑contain / non‑self‑contain / CPU IP).
- Verilog Debugging: buggy RTL variants and their associated
prompt.txt / ref.sv / test.svfor 0‑shot and 1‑shot debugging tasks. - Ref Model Gen: cross‑language functional reference models (Python / CXXRTL / SystemC) for the same set of problems.
- Tool Box: utilities for verification and data generation, such as cross‑language consistency checking and reference‑model‑based testbench generation.
Below we describe each component and its directory.
Directory: Verilog Gen/
-
dataset_self_contain/- Each problem is a self‑contained module with a single top module (e.g.,
TopModule) and no external submodule dependencies. - Contains challenging combinational and sequential logic problems, suitable for evaluating spec‑to‑RTL generation.
- Each problem is a self‑contained module with a single top module (e.g.,
-
dataset_not_self_contain/- Each problem is a non‑self‑contained design where a
Top Moduleinstantiates multipleSub‑Modules. - Evaluates the model’s ability to handle hierarchical designs and module interfaces.
- Each problem is a non‑self‑contained design where a
-
dataset_cpu_ip/- Contains more complex CPU IP–level problems, such as ALU, register file, branch, and control logic.
- Corresponds to the “CPU IP / ALU / Reg File / Branch / Control ...” region in the high‑level figure and targets high‑difficulty RTL generation.
Directory: Verilog Debugging/
Each subdirectory corresponds to a specific debugging setting (0‑shot / 1‑shot) and bug type (state machine / timing / arithmetic / assignment), matching the “Verilog Debugging” block in the figure.
-
Bug types:
- Timing Bug: timing‑related issues.
- Assignment Bug: incorrect blocking / non‑blocking or combinational assignments.
- Arithmetic Bug: incorrect arithmetic operations.
- State Machine Bug: wrong state transitions or output logic.
-
Shot settings:
dataset_debug_zero_shot_*: 0‑shot debugging, without in‑context examples.dataset_debug_one_shot_*: 1‑shot debugging, with one in‑context example problem.
Typical file triplet for each problem:
*_prompt.txt: prompt given to the LLM (natural language + code context) for generating or fixing Verilog.*_ref.sv: reference Verilog implementation (bug‑free).*_test.sv: testbench for automatic simulation and checking.
For example, in dataset_debug_one_shot_state_machine/:
Prob003_non-overlapping_sequence_detect_prompt.txtProb003_non-overlapping_sequence_detect_ref.svProb003_non-overlapping_sequence_detect_test.sv
form a non‑overlapping sequence detector state‑machine debugging task.
Directory: Ref Model Gen/
This component generates functional reference models, corresponding to the “Ref Model Gen” block in the figure:
gen.py: uses the specifications indata.jsonlto generate reference models in multiple backends (Python / CXXRTL / SystemC).gen_python_prompt.txt/gen_cxxrtl_prompt.txt/gen_systemc_prompt.txt: prompt templates for each backend.data.jsonl: unified description of problems and specifications.key.cfg: sample API‑key configuration (do not commit real keys).
Example generated reference models live under Tool Box/:
Tool Box/python/dut.pyTool Box/cxxrtl/dut.ccTool Box/systemc/dut.cc
These reference models can be used to cross‑check the Verilog designs.
Directory: Tool Box/
This corresponds to the “Tool Box” block and mainly provides:
-
Cross‑language consistency checking
crosslang_verify.py: verifies that multiple implementations (Python / CXXRTL / SystemC / Verilog) of the same task behave identically, and reports results.verilog/: containsdut.sv(design under test) andref.sv(reference RTL).python/,cxxrtl/,systemc/: language‑specific reference models.
-
Testbench / data generation
- By combining the reference models and scripts, you can batch‑generate testbenches and training data for different problems.
Directory: scripts/
sv-generate: unified LLM Verilog generation / debugging script.- Supports multiple backends (OpenAI, DeepSeek, Gemini, Claude, Together, local vLLM server).
--taskselects the high‑level prompting style, e.g.:code-complete-iccad2023: complete the body ofTopModule.spec-to-rtl: generate RTL directly from problem specification.
--examplesand--rulescontrol few‑shot examples and coding conventions.
- Other files such as
verilog-example-prefix_*.txtandprompt-example-prefix.txtprovide prefix examples for different tasks / shot settings.
mkdir -p build/
MODEL_NAME="gpt-5.2" # change this to your model
TASK_NAME="nowcoder" # change this to your task
./configure --with-model=$MODEL_NAME --with-task=$TASK_NAME
make
mkdir -p .save && mv Prob* .save/
The evalution harness is run using make and various evaluation parameters can be set as below:
mkdir -p build/
./configure --with-task=$task --with-model=$model --with-examples=$shots --with-samples=$samples --with-temperature=$temperature --with-top-p=$top_p
make
Evaluation can be sped up by providing the -j flag to make, such as -j4 to run 4 worker processes.
Valid models are listed at the top of scripts/sv-generate. The number of in-context learning examples can be between 0-4, and given with --with-examples. Samples to collect per problem are given by --with-samples. Finally, model temperature and top_p can be set to --with-temperature and --with-top-p, respectively.
These parameters can be easily swept with a shell script, to create separate build directories for each evaluation harness configuration target.