Reference for all file schemas used by the evaluation framework. This should be consistent for both the host and docker-based model evaluation.
Abbreviation: CDB -- CausalDriveBench
- The following are some of the important constants that affect the file organization.
- These directory paths are expected to be correctly set in the eval_config.yaml to the corresponding directories on the host environment based on the descriptions below.
- The docker path values are constants for the expected volume mounts in the docker environment.
- Instruction to the user:
- Create a local copy of the eval_config.yaml.
- And, when running evaluation scripts directly on the host, appropriately update your local copy to replace the default values.
- The EVAL_RUNTIME environment variable decides which paths will be used during inference, so that the evaluation scripts run seamlessly on the host environment and the docker environment.
- The evaluation/common/utils/config_reader.py correctly reads the host configs from the eval_config.yaml and creates a Namespace mapping across evaluation code files for correctly accessing the relevant paths during runtime.
import os
EVAL_RUNTIME = "host" # or "docker"
# All the key value pairs in this dict should follow the same structure
# "dir_key": {"host": "host_path", "docker": "docker_path", "description": "describe what this path is for"}
dict_dir_maps = {
"workspace": {
"host": "/path/to/workspace",
"docker": "/workspace",
"description": "Path to your workspace."
}
}
dict_dir_maps["dir_data_root"] = {
"host": "/path/to/data/root",
"docker": os.path.join(dict_dir_maps["workspace"], "data"),
"description": "Path to the data files root"
}
dict_dir_maps["dir_raw_data"] = {
"host": "/path/to/raw/data",
"docker": os.path.join(dict_dir_maps["dir_data_root"], "raw_data"),
"description": "Path to the raw_data root folder. This has subfolders for the opensource datasets."
}
dict_dir_maps["dir_causal_drive_bench"] = {
"host": "/path/to/causal/drive/bench",
"docker": os.path.join(dict_dir_maps["dir_data_root"], "causal_drive_bench"),
"description": "Path to the causal_drive_bench root folder. This has subfolders for the CausalDriveBench data annotations."
}
dict_dir_maps["dir_cdb_outputs"] = {
"host": "/path/to/cdb/outputs",
"docker": os.path.join(dict_dir_maps["dir_data_root"], "cdb_outputs"),
"description": "Path to the cdb_outputs root folder. This has subfolders for the CausalDriveBench inference outputs."
}
dict_dir_maps["dir_cdb_reports"] = {
"host": "/path/to/cdb/reports",
"docker": os.path.join(dict_dir_maps["dir_data_root"], "cdb_reports"),
"description": "Path to the cdb_reports root folder. This has subfolders for the CausalDriveBench evaluation reports."
}
dict_dir_maps["dir_cdb_eval_repo"] = {
"host": "/path/to/parent/of/evaluation",
"docker": os.path.join(dict_dir_maps["workspace"], "repo"),
"description": "Path to the parent directory of evaluation code."
}
dict_dir_maps["dir_model_code"] = {
"host": "/path/to/model/code",
"docker": os.path.join(dict_dir_maps["workspace"], "model"),
"description": "Path to the model's code root folder. This has model's codebase that is being evaluated."
}
dict_dir_maps["dir_model_weights"] = {
"host": "/path/to/model/weights",
"docker": os.path.join(dict_dir_maps["workspace"], "weights"),
"description": "Path to the model's trained weights root folder. This could also refer to HF_HOME cache folder, etc."
}This is based on the 3 open-source datasets as listed below. These datasets have their own folder structure but that does not affect the evaluation framework. The main usage of these files is to reference in the frames.json to read the correct image files.
<RAW_DATA_DIR>/ # e.g. /raw_data
├── nuscenes/
├── argoverse2/
└── openscene/
For each subfolder in the dir_raw_data directory above (e.g. the 3 open-source datasets), there is a corresponding causal_ folder that has the expanded structure as shown below.
<CAUSAL_BENCH_DIR>/ # e.g. evaluation/sample_data/
├── causal_nuscenes/
│ ├── manifest.json # Metadata about all the SAMPLED instances
│ ├── nuscenes-scene-<...>/ # e.g. nuscenes-scene-0001/
│ ├── nuscenes-scene-<...>/
│ └── nuscenes-scene-<...>/
│ ├── SAMPLED_x/
│ ├── SAMPLED_.../
│ └── SAMPLED_<N>/
├── causal_argoverse2/
│ ├── manifest.json
│ ├── argoverse2-<id1>/ # e.g. argoverse2-scene-0001/
│ ├── argoverse2-<id...>/
│ └── argoverse2-<id-N>/
│ ├── SAMPLED_x/
│ ├── SAMPLED_.../
│ └── SAMPLED_<N>/
└── causal_openscene/
├── manifest.json
├── openscene-<id1>/ # e.g. openscene-scene-0001/
├── openscene-<id...>/
└── openscene-<id-N>/
├── SAMPLED_x/
├── SAMPLED_.../
└── SAMPLED_<N>/
-----------
# Each SAMPLED_<N> folder above has the following folder structure of the CausalDriveBench
# Refer to the evaluation/sample_data files for example implementations.
<bench-dir-sample-folder> # e.g. /causal_drive_bench/nuscenes-scene-0001/SAMPLED_0/
├── frames.json — per-frame image paths keyed by frame name
├── state.json — ego + agent state at T=0 (ego-centric history)
├── calib.json — camera calibration matrices
├── meta.json — scene metadata (dataset, split, tokens)
├── graph.json — pruned causal graph (falls back to raw)
└── qa — all benchmark question JSON
├── active_qa.json — ladder / active questions
├── dormant_qa.json — dormant link questions
└── distractor_qa.json — distractor node questions
Notes:
- The sampled folders are not sequential and some folders might be malformed.
- qa – subfolder
- Has the following 3 question answer JSON files generated based on the scene-graph.
- active_qa.json
- dormant_qa.json
- distractor_qa.json
- Some of the individual _qa.json files may be missing in some cases.
- The files themselves may be malformed and may not have QA pairs even if the file exists.
- Has the following 3 question answer JSON files generated based on the scene-graph.
- Each _qa.json file has the following structure:
- A "questions" key and value of list of question metadata dictionary.
- The question metadata dictionary is valid if:
- REQUIRED
- "id" -- question ID in the sample
- "question" -- String
- "answer_format" -- binary or mcq
- "options" -- if answer_format is mcq -- A.../B.../C.../D... etc. else Yes/No for binary
- "correct_answer": Yes/No/
- "reasoning": The model's ground truth reasoning text.
- OPTIONAL
- Other metadata key value pairs
- REQUIRED
- If the question is invalid or cannot be processed, raise an appropriate exception, and then ignore processing the question.
- This is quite similar to the above CDB benchmark directory layout.
- The corresponding JSONL files during evaluation are created for each SAMPLED directory.
<CAUSAL_BENCH_OUTPUTS_DIR>/ # e.g. /cdb_outputs/
├── <model-id-1>_<timestamp>/ # e.g. alpamayo_1_0_20260409_171324
│ ├── causal_nuscenes/
│ │ ├── nuscenes-scene-<...>/ # e.g. nuscenes-scene-0001/
│ │ ├── nuscenes-scene-<...>/
│ │ └── nuscenes-scene-<...>/
│ │ ├── SAMPLED_x/
│ │ ├── SAMPLED_.../
│ │ └── SAMPLED_<N>/
│ ├── causal_argoverse2/
│ │ ├── argoverse2-<id1>/ # e.g. argoverse2-72737475-7678-79/
│ │ ├── argoverse2-<id...>/
│ │ └── argoverse2-<id-N>/
│ │ ├── SAMPLED_x/
│ │ ├── SAMPLED_.../
│ │ └── SAMPLED_<N>/
│ └── causal_openscene/
│ ├── openscene-<id1>/ # e.g. openscene-2021.06.14.17.26.26_veh-38_027/
│ ├── openscene-<id...>/
│ └── openscene-<id-N>/
│ ├── SAMPLED_x/
│ ├── SAMPLED_.../
│ └── SAMPLED_<N>/
└── <model-id-2>_<timestamp>/
├── causal_nuscenes/
├── causal_argoverse2/
└── causal_openscene/
-----------
# Each SAMPLED_<N> folder above has the following folder structure of the CausalDriveBench
<output-bench-dir-sample-folder> # e.g. /cdb_outputs/nuscenes-scene-0001/SAMPLED_0/
├── prompts.jsonl — inference input -- one line per question
└── outputs.jsonl — inference output -- one line per question
The files below are documented using the bundled sample data under evaluation/sample_data/.
Stores relative image paths for each temporal frame. The data_root field is
ignored by the evaluation framework — use CDBConfig.resolve_frame_path() instead.
{
"data_root": "/workspace/data",
"frames": {
"Tm1p5": {
"cam_front": "raw_data/nuscenes/samples/CAM_FRONT/<file>.jpg",
"cam_front_right": "raw_data/nuscenes/samples/CAM_FRONT_RIGHT/<file>.jpg",
"cam_front_left": "raw_data/nuscenes/samples/CAM_FRONT_LEFT/<file>.jpg",
"cam_back": "raw_data/nuscenes/samples/CAM_BACK/<file>.jpg"
},
"Tm1p0": { "...": "..." },
"Tm0p5": { "...": "..." },
"Tp0p0": { "...": "..." }
}
}Camera keys vary by source dataset:
- nuscenes:
cam_front,cam_front_left,cam_front_right,cam_back - argoverse2:
cam_front,cam_front_left,cam_front_right,cam_back_left,cam_back_right - openscene:
cam_front,cam_front_left,cam_front_right,cam_back
Ego vehicle and agent states at T=0 (ego-centric frame). Structure varies
slightly by dataset. Ego history is extracted by ModelDataLoader.interpolate_ego_history().
Camera calibration matrices (intrinsics, extrinsics) for all cameras. Used by models that need camera-to-ego transformations.
Scene metadata: dataset name, split, source tokens, timestamp.
Pruned causal scene graph used to generate the QA questions.
Each question file has the following structure:
{
"questions": [
{
"id": "Q1",
"question": "Which element is currently preventing you from proceeding?",
"answer_format": "mcq",
"options": [
"A) The construction worker on the crosswalk",
"B) The SUV stopped behind you",
"C) The construction barriers on the right",
"D) The traffic signal ahead"
],
"correct_answer": "A",
"reasoning": "You are stopped because people are crossing directly ahead…",
"graph_structure": "Direct",
"question_type": "CaI",
"rung": 0,
"meta": { "...": "..." }
}
]
}For binary questions, options is null and correct_answer is "Yes" or "No".
- One JSON object per line, one line per question.
- Written by
DataLoader.save_prompts(). is_evaluatedis set tofalsewhen written. Resume detection uses the presence of the correspondingquestion_idinoutputs.jsonlrather than this flag, keeping GPU inference unblocked by file I/O.qa_textcontains the question plus a format hint only — no ground-truth answer or reasoning — to prevent answer leakage into the model prompt. It is generated byModelDataLoader._format_inference_text().question_textis the raw question string without any formatting.
Minimum required fields (framework contract):
{
"scene_id": "openscene-2021.06.14.17.26.26_veh-38_027",
"sample_id": "SAMPLED_2",
"question_id": "DQ1",
"prompt_id": "<sequence-id-in-this-json-file>",
"is_evaluated": "false",
"question_json_file": "dormant_qa.json",
"answer_format": "binary",
"question_text": "Question: Is the silver sedan...",
"qa_text": "Question: Is the silver sedan... \n\nFormat: Answer: Yes or No",
"image_paths": [
# list of image path metadata of following dict format based on the model input format
{
"path": "raw_data/openscene/2021.06.14.17.26.26_veh-38_02740_03036/CAM_R0/a04d050b77765886.jpg",
"time_key": "Tm1p0",
"camera_key": "cam_front_right"
},
{
"path": "...",
"time_key": "...",
"camera_key": "..."
},
...
]
}Model-specific additions (alpamayo_1_0 example):
{
"ego_history_xyz": [
[
-5.7,
1.7,
0.0
],
[
-3.9,
0.8,
0.0
],
[
-2.0,
0.2,
0.0
],
[
0.0,
0.0,
0.0
]
],
"ego_history_rot": [
[
[
1,
0,
0
],
[
0,
1,
0
],
[
0,
0,
1
]
],
...
]
}All fields must be JSON-serialisable (no numpy arrays, no PIL.Image objects).
- One JSON object per line, one line per question.
- Written by
ModelInference.run_from_jsonl(). - The raw_output from the model will be used to postprocess and generate the metrics.
- Depending on the model, there might different kinds of raw_outputs.
{
"scene_id": "openscene-2021.06.14.17.26.26_veh-38_027",
"sample_id": "SAMPLED_2",
"question_id": "DQ1",
"prompt_id": "<from-prompts-jsonl>",
"raw_output": {
"text": "Answer: <model-answer>\nReasoning: <model-reasoning ... e.g. The truck is parked and not moving...>",
"waypoints": [], # optional other raw_output
"other_key": ..., # serializable output
},
"inference_time_s": 2.34,
"timestamp": "2026-04-04T10:00:00Z"
}Written by the model-specific postprocessor (e.g. alpamayo_1_0/postprocess.py).
Per-sample reports live at <run_dir>/<dataset>/<scene_id>/<sample_id>/report.json.
{
"schema_version": "1.0",
"generated_at": "2026-04-17T21:16:27.873834+00:00",
"level": "sample",
"run_name": "alpamayo_1_0_20260417_190704",
"dataset": "causal_nuscenes",
"scene_id": "nuscenes-scene-0007",
"sample_id": "SAMPLED_0",
"n_questions": 22,
"metrics": {
"overall": {
"accuracy": 0.82,
"n": 22,
"correct": 18
},
"per_qa_type": {
"ladder": {"accuracy": 0.80, "n": 5, "correct": 4},
"dormant": {"accuracy": 0.89, "n": 9, "correct": 8},
"distractor": {"accuracy": 0.75, "n": 8, "correct": 6}
},
"confusion": {
"matrix": {
"Yes": {"Yes": 10, "No": 2},
"No": {"Yes": 2, "No": 8}
},
"most_confused": [
{"true": "Yes", "predicted": "No", "count": 2},
{"true": "No", "predicted": "Yes", "count": 2}
]
}
},
"qa_results": [
{
"question_id": "CI1",
"qa_type": "distractor",
"answer_format": "binary",
"question_text": "Is the white van in the adjacent lane to your right responsible for your current speed?",
"predicted": "No",
"ground_truth": "No",
"correct": true,
"raw_output_text": "No, the white van is not responsible for my current speed.",
"inference_time_s": 2.34
}
]
}Notes:
levelis always"sample"for per-sample reports; higher-level aggregate reports use"dataset"or"run".metricscontains the nested accuracy / confusion data;qa_resultsholds one entry per question with full traceability.raw_output_textis the extracted text from the model'sraw_outputdict (the"text"key).
| File / Field | Written by | Read by | Notes |
|---|---|---|---|
frames.json |
Dataset pipeline | BenchmarkDataset, DataLoader |
Ignore data_root; use CDBConfig.resolve_frame_path() |
qa/<type>_qa.json |
Dataset pipeline | BenchmarkDataset, Postprocessor |
Missing files are silently skipped |
prompts.jsonl |
ModelDataLoader.save_prompts() |
ModelInference.run_from_jsonl() |
One per SAMPLED_N dir |
outputs.jsonl |
ModelInference.run_from_jsonl() |
ModelPostprocessor.compute_metrics() |
One per SAMPLED_N dir; appended for resume |
report.json |
Model postprocessor (postprocess.py) |
Orchestrator, HTML reporter, human review | Per sample; aggregated at dataset/run level |
inference.log |
inference.py logging setup |
Developer debugging | Per run, at run_dir root |