End-to-end pipeline for Spider2-Snowflake: environment setup, data download, metadata preprocessing, multi-program inference, majority voting, and evaluation under both pass@k and pass@1.
💡 The same instructions apply to the SQLite subset (Spider2-Lite). Download
spider2-litefrom the official repo and pass--db_path datasets/Spider2/spider2-lite/resource/databases/sqlite --db_type sqliteto the agent script in place of the Snowflake equivalents.
Given a natural-language question and a Snowflake schema, the pipeline:
- Cleans and enriches the released Spider2 metadata so schema linking sees accurate column casing, no all-null columns, and concrete example values.
- Generates
kcandidate programs per question with a hierarchical schema-linking + 2-step batched planning agent, executed against Snowflake. - Reduces the
kcandidates to a single answer per question via table-equivalence majority voting, and reports both pass@k and pass@1.
Install uv and sync dependencies:
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
. .venv/bin/activateFetch the Spider2-Snowflake split from the official repo (xlang-ai/Spider2) and place it at datasets/Spider2/spider2-snow/:
mkdir -p datasets/Spider2
git clone --depth 1 https://github.com/xlang-ai/Spider2.git /tmp/Spider2
cp -r /tmp/Spider2/spider2-snow datasets/Spider2/After this step the tree should look like:
datasets/Spider2/spider2-snow/
├── spider2-snow.jsonl
├── snowflake_credential.json # fill in with your Snowflake credentials
├── resource/
│ └── databases/
└── evaluation_suite/
├── evaluate.py
└── gold/
⚠️ A working Snowflake account is required. Populatedatasets/Spider2/spider2-snow/snowflake_credential.jsonbefore running any preprocessing or evaluation step.
The released Spider2-Snowflake metadata has casing inconsistencies, columns with only null values, and missing example values. We provide a cleaned/enriched version.
Option A — download the prepared metadata (recommended):
Download from Google Drive and extract under datasets/Spider2/spider2-snow/. Expected layout after extraction:
datasets/Spider2/spider2-snow/resource/databases_no_nulls_2/
datasets/Spider2/spider2-snow/table_similarities_report_no_nulls.json
Option B — run the preprocessing pipeline yourself:
See [src/preprocessing/README.md](src/preprocessing/README.md). In short:
bash src/preprocessing/run_all.shThis produces resource/databases_no_nulls_2/ and a table-similarity report under datasets/Spider2/spider2-snow/.
The main experiments are driven by src/main.py which can be launched via src/main_slurm_laucher.py on a Slurm cluster with local vllm servers.
python src/main_slurm_laucher.py \
--input_file datasets/Spider2/spider2-snow/spider2-snow.jsonl \
--num_folds 600 \
--vllm_model <path-to-your-model> \
--gpu_name l40s \
--vllm_server_extra_args "-tp 4 --enable-auto-tool-choice --tool-call-parser openai --gpu-memory-utilization 0.8" \
--sbatch_args $'#SBATCH --time=3:00:00' \
--base_dir inference_res \
--slurm_log_dir gpt-oss-120b-logs \
--agent_script_extra_args "--model gpt-oss-120b \
--db_path datasets/Spider2/spider2-snow --db_type snowflake \
--num_programs 1 --hierarchical-sl --planning_top_k 8 --planning_batch_size 4 --use_2step_batch_planning \
--similarities_path datasets/Spider2/spider2-snow/table_similarities_report.json \
--custom_exp_name spider2-snow-exp \
--planning_gen_config '{\"temperature\": 1.0}' \
--stitching_gen_config '{\"temperature\": 1.0}' \
--eval_gen_config '{\"temperature\": 1.0}' \
--max_self_eval_rounds 3"Per-question outputs land under inference_res/<custom_exp_name>/<question_id>/, each containing program_<k>.py (or .sql) and a corresponding program_output_<k>.csv.
Our experiment logs are available here
Reduce the k candidate outputs per question to a single majority answer:
python utils/major_voting.py \
--input_dir inference_res/<run_name>/<custom_exp_name> \
--output_dir inference_res/<run_name>/<custom_exp_name>_majority \
--fuzzy_threshold 0.90 \
--allow_supersetThis writes one <question_id>.csv per question into --output_dir.
Pass@k groups all k candidates per question and counts a question correct if any candidate matches gold. Use the provided wrapper, which first flattens the per-question program_output_*.csv files into a single folder:
bash eval_spider20.sh inference_res/<run_name>/<custom_exp_name>This calls:
python utils/move_csv_to_evaluate_folder.py \
--source_dir <infer_folder> \
--output <infer_folder>/evaluation
python utils/evaluate_passk.py \
--result_dir <infer_folder>/evaluation \
--gold_dir datasets/Spider2/spider2-snow/evaluation_suite/gold \
--mode exec_resultPass@1 evaluates a single answer per question. Run majority voting first (previous section), then point the official Spider2 evaluator at the majority output:
python datasets/Spider2/spider2-snow/evaluation_suite/evaluate.py \
--mode exec_result \
--result_dir inference_res/<run_name>/<custom_exp_name>_majority \
--gold_dir datasets/Spider2/spider2-snow/evaluation_suite/goldA Claude Code variant of this pipeline — packaged as Claude Code skills and backing MCP servers — is available in claude_harness.
@misc{pham2026flexsql,
title={FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents},
author={Quang Hieu Pham and Yang He and Ping Nie and Canwen Xu and Davood Rafiei and Yuepeng Wang and Xi Ye and Jocelyn Qiaochu Chen},
year={2026},
eprint={2605.02815},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.02815},
}