feat: bench suite — employees (MariaDB/4M rows) + lahman (SQLite/700k rows) by dimitri · Pull Request #1718 · dimitri/pgloader

dimitri · 2026-06-23T00:20:03Z

What

Adds clojure/tests/bench/ — a timing benchmark suite that runs each dataset 3 times against both v4 and v3 and reports a side-by-side comparison table.

Also fixes two pre-requisite gaps in v4's summary output:

--summary now emits real JSON (clojure.data.json) instead of Clojure EDN (pr-str), so downstream scripts can json.load() it directly
-S FILE added as alias for --summary FILE, matching v3's CLI

Datasets

Dataset	Source	Rows	Size
employees	MariaDB (custom Docker image, built at CI cache time)	~4M	35 MB tarball
lahman	SQLite (fetched by `make lahman.sqlite`, cached in CI)	~700k / 30 tables	66 MB

How timing works

Each Makefile target runs pgloader N times (default 3). Around each invocation:

T0=$(date +%s%3N)
java -jar /pgloader.jar -S summary.json employees.load
T1=$(date +%s%3N)
# augment JSON with OS-measured wall time
python3 -c "import json; d=json.load(open('summary.json')); d['os-wall-ms']=$((T1-T0)); json.dump(d,open('summary.json','w'))"

report.py reads all JSON files, extracts pgloader-reported time (grand-total.total-nanos) and OS wall time (os-wall-ms), computes medians and v3÷v4 ratio, and prints:

dataset      ver  run    pgloader   COPY wall   OS wall    rows
-----------------------------------------------------------------
employees    v4   1       2.047s     1.482s      2.150s    4.00M
employees    v4   2       1.998s     1.411s      2.098s    4.00M
employees    v4   3       2.103s     1.530s      2.210s    4.00M
employees    v4   med     2.047s     1.482s      2.150s
employees    v3   1       7.213s         —        7.40s    4.00M
...
employees    —    v3÷v4    3.52×         —        3.44×

In GitHub Actions the table is also written to the job's step summary as Markdown.

CI additions

build-bench-source — builds and caches the custom MariaDB+employees image (keyed on Dockerfile + init-employees.sh; skipped when unchanged)
bench matrix — 4 jobs: employees×{v4,v3} + lahman×{v4,v3}, RUNS=3 each; uploads per-run timing JSON as artifacts
bench-report — aggregates all 4 artifact sets and emits the comparison table
publish-dev now waits for bench jobs to pass before publishing

write-summary-json was using pr-str which produces Clojure EDN, not valid JSON. Switch to clojure.data.json/write-str (new dep: org.clojure/data.json 2.5.0) so --summary foo.json produces JSON that downstream tools (report scripts, CI aggregators) can parse directly. Add -S FILE as alias for --summary FILE, matching v3 pgloader's CLI.

…0k rows) Adds clojure/tests/bench/ — a dedicated benchmark suite that runs each dataset 3 times against v4 and v3 and produces a timing comparison table. Layout ------ Dockerfile Custom MariaDB 11 image; fetches employees tarball (35 MB) at build time, pre-seeds via init-employees.sh init-employees.sh Strips SOURCE commands (MySQL CLI-only) from employees.sql and runs DDL + per-dump-file loading directly docker-compose.yml mariadb (bench source) + postgres (target) + test-runner employees.load pgloader LOAD DATABASE mariadb→postgres, workers=4 lahman.load pgloader LOAD DATABASE sqlite→postgres Makefile 3-run timing loop per target; augments each JSON summary with os-wall-ms (date +%s%3N before/after pgloader) report.py Reads v4 JSON (grand-total.total-nanos) and v3 JSON (root SECS key) plus os-wall-ms; prints comparison table and writes Markdown to $GITHUB_STEP_SUMMARY when set CI additions ------------ build-bench-source Builds + caches the MariaDB image (keyed on Dockerfile and init-employees.sh; skipped when unchanged) bench matrix employees×{v4,v3} + lahman×{v4,v3}, RUNS=3 each bench-report Aggregates timing JSONs, prints table, writes step summary publish-dev Now requires bench jobs to pass before publishing Lahman SQLite (66 MB, jknecht/baseball-archive-sqlite 2022) is fetched by 'make lahman.sqlite' and cached in CI with actions/cache keyed on the release tag. Not committed to git.

make -C tests bench now delegates to tests/bench/Makefile. Also adds employees, employees-v3, lahman, lahman-v3, bench-report, bench-down as individual pass-throughs.

Three bugs found during local trial run and fixed: 1. date +%s%3N (macOS): BSD date appends literal N; switch to perl -MTime::HiRes to get ms epoch on both Linux and macOS. 2. Perl inline define block: single $ in Makefile define blocks is expanded by make as $(var) (empty string). Extracted the JSON injection to inject-ms.pl so all Perl variables are unambiguous. 3. Missing GRANT: MariaDB Docker creates the pgloader user via MARIADB_USER/PASSWORD but grants it no database access. Added GRANT ALL PRIVILEGES ON employees.* TO 'pgloader'@'%' at the end of init-employees.sh.

Previously 'make -C tests bench' delegated to 'make -C bench bench' which ran pgloader directly on the host. That means the Docker hostnames mariadb/postgres can't resolve, date +%s%3N fails on macOS, python3 isn't available, and summary files are lost when the container exits. The fix mirrors every other integration suite: make -C tests bench → docker compose -f bench/docker-compose.yml run --rm test-runner (starts mariadb + postgres, waits for healthchecks, then runs 'make -C /suite bench' inside the test-runner) → python3 bench/report.py bench/summaries (host, after container exits) → teardown Key changes: - tests/Makefile bench target uses docker compose run (not make -C bench) - bench/lahman.sqlite is downloaded as a prerequisite before docker starts - SUMMARY_DIR defaults to $(BENCH_DIR)summaries = /suite/summaries inside the container, which maps to bench/summaries/ on the host via .:/suite - bench target in bench/Makefile drops 'report' (no python3 in container) - bench/summaries/ and bench/lahman.sqlite added to .gitignore

…clear summaries Three fixes: 1. report.py parse_v3: v3 JSON DATA is a list of groups where each group is a list of per-table dicts (concurrent batches), not a flat list. Some groups can also be JSON null (e.g. SQLite loads with no DATA). Fixed by flattening and filtering Nones before summing ROWS. 2. report.py median row now includes the median row count column. 3. tests/Makefile bench target: - clears bench/summaries/ before each run so stale files from a previous run (different RUNS count or old stub lahman.sqlite) don't pollute the report - propagates RUNS variable into the container via 'make -C /suite bench RUNS=$(RUNS)'

New layout: run │ step │ employees v3 │ employees v4 │ v3÷v4 │ lahman v3 │ lahman v4 │ v3÷v4 1 │ pgloader │ x.xxxs │ x.xxxs │ x.xx× │ x.xxxs │ x.xxxs │ x.xx× 1 │ COPY wall │ — │ x.xxxs │ — │ — │ x.xxxs │ — 1 │ OS wall │ x.xxxs │ x.xxxs │ x.xx× │ x.xxxs │ x.xxxs │ x.xx× ... med │ pgloader │ ... Column width adapts to the widest suite+version header. ─ separators after each run block. v3 COPY wall is always — (not reported by v3).

…--quiet runs - parse_v3: extract COPY wall time from POSTLOAD 'COPY Threads Completion' entry (equivalent to v4's 'COPY Wall-Clock Time' post-phase entry) - build_table: v3÷v4 ratio is always pgloader-time based (total time reported by pgloader), shown only on the pgloader row; COPY wall and OS wall rows show — in the ratio column - Makefile: use --quiet for both v4 (java -jar ... --quiet) and v3 (pgloader --quiet) so log I/O does not inflate bench timings

logback.xml had an explicit '<logger name="pgloader" level="DEBUG"/>' that pinned every pgloader.* logger to DEBUG regardless of --quiet or any programmatic level change — set-log-level! only updated the root logger. Two-part fix: 1. Remove the explicit DEBUG override from logback.xml so the pgloader logger inherits from root (INFO by default, matching the root setting). 2. Harden set-log-level! to always clear the pgloader named-logger level (setLevel null → inherit from root) so any future logback.xml pin cannot defeat --quiet again. Default output is now INFO-only (was DEBUG) without any flags.

The COPY wall comparison is the most meaningful signal (pure transfer time, no connection setup or index overhead). OS wall stays — as it is indicative only.

## New benchmark: Divvy Bikeshare trips (CSV) - 3 summer months 2023 (June/July/August): ≈ 2.2 M rows, ≈ 450 MB CSV - Uses pgloader's filename-pattern feature: FROM ALL FILENAMES MATCHING ~/\d{6}-divvy-tripdata\.csv$/ IN DIRECTORY ... - Data fetched on host via `make divvy-data` (curl + unzip), then bind-mounted read-only into the test-runner at /work/divvy/ - Same CI caching pattern as lahman (actions/cache keyed on month range) ## Report rewritten to match PR-description format - Columns: dataset │ ver │ run │ pgloader │ COPY wall │ OS wall │ rows │ bytes │ MB/s - Rows grouped by (dataset, version); v3÷v4 ratio row per dataset - bytes from grand-total.bytes (v4) / root BYTES (v3) - rows from grand-total.rows (v4) / sum of DATA[*].ROWS (v3) - MB/s = bytes / (1 MiB) / copy_wall_s (shown per run and median) - Suites auto-detected from summary file names (no hard-coded list) - GITHUB_STEP_SUMMARY: appended as markdown code block from the CI step ## CI fixes (bench jobs were all failing with exit code 2) 1. make command: was `make $target` (wrong CWD in container); fixed to `make -C /suite $target` 2. artifact path: was /tmp/pgloader-bench/ (non-existent); fixed to clojure/tests/bench/summaries/ 3. step summary: removed invalid ${{ github.step_summary }} env override; now wraps report output in a markdown code block via tee + redirect

report.py: ratio was v3÷v4 (> 1 = v4 faster), flip to v4÷v3 (< 1 = v4 faster, > 1 = v4 slower) so the direction reads naturally when comparing v4 against v3. csv.clj: GlobCSVSource.read-rows was wrapping each file's lazy sequence in (vec ...), forcing the entire CSV file into memory before the prefetch pipeline could drain it. For large multi-file loads (e.g. 3 × 150 MB Divvy CSVs) this blows the default JVM heap. Drop the vec to keep rows lazy -- the prefetch reader loop already handles lazy seqs via first/rest. bench/Makefile: add -Xmx2g to PGLOADER_V4 as a belt-and- suspenders safety net for the bench runs.

dimitri added 12 commits June 23, 2026 02:13

fix: add bench pass-through targets to tests/Makefile

8a38978

make -C tests bench now delegates to tests/bench/Makefile. Also adds employees, employees-v3, lahman, lahman-v3, bench-report, bench-down as individual pass-throughs.

bench: show v3÷v4 ratio on COPY wall row too

cb85f3c

The COPY wall comparison is the most meaningful signal (pure transfer time, no connection setup or index overhead). OS wall stays — as it is indicative only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: bench suite — employees (MariaDB/4M rows) + lahman (SQLite/700k rows)#1718

feat: bench suite — employees (MariaDB/4M rows) + lahman (SQLite/700k rows)#1718
dimitri wants to merge 12 commits into
mainfrom
feat/bench-suite

dimitri commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dimitri commented Jun 23, 2026

What

Datasets

How timing works

CI additions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant