Skip to content

feat: bench suite — employees (MariaDB/4M rows) + lahman (SQLite/700k rows)#1718

Open
dimitri wants to merge 12 commits into
mainfrom
feat/bench-suite
Open

feat: bench suite — employees (MariaDB/4M rows) + lahman (SQLite/700k rows)#1718
dimitri wants to merge 12 commits into
mainfrom
feat/bench-suite

Conversation

@dimitri

@dimitri dimitri commented Jun 23, 2026

Copy link
Copy Markdown
Owner

What

Adds clojure/tests/bench/ — a timing benchmark suite that runs each dataset 3 times against both v4 and v3 and reports a side-by-side comparison table.

Also fixes two pre-requisite gaps in v4's summary output:

  • --summary now emits real JSON (clojure.data.json) instead of Clojure EDN (pr-str), so downstream scripts can json.load() it directly
  • -S FILE added as alias for --summary FILE, matching v3's CLI

Datasets

Dataset Source Rows Size
employees MariaDB (custom Docker image, built at CI cache time) ~4M 35 MB tarball
lahman SQLite (fetched by make lahman.sqlite, cached in CI) ~700k / 30 tables 66 MB

How timing works

Each Makefile target runs pgloader N times (default 3). Around each invocation:

T0=$(date +%s%3N)
java -jar /pgloader.jar -S summary.json employees.load
T1=$(date +%s%3N)
# augment JSON with OS-measured wall time
python3 -c "import json; d=json.load(open('summary.json')); d['os-wall-ms']=$((T1-T0)); json.dump(d,open('summary.json','w'))"

report.py reads all JSON files, extracts pgloader-reported time (grand-total.total-nanos) and OS wall time (os-wall-ms), computes medians and v3÷v4 ratio, and prints:

dataset      ver  run    pgloader   COPY wall   OS wall    rows
-----------------------------------------------------------------
employees    v4   1       2.047s     1.482s      2.150s    4.00M
employees    v4   2       1.998s     1.411s      2.098s    4.00M
employees    v4   3       2.103s     1.530s      2.210s    4.00M
employees    v4   med     2.047s     1.482s      2.150s
employees    v3   1       7.213s         —        7.40s    4.00M
...
employees    —    v3÷v4    3.52×         —        3.44×

In GitHub Actions the table is also written to the job's step summary as Markdown.

CI additions

  • build-bench-source — builds and caches the custom MariaDB+employees image (keyed on Dockerfile + init-employees.sh; skipped when unchanged)
  • bench matrix — 4 jobs: employees×{v4,v3} + lahman×{v4,v3}, RUNS=3 each; uploads per-run timing JSON as artifacts
  • bench-report — aggregates all 4 artifact sets and emits the comparison table
  • publish-dev now waits for bench jobs to pass before publishing

dimitri added 12 commits June 23, 2026 02:13
write-summary-json was using pr-str which produces Clojure EDN, not
valid JSON. Switch to clojure.data.json/write-str (new dep:
org.clojure/data.json 2.5.0) so --summary foo.json produces JSON that
downstream tools (report scripts, CI aggregators) can parse directly.

Add -S FILE as alias for --summary FILE, matching v3 pgloader's CLI.
…0k rows)

Adds clojure/tests/bench/ — a dedicated benchmark suite that runs each
dataset 3 times against v4 and v3 and produces a timing comparison table.

Layout
------
  Dockerfile        Custom MariaDB 11 image; fetches employees tarball
                    (35 MB) at build time, pre-seeds via init-employees.sh
  init-employees.sh Strips SOURCE commands (MySQL CLI-only) from employees.sql
                    and runs DDL + per-dump-file loading directly
  docker-compose.yml mariadb (bench source) + postgres (target) + test-runner
  employees.load    pgloader LOAD DATABASE mariadb→postgres, workers=4
  lahman.load       pgloader LOAD DATABASE sqlite→postgres
  Makefile          3-run timing loop per target; augments each JSON summary
                    with os-wall-ms (date +%s%3N before/after pgloader)
  report.py         Reads v4 JSON (grand-total.total-nanos) and v3 JSON
                    (root SECS key) plus os-wall-ms; prints comparison table
                    and writes Markdown to $GITHUB_STEP_SUMMARY when set

CI additions
------------
  build-bench-source  Builds + caches the MariaDB image (keyed on Dockerfile
                      and init-employees.sh; skipped when unchanged)
  bench matrix        employees×{v4,v3} + lahman×{v4,v3}, RUNS=3 each
  bench-report        Aggregates timing JSONs, prints table, writes step summary
  publish-dev         Now requires bench jobs to pass before publishing

Lahman SQLite (66 MB, jknecht/baseball-archive-sqlite 2022) is fetched by
'make lahman.sqlite' and cached in CI with actions/cache keyed on the
release tag.  Not committed to git.
make -C tests bench now delegates to tests/bench/Makefile.
Also adds employees, employees-v3, lahman, lahman-v3, bench-report,
bench-down as individual pass-throughs.
Three bugs found during local trial run and fixed:

1. date +%s%3N (macOS): BSD date appends literal N; switch to
   perl -MTime::HiRes to get ms epoch on both Linux and macOS.

2. Perl inline define block: single $ in Makefile define blocks is
   expanded by make as $(var) (empty string).  Extracted the JSON
   injection to inject-ms.pl so all Perl variables are unambiguous.

3. Missing GRANT: MariaDB Docker creates the pgloader user via
   MARIADB_USER/PASSWORD but grants it no database access.  Added
   GRANT ALL PRIVILEGES ON employees.* TO 'pgloader'@'%' at the end
   of init-employees.sh.
Previously 'make -C tests bench' delegated to 'make -C bench bench' which
ran pgloader directly on the host.  That means the Docker hostnames
mariadb/postgres can't resolve, date +%s%3N fails on macOS, python3 isn't
available, and summary files are lost when the container exits.

The fix mirrors every other integration suite:

  make -C tests bench
    → docker compose -f bench/docker-compose.yml run --rm test-runner
        (starts mariadb + postgres, waits for healthchecks,
         then runs 'make -C /suite bench' inside the test-runner)
    → python3 bench/report.py bench/summaries   (host, after container exits)
    → teardown

Key changes:
- tests/Makefile bench target uses docker compose run (not make -C bench)
- bench/lahman.sqlite is downloaded as a prerequisite before docker starts
- SUMMARY_DIR defaults to $(BENCH_DIR)summaries = /suite/summaries inside
  the container, which maps to bench/summaries/ on the host via .:/suite
- bench target in bench/Makefile drops 'report' (no python3 in container)
- bench/summaries/ and bench/lahman.sqlite added to .gitignore
…clear summaries

Three fixes:

1. report.py parse_v3: v3 JSON DATA is a list of groups where each group
   is a list of per-table dicts (concurrent batches), not a flat list.
   Some groups can also be JSON null (e.g. SQLite loads with no DATA).
   Fixed by flattening and filtering Nones before summing ROWS.

2. report.py median row now includes the median row count column.

3. tests/Makefile bench target:
   - clears bench/summaries/ before each run so stale files from a
     previous run (different RUNS count or old stub lahman.sqlite) don't
     pollute the report
   - propagates RUNS variable into the container via
     'make -C /suite bench RUNS=$(RUNS)'
New layout:

  run │      step │ employees v3 │ employees v4 │ v3÷v4 │ lahman v3 │ lahman v4 │ v3÷v4
    1 │  pgloader │       x.xxxs │       x.xxxs │ x.xx× │    x.xxxs │    x.xxxs │ x.xx×
    1 │ COPY wall │            — │       x.xxxs │     — │         — │    x.xxxs │     —
    1 │   OS wall │       x.xxxs │       x.xxxs │ x.xx× │    x.xxxs │    x.xxxs │ x.xx×
  ...
  med │  pgloader │         ...

Column width adapts to the widest suite+version header.  ─ separators
after each run block.  v3 COPY wall is always — (not reported by v3).
…--quiet runs

- parse_v3: extract COPY wall time from POSTLOAD 'COPY Threads Completion'
  entry (equivalent to v4's 'COPY Wall-Clock Time' post-phase entry)

- build_table: v3÷v4 ratio is always pgloader-time based (total time
  reported by pgloader), shown only on the pgloader row; COPY wall and
  OS wall rows show — in the ratio column

- Makefile: use --quiet for both v4 (java -jar ... --quiet) and v3
  (pgloader --quiet) so log I/O does not inflate bench timings
logback.xml had an explicit '<logger name="pgloader" level="DEBUG"/>'
that pinned every pgloader.* logger to DEBUG regardless of --quiet or any
programmatic level change — set-log-level! only updated the root logger.

Two-part fix:
1. Remove the explicit DEBUG override from logback.xml so the pgloader
   logger inherits from root (INFO by default, matching the root setting).
2. Harden set-log-level! to always clear the pgloader named-logger level
   (setLevel null → inherit from root) so any future logback.xml pin
   cannot defeat --quiet again.

Default output is now INFO-only (was DEBUG) without any flags.
The COPY wall comparison is the most meaningful signal (pure transfer
time, no connection setup or index overhead).  OS wall stays — as it
is indicative only.
## New benchmark: Divvy Bikeshare trips (CSV)
- 3 summer months 2023 (June/July/August): ≈ 2.2 M rows, ≈ 450 MB CSV
- Uses pgloader's filename-pattern feature:
    FROM ALL FILENAMES MATCHING ~/\d{6}-divvy-tripdata\.csv$/ IN DIRECTORY ...
- Data fetched on host via `make divvy-data` (curl + unzip), then
  bind-mounted read-only into the test-runner at /work/divvy/
- Same CI caching pattern as lahman (actions/cache keyed on month range)

## Report rewritten to match PR-description format
- Columns: dataset │ ver │ run │ pgloader │ COPY wall │ OS wall │ rows │ bytes │ MB/s
- Rows grouped by (dataset, version); v3÷v4 ratio row per dataset
- bytes from grand-total.bytes (v4) / root BYTES (v3)
- rows from grand-total.rows (v4) / sum of DATA[*].ROWS (v3)
- MB/s = bytes / (1 MiB) / copy_wall_s (shown per run and median)
- Suites auto-detected from summary file names (no hard-coded list)
- GITHUB_STEP_SUMMARY: appended as markdown code block from the CI step

## CI fixes (bench jobs were all failing with exit code 2)
1. make command: was `make $target` (wrong CWD in container);
   fixed to `make -C /suite $target`
2. artifact path: was /tmp/pgloader-bench/ (non-existent);
   fixed to clojure/tests/bench/summaries/
3. step summary: removed invalid ${{ github.step_summary }} env override;
   now wraps report output in a markdown code block via tee + redirect
report.py: ratio was v3÷v4 (> 1 = v4 faster), flip to v4÷v3
(< 1 = v4 faster, > 1 = v4 slower) so the direction reads
naturally when comparing v4 against v3.

csv.clj: GlobCSVSource.read-rows was wrapping each file's lazy
sequence in (vec ...), forcing the entire CSV file into memory
before the prefetch pipeline could drain it. For large multi-file
loads (e.g. 3 × 150 MB Divvy CSVs) this blows the default JVM
heap. Drop the vec to keep rows lazy -- the prefetch reader loop
already handles lazy seqs via first/rest.

bench/Makefile: add -Xmx2g to PGLOADER_V4 as a belt-and-
suspenders safety net for the bench runs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant