Skip to content

Fix "reset sequences" crash on tables with case sensitive column names#1509

Open
codereverser wants to merge 1 commit into
dimitri:mainfrom
codereverser:fix/quote-idenfitiers-reset-sequences
Open

Fix "reset sequences" crash on tables with case sensitive column names#1509
codereverser wants to merge 1 commit into
dimitri:mainfrom
codereverser:fix/quote-idenfitiers-reset-sequences

Conversation

@codereverser

@codereverser codereverser commented Jul 16, 2023

Copy link
Copy Markdown

On case sensitive migrations (i.e. with "quote identifiers"), reset sequences fail with the following error.

Database error 42703: column ""<columnName>"" of relation "<table>" does not exist

This is due to invalid parameters being passed to pg_get_serial_sequence function introduced in e32066d

for example, on a mysql table with following definition

CREATE TABLE botLog(
    `logID` bigint(20) NOT NULL AUTO_INCREMENT, 
    `message` varchar(255), 
    PRIMARY KEY (`logID`)
);
SELECT pg_get_serial_sequence(quote_ident('public') || '.' || quote_ident('botLog'), quote_ident('logID'));
ERROR:  column ""logID"" of relation "botLog" does not exist

throws the same error, but the following works

SELECT pg_get_serial_sequence(quote_ident('public') || '.' || quote_ident('botLog'), 'logID');
  pg_get_serial_sequence
---------------------------
 public."botLog_logID_seq"

ref: #425

@Asher-JH

Asher-JH commented Oct 3, 2024

Copy link
Copy Markdown

Is this something that can be merged in soon? facing this issue

dimitri added a commit that referenced this pull request Jun 20, 2026
Comprehensive scan of all 200 open issues and 40+ open PRs on the repo
against v4 fix status. Major additions vs previous version:

- FULLTEXT → GIN tsvector: #1230 now ✅ (was 🔧)
- MySQL/MariaDB: add #1539, #1570, #1572, #1592, #1617, #1641, #1653,
  #1654, #1661
- PostgreSQL target: add #1461, #1490, #1576, #1600, #1693
- pgsql→pgsql: add #1419, #1550, #1556, #1601
- SQLite: add #1450, #1451, #1472, #1515, #1517, #1523, #1543, #1547,
  #1552, #1607, #1631, #1665, #1687
- MSSQL: add #304, #1445, #1454, #1551, #1582, #1586, #1590, #1597,
  #1627, #1630, #1669
- CLI: add #1463, #1475, #1476, #1682
- Parallelism: add #1405, #1487, #1520, #1583, #1589, #1623, #1648, #1680
- Connection: add #1480, #1507, #1562, #1579, #1636, #1685
- Summary: add #1561
- Open PRs superseded by v4: #1662, #1652, #1596, #1595, #1594, #1531,
  #1509, #1321
- Known gaps section updated
dimitri added a commit that referenced this pull request Jun 20, 2026
* Add pgloader v4 Clojure rewrite

Initial commit of the Clojure v4 rewrite of pgloader.

## What this adds

- Full LOAD DATABASE pipeline for MySQL, MariaDB, PostgreSQL-as-source,
  MSSQL, and SQLite
- File sources: CSV, COPY format, Fixed-width, DBF (including memo fields),
  LOAD ARCHIVE (zip/tar with HTTP fetch)
- Grammar / parser: instaparse-based .load file parser covering all v3
  command types; optional trailing semicolon
- DDL pipeline: CREATE TABLE, indexes (OID-named), FKs, sequence reset,
  ALTER SCHEMA RENAME, ALTER TABLE NAMES MATCHING
- Type mapping: MySQL → PostgreSQL with full unsigned upcast, tinyint(1),
  ENUM, geometry, JSON, spatial types
- Citus: grammar + Phase 1 (reference table, distributed table)
- CLI: --verbose, --debug, --quiet, --version, --summary, --type,
  --encoding, --with, --cast, --set, --field, --before, --after
- Summary output: sectioned table (pre/data/post), download+extract bytes,
  COPY Wall-Clock Time, optional read/write timing (--debug)
- Regression test framework (regress subcommand) with Docker-based E2E
  suites for all source types

## Bug fixes vs v3

- MariaDB detection done once per connection (not per catalog call)
- pg_get_serial_sequence second argument is correctly unquoted
- ENUM cast rules deduplicated by source definition
- int(N) with N≥10 maps to bigint (matches CL cast rule)
- Identifier quoting in summary matches Postgres rules (no over-quoting)
- Sequence reset guard includes bigserial columns

## CI

- GitHub Actions workflow for integration tests

## Docs

- GITHUB-ISSUES.md: maps GitHub issues to v4 fix status

* Fix Citus, CamelCase, Makefile silencers, add unit tests to CI

- Citus: uncomment all DISTRIBUTE clauses in citus.load — FK backfill
  implementation handles multi-hop chains (ads/clicks/impressions via
  campaigns). Both v4 (citus.clj augment-catalog) and v3 support this.

- CamelCase: add 02-camelcase test to mysql/source suite verifying that
  CamelCaseTable and mixed-case column names (MyId, MyName) are preserved
  end-to-end when quote identifiers is set.

- Makefile silencers: remove - prefix from my/sakila targets in mysql,
  mariadb, and mysql57 test Makefiles. Regression diffs all pass; the
  silencers were vestigial from earlier development.

- CI: add unit test job that runs before the integration matrix. Covers all
  90 unit tests (cast, batch, DDL, citus, parser, transforms, pg-service, cli).
  Integration jobs now have needs: unit so they only run after unit tests pass.

- CI: add v4-clojure-rewrite to push branches so the workflow fires on the
  new development branch.

* CI: fix DeLaGuardo/setup-clojure@12 → @13 (v12 tag does not exist)

* CI: fix Docker build context — must be repo root, not clojure/

* CI: cp JAR to fixed name instead of relative symlink (Docker can't follow relative symlinks across bind-mounts)

* CI: build test-runner image and v4 JAR once, share via artifacts

- New build-test-runner job: builds image once, docker save | gzip,
  uploads as artifact. Cross-run cache keyed on Dockerfile + v3 source
  hash so it's only rebuilt when those files change.
- New build-jar job: clojure -T:build uber once, uploads JAR as artifact.
- integration jobs: download both artifacts, docker load, run tests.
  No Java/Clojure setup needed — nothing is compiled per-job.

Before: each of 22 matrix jobs independently rebuilt the image (~2min)
and compiled the JAR (~2min) = ~88 min of redundant work per run.
After: one image build + one JAR build, 22 jobs just download + test.

* CI: rm symlink before cp JAR — build tool already creates pgloader.jar symlink, cp fails with 'same file'

* CI: replace JAR symlink with hard link — ln -f, no rm+cp needed

* Fix 8 CI test failures: guess-csv, GlobCSVSource arity, expected files, schema remap, SQLite defaults

* Fix more CI failures: guess-csv delimiter, strip-quotes recursion, Dockerfile heredocs, target-columns v3

* Fix target-columns: rename HAVING FIELDS to match TARGET COLUMNS so v3 works too

* Align CSV test data with v3: non-printable control chars, blanks-keep 7 rows

- non-printable: replace 4 clean rows with 7 rows from v3 inline data,
  including actual non-printable control characters (0x05, 0x06) in
  the c2 field — this actually tests non-printable char handling.
  Update expected count from 4 to 7.

- blanks-keep: replace 4-row header CSV with v3's 7-row inline data
  (no header row); remove skip header = 1 from load file to match.
  Update expected output for all 7 rows, showing that keep unquoted
  blanks preserves spaces but empty/null values are still NULL.

* v4 CSV tests: align all tests with v3, fix per-column null-if with backslash escape

- Fix per-column [null if '\N'] with fields escaped by backslash-quote:
  when opencsv strips backslash escapes (\N → N), the per-column nullif
  comparison failed. Build col-nil-val-alt map to match the stripped form.

- Fix ArrayIndexOutOfBoundsException for short CSV rows (fewer fields
  than declared columns): guard aget with bounds check.

- Add actual-table-columns + column filtering in core.clj: v4 now silently
  drops columns declared in load file that don't exist in the target table,
  matching v3 behavior (fixes 'column c does not exist' COPY errors).

- Add row-filter-fn plumbing through prefetch.clj to apply column filtering
  after reading rows but before COPY.

- Align CSV test load files and data with v3 counterparts:
  blanks-keep, blanks-trim, nulls, null-if, parse-date, guess-csv,
  header-auto, filename-pattern, non-printable

- Regenerate all expected output files by running v3 and v4 against
  identical load files; all 38 CSV tests now pass with v4.

* Fix v4 sqlite/mysql integration test failures

SQLite source:
- Default table name case to lowercase (matching v3 behavior). v3 lowercases
  SQLite table names by default; v4 was preserving original CamelCase. Fixed
  chinook-noseq (expected lowercase album/track/etc.) and sqlite tests.
- Use 'main' as source schema name (SQLite's default schema is 'main').
  ALTER SCHEMA 'main' RENAME TO 'chinook' now applies correctly. Previously
  the catalog used 'public' as schema, so only ALTER SCHEMA 'public' worked.
- Add source-table-name field so read-rows queries use original SQLite name
  while DDL uses the normalized (lowercased) name.
- Pass with-options to SQLite source (snake-case-ids, downcase-ids).

MySQL source:
- Use MariaDB index query (without expression column) for MySQL 5.x.
  information_schema.statistics.expression is MySQL 8.0+ only; MySQL 5.7
  throws 'Unknown column expression'. Now branches on both mariadb variant
  AND major-version < 8.

Test fixes:
- sqlite/sqlite.load: ALTER SCHEMA 'main' (was 'public')
- sqlite/base64.load: ALTER SCHEMA 'main' (was 'public')
- sqlite/chinook/sql/01-counts.sql: use chinook.track (lowercase, correct schema)
- Update sqlite expected files to match v4 output (v3 was broken on these)
- mysql expression-index expected: add trailing space to match psql output

* Translate MySQL FULLTEXT indexes to PostgreSQL GIN tsvector

v3 (CL pgloader) translates FULLTEXT KEY → CREATE INDEX USING gin(to_tsvector('simple', col)).
v4 was silently dropping them: the btree attempt failed on long mediumtext values
(row size exceeds btree maximum) and exec-post-ddl! swallowed the error.

Fix:
- Preserve index_type in the MySQL catalog index map (:index-type field)
- In create-indexes-sql, detect FULLTEXT and emit USING gin(to_tsvector(...))
  instead of a plain btree

Update expected/03-indexes.out for mysql, mysql57, mariadb to include
idx_{oid}_search (the FULLTEXT KEY on fcm_batches.raw_payload).

* SQLite: use 'public' as catalog schema name (matches v3 behavior)

v3's SQLite source creates catalog entries with schema 'public'.
v4 was using 'main' (SQLite's internal schema name) after a previous
fix, but this broke v3 compatibility since v3's ALTER SCHEMA rules
match against 'public', not 'main'.

Revert catalog schema to 'public' and load files from 'main' to 'public'
so ALTER SCHEMA 'public' RENAME TO '...' works in v4.

Note: v3 SQLite assigns nil as the source schema name (not 'public'),
so ALTER SCHEMA clauses still don't work for v3. This is a pre-existing
v3 limitation — chinook and test-pk will fail under v3-v3 runs.
The typenames '"0"' default is similarly a pre-existing v3 bug that
commit 486be8c partially fixed (handles "0" but not '"0"').

* Fix v3 SQLite type handling and v4 snake_case for CamelCase identifiers

v3 fixes:
- sqlite-schema.lisp: double-unquote defaults to strip nested '"0"' quoting
  (unquote strips outer single-quotes, leaving "0"; unquote again with #\"
  strips inner double-quotes, yielding bare 0 for the PostgreSQL DEFAULT)
- sqlite-cast-rules.lisp: add catch-all rule mapping unknown SQLite types to
  text (matches v4 fallback; fixes 'intege' table with type 'intege')
- casting-rules.lisp: support :type t as a wildcard in cast rule matching

v4 fix:
- sqlite.clj: apply CamelCase -> snake_case conversion for SQLite identifiers
  when snake_case identifiers is set (was always lowercasing, ignoring id-case)
  'TableName' now correctly becomes 'table_name' with snake_case identifiers

Update expected: sqlite.out 'tablename' -> 'table_name' to match v3+v4 output

* Fix pre-existing v3 bugs: sequence reset, ENUM dedup, pgsql DSN

pgsql-create-schema.lisp: remove erroneous quote_ident() around attname in
  pg_get_serial_sequence() call. The column_name argument is TEXT, not a SQL
  identifier — quote_ident was embedding literal double-quote chars into the
  name, causing 'column ""raceId"" does not exist' for CamelCase primary
  key columns (e.g. MySQL races.raceId). Bug introduced in commit 644f261.

mysql-cast-rules.lisp: deduplicate ENUM/SET types by source definition.
  MySQL ENUM/SET are inline per-column; two columns with identical values
  (e.g. film.rating and film_list.rating in a materialized view) were
  registered twice under different names. Reuse the existing type.

pgsql/pgsql.load: use explicit connection URIs instead of ?service=name.
  v3 pgloader does not support the ?service= URI format; use explicit
  postgresql://user:pass@host/db form so mysql-v3 CI tests can run.

* Fix v3 test compatibility: semicolons, IN SCHEMA, funny_string cast, grammar

- source.load: add missing trailing semicolon (v3 end-of-command rule requires it)
- pgsql.load: add IN SCHEMA 'pgsql_source' to INCLUDING clause (required by
  v3's including-matching-in-schema-filter grammar rule)
- my.load: add 'column funny_string.s to text drop not null' cast so GBK
  decoding failures produce NULL rather than causing a NOT NULL violation
- grammar.clj: accept optional IN SCHEMA 'name' suffix in including-only and
  excluding-only rules (v4 ignores the schema qualifier; v3 requires it)

* Fix pgsql v3 schema mapping and mysql v3 typemod error

- pgsql.load: add ALTER SCHEMA 'pgsql_source' RENAME TO 'pgsql_migrated' so
  v3 puts tables in pgsql_migrated (v3 doesn't treat SET search_path as an
  implicit schema remap the way v4 does)
- my.load: add 'drop typemod' to funny_string.s cast; MySQL char(N) has a
  typemod and PostgreSQL rejects text(N)

* Remove search_path schema inference; rebuild pgsql test with multi-schema source

search_path inference in core.clj was a security risk: any load file that used
SET search_path for name resolution (a PostgreSQL session variable) would
silently remap ALL source schemas to that value as a side effect. Schema
mapping must be explicit via ALTER SCHEMA 'old' RENAME TO 'new'.

Changes:
- core.clj: remove implicit search_path→schema-remap logic; SET search_path
  now passes through as a literal SET command on the target connection only
- chinook-noseq.load: add explicit ALTER SCHEMA (was relying on inference)
- sqlite.load, base64.load: drop now-redundant search_path from SET clause
  (ALTER SCHEMA already handles the rename; search_path had no other effect)

Rebuild pgsql test suite to exercise real pgsql-to-pgsql migration:
- source PostgreSQL is populated from diverse source types (SQLite Chinook,
  SQLite sqlite.db, DBF dbase_8b, CSV track, fixed-width census places),
  each loaded into its own named schema by v4 load scripts in source/
- pgsql.load then copies the entire source DB to target with no schema
  filters — all user schemas migrate intact, names preserved
- Regression checks schema list and row counts in target
- docker-compose adds fileserver (for census places archive) and mounts
  sqlite/dbf/csv test data directories into the test-runner container
- init.sql and pg_service.conf removed (no longer needed)

* Fix v3 crash on expression index; use plain postgres:16 for pgsql test

pgsql-ddl.lisp: guard against NIL when looking up an index column in
the table's column list. Expression indexes (e.g. (LOWER(email))) have
no matching column name in the catalog; the find returns NIL and the
subsequent column-type-name call crashed with TYPE-ERROR. The fix binds
col separately and wraps the accessor in (when col ...) so expression
index columns simply produce a NIL type, which is already handled
downstream by the (when (stringp idx-type) ...) guard.

pgsql/docker-compose.yml: switch both source and target from
postgis/postgis:16-3.4 to postgres:16. The postgis image pre-installs
the tiger and topology schemas; with 'include drop' pgloader tried to
DROP TABLE on extension-owned tables (spatial_ref_sys, tiger.*, etc.)
which PostgreSQL refuses. Our test data has no geometry columns so
PostGIS is not needed in either container.

* Fix three CI failures

v3 mysql: use COALESCE(column_name, expression) in list-all-indexes.sql so
MySQL 8.0 functional/expression indexes are captured. Translate backtick-quoted
identifiers in expressions to double-quoted for PostgreSQL compatibility.

v3 pgsql-ddl: use :when col :collect instead of :collect (when col ...) in
index-access-method loop so NIL is not inserted into idx-types list, preventing
the 'column "nil" does not exist' SQL error on expression indexes.

v4 regress: strip trailing whitespace from psql output before diff so psql
column-header alignment padding across versions does not cause spurious
01-schemas and 02-counts failures.

* regress: fix trailing-whitespace strip eating final blank line

str/split-lines drops trailing newline so joining back lost the blank
line that psql emits after (N rows). Use regex replace instead to
strip trailing spaces/tabs per line without disturbing line endings.

* regress: normalize expected file whitespace before diff

Expected files created before trailing-space stripping was added still
have trailing spaces in column headers. Normalize both expected and
actual to a temp file before diffing so neither psql version nor
pre-existing baseline format causes spurious failures.

* v3 mysql: fetch expression index columns via secondary query

COALESCE(column_name, expression) in list-all-indexes.sql broke
MariaDB and MySQL 5.7 which lack the EXPRESSION column. Revert the
SQL to column_name-only.

Instead, when cols is NIL (expression/functional index), issue a
secondary targeted query against information_schema.statistics.EXPRESSION
(MySQL 8.0+ only). The query is wrapped in handler-case so MariaDB
and older MySQL silently skip expression indexes rather than crashing.

Also translate MySQL backtick-quoted identifiers to double-quotes in
all index columns (no-op for plain names, correct for expressions like
lower(`email`) → lower("email")).

* regress: remove whitespace normalization; fix hand-written pgsql expected files

The normalization was a workaround for expected files that were authored by
hand rather than generated by 'make update-expected' inside Docker.  Same
Docker image = same psql 14 = deterministic output; no normalization needed.

psql 14 on Ubuntu Jammy (eclipse-temurin:21-jre-jammy) pads column headers
to full display width including the right border space (e.g. ' schema_name '
not ' schema_name').  All pre-existing expected files were generated inside
Docker and already match this format.  The two pgsql expected files were
written by hand and were missing the trailing space — corrected here.

* v4: parallel index building via ExecutorService

Each table's indexes are submitted to a fixed thread pool as soon as
that table's COPY finishes, so index builds overlap with subsequent
table copies — matching v3 behavior.

Key changes in core.clj:
- Create idx-executor (Executors/newFixedThreadPool) sized to
  max-parallel-create-index or total index count (v3 default)
- After each copy-table call, submit per-index futures to executor
- schema-only path: submit all indexes after DDL (no COPY ran)
- After Phase 2, shutdown + awaitTermination on the executor
- Remove the old sequential Phase 3 create-indexes doseq

Each index future opens its own PostgreSQL connection (same as v3's
pgsql-connect-and-execute-with-timing), executes one CREATE INDEX
statement, then closes it. exec-post-ddl! handles errors internally
so futures always complete normally.

* feat: add table-level worker parallelism (workers setting)

Mirrors v3 behavior: each table is copied by a dedicated virtual thread,
with a pool capped at  (default 4 for DB sources, 8 for file
sources). Each worker creates its own source connection and PostgreSQL
connection so they don't share state.

Index futures are submitted to the shared idx-executor as each worker
completes its COPY, matching the v3 interleaving of copy and index
phases.

* fix: apply PG SET params to each worker connection

Worker connections are created fresh (one per table), so the SET
parameters from the LOAD command (e.g. DateStyle) must be applied
to each worker-pg connection, mirroring what is done for the shared
pg-conn at the start of the run.

* feat: multiple readers per table for MySQL and PostgreSQL sources

Add opt-in concurrent reads per table via PK range splitting (MySQL) or
ctid range scans (PostgreSQL >= 14).  The same 1-reader-1-writer
pipeline architecture used for normal tables applies to each partition;
partitions of the same table run concurrently inside a virtual-thread
executor while the table's workers-pool slot waits for all of them.

Options (reuse v3 names where applicable):
  WITH multiple readers per thread   -- opt-in, default OFF
  WITH single reader                 -- explicit reset to single reader
  WITH concurrency = N               -- number of parallel readers per table (default 1)
  WITH rows per range = N            -- explicit rows per chunk (MySQL; overrides chunk size)
  WITH chunk size = 50 MB            -- target bytes per chunk (new; default 50 MB when enabled)

Chunk size calculation:
  MySQL:    rows_per_chunk = chunk_bytes / AVG_ROW_LENGTH (information_schema.TABLES)
  PostgreSQL: pages_per_chunk = chunk_bytes / 8192 (relpages from pg_class)

Partitioning falls back to single reader when:
  - concurrency <= 1
  - table has no single-column integer PK (MySQL)
  - PostgreSQL server < 14 (ctid range scans require PG 14+)
  - table has fewer than 2 range chunks (too small to benefit)
  - partition-source query fails for any reason

Implementation across 14 files:
  grammar.clj / ast.clj: chunk-size option with unit-aware parsing
    (50 MB -> 52428800 bytes); multiple-readers / single-reader now
    correctly map to {:multiple-readers true/false}
  protocol.clj: partition-source method added to Source protocol
  mysql.clj: refactored read-rows to use stream-ranges! helper that
    executes range queries sequentially on the same connection; added
    mysql-partition-source that splits PK range across N readers
  pgsql.clj: read-rows appends ctid WHERE clause when ctid-lo/hi set;
    pgsql-partition-source uses pg_class.relpages to size chunks
  mysql.sql / pgsql.sql: new AVG_ROW_LENGTH, relpages, and
    server_version_num queries
  core.clj: workers dispatch checks multiple-readers? + concurrency;
    partition-source called per table; N virtual-thread copy-table
    calls run in parallel with result merge for stats

* chore: remove stray files accidentally staged in previous commit

* refactor: move inline SQL into named HugSQL .sql files

- target/pgsql.sql (new): table-exists, table-columns, table-oid
  loaded in core.clj; replaces three jdbc/execute-one! inline queries
- ddl/citus.sql (new): exec-create-reference-table, exec-create-distributed-table
  loaded in citus.clj; distribute-sql replaced by execute-distribute!/conn
- source/mysql.sql: add table-pk-min-max (:sql: substitution for backtick identifiers)
  replaces two inline MIN/MAX jdbc queries in mysql-partition-source
- source/pgsql.sql: add ctid-where :snip (def-sqlvec-fns)
  replaces inline WHERE ctid string concat in PGSQLSource.read-rows
- clojure/.gitignore: anchor target/ to repo root so src/pgloader/target/ is not excluded

* docs: update parallelism docs for v4 architecture

batches.rst: rewrite the 'A Note About Parallelism' section to reflect v4's
  architecture — workers = simultaneous tables, concurrency = parallel readers
  per table (1:1 reader→writer pairs), chunk size = range sizing method.
  Removed the v3-specific '1 reader + N writers per table' pipeline description.

command.rst: add 'chunk size' to the global WITH options list.

docs/ref/pgsql.rst: add 'single/multiple readers per thread', 'chunk size',
  and a corrected 'rows per range' entry to the PostgreSQL source WITH options.
  The previous 'rows per range' entry had a dangling 'see above' reference
  pointing to nothing (the reader options were never documented here).

* ci: publish evergreen v4-dev release after each green push

- build.clj: version now read from PGLOADER_VERSION env var (fallback 4.0.0-SNAPSHOT)
- clojure-integration-tests.yml: add publish-dev job that runs after all tests pass
  on push events; overwrites assets on the fixed v4-dev pre-release tag so that
  .../releases/download/v4-dev/pgloader.jar always points to latest green HEAD.
  Also uploads a versioned copy pgloader-4.0.0-<sha>.jar for pinning.

* docs,ci: v4 install section + smoke-test workflow

docs/install.rst:
- Fix 'pgcopydb' typo in opening sentence
- Add 'pgloader v4 — Clojure/JVM (recommended)' section at the top with
  the stable v4-dev JAR download URL and one-liner usage examples
- Relabel existing Debian/RPM/Docker/build-from-source sections as v3
- Fix RST heading hierarchy (sub-sub-headings under 'Build from sources (v3)'
  changed from ^^^ to """ to avoid Sphinx level-conflict warnings)

.github/workflows/smoke-test.yml (new):
- Triggered on release:prereleased (fires when publish-dev updates v4-dev)
  and on workflow_dispatch for manual runs
- Downloads the published pgloader.jar from the v4-dev GitHub release
- Runs java -jar pgloader.jar --version to confirm the JAR starts
- Migrates the Chinook SQLite sample database into a PostgreSQL service
  using the Mustache-templated load file
- Validates exact row counts (tracks=3503, albums=347, customers=59,
  invoices=412) with a PL/pgSQL ASSERT block
- Uploads pgloader.log as an artifact on failure

clojure/tests/sqlite/smoke/chinook.load (new):
- Mustache-templated load file for the smoke test; SQLite path injected
  via CHINOOK_SQLITE env var so it works on the GHA runner without Docker

* docs: update GITHUB-ISSUES.md before merge

Comprehensive scan of all 200 open issues and 40+ open PRs on the repo
against v4 fix status. Major additions vs previous version:

- FULLTEXT → GIN tsvector: #1230 now ✅ (was 🔧)
- MySQL/MariaDB: add #1539, #1570, #1572, #1592, #1617, #1641, #1653,
  #1654, #1661
- PostgreSQL target: add #1461, #1490, #1576, #1600, #1693
- pgsql→pgsql: add #1419, #1550, #1556, #1601
- SQLite: add #1450, #1451, #1472, #1515, #1517, #1523, #1543, #1547,
  #1552, #1607, #1631, #1665, #1687
- MSSQL: add #304, #1445, #1454, #1551, #1582, #1586, #1590, #1597,
  #1627, #1630, #1669
- CLI: add #1463, #1475, #1476, #1682
- Parallelism: add #1405, #1487, #1520, #1583, #1589, #1623, #1648, #1680
- Connection: add #1480, #1507, #1562, #1579, #1636, #1685
- Summary: add #1561
- Open PRs superseded by v4: #1662, #1652, #1596, #1595, #1594, #1531,
  #1509, #1321
- Known gaps section updated

* fix(smoke-test): use URI one-liner instead of Mustache load file

Mustache template expansion ({{VAR}} in .load files) is not implemented
in v4. The smoke test was using a load file with {{CHINOOK_SQLITE}} which
would have failed at runtime. Switch to the URI one-liner form which
needs no templating and exercises the same code path.

* feat: expand {{VAR}} placeholders in .load files from OS environment

Implements the same env-var substitution that v3 provides via its
Mustache template support. Before the file content is handed to the
parser, every {{VAR}} token is replaced with the value of the OS
environment variable VAR. An undefined variable produces a clear
Template error rather than a confusing parse failure.

Implementation: clojure.string/replace with a single regex pass in
expand-env (parser.clj). No new dependencies.

Tests: 3 new assertions in parser_test covering successful expansion,
multiple placeholders in one string, undefined-var error, and
pass-through of strings with no placeholders.

* docs: mark Mustache templating as implemented in GITHUB-ISSUES

* feat: skip generated columns in MySQL and SQLite sources

MySQL: filter columns where EXTRA contains 'generated' (VIRTUAL GENERATED
or STORED GENERATED). These columns cannot be written via INSERT/COPY.

SQLite: parse the raw CREATE TABLE SQL from sqlite_master to detect
GENERATED ALWAYS AS columns, then exclude them from the catalog before
the schema is introspected.

* feat: MATERIALIZE VIEWS named list adds views to catalog

Named views WITHOUT an explicit SQL definition are now added to the
catalog so the normal DDL + COPY pipeline creates the target table and
copies the data — same pipeline as MATERIALIZE ALL VIEWS but filtered
to just the named views.

Named views WITH an explicit SQL definition (name AS $$ sql $$) are
still handled by materialize-views!, which runs the user SQL against the
source and writes to an already-existing target table (created via
BEFORE LOAD DO, as in the db789 test fixture).

* feat: migrate MSSQL MS_Description comments to PostgreSQL COMMENT ON

Query sys.extended_properties for MS_Description values (minor_id=0
for the table, minor_id>0 for columns) and store them in :table-comment
and :column-comment on catalog entries.

The existing create-table-sql in ddl/common.clj already emits
COMMENT ON TABLE / COMMENT ON COLUMN for any non-empty comment values,
so no further changes are needed.

* docs: update GITHUB-ISSUES.md — mark generated columns, matviews, MSSQL comments as fixed

* feat: recreate generated columns as GENERATED ALWAYS AS on target PostgreSQL

MySQL VIRTUAL GENERATED and STORED GENERATED columns, and SQLite
GENERATED ALWAYS AS columns, are now reproduced on the target as
PostgreSQL GENERATED ALWAYS AS (expr) STORED columns.

Changes:
- mysql.sql: add GENERATION_EXPRESSION to the columns query
- mysql.clj: populate :generated-expression from GENERATION_EXPRESSION;
  translate backtick-quoted identifiers to double-quoted ("col")
- sqlite.clj: replace generated-column-names (filter-out) with
  generated-column-expressions (extract expression from sqlite_master);
  balanced-paren extractor handles nested expressions
- ddl/common.clj: column-def emits GENERATED ALWAYS AS (expr) STORED
  when :generated-expression is set; skips NOT NULL and DEFAULT for
  generated columns (they are auto-computed by the engine)
- core.clj: filter :generated-expression columns from the COPY table-spec
  so they are excluded from SELECT and COPY (PostgreSQL rejects inserts
  into generated columns)

PostgreSQL only supports STORED generated columns; MySQL VIRTUAL columns
are migrated as STORED since PostgreSQL has no VIRTUAL equivalent.
Expressions are passed through as-is; complex expressions using
MySQL-specific functions may need manual adjustment after migration.

Closes #1442, closes #1687

* docs: update GITHUB-ISSUES — generated columns now recreated on target (not just skipped)

* docs: Citus FK backfill is fully implemented — close #1195, remove from known gaps

* feat: JDBC URI compatibility for all database sources

Accept jdbc: URLs everywhere a source/target connection string is
expected, alongside the existing pgloader-native URI schemes.

Grammar:
- source-uri matches jdbc:[^\s]+ so jdbc:mysql://, jdbc:sqlserver://
  (including semicolon parameters) and other JDBC URLs are accepted
- pg-uri matches jdbc:postgresql:// and jdbc:pgsql://

parse-uri:
- Always synthesises :jdbc-url in the returned map
- For jdbc:sqlserver:// uses a custom semicolon-param parser since
  the MSSQL JDBC driver uses ; instead of ? for connection properties
- For jdbc:postgresql://, jdbc:mysql://, etc. strips the jdbc: prefix,
  parses the inner URI for metadata (host/port/db/user/password), and
  stores the full original JDBC URL as :jdbc-url
- For native mysql://?useSSL=false, mssql://host/db, postgresql://...
  the query string is now preserved in the synthesised jdbc-url so
  query parameters actually reach the JDBC driver

Connection functions (mysql.clj, mssql.clj, pgsql.clj, core.clj):
- All use :jdbc-url from the uri-map when present, falling back to
  constructing the URL from parsed components
- mysql.clj switched from MysqlDataSource to DriverManager/getConnection
  so MySQL JDBC query parameters (useSSL, allowPublicKeyRetrieval,
  serverTimezone, …) are now correctly passed to the driver

Test suite:
- mssql.load updated to use jdbc:sqlserver:// URL demonstrating that
  special characters in passwords need no URL-encoding
- MySQL test files keep mysql://…?useSSL=false form which now correctly
  reaches the driver via the synthesised jdbc-url
- New ast_test.clj covers mysql://?useSSL=false jdbc-url synthesis,
  jdbc:mysql:// passthrough, jdbc:postgresql:// passthrough,
  jdbc:sqlserver:// semicolon parsing, mssql:// jdbc-url synthesis

Docs:
- docs/command.rst: Connection String section now covers both formats
  with JDBC URL examples for all four source types
- docs/ref/mysql.rst: documents JDBC URL form and MySQL Connector/J
  property passthrough
- docs/ref/mssql.rst: updated for MSSQL JDBC driver (replaces FreeTDS
  reference), documents jdbc:sqlserver:// form with Azure SQL example

Fixes MySQL authentication issues caused by useSSL=false being silently
ignored when using MysqlDataSource; enables Azure SQL connectivity via
JDBC URL without needing mssql:// URI workarounds.

* fix: revert mssql.load to mssql:// and drop MSSQL JDBC URL parser

Two issues with the previous commit:

1. mssql.load is shared between v3 and v4 test runners. v3's Common Lisp
   ESRAP parser does not understand jdbc:sqlserver:// URLs, causing the
   mssql (v3) CI job to fail with ESRAP-PARSE-ERROR. Reverted back to
   the mssql:// form that both parsers accept.

2. parse-mssql-jdbc-url was unnecessarily parsing the MSSQL JDBC URL's
   semicolon-delimited connection properties to extract host/port/db/user/
   password. This was wrong in principle: the MSSQL JDBC driver owns the
   parsing of its own URL format. Credentials embedded in the URL
   (user=sa;password=...) should not be re-extracted and re-passed via
   Properties (Properties take precedence over URL in the MSSQL driver,
   which would break if we ever passed wrong values). Replaced with a
   single-line passthrough: {:type :mssql :jdbc-url uri-str}.

The mssql.clj connection function already guards with (when (:user ...)),
so passing nil user/password when a JDBC URL is used is correct — the
driver reads credentials from the URL itself.

* test: validate synthesised jdbc-urls against actual JDBC drivers

DriverManager/getDriver calls Driver.acceptsURL() on every registered
driver without opening a connection. Covers all three drivers on the
classpath (MySQL Connector/J, PostgreSQL JDBC, mssql-jdbc) and both
path variants for each:

- native URI  → synthesised jdbc-url  (mysql://, postgresql://, mssql://)
- jdbc: prefix → passthrough jdbc-url (jdbc:mysql://, jdbc:postgresql://, jdbc:sqlserver://)

Catches malformed synthesised URLs (wrong scheme prefix, mis-placed
databaseName=, invalid sqlserver:// semicolon format) that the
round-trip string tests in the same file cannot detect.

* fix: add --logfile FILE flag to CLI

The smoke-test workflow passes --logfile /tmp/pgloader.log to the JAR.
Without this flag the argument fell through to the positional parser,
was treated as a third positional arg, and triggered 'Unknown arg'.

Implementation: at logging-configure time, if --logfile is set, attach
a plain FileAppender (PatternLayoutEncoder, append=false) to the root
Logback logger. All log events flow to both the console and the file,
matching the behaviour of the FILE appender in logback.xml but writing
to the user-specified path instead of /tmp/pgloader/pgloader.log.

* feat: implement --on-error-stop, --dry-run, --list-encodings, --client-min-messages, --log-min-messages; deprecate v3-only flags

--on-error-stop
  copy/*on-error-stop* was already declared in copy.clj but never read.
  Now wired: CLI sets the binding; the per-table catch block in core.clj
  re-throws after logging 'ON ERROR STOP', which propagates out of the
  worker Future and aborts the remaining tables in the pool.

--dry-run
  New copy/*dry-run* dynamic var in copy.clj. When set, core.clj prints
  'DRY RUN — skipping COPY for N tables' and skips the entire Phase 2
  (worker pool + COPY). DDL (CREATE TABLE, indexes) still runs so the
  schema is validated. Summary is printed normally.

--list-encodings
  Prints all charset names from java.nio.charset.Charset/availableCharsets
  (sorted) and exits. 173 charsets on a typical JVM 21.

--client-min-messages LEVEL
  Sets a ThresholdFilter on the CONSOLE Logback appender so console
  output can be quieter than the root logger level (e.g. suppress INFO
  on console while keeping DEBUG in the logfile).

--log-min-messages LEVEL
  Same but for the FILE appender defined in logback.xml.

Both --client-min-messages and --log-min-messages accept the Logback
level names: trace, debug, info, warn, error (case-insensitive).

Deprecated flags (exit 1 with a clear migration message):
  --upgrade-config        v2 INI converter; INI not supported in v4
  --self-upgrade          Lisp hot-reload; not applicable to JAR
  --load-lisp-file / -l   Lisp extensions; contact maintainers if needed
  --context / -C          INI variable file; use env vars / {{VAR}} instead
  --no-ssl-cert-verification  use ?sslmode=disable in the URI instead

Help text updated: all new flags documented, deprecated section added,
more source URI examples including JDBC URL forms.

* docs: annotate pgloader.rst CLI options with v4 changes and deprecations

For each option that behaves differently in v4, or was removed, a
'.. note:: v4 ...' callout is added immediately after the existing
description text. Original prose is preserved verbatim.

Annotated options:
- --with-encodings → renamed --list-encodings in v4; note explains
- --upgrade-config → removed; migration advice given
- --logfile → v4 adds a second sink, does not replace console
- --log-min-messages / --client-min-messages → level name mapping
  (critical→error, log→debug, notice→info, data→trace); v4 uses
  Logback names; --client-min-messages sets a ThresholdFilter so
  console and file levels are independently controlled
- --load-lisp-file → deprecated in v4; built-in transforms cover
  common cases; contact maintainers for custom extension needs
- --dry-run → fully implemented in v4; DDL runs, COPY skipped
- --on-error-stop → fully implemented in v4; re-raises from worker
  thread to abort remaining tables
- --self-upgrade → removed; replace the JAR to upgrade
- --no-ssl-cert-verification → removed; use ?sslmode= in URI

Also added a note under Source Connection String documenting the JDBC
URL forms accepted in v4 (jdbc:mysql://, jdbc:postgresql://, etc.).

* docs: trim 'fully implemented' v4 notes; keep only behaviour surprises

* fix: dry-run checks connections only — no catalog fetch, DDL, or COPY

Match v3 behaviour: when --dry-run is set, pgloader opens and immediately
closes both the source and target connections, logs 'connections OK', and
exits. No catalog is fetched, no DDL is executed, no data is copied.

Previously v4 ran the full DDL phase (CREATE TABLE, indexes) and only
skipped the COPY phase, which defeats the purpose of a dry run.

* docs: fix --with-encodings phantom name; remove dry-run note

The option has always been --list-encodings in both v3 and v4.
--with-encodings never existed. Also drop the --dry-run v4 note
since the behaviour matches v3 exactly.

* docs: add #1584 #1675 #1678 to GITHUB-ISSUES as fixed by JDBC URI support

* fix: interrupt reader thread on writer failure to prevent deadlock

When the writer virtual thread exits with an exception, the queue is no
longer drained. The reader is likely blocked on .put (queue capacity 4),
and the subsequent (.join reader) call blocks forever — a deadlock.

Fix: interrupt the reader thread immediately when the writer catch fires.
The reader's .put then throws InterruptedException, which is caught
separately and suppressed (the writer exception is the real one, stored
in exc-holder and re-thrown after the join). The interrupted flag is
restored on the reader thread so its interrupted state is visible to
callers of .join.

The reader-fails → writer-unblocked path was already correct:
reader catches its own exception, calls .offer :end-of-data, and sets
done=true, which causes the writer's .take to return :end-of-data and
exit cleanly.

* fix: close race in get-table-lock where two threads could get different lock instances

The old code read from the atom, then conditionally wrote to it — the
check and the insert were not atomic:

  (or (get @locks key)         ; thread A reads: key absent
      (let [lock (ReentrantLock.)]
        (swap! locks assoc key lock)  ; both threads swap, one wins
        lock))                 ; both return their *own* local lock

Both threads pass the (or) nil check, both swap!, one lock ends up in
the map but both callers hold different instances — the mutex is broken.

Fix: move the check inside the swap! update function so the entire
check-and-insert is a single CAS. swap! may retry under contention but
the update fn is pure. The canonical lock is whatever ends up in the
map, and get on the swap! return value retrieves that same instance for
all callers.

* style: add cljfmt formatter; reformat all source files

- deps.edn: add :cljfmt alias (dev.weavejester/cljfmt 0.13.1)
- Reformat 28 source and test files with 'clojure -M:cljfmt fix src test'
- CI: add 'format' job (cljfmt check) that runs before 'unit'; unit now
  needs: [format] so a formatting failure blocks the test run
- .git/hooks/pre-commit: run cljfmt check on commit when any .clj file
  is staged; prints fix command on failure; skippable with --no-verify

All 124 unit tests pass after reformatting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants