fix: Prevent camelCase-to-colname to duplicate underscores (#1287) by OldrichDlouhy · Pull Request #1652 · dimitri/pgloader

OldrichDlouhy · 2025-03-21T11:50:48Z

Prevent the conversion of camel case to snake case to double the underscores.

Fixes issue #1287

Comprehensive scan of all 200 open issues and 40+ open PRs on the repo against v4 fix status. Major additions vs previous version: - FULLTEXT → GIN tsvector: #1230 now ✅ (was 🔧) - MySQL/MariaDB: add #1539, #1570, #1572, #1592, #1617, #1641, #1653, #1654, #1661 - PostgreSQL target: add #1461, #1490, #1576, #1600, #1693 - pgsql→pgsql: add #1419, #1550, #1556, #1601 - SQLite: add #1450, #1451, #1472, #1515, #1517, #1523, #1543, #1547, #1552, #1607, #1631, #1665, #1687 - MSSQL: add #304, #1445, #1454, #1551, #1582, #1586, #1590, #1597, #1627, #1630, #1669 - CLI: add #1463, #1475, #1476, #1682 - Parallelism: add #1405, #1487, #1520, #1583, #1589, #1623, #1648, #1680 - Connection: add #1480, #1507, #1562, #1579, #1636, #1685 - Summary: add #1561 - Open PRs superseded by v4: #1662, #1652, #1596, #1595, #1594, #1531, #1509, #1321 - Known gaps section updated

@13

* Add pgloader v4 Clojure rewrite Initial commit of the Clojure v4 rewrite of pgloader. ## What this adds - Full LOAD DATABASE pipeline for MySQL, MariaDB, PostgreSQL-as-source, MSSQL, and SQLite - File sources: CSV, COPY format, Fixed-width, DBF (including memo fields), LOAD ARCHIVE (zip/tar with HTTP fetch) - Grammar / parser: instaparse-based .load file parser covering all v3 command types; optional trailing semicolon - DDL pipeline: CREATE TABLE, indexes (OID-named), FKs, sequence reset, ALTER SCHEMA RENAME, ALTER TABLE NAMES MATCHING - Type mapping: MySQL → PostgreSQL with full unsigned upcast, tinyint(1), ENUM, geometry, JSON, spatial types - Citus: grammar + Phase 1 (reference table, distributed table) - CLI: --verbose, --debug, --quiet, --version, --summary, --type, --encoding, --with, --cast, --set, --field, --before, --after - Summary output: sectioned table (pre/data/post), download+extract bytes, COPY Wall-Clock Time, optional read/write timing (--debug) - Regression test framework (regress subcommand) with Docker-based E2E suites for all source types ## Bug fixes vs v3 - MariaDB detection done once per connection (not per catalog call) - pg_get_serial_sequence second argument is correctly unquoted - ENUM cast rules deduplicated by source definition - int(N) with N≥10 maps to bigint (matches CL cast rule) - Identifier quoting in summary matches Postgres rules (no over-quoting) - Sequence reset guard includes bigserial columns ## CI - GitHub Actions workflow for integration tests ## Docs - GITHUB-ISSUES.md: maps GitHub issues to v4 fix status * Fix Citus, CamelCase, Makefile silencers, add unit tests to CI - Citus: uncomment all DISTRIBUTE clauses in citus.load — FK backfill implementation handles multi-hop chains (ads/clicks/impressions via campaigns). Both v4 (citus.clj augment-catalog) and v3 support this. - CamelCase: add 02-camelcase test to mysql/source suite verifying that CamelCaseTable and mixed-case column names (MyId, MyName) are preserved end-to-end when quote identifiers is set. - Makefile silencers: remove - prefix from my/sakila targets in mysql, mariadb, and mysql57 test Makefiles. Regression diffs all pass; the silencers were vestigial from earlier development. - CI: add unit test job that runs before the integration matrix. Covers all 90 unit tests (cast, batch, DDL, citus, parser, transforms, pg-service, cli). Integration jobs now have needs: unit so they only run after unit tests pass. - CI: add v4-clojure-rewrite to push branches so the workflow fires on the new development branch. * CI: fix DeLaGuardo/setup-clojure@12 → @13 (v12 tag does not exist) * CI: fix Docker build context — must be repo root, not clojure/ * CI: cp JAR to fixed name instead of relative symlink (Docker can't follow relative symlinks across bind-mounts) * CI: build test-runner image and v4 JAR once, share via artifacts - New build-test-runner job: builds image once, docker save | gzip, uploads as artifact. Cross-run cache keyed on Dockerfile + v3 source hash so it's only rebuilt when those files change. - New build-jar job: clojure -T:build uber once, uploads JAR as artifact. - integration jobs: download both artifacts, docker load, run tests. No Java/Clojure setup needed — nothing is compiled per-job. Before: each of 22 matrix jobs independently rebuilt the image (~2min) and compiled the JAR (~2min) = ~88 min of redundant work per run. After: one image build + one JAR build, 22 jobs just download + test. * CI: rm symlink before cp JAR — build tool already creates pgloader.jar symlink, cp fails with 'same file' * CI: replace JAR symlink with hard link — ln -f, no rm+cp needed * Fix 8 CI test failures: guess-csv, GlobCSVSource arity, expected files, schema remap, SQLite defaults * Fix more CI failures: guess-csv delimiter, strip-quotes recursion, Dockerfile heredocs, target-columns v3 * Fix target-columns: rename HAVING FIELDS to match TARGET COLUMNS so v3 works too * Align CSV test data with v3: non-printable control chars, blanks-keep 7 rows - non-printable: replace 4 clean rows with 7 rows from v3 inline data, including actual non-printable control characters (0x05, 0x06) in the c2 field — this actually tests non-printable char handling. Update expected count from 4 to 7. - blanks-keep: replace 4-row header CSV with v3's 7-row inline data (no header row); remove skip header = 1 from load file to match. Update expected output for all 7 rows, showing that keep unquoted blanks preserves spaces but empty/null values are still NULL. * v4 CSV tests: align all tests with v3, fix per-column null-if with backslash escape - Fix per-column [null if '\N'] with fields escaped by backslash-quote: when opencsv strips backslash escapes (\N → N), the per-column nullif comparison failed. Build col-nil-val-alt map to match the stripped form. - Fix ArrayIndexOutOfBoundsException for short CSV rows (fewer fields than declared columns): guard aget with bounds check. - Add actual-table-columns + column filtering in core.clj: v4 now silently drops columns declared in load file that don't exist in the target table, matching v3 behavior (fixes 'column c does not exist' COPY errors). - Add row-filter-fn plumbing through prefetch.clj to apply column filtering after reading rows but before COPY. - Align CSV test load files and data with v3 counterparts: blanks-keep, blanks-trim, nulls, null-if, parse-date, guess-csv, header-auto, filename-pattern, non-printable - Regenerate all expected output files by running v3 and v4 against identical load files; all 38 CSV tests now pass with v4. * Fix v4 sqlite/mysql integration test failures SQLite source: - Default table name case to lowercase (matching v3 behavior). v3 lowercases SQLite table names by default; v4 was preserving original CamelCase. Fixed chinook-noseq (expected lowercase album/track/etc.) and sqlite tests. - Use 'main' as source schema name (SQLite's default schema is 'main'). ALTER SCHEMA 'main' RENAME TO 'chinook' now applies correctly. Previously the catalog used 'public' as schema, so only ALTER SCHEMA 'public' worked. - Add source-table-name field so read-rows queries use original SQLite name while DDL uses the normalized (lowercased) name. - Pass with-options to SQLite source (snake-case-ids, downcase-ids). MySQL source: - Use MariaDB index query (without expression column) for MySQL 5.x. information_schema.statistics.expression is MySQL 8.0+ only; MySQL 5.7 throws 'Unknown column expression'. Now branches on both mariadb variant AND major-version < 8. Test fixes: - sqlite/sqlite.load: ALTER SCHEMA 'main' (was 'public') - sqlite/base64.load: ALTER SCHEMA 'main' (was 'public') - sqlite/chinook/sql/01-counts.sql: use chinook.track (lowercase, correct schema) - Update sqlite expected files to match v4 output (v3 was broken on these) - mysql expression-index expected: add trailing space to match psql output * Translate MySQL FULLTEXT indexes to PostgreSQL GIN tsvector v3 (CL pgloader) translates FULLTEXT KEY → CREATE INDEX USING gin(to_tsvector('simple', col)). v4 was silently dropping them: the btree attempt failed on long mediumtext values (row size exceeds btree maximum) and exec-post-ddl! swallowed the error. Fix: - Preserve index_type in the MySQL catalog index map (:index-type field) - In create-indexes-sql, detect FULLTEXT and emit USING gin(to_tsvector(...)) instead of a plain btree Update expected/03-indexes.out for mysql, mysql57, mariadb to include idx_{oid}_search (the FULLTEXT KEY on fcm_batches.raw_payload). * SQLite: use 'public' as catalog schema name (matches v3 behavior) v3's SQLite source creates catalog entries with schema 'public'. v4 was using 'main' (SQLite's internal schema name) after a previous fix, but this broke v3 compatibility since v3's ALTER SCHEMA rules match against 'public', not 'main'. Revert catalog schema to 'public' and load files from 'main' to 'public' so ALTER SCHEMA 'public' RENAME TO '...' works in v4. Note: v3 SQLite assigns nil as the source schema name (not 'public'), so ALTER SCHEMA clauses still don't work for v3. This is a pre-existing v3 limitation — chinook and test-pk will fail under v3-v3 runs. The typenames '"0"' default is similarly a pre-existing v3 bug that commit 486be8c partially fixed (handles "0" but not '"0"'). * Fix v3 SQLite type handling and v4 snake_case for CamelCase identifiers v3 fixes: - sqlite-schema.lisp: double-unquote defaults to strip nested '"0"' quoting (unquote strips outer single-quotes, leaving "0"; unquote again with #\" strips inner double-quotes, yielding bare 0 for the PostgreSQL DEFAULT) - sqlite-cast-rules.lisp: add catch-all rule mapping unknown SQLite types to text (matches v4 fallback; fixes 'intege' table with type 'intege') - casting-rules.lisp: support :type t as a wildcard in cast rule matching v4 fix: - sqlite.clj: apply CamelCase -> snake_case conversion for SQLite identifiers when snake_case identifiers is set (was always lowercasing, ignoring id-case) 'TableName' now correctly becomes 'table_name' with snake_case identifiers Update expected: sqlite.out 'tablename' -> 'table_name' to match v3+v4 output * Fix pre-existing v3 bugs: sequence reset, ENUM dedup, pgsql DSN pgsql-create-schema.lisp: remove erroneous quote_ident() around attname in pg_get_serial_sequence() call. The column_name argument is TEXT, not a SQL identifier — quote_ident was embedding literal double-quote chars into the name, causing 'column ""raceId"" does not exist' for CamelCase primary key columns (e.g. MySQL races.raceId). Bug introduced in commit 644f261. mysql-cast-rules.lisp: deduplicate ENUM/SET types by source definition. MySQL ENUM/SET are inline per-column; two columns with identical values (e.g. film.rating and film_list.rating in a materialized view) were registered twice under different names. Reuse the existing type. pgsql/pgsql.load: use explicit connection URIs instead of ?service=name. v3 pgloader does not support the ?service= URI format; use explicit postgresql://user:pass@host/db form so mysql-v3 CI tests can run. * Fix v3 test compatibility: semicolons, IN SCHEMA, funny_string cast, grammar - source.load: add missing trailing semicolon (v3 end-of-command rule requires it) - pgsql.load: add IN SCHEMA 'pgsql_source' to INCLUDING clause (required by v3's including-matching-in-schema-filter grammar rule) - my.load: add 'column funny_string.s to text drop not null' cast so GBK decoding failures produce NULL rather than causing a NOT NULL violation - grammar.clj: accept optional IN SCHEMA 'name' suffix in including-only and excluding-only rules (v4 ignores the schema qualifier; v3 requires it) * Fix pgsql v3 schema mapping and mysql v3 typemod error - pgsql.load: add ALTER SCHEMA 'pgsql_source' RENAME TO 'pgsql_migrated' so v3 puts tables in pgsql_migrated (v3 doesn't treat SET search_path as an implicit schema remap the way v4 does) - my.load: add 'drop typemod' to funny_string.s cast; MySQL char(N) has a typemod and PostgreSQL rejects text(N) * Remove search_path schema inference; rebuild pgsql test with multi-schema source search_path inference in core.clj was a security risk: any load file that used SET search_path for name resolution (a PostgreSQL session variable) would silently remap ALL source schemas to that value as a side effect. Schema mapping must be explicit via ALTER SCHEMA 'old' RENAME TO 'new'. Changes: - core.clj: remove implicit search_path→schema-remap logic; SET search_path now passes through as a literal SET command on the target connection only - chinook-noseq.load: add explicit ALTER SCHEMA (was relying on inference) - sqlite.load, base64.load: drop now-redundant search_path from SET clause (ALTER SCHEMA already handles the rename; search_path had no other effect) Rebuild pgsql test suite to exercise real pgsql-to-pgsql migration: - source PostgreSQL is populated from diverse source types (SQLite Chinook, SQLite sqlite.db, DBF dbase_8b, CSV track, fixed-width census places), each loaded into its own named schema by v4 load scripts in source/ - pgsql.load then copies the entire source DB to target with no schema filters — all user schemas migrate intact, names preserved - Regression checks schema list and row counts in target - docker-compose adds fileserver (for census places archive) and mounts sqlite/dbf/csv test data directories into the test-runner container - init.sql and pg_service.conf removed (no longer needed) * Fix v3 crash on expression index; use plain postgres:16 for pgsql test pgsql-ddl.lisp: guard against NIL when looking up an index column in the table's column list. Expression indexes (e.g. (LOWER(email))) have no matching column name in the catalog; the find returns NIL and the subsequent column-type-name call crashed with TYPE-ERROR. The fix binds col separately and wraps the accessor in (when col ...) so expression index columns simply produce a NIL type, which is already handled downstream by the (when (stringp idx-type) ...) guard. pgsql/docker-compose.yml: switch both source and target from postgis/postgis:16-3.4 to postgres:16. The postgis image pre-installs the tiger and topology schemas; with 'include drop' pgloader tried to DROP TABLE on extension-owned tables (spatial_ref_sys, tiger.*, etc.) which PostgreSQL refuses. Our test data has no geometry columns so PostGIS is not needed in either container. * Fix three CI failures v3 mysql: use COALESCE(column_name, expression) in list-all-indexes.sql so MySQL 8.0 functional/expression indexes are captured. Translate backtick-quoted identifiers in expressions to double-quoted for PostgreSQL compatibility. v3 pgsql-ddl: use :when col :collect instead of :collect (when col ...) in index-access-method loop so NIL is not inserted into idx-types list, preventing the 'column "nil" does not exist' SQL error on expression indexes. v4 regress: strip trailing whitespace from psql output before diff so psql column-header alignment padding across versions does not cause spurious 01-schemas and 02-counts failures. * regress: fix trailing-whitespace strip eating final blank line str/split-lines drops trailing newline so joining back lost the blank line that psql emits after (N rows). Use regex replace instead to strip trailing spaces/tabs per line without disturbing line endings. * regress: normalize expected file whitespace before diff Expected files created before trailing-space stripping was added still have trailing spaces in column headers. Normalize both expected and actual to a temp file before diffing so neither psql version nor pre-existing baseline format causes spurious failures. * v3 mysql: fetch expression index columns via secondary query COALESCE(column_name, expression) in list-all-indexes.sql broke MariaDB and MySQL 5.7 which lack the EXPRESSION column. Revert the SQL to column_name-only. Instead, when cols is NIL (expression/functional index), issue a secondary targeted query against information_schema.statistics.EXPRESSION (MySQL 8.0+ only). The query is wrapped in handler-case so MariaDB and older MySQL silently skip expression indexes rather than crashing. Also translate MySQL backtick-quoted identifiers to double-quotes in all index columns (no-op for plain names, correct for expressions like lower(`email`) → lower("email")). * regress: remove whitespace normalization; fix hand-written pgsql expected files The normalization was a workaround for expected files that were authored by hand rather than generated by 'make update-expected' inside Docker. Same Docker image = same psql 14 = deterministic output; no normalization needed. psql 14 on Ubuntu Jammy (eclipse-temurin:21-jre-jammy) pads column headers to full display width including the right border space (e.g. ' schema_name ' not ' schema_name'). All pre-existing expected files were generated inside Docker and already match this format. The two pgsql expected files were written by hand and were missing the trailing space — corrected here. * v4: parallel index building via ExecutorService Each table's indexes are submitted to a fixed thread pool as soon as that table's COPY finishes, so index builds overlap with subsequent table copies — matching v3 behavior. Key changes in core.clj: - Create idx-executor (Executors/newFixedThreadPool) sized to max-parallel-create-index or total index count (v3 default) - After each copy-table call, submit per-index futures to executor - schema-only path: submit all indexes after DDL (no COPY ran) - After Phase 2, shutdown + awaitTermination on the executor - Remove the old sequential Phase 3 create-indexes doseq Each index future opens its own PostgreSQL connection (same as v3's pgsql-connect-and-execute-with-timing), executes one CREATE INDEX statement, then closes it. exec-post-ddl! handles errors internally so futures always complete normally. * feat: add table-level worker parallelism (workers setting) Mirrors v3 behavior: each table is copied by a dedicated virtual thread, with a pool capped at (default 4 for DB sources, 8 for file sources). Each worker creates its own source connection and PostgreSQL connection so they don't share state. Index futures are submitted to the shared idx-executor as each worker completes its COPY, matching the v3 interleaving of copy and index phases. * fix: apply PG SET params to each worker connection Worker connections are created fresh (one per table), so the SET parameters from the LOAD command (e.g. DateStyle) must be applied to each worker-pg connection, mirroring what is done for the shared pg-conn at the start of the run. * feat: multiple readers per table for MySQL and PostgreSQL sources Add opt-in concurrent reads per table via PK range splitting (MySQL) or ctid range scans (PostgreSQL >= 14). The same 1-reader-1-writer pipeline architecture used for normal tables applies to each partition; partitions of the same table run concurrently inside a virtual-thread executor while the table's workers-pool slot waits for all of them. Options (reuse v3 names where applicable): WITH multiple readers per thread -- opt-in, default OFF WITH single reader -- explicit reset to single reader WITH concurrency = N -- number of parallel readers per table (default 1) WITH rows per range = N -- explicit rows per chunk (MySQL; overrides chunk size) WITH chunk size = 50 MB -- target bytes per chunk (new; default 50 MB when enabled) Chunk size calculation: MySQL: rows_per_chunk = chunk_bytes / AVG_ROW_LENGTH (information_schema.TABLES) PostgreSQL: pages_per_chunk = chunk_bytes / 8192 (relpages from pg_class) Partitioning falls back to single reader when: - concurrency <= 1 - table has no single-column integer PK (MySQL) - PostgreSQL server < 14 (ctid range scans require PG 14+) - table has fewer than 2 range chunks (too small to benefit) - partition-source query fails for any reason Implementation across 14 files: grammar.clj / ast.clj: chunk-size option with unit-aware parsing (50 MB -> 52428800 bytes); multiple-readers / single-reader now correctly map to {:multiple-readers true/false} protocol.clj: partition-source method added to Source protocol mysql.clj: refactored read-rows to use stream-ranges! helper that executes range queries sequentially on the same connection; added mysql-partition-source that splits PK range across N readers pgsql.clj: read-rows appends ctid WHERE clause when ctid-lo/hi set; pgsql-partition-source uses pg_class.relpages to size chunks mysql.sql / pgsql.sql: new AVG_ROW_LENGTH, relpages, and server_version_num queries core.clj: workers dispatch checks multiple-readers? + concurrency; partition-source called per table; N virtual-thread copy-table calls run in parallel with result merge for stats * chore: remove stray files accidentally staged in previous commit * refactor: move inline SQL into named HugSQL .sql files - target/pgsql.sql (new): table-exists, table-columns, table-oid loaded in core.clj; replaces three jdbc/execute-one! inline queries - ddl/citus.sql (new): exec-create-reference-table, exec-create-distributed-table loaded in citus.clj; distribute-sql replaced by execute-distribute!/conn - source/mysql.sql: add table-pk-min-max (:sql: substitution for backtick identifiers) replaces two inline MIN/MAX jdbc queries in mysql-partition-source - source/pgsql.sql: add ctid-where :snip (def-sqlvec-fns) replaces inline WHERE ctid string concat in PGSQLSource.read-rows - clojure/.gitignore: anchor target/ to repo root so src/pgloader/target/ is not excluded * docs: update parallelism docs for v4 architecture batches.rst: rewrite the 'A Note About Parallelism' section to reflect v4's architecture — workers = simultaneous tables, concurrency = parallel readers per table (1:1 reader→writer pairs), chunk size = range sizing method. Removed the v3-specific '1 reader + N writers per table' pipeline description. command.rst: add 'chunk size' to the global WITH options list. docs/ref/pgsql.rst: add 'single/multiple readers per thread', 'chunk size', and a corrected 'rows per range' entry to the PostgreSQL source WITH options. The previous 'rows per range' entry had a dangling 'see above' reference pointing to nothing (the reader options were never documented here). * ci: publish evergreen v4-dev release after each green push - build.clj: version now read from PGLOADER_VERSION env var (fallback 4.0.0-SNAPSHOT) - clojure-integration-tests.yml: add publish-dev job that runs after all tests pass on push events; overwrites assets on the fixed v4-dev pre-release tag so that .../releases/download/v4-dev/pgloader.jar always points to latest green HEAD. Also uploads a versioned copy pgloader-4.0.0-<sha>.jar for pinning. * docs,ci: v4 install section + smoke-test workflow docs/install.rst: - Fix 'pgcopydb' typo in opening sentence - Add 'pgloader v4 — Clojure/JVM (recommended)' section at the top with the stable v4-dev JAR download URL and one-liner usage examples - Relabel existing Debian/RPM/Docker/build-from-source sections as v3 - Fix RST heading hierarchy (sub-sub-headings under 'Build from sources (v3)' changed from ^^^ to """ to avoid Sphinx level-conflict warnings) .github/workflows/smoke-test.yml (new): - Triggered on release:prereleased (fires when publish-dev updates v4-dev) and on workflow_dispatch for manual runs - Downloads the published pgloader.jar from the v4-dev GitHub release - Runs java -jar pgloader.jar --version to confirm the JAR starts - Migrates the Chinook SQLite sample database into a PostgreSQL service using the Mustache-templated load file - Validates exact row counts (tracks=3503, albums=347, customers=59, invoices=412) with a PL/pgSQL ASSERT block - Uploads pgloader.log as an artifact on failure clojure/tests/sqlite/smoke/chinook.load (new): - Mustache-templated load file for the smoke test; SQLite path injected via CHINOOK_SQLITE env var so it works on the GHA runner without Docker * docs: update GITHUB-ISSUES.md before merge Comprehensive scan of all 200 open issues and 40+ open PRs on the repo against v4 fix status. Major additions vs previous version: - FULLTEXT → GIN tsvector: #1230 now ✅ (was 🔧) - MySQL/MariaDB: add #1539, #1570, #1572, #1592, #1617, #1641, #1653, #1654, #1661 - PostgreSQL target: add #1461, #1490, #1576, #1600, #1693 - pgsql→pgsql: add #1419, #1550, #1556, #1601 - SQLite: add #1450, #1451, #1472, #1515, #1517, #1523, #1543, #1547, #1552, #1607, #1631, #1665, #1687 - MSSQL: add #304, #1445, #1454, #1551, #1582, #1586, #1590, #1597, #1627, #1630, #1669 - CLI: add #1463, #1475, #1476, #1682 - Parallelism: add #1405, #1487, #1520, #1583, #1589, #1623, #1648, #1680 - Connection: add #1480, #1507, #1562, #1579, #1636, #1685 - Summary: add #1561 - Open PRs superseded by v4: #1662, #1652, #1596, #1595, #1594, #1531, #1509, #1321 - Known gaps section updated * fix(smoke-test): use URI one-liner instead of Mustache load file Mustache template expansion ({{VAR}} in .load files) is not implemented in v4. The smoke test was using a load file with {{CHINOOK_SQLITE}} which would have failed at runtime. Switch to the URI one-liner form which needs no templating and exercises the same code path. * feat: expand {{VAR}} placeholders in .load files from OS environment Implements the same env-var substitution that v3 provides via its Mustache template support. Before the file content is handed to the parser, every {{VAR}} token is replaced with the value of the OS environment variable VAR. An undefined variable produces a clear Template error rather than a confusing parse failure. Implementation: clojure.string/replace with a single regex pass in expand-env (parser.clj). No new dependencies. Tests: 3 new assertions in parser_test covering successful expansion, multiple placeholders in one string, undefined-var error, and pass-through of strings with no placeholders. * docs: mark Mustache templating as implemented in GITHUB-ISSUES * feat: skip generated columns in MySQL and SQLite sources MySQL: filter columns where EXTRA contains 'generated' (VIRTUAL GENERATED or STORED GENERATED). These columns cannot be written via INSERT/COPY. SQLite: parse the raw CREATE TABLE SQL from sqlite_master to detect GENERATED ALWAYS AS columns, then exclude them from the catalog before the schema is introspected. * feat: MATERIALIZE VIEWS named list adds views to catalog Named views WITHOUT an explicit SQL definition are now added to the catalog so the normal DDL + COPY pipeline creates the target table and copies the data — same pipeline as MATERIALIZE ALL VIEWS but filtered to just the named views. Named views WITH an explicit SQL definition (name AS $$ sql $$) are still handled by materialize-views!, which runs the user SQL against the source and writes to an already-existing target table (created via BEFORE LOAD DO, as in the db789 test fixture). * feat: migrate MSSQL MS_Description comments to PostgreSQL COMMENT ON Query sys.extended_properties for MS_Description values (minor_id=0 for the table, minor_id>0 for columns) and store them in :table-comment and :column-comment on catalog entries. The existing create-table-sql in ddl/common.clj already emits COMMENT ON TABLE / COMMENT ON COLUMN for any non-empty comment values, so no further changes are needed. * docs: update GITHUB-ISSUES.md — mark generated columns, matviews, MSSQL comments as fixed * feat: recreate generated columns as GENERATED ALWAYS AS on target PostgreSQL MySQL VIRTUAL GENERATED and STORED GENERATED columns, and SQLite GENERATED ALWAYS AS columns, are now reproduced on the target as PostgreSQL GENERATED ALWAYS AS (expr) STORED columns. Changes: - mysql.sql: add GENERATION_EXPRESSION to the columns query - mysql.clj: populate :generated-expression from GENERATION_EXPRESSION; translate backtick-quoted identifiers to double-quoted ("col") - sqlite.clj: replace generated-column-names (filter-out) with generated-column-expressions (extract expression from sqlite_master); balanced-paren extractor handles nested expressions - ddl/common.clj: column-def emits GENERATED ALWAYS AS (expr) STORED when :generated-expression is set; skips NOT NULL and DEFAULT for generated columns (they are auto-computed by the engine) - core.clj: filter :generated-expression columns from the COPY table-spec so they are excluded from SELECT and COPY (PostgreSQL rejects inserts into generated columns) PostgreSQL only supports STORED generated columns; MySQL VIRTUAL columns are migrated as STORED since PostgreSQL has no VIRTUAL equivalent. Expressions are passed through as-is; complex expressions using MySQL-specific functions may need manual adjustment after migration. Closes #1442, closes #1687 * docs: update GITHUB-ISSUES — generated columns now recreated on target (not just skipped) * docs: Citus FK backfill is fully implemented — close #1195, remove from known gaps * feat: JDBC URI compatibility for all database sources Accept jdbc: URLs everywhere a source/target connection string is expected, alongside the existing pgloader-native URI schemes. Grammar: - source-uri matches jdbc:[^\s]+ so jdbc:mysql://, jdbc:sqlserver:// (including semicolon parameters) and other JDBC URLs are accepted - pg-uri matches jdbc:postgresql:// and jdbc:pgsql:// parse-uri: - Always synthesises :jdbc-url in the returned map - For jdbc:sqlserver:// uses a custom semicolon-param parser since the MSSQL JDBC driver uses ; instead of ? for connection properties - For jdbc:postgresql://, jdbc:mysql://, etc. strips the jdbc: prefix, parses the inner URI for metadata (host/port/db/user/password), and stores the full original JDBC URL as :jdbc-url - For native mysql://?useSSL=false, mssql://host/db, postgresql://... the query string is now preserved in the synthesised jdbc-url so query parameters actually reach the JDBC driver Connection functions (mysql.clj, mssql.clj, pgsql.clj, core.clj): - All use :jdbc-url from the uri-map when present, falling back to constructing the URL from parsed components - mysql.clj switched from MysqlDataSource to DriverManager/getConnection so MySQL JDBC query parameters (useSSL, allowPublicKeyRetrieval, serverTimezone, …) are now correctly passed to the driver Test suite: - mssql.load updated to use jdbc:sqlserver:// URL demonstrating that special characters in passwords need no URL-encoding - MySQL test files keep mysql://…?useSSL=false form which now correctly reaches the driver via the synthesised jdbc-url - New ast_test.clj covers mysql://?useSSL=false jdbc-url synthesis, jdbc:mysql:// passthrough, jdbc:postgresql:// passthrough, jdbc:sqlserver:// semicolon parsing, mssql:// jdbc-url synthesis Docs: - docs/command.rst: Connection String section now covers both formats with JDBC URL examples for all four source types - docs/ref/mysql.rst: documents JDBC URL form and MySQL Connector/J property passthrough - docs/ref/mssql.rst: updated for MSSQL JDBC driver (replaces FreeTDS reference), documents jdbc:sqlserver:// form with Azure SQL example Fixes MySQL authentication issues caused by useSSL=false being silently ignored when using MysqlDataSource; enables Azure SQL connectivity via JDBC URL without needing mssql:// URI workarounds. * fix: revert mssql.load to mssql:// and drop MSSQL JDBC URL parser Two issues with the previous commit: 1. mssql.load is shared between v3 and v4 test runners. v3's Common Lisp ESRAP parser does not understand jdbc:sqlserver:// URLs, causing the mssql (v3) CI job to fail with ESRAP-PARSE-ERROR. Reverted back to the mssql:// form that both parsers accept. 2. parse-mssql-jdbc-url was unnecessarily parsing the MSSQL JDBC URL's semicolon-delimited connection properties to extract host/port/db/user/ password. This was wrong in principle: the MSSQL JDBC driver owns the parsing of its own URL format. Credentials embedded in the URL (user=sa;password=...) should not be re-extracted and re-passed via Properties (Properties take precedence over URL in the MSSQL driver, which would break if we ever passed wrong values). Replaced with a single-line passthrough: {:type :mssql :jdbc-url uri-str}. The mssql.clj connection function already guards with (when (:user ...)), so passing nil user/password when a JDBC URL is used is correct — the driver reads credentials from the URL itself. * test: validate synthesised jdbc-urls against actual JDBC drivers DriverManager/getDriver calls Driver.acceptsURL() on every registered driver without opening a connection. Covers all three drivers on the classpath (MySQL Connector/J, PostgreSQL JDBC, mssql-jdbc) and both path variants for each: - native URI → synthesised jdbc-url (mysql://, postgresql://, mssql://) - jdbc: prefix → passthrough jdbc-url (jdbc:mysql://, jdbc:postgresql://, jdbc:sqlserver://) Catches malformed synthesised URLs (wrong scheme prefix, mis-placed databaseName=, invalid sqlserver:// semicolon format) that the round-trip string tests in the same file cannot detect. * fix: add --logfile FILE flag to CLI The smoke-test workflow passes --logfile /tmp/pgloader.log to the JAR. Without this flag the argument fell through to the positional parser, was treated as a third positional arg, and triggered 'Unknown arg'. Implementation: at logging-configure time, if --logfile is set, attach a plain FileAppender (PatternLayoutEncoder, append=false) to the root Logback logger. All log events flow to both the console and the file, matching the behaviour of the FILE appender in logback.xml but writing to the user-specified path instead of /tmp/pgloader/pgloader.log. * feat: implement --on-error-stop, --dry-run, --list-encodings, --client-min-messages, --log-min-messages; deprecate v3-only flags --on-error-stop copy/*on-error-stop* was already declared in copy.clj but never read. Now wired: CLI sets the binding; the per-table catch block in core.clj re-throws after logging 'ON ERROR STOP', which propagates out of the worker Future and aborts the remaining tables in the pool. --dry-run New copy/*dry-run* dynamic var in copy.clj. When set, core.clj prints 'DRY RUN — skipping COPY for N tables' and skips the entire Phase 2 (worker pool + COPY). DDL (CREATE TABLE, indexes) still runs so the schema is validated. Summary is printed normally. --list-encodings Prints all charset names from java.nio.charset.Charset/availableCharsets (sorted) and exits. 173 charsets on a typical JVM 21. --client-min-messages LEVEL Sets a ThresholdFilter on the CONSOLE Logback appender so console output can be quieter than the root logger level (e.g. suppress INFO on console while keeping DEBUG in the logfile). --log-min-messages LEVEL Same but for the FILE appender defined in logback.xml. Both --client-min-messages and --log-min-messages accept the Logback level names: trace, debug, info, warn, error (case-insensitive). Deprecated flags (exit 1 with a clear migration message): --upgrade-config v2 INI converter; INI not supported in v4 --self-upgrade Lisp hot-reload; not applicable to JAR --load-lisp-file / -l Lisp extensions; contact maintainers if needed --context / -C INI variable file; use env vars / {{VAR}} instead --no-ssl-cert-verification use ?sslmode=disable in the URI instead Help text updated: all new flags documented, deprecated section added, more source URI examples including JDBC URL forms. * docs: annotate pgloader.rst CLI options with v4 changes and deprecations For each option that behaves differently in v4, or was removed, a '.. note:: v4 ...' callout is added immediately after the existing description text. Original prose is preserved verbatim. Annotated options: - --with-encodings → renamed --list-encodings in v4; note explains - --upgrade-config → removed; migration advice given - --logfile → v4 adds a second sink, does not replace console - --log-min-messages / --client-min-messages → level name mapping (critical→error, log→debug, notice→info, data→trace); v4 uses Logback names; --client-min-messages sets a ThresholdFilter so console and file levels are independently controlled - --load-lisp-file → deprecated in v4; built-in transforms cover common cases; contact maintainers for custom extension needs - --dry-run → fully implemented in v4; DDL runs, COPY skipped - --on-error-stop → fully implemented in v4; re-raises from worker thread to abort remaining tables - --self-upgrade → removed; replace the JAR to upgrade - --no-ssl-cert-verification → removed; use ?sslmode= in URI Also added a note under Source Connection String documenting the JDBC URL forms accepted in v4 (jdbc:mysql://, jdbc:postgresql://, etc.). * docs: trim 'fully implemented' v4 notes; keep only behaviour surprises * fix: dry-run checks connections only — no catalog fetch, DDL, or COPY Match v3 behaviour: when --dry-run is set, pgloader opens and immediately closes both the source and target connections, logs 'connections OK', and exits. No catalog is fetched, no DDL is executed, no data is copied. Previously v4 ran the full DDL phase (CREATE TABLE, indexes) and only skipped the COPY phase, which defeats the purpose of a dry run. * docs: fix --with-encodings phantom name; remove dry-run note The option has always been --list-encodings in both v3 and v4. --with-encodings never existed. Also drop the --dry-run v4 note since the behaviour matches v3 exactly. * docs: add #1584 #1675 #1678 to GITHUB-ISSUES as fixed by JDBC URI support * fix: interrupt reader thread on writer failure to prevent deadlock When the writer virtual thread exits with an exception, the queue is no longer drained. The reader is likely blocked on .put (queue capacity 4), and the subsequent (.join reader) call blocks forever — a deadlock. Fix: interrupt the reader thread immediately when the writer catch fires. The reader's .put then throws InterruptedException, which is caught separately and suppressed (the writer exception is the real one, stored in exc-holder and re-thrown after the join). The interrupted flag is restored on the reader thread so its interrupted state is visible to callers of .join. The reader-fails → writer-unblocked path was already correct: reader catches its own exception, calls .offer :end-of-data, and sets done=true, which causes the writer's .take to return :end-of-data and exit cleanly. * fix: close race in get-table-lock where two threads could get different lock instances The old code read from the atom, then conditionally wrote to it — the check and the insert were not atomic: (or (get @locks key) ; thread A reads: key absent (let [lock (ReentrantLock.)] (swap! locks assoc key lock) ; both threads swap, one wins lock)) ; both return their *own* local lock Both threads pass the (or) nil check, both swap!, one lock ends up in the map but both callers hold different instances — the mutex is broken. Fix: move the check inside the swap! update function so the entire check-and-insert is a single CAS. swap! may retry under contention but the update fn is pure. The canonical lock is whatever ends up in the map, and get on the swap! return value retrieves that same instance for all callers. * style: add cljfmt formatter; reformat all source files - deps.edn: add :cljfmt alias (dev.weavejester/cljfmt 0.13.1) - Reformat 28 source and test files with 'clojure -M:cljfmt fix src test' - CI: add 'format' job (cljfmt check) that runs before 'unit'; unit now needs: [format] so a formatting failure blocks the test run - .git/hooks/pre-commit: run cljfmt check on commit when any .clj file is staged; prints fix command on failure; skippable with --no-verify All 124 unit tests pass after reformatting.

fix: Prevent camelCase-to-colname to duplicate underscores

119f01a

Fixes issue dimitri#1287

OldrichDlouhy changed the title ~~fix: Prevent camelCase-to-colname to duplicate underscores~~ fix: Prevent camelCase-to-colname to duplicate underscores (#1287) Mar 21, 2025

GentiSe mentioned this pull request May 15, 2025

Extra Underscore in Column Names When Using snake_case During Data-Only Migration from MSSQL to PostgreSQL #1669

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Prevent camelCase-to-colname to duplicate underscores (#1287)#1652

fix: Prevent camelCase-to-colname to duplicate underscores (#1287)#1652
OldrichDlouhy wants to merge 1 commit into
dimitri:mainfrom
OldrichDlouhy:snake-case-doubles-underscores

OldrichDlouhy commented Mar 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

OldrichDlouhy commented Mar 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant