Skip to content

util/collate: support latin1_swedish_ci#67255

Open
kennytm wants to merge 9 commits intopingcap:masterfrom
kennytm:latin1_swedish_ci
Open

util/collate: support latin1_swedish_ci#67255
kennytm wants to merge 9 commits intopingcap:masterfrom
kennytm:latin1_swedish_ci

Conversation

@kennytm
Copy link
Copy Markdown
Contributor

@kennytm kennytm commented Mar 24, 2026

What problem does this PR solve?

Issue Number: close #67198, close #36057

Problem Summary: TiDB did not support the latin1_swedish_ci collation, which was the default collation on MySQL 5.x and often inherited after upgrading. Lack of such collation complicates migrations to/from TiDB.

What changed and how does it work?

A new Collator is added to the newCollatorMap for "latin1_swedish_ci" (ID = 8). The Collator embeds a look-up table which maps every Latin-1 character to a specific weight, compatible with MySQL 8.4 (the look-up table can be exchanged to support more SBCS collations e.g. latin1_general_ci in the future, if there is actual demand).

Other behavior of latin1, e.g. the default collation (latin1_bin) and the incompatible HEX() behavior (#18955) are not changed.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Supports `latin1_swedish_ci` collation, to simplify migration from MySQL/MariaDB databases.

Summary by CodeRabbit

  • New Features

    • Added latin1_swedish_ci collation: case-insensitive Swedish ordering for comparisons, sorting, LIKE, pattern matching and range queries, with a binary mapping to latin1_bin.
  • Tests

    • Added and updated unit and integration tests validating ordering, key generation, pattern matching, index usage, and query plans/results for latin1_swedish_ci.
  • Chores

    • Enabled the Latin-1 collator in the build and updated collation metadata and expected outputs for latin1/latin1_swedish_ci.

@ti-chi-bot ti-chi-bot bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 24, 2026
@pantheon-ai

This comment was marked as resolved.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Mar 24, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tisonkun for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 24, 2026
@tiprow

This comment was marked as resolved.

@hawkingrei
Copy link
Copy Markdown
Member

/ok-to-test

@ti-chi-bot ti-chi-bot bot added the ok-to-test Indicates a PR is ready to be tested. label Mar 24, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds native support for the latin1_swedish_ci collation: new Latin‑1 collator implementation, Bazel build inclusion, registry entries, unit tests, and integration test updates covering classification, ordering, keys, pattern matching, and SQL-level behavior.

Changes

Cohort / File(s) Summary
Build
pkg/util/collate/BUILD.bazel
Added latin1.go to srcs and added @org_golang_x_text//encoding/charmap to deps.
Collation registry & helpers
pkg/util/collate/collate.go
Marked latin1_swedish_ci as case‑insensitive, map to latin1_bin for binary conversion, and register name/ID in init().
Latin‑1 collator implementation
pkg/util/collate/latin1.go
New file implementing latin1Collator and latin1Pattern with a 256‑entry weight table, Compare/Key/ImmutableKey/KeyWithoutTrimRightSpace, Pattern/Clone/MaxKeyLen, and Windows‑1252 encoding fallback for rune→byte mapping.
Unit tests
pkg/util/collate/collate_test.go
Extended tests to cover latin1_swedish_ci collator resolution (by name and ID), classification, binary conversion, ordering, exact keys, and pattern behavior.
Integration tests & expectations
tests/integrationtest/t/collation_misc.test, tests/integrationtest/r/collation_misc_enabled.result, tests/integrationtest/r/collation_misc_disabled.result
Added/updated scenarios and expected outputs for latin1 DBs using latin1_swedish_ci: DB/table creation, sample inserts, lag()/hex(weight_string(...)) checks, and planner/execution assertions for equality, LIKE, and range predicates.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

🐰 I nibble through bytes from A to Ö,
I map each rune to Windows‑1252 glow,
Keys trimmed tidy, patterns hop in line,
Old latin1 tables wake up fine,
Hooray — collations sorted, row by row.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'util/collate: support latin1_swedish_ci' is concise, specific, and accurately describes the main change—adding support for the latin1_swedish_ci collation to the collation package.
Description check ✅ Passed The PR description follows the template structure with clear issue numbers, problem summary, implementation details, completed checklists (unit and integration tests, affects user behavior and MySQL compatibility), and an appropriate release note.
Linked Issues check ✅ Passed The PR fully addresses the coding requirements from linked issues #67198 and #36057: implements latin1_swedish_ci collation with proper weight mapping compatible with MySQL 8.4, registers it in newCollatorMap, and includes comprehensive unit and integration tests.
Out of Scope Changes check ✅ Passed All changes are directly related to adding latin1_swedish_ci support: new collator implementation, integration into the collation registry, test coverage for the new collation, and expected result updates for integration tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

Comment thread pkg/util/collate/latin1.go
@kennytm
Copy link
Copy Markdown
Contributor Author

kennytm commented Mar 24, 2026

Since TiDB treats latin1 as equivalent to utf8mb4, #67255 (comment) forces us that some compromises must be made to handle non-Latin1 characters.

Currently we treats all non-Latin1 characters being distinct and ordered after "ÿ". This requires us to change the WEIGHT_STRING() of "ÿ" from x'FF' to x'FF00' to maintain the proper lexical ordering so that "ÿÿ" < "Ā". But this breaks the WEIGHT_STRING() compatibility with MySQL, and also increased MaxKeyLen(s) to len(s)*2.

Alternative we can treat all non-Latin1 characters being invalid and map them all to x'3F' ("?"), keeping the sort key 1-byte-per-character.

Please pick a choice.

EDIT: Went with the x'3F' approach because on MySQL "latin1" actually means Windows-1252 so out-of-sequence characters like "œ" have to be supported (U+0153, but 0x9C in Windows-1252). So there is no advantage of keeping r <= 0xff special.

coderabbitai[bot]

This comment was marked as resolved.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 54.34783% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.6696%. Comparing base (d3360c6) to head (501af8e).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #67255        +/-   ##
================================================
- Coverage   77.8039%   77.6696%   -0.1344%     
================================================
  Files          2023       2025         +2     
  Lines        556185     556234        +49     
================================================
- Hits         432734     432025       -709     
- Misses       121706     122428       +722     
- Partials       1745       1781        +36     
Flag Coverage Δ
unit 76.1264% <54.3478%> (-0.2270%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 61.5065% <ø> (ø)
parser ∅ <ø> (∅)
br 60.8190% <ø> (-0.0259%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@mjonss
Copy link
Copy Markdown
Contributor

mjonss commented Mar 26, 2026

@kennytm Are you also adding the collation to TiKV and TiFlash?

@kennytm
Copy link
Copy Markdown
Contributor Author

kennytm commented Mar 27, 2026

Not yet. I'll submit PR if the current design (all non-Latin1 maps to '?') is acceptable.

mysql> select count(1) from t where val < concat(_latin1'ä', val);
ERROR 1105 (HY000): other error: [components/tidb_query_datatype/src/codec/error.rs:215]: invalid schema: UnsupportedCollation { code: -8 }

mysql> select /*+ read_from_storage(tiflash[t]) */ count(1) from t where val < concat(_latin1'ä', val);
ERROR 1105 (HY000): other error for mpp stream: Code: 49, e.displayText() = DB::Exception: static TiDBCollatorPtr TiDB::ITiDBCollator::getCollator(int32_t): invalid collation ID: 8, e.what() = DB::Exception,

Edit:

@kennytm kennytm force-pushed the latin1_swedish_ci branch from fca80e0 to 501af8e Compare March 27, 2026 09:37
@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 27, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Mar 27, 2026

@kennytm: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
idc-jenkins-ci-tidb/mysql-test 501af8e link true /test mysql-test
idc-jenkins-ci-tidb/check_dev_2 501af8e link true /test check-dev2
idc-jenkins-ci-tidb/unit-test 501af8e link true /test unit-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/charset ok-to-test Indicates a PR is ready to be tested. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support latin1_swedish_ci Incompatible with MySQL when using ProxySQL

3 participants