Skip to content

fix(cli): pin UTF-8 encoding on init-options and .extensionignore I/O#2686

Open
Quratulain-bilal wants to merge 1 commit into
github:mainfrom
Quratulain-bilal:fix/cli-explicit-utf8-encoding
Open

fix(cli): pin UTF-8 encoding on init-options and .extensionignore I/O#2686
Quratulain-bilal wants to merge 1 commit into
github:mainfrom
Quratulain-bilal:fix/cli-explicit-utf8-encoding

Conversation

@Quratulain-bilal
Copy link
Copy Markdown
Contributor

What

Path.read_text() / Path.write_text() default to the system locale codec, which is cp1252 / gb2312 / cp932 on Windows.
Two user-facing file paths in spec-kit were calling them without an explicit encoding= argument:

  • src/specify_cli/__init__.py:400, 412save_init_options / load_init_options for
    .specify/init-options.json. A peer machine with a different default locale (or a UTF-8 Unix CI runner reading a file
    written on a cp1252 Windows host) cannot decode the file, raising UnicodeDecodeError. UnicodeDecodeError is a
    subclass of ValueError — not OSError / json.JSONDecodeError — so the existing fall-back except tuple in
    load_init_options also misses it and the error propagates raw to the CLI.

  • src/specify_cli/extensions.py:764.extensionignore pattern reader. The very next line already normalises
    backslashes "so Windows-authored files work", proving the codebase expects Windows authors to write this file.
    Multibyte UTF-8 patterns (Chinese filenames, accented directory names) silently mojibake when the host locale is not
    UTF-8, so the patterns fail to match and unintended files are shipped with the extension.

Reproducer

# On a non-UTF-8 Windows host (e.g. cp1252):
from specify_cli import save_init_options, load_init_options
save_init_options(Path("."), {"project_name": "café"})
# File written as cp1252.

# On a UTF-8 host (or another Windows host with a different default codec):
load_init_options(Path("."))
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9...
# — propagates raw because UnicodeDecodeError is a ValueError, not an OSError.

Why this matters

  • The sibling integration-catalog reader at src/specify_cli/integrations/catalog.py:150,156,193,202,374 already pins
    encoding="utf-8" everywhere. PR fix(powershell): ensure UTF-8 templates are written without BOM #2280 (f684305) fixed the symmetric PowerShell-template BOM bug. The two paths in
    this PR were the remaining drifted ones.
  • init-options.json is meant to be a portable record of how a project was scaffolded — a peer cloning the repo on a
    different OS / locale must be able to read it. Today they can't if the original author's project name (or any future
    field) contains non-ASCII.
  • .extensionignore already explicitly supports Windows authors (see line 766). UTF-8 patterns are part of that same
    contract.

The change

  • src/specify_cli/__init__.py — pin encoding="utf-8" on both write_text and read_text; extend the existing
    except tuple in load_init_options to include UnicodeDecodeError so a peer file written in a non-UTF-8 codec falls
    back to {} per the existing contract instead of crashing.
  • src/specify_cli/extensions.py — pin encoding="utf-8" on the .extensionignore reader.

Tests

  • tests/test_presets.py::TestInitOptions — parametrized non-ASCII round-trip (CJK / Latin-1 / Greek / emoji) plus a
    0xe9-byte corrupted-file fallback test.
  • tests/test_extensions.py::TestExtensionIgnore — Japanese (ドキュメント/) and Latin-1 (café/) ignore patterns
    correctly exclude their directories during install.
$ python -m pytest tests/test_presets.py::TestInitOptions tests/test_extensions.py::TestExtensionIgnore -q
22 passed in 4.56s
$ python -m ruff check src/
All checks passed!

Scope

Intentionally narrow: no behaviour change for ASCII content (UTF-8 is a superset). Only non-ASCII content that previously
round-tripped accidentally (when host locale happened to be UTF-8) or silently mojibaked (when it wasn't) now
round-trips reliably across all hosts.

``Path.read_text`` / ``Path.write_text`` default to the system locale
codec, which is cp1252 / gb2312 / cp932 on Windows. Two user-facing
file paths in spec-kit were calling them without an explicit
``encoding=`` argument:

  - ``src/specify_cli/__init__.py:400,412`` —
    ``save_init_options`` / ``load_init_options`` for
    ``.specify/init-options.json``. A peer machine with a different
    default locale (or a UTF-8 Unix CI runner reading a file written on
    a cp1252 Windows host) cannot decode the file, raising
    ``UnicodeDecodeError``. ``UnicodeDecodeError`` is a subclass of
    ``ValueError`` — not ``OSError`` / ``json.JSONDecodeError`` — so
    the existing fall-back ``except`` tuple in ``load_init_options``
    also misses it and the error propagates raw to the CLI.

  - ``src/specify_cli/extensions.py:764`` — ``.extensionignore``
    pattern reader. The very next line already normalises
    backslashes "so Windows-authored files work", proving the codebase
    expects Windows authors to write this file. Multibyte UTF-8
    patterns (Chinese filenames, accented directory names) silently
    mojibake when the host locale is not UTF-8, so the patterns fail
    to match and unintended files are shipped with the extension.

The sibling integration-catalog reader at
``src/specify_cli/integrations/catalog.py:150,156,193,202,374``
already pins ``encoding="utf-8"`` everywhere. PR github#2280 fixed the
symmetric PowerShell-template BOM bug. This change brings the two
remaining drifted paths in line with that precedent.

Regression tests:

  - ``tests/test_presets.py::TestInitOptions`` — parametrized non-ASCII
    round-trip (CJK, Latin-1, Greek, emoji) plus a corrupted-file case
    that asserts the existing "fall back to {}" contract still holds
    when a peer file contains bytes invalid as UTF-8.
  - ``tests/test_extensions.py::TestExtensionIgnore`` — Japanese
    (``ドキュメント/``) and Latin-1 (``café/``) ignore patterns
    correctly exclude their directories during install.
@Quratulain-bilal Quratulain-bilal requested a review from mnriem as a code owner May 23, 2026 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant