Skip to content

fix: handle non-BMP Unicode codepoints in foldl, foldr, and %c format#606

Open
JoshRosen wants to merge 1 commit intodatabricks:masterfrom
JoshRosen:fix-unicode-foldl-foldr-format-c
Open

fix: handle non-BMP Unicode codepoints in foldl, foldr, and %c format#606
JoshRosen wants to merge 1 commit intodatabricks:masterfrom
JoshRosen:fix-unicode-foldl-foldr-format-c

Conversation

@JoshRosen
Copy link
Contributor

This PR fixes two more non-BMP Unicode bugs:

  • foldl/foldr iterated strings by UTF-16 code unit (for (char <- s.value)), splitting non-BMP characters like emoji into surrogate pair halves. Use codePointAt/codePointBefore with Character.charCount for correct codepoint iteration.
  • The %c format conversion used s.toChar.toString which truncates codepoints above U+FFFF to 16 bits. Use Character.toString(s.toInt) instead.

All code written by Claude Opus 4.6.

foldl/foldr iterated strings by UTF-16 code unit (for (char <- s.value)),
splitting non-BMP characters like emoji into surrogate pair halves. Use
codePointAt/codePointBefore with Character.charCount for correct codepoint
iteration.

The %c format conversion used s.toChar.toString which truncates codepoints
above U+FFFF to 16 bits. Use Character.toString(s.toInt) instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments