Skip to content

validate utf-8 continuation bytes in utf8ToCodepoint#1702

Open
SABITHSAHEB wants to merge 1 commit into
open-source-parsers:masterfrom
SABITHSAHEB:writer-utf8-continuation-bytes
Open

validate utf-8 continuation bytes in utf8ToCodepoint#1702
SABITHSAHEB wants to merge 1 commit into
open-source-parsers:masterfrom
SABITHSAHEB:writer-utf8-continuation-bytes

Conversation

@SABITHSAHEB

Copy link
Copy Markdown
Contributor
  1. utf8ToCodepoint consumed the bytes after a multibyte lead without checking they are continuation bytes (10xxxxxx), so a broken sequence swallowed the ASCII that followed it. With the default emitUTF8=off, serializing a string containing "\xE0AB" produced "�" and dropped the valid A and B; "\xF0XYZ" produced a bogus surrogate pair with X, Y, Z gone.
  2. a well-formed 4-byte sequence decoding past U+10FFFF was still emitted, as an invalid surrogate pair.
    Validate each trailing byte and cap at U+10FFFF, returning U+FFFD without advancing past bytes that are not part of the sequence. This is the lenient replacement behaviour rather than a hard error, so valid UTF-8 output is unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant