validate utf-8 continuation bytes in utf8ToCodepoint by SABITHSAHEB · Pull Request #1702 · open-source-parsers/jsoncpp

SABITHSAHEB · 2026-07-02T07:02:10Z

utf8ToCodepoint consumed the bytes after a multibyte lead without checking they are continuation bytes (10xxxxxx), so a broken sequence swallowed the ASCII that followed it. With the default emitUTF8=off, serializing a string containing "\xE0AB" produced "�" and dropped the valid A and B; "\xF0XYZ" produced a bogus surrogate pair with X, Y, Z gone.
a well-formed 4-byte sequence decoding past U+10FFFF was still emitted, as an invalid surrogate pair.
Validate each trailing byte and cap at U+10FFFF, returning U+FFFD without advancing past bytes that are not part of the sequence. This is the lenient replacement behaviour rather than a hard error, so valid UTF-8 output is unchanged.

validate utf-8 continuation bytes in utf8ToCodepoint

c4b40c1

Provide feedback