Skip to content

Decode PBF blobs by data field#895

Open
Symmetricity wants to merge 1 commit into
systemed:masterfrom
Symmetricity:fix/pbf-zlib-without-raw-size
Open

Decode PBF blobs by data field#895
Symmetricity wants to merge 1 commit into
systemed:masterfrom
Symmetricity:fix/pbf-zlib-without-raw-size

Conversation

@Symmetricity
Copy link
Copy Markdown

This PR is AI generated.

Summary

This fixes PBF Blob decoding so tilemaker uses the actual Blob data field to
decide whether payload bytes are raw or zlib-compressed.

The current reader uses raw_size == -1 to decide that a Blob is uncompressed:

if (rawSize == -1)
	return view;

That is not the right discriminator. In the OSM-binary storage format,
raw_size is optional metadata for the uncompressed size, while the payload
type is carried by the Blob oneof data field:

  • raw: uncompressed payload
  • zlib_data: zlib-compressed payload
  • lzma_data, lz4_data, zstd_data: other compressed payloads

Reference:
https://github.com/openstreetmap/OSM-binary/blob/master/osmpbf/fileformat.proto

If a Blob contains zlib_data but omits raw_size, the old code returns the
compressed bytes directly to the protozero parser. Protozero then attempts to
parse compressed data as a HeaderBlock or PrimitiveBlock, which can surface
as errors such as protozero::invalid_tag_exception or
protozero::end_of_buffer_exception.

Implementation

The patch keeps the change local to PbfReader::readBlob():

  • Track which Blob data field was present.
  • Return bytes directly only when the Blob contains the raw field.
  • Inflate zlib_data even when raw_size is omitted.
  • Preserve the existing pre-sized output buffer when raw_size is available.
  • Report missing Blob payloads and unsupported compression fields explicitly.

No block scheduling, offset handling, storage, or protozero object parsing logic
is changed.

Expected Behavior

Before this change, a zlib-compressed Blob without raw_size could be silently
misclassified as raw data. The failure would happen later, when protozero tried
to parse compressed bytes as protobuf message content.

After this change, tilemaker determines the payload type from the Blob data
field. Raw payloads still avoid a copy/decompression step, while zlib payloads
are decompressed before parsing.

Possible Regressions

The normal path for common PBFs with zlib_data and raw_size should behave
the same, except the compression decision is now based on the data field rather
than the metadata field.

For zlib Blobs that omit raw_size, the decompression buffer may need to grow
dynamically because tilemaker does not know the uncompressed size up front.
That is a small allocation cost on this uncommon path and avoids parsing the
wrong bytes.

For unusual raw Blobs that include raw_size, this changes behavior from
attempting to decompress raw bytes to returning the raw payload. That follows
the Blob data field and should be more correct.

Unsupported compression fields remain unsupported, but the error message is now
explicit.

Related Issues And PRs

I did not find an existing upstream PR for the same raw_size/zlib_data
change.

I also did not find an upstream issue that directly reports a zlib Blob without
raw_size. This may be relevant to broader PBF parsing failures, but this PR
does not claim to fix those reports without a matching reproducer.

Testing

Code checks:

git diff --cached --check

Local build:

cmake -S . -B local-builds/pbf-zlib-without-raw-size -DCMAKE_BUILD_TYPE=RelWithDebInfo
cmake --build local-builds/pbf-zlib-without-raw-size --target tilemaker -j1

Focused parser test:

make test_pbf_reader -j1

Result:

1 tests, 29 assertions, 0 failures

Blob.raw_size is optional metadata for compressed payloads. The reader was using the absence of raw_size to decide that a blob was raw, so a zlib_data blob without raw_size could be returned to protozero still compressed and surface as invalid_tag_exception while parsing a PrimitiveBlock.

Track the actual Blob data field instead. Return raw payloads only when the raw field is present, inflate zlib_data even when raw_size is omitted, and report missing or unsupported blob payloads explicitly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant