feat(gcp_cloud_storage sink): support Parquet batch encoding#25590
feat(gcp_cloud_storage sink): support Parquet batch encoding#25590dshmatov wants to merge 10 commits into
Conversation
Add Apache Parquet columnar batch encoding to the `gcp_cloud_storage` sink, matching the existing `aws_s3` sink capability. Enable it via `batch_encoding.codec = "parquet"`. When `batch_encoding` is set, events are encoded together as a batch in the columnar Parquet format instead of the standard per-event, framing-based encoding. Parquet handles its own compression internally (configurable via `batch_encoding.compression`), so the top-level `compression` setting is bypassed (with a warning if set), the object `Content-Type` defaults to `application/vnd.apache.parquet`, and the filename extension defaults to `parquet`. This reuses the shared codec/batch-encoder infrastructure already used by the `aws_s3` sink; the change is purely the sink-side wiring plus the `GcsBatchEncoding` config enum.
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
…ct list - 'vnd' is introduced by this PR (the application/vnd.apache.parquet MIME type) - 'deser' comes from the upstream 'deser_failed' buffer metric in internal_metrics.cue, which the PR's merge with master surfaces
drichards-87
left a comment
There was a problem hiding this comment.
Left a couple of suggestions from Docs and approved the PR.
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
…batch-encoding # Conflicts: # .github/actions/spelling/expect.txt
… into feat/gcs-parquet-batch-encoding
…ptions Improve the wording of the shared ParquetSerializerConfig doc-comments (schema_mode, auto_infer, and compression level). Regenerates the aws_s3 and gcp_cloud_storage component docs accordingly.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2a4a585b76
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
… in changelog A config with only `batch_encoding.codec = "parquet"` fails at sink startup because `schema_mode` defaults to `relaxed`, which requires a `schema_file`. Clarify that either `schema_mode = "auto_infer"` or a `schema_file` must be set.
|
Hi, I wanted to follow up on this PR. It would help fix the limitation I'm facing, and I'd really like to use it upstream. |
Summary
Adds Apache Parquet columnar batch encoding to the
gcp_cloud_storagesink, matching the capability theaws_s3sink already has. Enable it withbatch_encoding.codec = "parquet".When
batch_encodingis set, events are encoded together as a batch into a single Parquet file per batch (instead of the standard per-event, framing-based encoding). Parquet handles its own compression internally, so:compressionsetting is bypassed (a warning is logged if it was set),Content-Typedefaults toapplication/vnd.apache.parquet,parquet.This reuses the shared codec/batch-encoder infrastructure introduced for
aws_s3(theBatchEncoder/BatchSerializerConfig/EncoderKindmachinery inlib/codecsandsrc/sinks/util). The change is purely the sink-side wiring plus aGcsBatchEncodingconfig enum — no new encoding logic.Vector configuration
Schema-file mode is also supported:
How did you test this PR?
cargo test --features "sinks-gcp,codecs-parquet" --lib gcp::cloud_storage— 9 new Parquet tests + the 15 existing GCS sink tests pass.schema_modeparsing/defaults, content-type auto-detection + user override,.parquetextension default + override, top-level-compression bypass, rejection of non-parquet codecs.parquet_encodes_valid_fileruns a real batch through the sink's request builder and asserts the output is a valid Parquet file (PAR1magic bytes, correct row count, inferredmessage/hostcolumns) using theparquetreader.codecs-parquetboth enabled and disabled (the feature-gated paths).make check-clippy,make fmt, andmake check-generated-docs(regeneratedgcp_cloud_storage.cue) pass.Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References