Skip to content

feat(gcp_cloud_storage sink): support Parquet batch encoding#25590

Open
dshmatov wants to merge 10 commits into
vectordotdev:masterfrom
dshmatov:feat/gcs-parquet-batch-encoding
Open

feat(gcp_cloud_storage sink): support Parquet batch encoding#25590
dshmatov wants to merge 10 commits into
vectordotdev:masterfrom
dshmatov:feat/gcs-parquet-batch-encoding

Conversation

@dshmatov

@dshmatov dshmatov commented Jun 7, 2026

Copy link
Copy Markdown

Summary

Adds Apache Parquet columnar batch encoding to the gcp_cloud_storage sink, matching the capability the aws_s3 sink already has. Enable it with batch_encoding.codec = "parquet".

When batch_encoding is set, events are encoded together as a batch into a single Parquet file per batch (instead of the standard per-event, framing-based encoding). Parquet handles its own compression internally, so:

  • the top-level compression setting is bypassed (a warning is logged if it was set),
  • the object Content-Type defaults to application/vnd.apache.parquet,
  • the filename extension defaults to parquet.

This reuses the shared codec/batch-encoder infrastructure introduced for aws_s3 (the BatchEncoder / BatchSerializerConfig / EncoderKind machinery in lib/codecs and src/sinks/util). The change is purely the sink-side wiring plus a GcsBatchEncoding config enum — no new encoding logic.

Vector configuration

sources:
  demo:
    type: demo_logs
    format: json

sinks:
  gcs_out:
    type: gcp_cloud_storage
    inputs: [demo]
    bucket: my-bucket
    encoding:
      codec: text
    batch_encoding:
      codec: parquet
      schema_mode: auto_infer
      compression:
        algorithm: snappy

Schema-file mode is also supported:

    batch_encoding:
      codec: parquet
      schema_mode: strict
      schema_file: /etc/vector/schema.parquet
      compression:
        algorithm: zstd
        level: 10

How did you test this PR?

  • cargo test --features "sinks-gcp,codecs-parquet" --lib gcp::cloud_storage — 9 new Parquet tests + the 15 existing GCS sink tests pass.
    • Config-level: TOML/YAML shape, schema_mode parsing/defaults, content-type auto-detection + user override, .parquet extension default + override, top-level-compression bypass, rejection of non-parquet codecs.
    • End-to-end encoding: parquet_encodes_valid_file runs a real batch through the sink's request builder and asserts the output is a valid Parquet file (PAR1 magic bytes, correct row count, inferred message/host columns) using the parquet reader.
  • Verified the crate compiles with codecs-parquet both enabled and disabled (the feature-gated paths).
  • make check-clippy, make fmt, and make check-generated-docs (regenerated gcp_cloud_storage.cue) pass.

Note: this repo currently has no GCS-cloud-storage integration-test harness (no fake-gcs-server service under scripts/integration/), so Parquet output is validated at the unit level through the real encode path rather than against a live backend. Happy to add an integration harness if maintainers would prefer one.

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Add Apache Parquet columnar batch encoding to the `gcp_cloud_storage`
sink, matching the existing `aws_s3` sink capability. Enable it via
`batch_encoding.codec = "parquet"`.

When `batch_encoding` is set, events are encoded together as a batch in
the columnar Parquet format instead of the standard per-event,
framing-based encoding. Parquet handles its own compression internally
(configurable via `batch_encoding.compression`), so the top-level
`compression` setting is bypassed (with a warning if set), the object
`Content-Type` defaults to `application/vnd.apache.parquet`, and the
filename extension defaults to `parquet`.

This reuses the shared codec/batch-encoder infrastructure already used
by the `aws_s3` sink; the change is purely the sink-side wiring plus the
`GcsBatchEncoding` config enum.
@dshmatov dshmatov requested review from a team as code owners June 7, 2026 23:37
@github-actions github-actions Bot added docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. domain: sinks Anything related to the Vector's sinks domain: external docs Anything related to Vector's external, public documentation and removed docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. labels Jun 7, 2026
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@dshmatov

dshmatov commented Jun 7, 2026

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

dshmatov added 2 commits June 8, 2026 01:54
…ct list

- 'vnd' is introduced by this PR (the application/vnd.apache.parquet MIME type)
- 'deser' comes from the upstream 'deser_failed' buffer metric in
  internal_metrics.cue, which the PR's merge with master surfaces
@github-actions github-actions Bot added docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. domain: ci Anything related to Vector's CI environment and removed docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. labels Jun 7, 2026
@drichards-87 drichards-87 self-assigned this Jun 8, 2026

@drichards-87 drichards-87 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of suggestions from Docs and approved the PR.

Comment thread website/cue/reference/components/sinks/generated/gcp_cloud_storage.cue Outdated
Comment thread website/cue/reference/components/sinks/generated/gcp_cloud_storage.cue Outdated
Comment thread website/cue/reference/components/sinks/generated/gcp_cloud_storage.cue Outdated
@drichards-87 drichards-87 removed their assignment Jun 8, 2026
dshmatov and others added 2 commits June 8, 2026 19:24
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
@github-actions github-actions Bot added the docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. label Jun 8, 2026
dshmatov and others added 4 commits June 8, 2026 19:24
Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
…batch-encoding

# Conflicts:
#	.github/actions/spelling/expect.txt
…ptions

Improve the wording of the shared ParquetSerializerConfig doc-comments
(schema_mode, auto_infer, and compression level). Regenerates the
aws_s3 and gcp_cloud_storage component docs accordingly.
@github-actions github-actions Bot removed the domain: ci Anything related to Vector's CI environment label Jun 12, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a4a585b76

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread changelog.d/gcs_parquet_encoding.feature.md Outdated
… in changelog

A config with only `batch_encoding.codec = "parquet"` fails at sink
startup because `schema_mode` defaults to `relaxed`, which requires a
`schema_file`. Clarify that either `schema_mode = "auto_infer"` or a
`schema_file` must be set.
@akukhar

akukhar commented Jun 12, 2026

Copy link
Copy Markdown

Hi, I wanted to follow up on this PR. It would help fix the limitation I'm facing, and I'd really like to use it upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants