Skip to content

fix(file source): handle concatenated gzip streams#25614

Draft
thomasqueirozb wants to merge 8 commits into
masterfrom
fix/file-source-gzip-multi-stream
Draft

fix(file source): handle concatenated gzip streams#25614
thomasqueirozb wants to merge 8 commits into
masterfrom
fix/file-source-gzip-multi-stream

Conversation

@thomasqueirozb

@thomasqueirozb thomasqueirozb commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

The async file source migration (#23612, v0.50.0) replaced flate2::bufread::MultiGzDecoder with async_compression::tokio::bufread::GzipDecoder. MultiGzDecoder handles concatenated gzip streams by design; GzipDecoder stops after the first member unless .multiple_members(true) is called, which was never done. This caused the file source and fingerprinter to silently drop all but the first gzip stream in multi-member files.

The fix introduces gzip_multiple_decoder in vector-common::compression — a thin wrapper that constructs a GzipDecoder with multiple_members enabled — and replaces all bare GzipDecoder::new call sites (file watcher, fingerprinter, aws_s3 source). GzipDecoder::new is now a denied method in clippy.toml to prevent recurrence.

Vector configuration

data_dir: /tmp/vector-test

sources:
  files:
    type: file
    include:
      - /tmp/vector-test/*.gz
    fingerprint:
      strategy: checksum
    read_from: beginning

sinks:
  out:
    type: console
    inputs: [files]
    encoding:
      codec: text

How did you test this PR?

Create a multi-member gzip file and a standard single-member gzip file:

mkdir -p /tmp/vector-test

# multi-stream: two separate gzip members concatenated
echo "multiple_1hello" | gzip -c >  /tmp/vector-test/multiple-stream.gz
echo "multiple_2world" | gzip -c >> /tmp/vector-test/multiple-stream.gz

# single-stream: two lines in one gzip member
printf "single_1hello\nsingle_2world\n" | gzip -c > /tmp/vector-test/single-stream.gz

Run vector with the config above. Expected output (order may vary):

multiple_1hello
multiple_2world
single_1hello
single_2world

Before this fix, multiple_2world was silently dropped because GzipDecoder stopped after the first member. To stress the path further, a third member was appended and all three were read correctly.

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

@github-actions github-actions Bot added the domain: sources Anything related to the Vector's sources label Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: sources Anything related to the Vector's sources

Projects

None yet

Development

Successfully merging this pull request may close these issues.

File source no longer can decompress Gzip

1 participant