From efda01284a979f7d30b229968f965462c882a546 Mon Sep 17 00:00:00 2001 From: Rituparna Khaund Date: Fri, 29 May 2026 22:34:57 +0000 Subject: [PATCH 1/2] s3: document format=parquet option and page-level compression Update S3 output plugin documentation to reflect the new format=parquet option that separates output format selection from byte-level compression. Documents: - New parquet value for the format option - Page-level compression codec control via compression when format is parquet - Migration path from deprecated compression=parquet syntax - Configuration examples with and without page-level compression - Updated existing parquet examples to use new syntax Related code PR: https://github.com/fluent/fluent-bit/pull/11885 Signed-off-by: Rituparna Khaund --- pipeline/outputs/s3.md | 91 +++++++++++++++++++++++++++++++++++++++--- 1 file changed, 86 insertions(+), 5 deletions(-) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index 74046e0e0..208c0272f 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -46,12 +46,12 @@ The [Prometheus success/retry/error metrics values](../../administration/monitor | `blob_database_file` | Absolute path to a database file to be used to store blob files contexts. | _none_ | | `bucket` | S3 bucket name. | _none_ | | `canned_acl` | [Predefined Canned ACL policy](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl) for S3 objects. | _none_ | -| `compression` | Compression type for S3 objects. Supported values: `gzip`, `zstd`, `snappy`. `arrow` and `parquet` are also available if Apache Arrow was enabled at compile time. See [Compression](#compression). | _none_ | +| `compression` | Compression type for S3 objects. Supported values: `gzip`, `zstd`, `snappy`, `arrow`. When `format` is set to `parquet`, this controls the page-level codec inside the Parquet file (supported: `snappy`, `zstd`, `gzip`). `compression=parquet` is deprecated; use `format parquet` instead. See [Compression](#compression). | _none_ | | `content_type` | A standard MIME type for the S3 object, set as the Content-Type HTTP header. | _none_ | | `endpoint` | Custom endpoint for the S3 API. Endpoints can contain scheme and port. | _none_ | | `external_id` | Specify an external ID for the STS API. Can be used with the `role_arn` parameter if your role requires an external ID. | _none_ | | `file_delivery_attempt_limit` | File delivery attempt limit. | `1` | -| `format` | Set the record output format. Supported values: `json_lines`, `otlp_json`. When set to `otlp_json`, the `log_key` option isn't supported and only `logs` event chunks are converted. | `json_lines` | +| `format` | Set the output format. Supported values: `json_lines`, `otlp_json`, `parquet`. When set to `parquet`, records are converted to Apache Parquet columnar format (requires Apache Arrow Parquet support at compile time). The `compression` option controls the page-level codec inside the Parquet file. When set to `otlp_json`, the `log_key` option isn't supported and only `logs` event chunks are converted. | `json_lines` | | `host` | IP address or hostname of the target HTTP server. | `127.0.0.1` | | `json_date_format` | Specify the format of the date. Accepted values: `double`, `epoch`, `epoch_ms`, `iso8601` (2018-05-30T09:39:52.000681Z), `_java_sql_timestamp_` (2018-05-30 09:39:52.000681). | _none_ | | `json_date_key` | Specify the name of the date key in the output record. To disable the time key, set the value to `false`. | `date` | @@ -128,6 +128,85 @@ Fluent Bit compresses data before uploading to S3. Consumers must decompress the {% endhint %} +## Parquet format + +Setting `format` to `parquet` converts log records to Apache Parquet columnar format before uploading to S3. Parquet files are directly queryable by Athena, Spark, and Presto without additional transformation. + +The `compression` option controls the page-level codec applied inside the Parquet file: + +| `compression` value | Parquet page codec | Notes | +|---------------------|-------------------|-------| +| `snappy` | Snappy | Fast, moderate compression ratio. Industry standard default. | +| `zstd` | Zstandard | Better ratio, slightly slower. | +| `gzip` | Gzip | Best ratio, slowest. | +| _(unset)_ | Uncompressed | No page-level compression. | + +{% hint style="info" %} + +`format parquet` requires `use_put_object On`. Multipart uploads are not supported with Parquet format. + +{% endhint %} + +### Example: Parquet with Snappy compression + +```yaml +pipeline: + outputs: + - name: s3 + match: '*' + bucket: my-bucket + region: us-east-1 + format: parquet + compression: snappy + use_put_object: on + upload_timeout: 60s + total_file_size: 50M + s3_key_format: '/logs/dt=%Y-%m-%d/h=%H/$UUID.parquet' +``` + +### Example: Parquet without page-level compression + +```yaml +pipeline: + outputs: + - name: s3 + match: '*' + bucket: my-bucket + region: us-east-1 + format: parquet + use_put_object: on + upload_timeout: 60s + s3_key_format: '/logs/dt=%Y-%m-%d/h=%H/$UUID.parquet' +``` + +### Migrating from `compression=parquet` + +The `compression=parquet` syntax is deprecated. To migrate: + +**Before (deprecated):** + +```yaml +compression: parquet +``` + +**After (recommended):** + +```yaml +format: parquet +compression: snappy +``` + +The deprecated syntax continues to work but produces Parquet files with uncompressed pages and emits a warning at startup. + +### Build requirements + +Parquet format requires Apache Arrow Parquet support at compile time: + +- CMake flag: `-DFLB_ARROW=On` +- System packages: `arrow-glib-devel` and `parquet-glib-devel` + +The `AWS for Fluent Bit` version 3 container image includes these dependencies by default. + ## Permissions The plugin requires the following AWS IAM permissions: @@ -694,7 +773,7 @@ pipeline: {% endtab %} {% endtabs %} -Setting `Compression` to `arrow` makes Fluent Bit convert payload into Apache Arrow format. +Setting `compression` to `arrow` converts the payload to Apache Arrow (Feather) format. For Parquet output, use `format parquet` instead. Load, analyze, and process stored data using popular data processing tools such as Python pandas, Apache Spark and Tensorflow. @@ -766,7 +845,8 @@ pipeline: region: us-east-2 bucket: use_put_object: On - compression: parquet + format: parquet + compression: snappy # other parameters ``` @@ -791,7 +871,8 @@ pipeline: Region us-east-2 Bucket Use_Put_Object On - Compression parquet + Format parquet + Compression snappy # other parameters ``` From 546958f8d9386f5ca29410ffee3509d3f6f1a77e Mon Sep 17 00:00:00 2001 From: "Eric D. Schabell" Date: Mon, 1 Jun 2026 08:26:24 +0200 Subject: [PATCH 2/2] docs: pipeline: outputs: s3: fix Vale spelling and contraction suggestions - Wrap `codec` in backticks for linting issue - Replace "are not" with "aren't" for linting isse Applies to #2591 Signed-off-by: Eric D. Schabell --- pipeline/outputs/s3.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index 208c0272f..7dd2023b5 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -46,12 +46,12 @@ The [Prometheus success/retry/error metrics values](../../administration/monitor | `blob_database_file` | Absolute path to a database file to be used to store blob files contexts. | _none_ | | `bucket` | S3 bucket name. | _none_ | | `canned_acl` | [Predefined Canned ACL policy](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl) for S3 objects. | _none_ | -| `compression` | Compression type for S3 objects. Supported values: `gzip`, `zstd`, `snappy`, `arrow`. When `format` is set to `parquet`, this controls the page-level codec inside the Parquet file (supported: `snappy`, `zstd`, `gzip`). `compression=parquet` is deprecated; use `format parquet` instead. See [Compression](#compression). | _none_ | +| `compression` | Compression type for S3 objects. Supported values: `gzip`, `zstd`, `snappy`, `arrow`. When `format` is set to `parquet`, this controls the page-level `codec` inside the Parquet file (supported: `snappy`, `zstd`, `gzip`). `compression=parquet` is deprecated; use `format parquet` instead. See [Compression](#compression). | _none_ | | `content_type` | A standard MIME type for the S3 object, set as the Content-Type HTTP header. | _none_ | | `endpoint` | Custom endpoint for the S3 API. Endpoints can contain scheme and port. | _none_ | | `external_id` | Specify an external ID for the STS API. Can be used with the `role_arn` parameter if your role requires an external ID. | _none_ | | `file_delivery_attempt_limit` | File delivery attempt limit. | `1` | -| `format` | Set the output format. Supported values: `json_lines`, `otlp_json`, `parquet`. When set to `parquet`, records are converted to Apache Parquet columnar format (requires Apache Arrow Parquet support at compile time). The `compression` option controls the page-level codec inside the Parquet file. When set to `otlp_json`, the `log_key` option isn't supported and only `logs` event chunks are converted. | `json_lines` | +| `format` | Set the output format. Supported values: `json_lines`, `otlp_json`, `parquet`. When set to `parquet`, records are converted to Apache Parquet columnar format (requires Apache Arrow Parquet support at compile time). The `compression` option controls the page-level `codec` inside the Parquet file. When set to `otlp_json`, the `log_key` option isn't supported and only `logs` event chunks are converted. | `json_lines` | | `host` | IP address or hostname of the target HTTP server. | `127.0.0.1` | | `json_date_format` | Specify the format of the date. Accepted values: `double`, `epoch`, `epoch_ms`, `iso8601` (2018-05-30T09:39:52.000681Z), `_java_sql_timestamp_` (2018-05-30 09:39:52.000681). | _none_ | | `json_date_key` | Specify the name of the date key in the output record. To disable the time key, set the value to `false`. | `date` | @@ -132,9 +132,9 @@ Fluent Bit compresses data before uploading to S3. Consumers must decompress the Setting `format` to `parquet` converts log records to Apache Parquet columnar format before uploading to S3. Parquet files are directly queryable by Athena, Spark, and Presto without additional transformation. -The `compression` option controls the page-level codec applied inside the Parquet file: +The `compression` option controls the page-level `codec` applied inside the Parquet file: -| `compression` value | Parquet page codec | Notes | +| `compression` value | Parquet page `codec` | Notes | |---------------------|-------------------|-------| | `snappy` | Snappy | Fast, moderate compression ratio. Industry standard default. | | `zstd` | Zstandard | Better ratio, slightly slower. | @@ -143,7 +143,7 @@ The `compression` option controls the page-level codec applied inside the Parque {% hint style="info" %} -`format parquet` requires `use_put_object On`. Multipart uploads are not supported with Parquet format. +`format parquet` requires `use_put_object On`. Multipart uploads aren't supported with Parquet format. {% endhint %}