diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index 74046e0e0..7dd2023b5 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -46,12 +46,12 @@ The [Prometheus success/retry/error metrics values](../../administration/monitor | `blob_database_file` | Absolute path to a database file to be used to store blob files contexts. | _none_ | | `bucket` | S3 bucket name. | _none_ | | `canned_acl` | [Predefined Canned ACL policy](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl) for S3 objects. | _none_ | -| `compression` | Compression type for S3 objects. Supported values: `gzip`, `zstd`, `snappy`. `arrow` and `parquet` are also available if Apache Arrow was enabled at compile time. See [Compression](#compression). | _none_ | +| `compression` | Compression type for S3 objects. Supported values: `gzip`, `zstd`, `snappy`, `arrow`. When `format` is set to `parquet`, this controls the page-level `codec` inside the Parquet file (supported: `snappy`, `zstd`, `gzip`). `compression=parquet` is deprecated; use `format parquet` instead. See [Compression](#compression). | _none_ | | `content_type` | A standard MIME type for the S3 object, set as the Content-Type HTTP header. | _none_ | | `endpoint` | Custom endpoint for the S3 API. Endpoints can contain scheme and port. | _none_ | | `external_id` | Specify an external ID for the STS API. Can be used with the `role_arn` parameter if your role requires an external ID. | _none_ | | `file_delivery_attempt_limit` | File delivery attempt limit. | `1` | -| `format` | Set the record output format. Supported values: `json_lines`, `otlp_json`. When set to `otlp_json`, the `log_key` option isn't supported and only `logs` event chunks are converted. | `json_lines` | +| `format` | Set the output format. Supported values: `json_lines`, `otlp_json`, `parquet`. When set to `parquet`, records are converted to Apache Parquet columnar format (requires Apache Arrow Parquet support at compile time). The `compression` option controls the page-level `codec` inside the Parquet file. When set to `otlp_json`, the `log_key` option isn't supported and only `logs` event chunks are converted. | `json_lines` | | `host` | IP address or hostname of the target HTTP server. | `127.0.0.1` | | `json_date_format` | Specify the format of the date. Accepted values: `double`, `epoch`, `epoch_ms`, `iso8601` (2018-05-30T09:39:52.000681Z), `_java_sql_timestamp_` (2018-05-30 09:39:52.000681). | _none_ | | `json_date_key` | Specify the name of the date key in the output record. To disable the time key, set the value to `false`. | `date` | @@ -128,6 +128,85 @@ Fluent Bit compresses data before uploading to S3. Consumers must decompress the {% endhint %} +## Parquet format + +Setting `format` to `parquet` converts log records to Apache Parquet columnar format before uploading to S3. Parquet files are directly queryable by Athena, Spark, and Presto without additional transformation. + +The `compression` option controls the page-level `codec` applied inside the Parquet file: + +| `compression` value | Parquet page `codec` | Notes | +|---------------------|-------------------|-------| +| `snappy` | Snappy | Fast, moderate compression ratio. Industry standard default. | +| `zstd` | Zstandard | Better ratio, slightly slower. | +| `gzip` | Gzip | Best ratio, slowest. | +| _(unset)_ | Uncompressed | No page-level compression. | + +{% hint style="info" %} + +`format parquet` requires `use_put_object On`. Multipart uploads aren't supported with Parquet format. + +{% endhint %} + +### Example: Parquet with Snappy compression + +```yaml +pipeline: + outputs: + - name: s3 + match: '*' + bucket: my-bucket + region: us-east-1 + format: parquet + compression: snappy + use_put_object: on + upload_timeout: 60s + total_file_size: 50M + s3_key_format: '/logs/dt=%Y-%m-%d/h=%H/$UUID.parquet' +``` + +### Example: Parquet without page-level compression + +```yaml +pipeline: + outputs: + - name: s3 + match: '*' + bucket: my-bucket + region: us-east-1 + format: parquet + use_put_object: on + upload_timeout: 60s + s3_key_format: '/logs/dt=%Y-%m-%d/h=%H/$UUID.parquet' +``` + +### Migrating from `compression=parquet` + +The `compression=parquet` syntax is deprecated. To migrate: + +**Before (deprecated):** + +```yaml +compression: parquet +``` + +**After (recommended):** + +```yaml +format: parquet +compression: snappy +``` + +The deprecated syntax continues to work but produces Parquet files with uncompressed pages and emits a warning at startup. + +### Build requirements + +Parquet format requires Apache Arrow Parquet support at compile time: + +- CMake flag: `-DFLB_ARROW=On` +- System packages: `arrow-glib-devel` and `parquet-glib-devel` + +The `AWS for Fluent Bit` version 3 container image includes these dependencies by default. + ## Permissions The plugin requires the following AWS IAM permissions: @@ -694,7 +773,7 @@ pipeline: {% endtab %} {% endtabs %} -Setting `Compression` to `arrow` makes Fluent Bit convert payload into Apache Arrow format. +Setting `compression` to `arrow` converts the payload to Apache Arrow (Feather) format. For Parquet output, use `format parquet` instead. Load, analyze, and process stored data using popular data processing tools such as Python pandas, Apache Spark and Tensorflow. @@ -766,7 +845,8 @@ pipeline: region: us-east-2 bucket: use_put_object: On - compression: parquet + format: parquet + compression: snappy # other parameters ``` @@ -791,7 +871,8 @@ pipeline: Region us-east-2 Bucket Use_Put_Object On - Compression parquet + Format parquet + Compression snappy # other parameters ```