-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Add support for parquetOptions in GCSToBigQueryOperator
#60876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for parquetOptions in GCSToBigQueryOperator
#60876
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
|
216e83e to
4d07633
Compare
providers/google/src/airflow/providers/google/cloud/transfers/gcs_to_bigquery.py
Show resolved
Hide resolved
4d07633 to
649a54a
Compare
649a54a to
603a232
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM - small comment regarding the newsfragment file, and once you fix it I'll wait a couple of more days to let Google team to review it before merging.
As you tested the changes on a real BigQuery instance and changes are non-breaking, I feel comfortable merging it.
|
Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions. |
Description
Add support for
parquetOptionsin GCSToBigQueryOperatorsrc_fmt_configs.My team has a lot of workflows that involve loading parquet in gcs to bigquery using airflow. By default (without parquetOptions.enable_list_inference) parquet lists get loaded to bigquery as
STRUCT<list ARRAY<STRUCT<element $TYPE>>>. This nested struct is cumbersome to use for querying and analysis.With the
enable_list_inferenceflag, the same parquet list is loaded simply asARRAY<$TYPE>which is much easier to work with.This PR adds support for passing
enableListInferenceas one of the options insrc_fmt_configswhen the source format isPARQUET. This works both for the external table code path as well as the bq load code path.Testing
providers/google/tests/system/google/cloud/gcs/example_gcs_to_bigquery.py(haven't managed to get this running yet)Without
enableListInference:Produces:

With
enableListInference:produces:
And likewise for the external table case which i also tested.
Was generative AI tooling used to co-author this PR?
Generated-by: Claude Sonnet 4.5 following the guidelines
GenAI tooling was used only for code review and discussion, no lines of code in the PR were written or directly copied from Claude.
{pr_number}.significant.rstor{issue_number}.significant.rst, in airflow-core/newsfragments.