-
Notifications
You must be signed in to change notification settings - Fork 484
Docs: Cloud data catalog integration quickstart #6449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
f7a2f60
fd1cf5a
1101604
c9385c3
eb13bde
d63bc7b
7eb794c
6d8dc9c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,217 @@ | ||||||
| --- | ||||||
| sidebar_label: 'Connect a data catalog' | ||||||
| description: 'Connect an external data catalog to ClickHouse Cloud via the Data sources UI.' | ||||||
| slug: /integrations/data-catalogs | ||||||
| title: 'Connect a data catalog in ClickHouse Cloud' | ||||||
| doc_type: 'guide' | ||||||
| keywords: ['data catalogs', 'data lake', 'iceberg', 'unity catalog', 'glue', 'delta lake', 'onelake', 'polaris', 'biglake', 'clickhouse cloud'] | ||||||
| integration: | ||||||
| - support_level: 'core' | ||||||
| - category: 'data_lake' | ||||||
| --- | ||||||
|
|
||||||
| import Image from '@theme/IdealImage'; | ||||||
| import BetaBadge from '@theme/badges/BetaBadge'; | ||||||
| import catalog_flyout_select from '@site/static/images/integrations/data-catalogs/catalog-flyout-select.png'; | ||||||
| import linked_catalogs_table from '@site/static/images/integrations/data-catalogs/linked-catalogs-table.png'; | ||||||
| import catalog_tables_browser from '@site/static/images/integrations/data-catalogs/catalog-tables-browser.png'; | ||||||
| import catalog_sql_query from '@site/static/images/integrations/data-catalogs/catalog-sql-query.png'; | ||||||
| import data_catalogs_ui from '@site/static/images/cloud/features/data-catalogs-ui.png'; | ||||||
|
|
||||||
| <BetaBadge/> | ||||||
|
|
||||||
| # Connect a data catalog in ClickHouse Cloud | ||||||
|
|
||||||
| :::info | ||||||
| Data catalog integrations in ClickHouse Cloud are in public beta. | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this is needed because of the BetaBadge up top. |
||||||
| ::: | ||||||
|
|
||||||
| Connect ClickHouse Cloud to your data catalogs to access your open table format tables. You can set up connections in the **Data sources** UI. For setup via SQL, use the `[DataLakeCatalog](/engines/database-engines/datalakecatalog)` database engine in your SQL editor of choice. | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Fixing the backticks here so that the link renders properly. |
||||||
|
|
||||||
| <Image img={data_catalogs_ui} size="md" alt="ClickHouse Cloud UI with data catalog integrations"/> | ||||||
|
|
||||||
| Once connected, catalog tables show up in the SQL console under the database name you choose. You can query them with standard ClickHouse SQL, join them with [MergeTree](/engines/table-engines/mergetree-family/mergetree) tables, and use them as sources for [materialized views](/materialized-views). | ||||||
|
|
||||||
| ## Prerequisites {#prerequisites} | ||||||
|
|
||||||
| Before you connect a catalog, confirm the following: | ||||||
|
|
||||||
| - **Service permissions.** You need the `control-plane:service:manage` permission to access the **Data sources** page and add catalogs. | ||||||
| - **Running service.** If the service is idle, wake it from the **Data sources** page or the service overview before connecting or viewing linked catalogs. | ||||||
| - **Catalog credentials.** Gather connection details for your catalog type before opening the flyout. Each catalog uses different fields and authentication — see [Add your catalog connection](#add-your-catalog-connection) below. | ||||||
|
|
||||||
| ## Connect your catalog {#connect-your-catalog} | ||||||
|
|
||||||
| Make sure you're logged in to your [ClickHouse Cloud](https://cloud.clickhouse.com/) account. | ||||||
|
|
||||||
| 1. In the console, open the ClickHouse Cloud service you want to connect. | ||||||
| 2. Select **Data sources** in the left navigation. | ||||||
| 3. Click **+ Add catalog** if you haven't set up any data sources. Click **Add data source** > **Add data lake catalog.** | ||||||
| 4. In the **Connect your data catalog** flyout, select your catalog from the **Select catalog** dropdown. If the catalog supports multiple open table formats, choose the format in **Open table format**. | ||||||
|
|
||||||
| <Image img={catalog_flyout_select} alt="Connect your data catalog flyout with Select catalog dropdown showing AWS Glue, BigLake Metastore, Microsoft OneLake, Polaris, REST Catalog, and Unity Catalog" size="lg" border/> | ||||||
|
|
||||||
| ### Add your catalog connection {#add-your-catalog-connection} | ||||||
|
|
||||||
| 1. Fill in the connection parameters and **Database name** for your catalog type. The **Database name** is the ClickHouse database that exposes your catalog tables in the SQL console. | ||||||
| Select your catalog below for field-level guidance and prerequisites. | ||||||
|
|
||||||
| [AWS Glue Catalog](/use-cases/data-lake/glue-catalog) exposes [Iceberg](/engines/table-engines/integrations/iceberg) tables registered in the Glue Data Catalog. | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd suggest for each of these Catalog sections that we give them a seperate header or create a tabbed section with each catalog. |
||||||
|
|
||||||
| Before you connect, confirm: | ||||||
|
|
||||||
| - ClickHouse version on 25.12+. | ||||||
| - Iceberg tables are registered in AWS Glue Data Catalog in your target region. | ||||||
| - For access key authentication, you have an IAM user access key with permissions to read Glue metadata and the underlying S3 objects. | ||||||
| - For IAM role authentication (26.2+), you have an IAM role that trusts your ClickHouse service role. Include the service role ARN from **Settings → Network security information** in the role trust policy. See [Accessing Iceberg data securely](/cloud/data-sources/secure-iceberg) for IAM policy examples. | ||||||
|
|
||||||
| In the flyout, enter your AWS **Region** (e.g. `us-west-2`), then choose an authentication method: | ||||||
|
|
||||||
| **Access key authentication** | ||||||
|
|
||||||
| 1. Select **AWS Access Key** as the **Authentication method**. | ||||||
| 2. Enter your **Access Key ID** and **Secret Access Key**. | ||||||
| 3. Enter a **Database name** for the ClickHouse database that exposes your Glue tables. | ||||||
|
|
||||||
| **IAM role authentication (26.2+)** | ||||||
|
|
||||||
| 1. Select **AWS IAM Role** as the **Authentication method**. | ||||||
| 2. Copy the **Service role ID (IAM)** from the flyout panel and add it to your IAM role trust policy. | ||||||
| 3. Enter your **AWS Role ARN** and an optional **AWS Role Session Name**. | ||||||
| 4. Enter a **Database name** for the ClickHouse database that exposes your Glue tables. | ||||||
|
|
||||||
| :::note | ||||||
| Glue supports multiple table formats, but ClickHouse only reads **Iceberg** tables from Glue. | ||||||
| ::: | ||||||
|
|
||||||
| Query Unity Catalog managed [Iceberg](/engines/table-engines/integrations/iceberg) tables using OAuth client credentials from a Databricks service principal. See the [Unity Catalog guide](/use-cases/data-lake/unity-catalog#read-iceberg) for full setup. | ||||||
|
|
||||||
| Before you connect, confirm: | ||||||
|
|
||||||
| - ClickHouse version on 25.12+. | ||||||
| - [Unity Catalog is configured for external data access](https://docs.databricks.com/aws/en/external-access/admin). | ||||||
| - Databricks [service principal](https://docs.databricks.com/aws/en/dev-tools/auth/oauth-m2m) with OAuth client ID and secret. The service principal has `USE CATALOG`, `USE SCHEMA`, `USE EXTERNAL SCHEMA` and `SELECT` privileges on the tables you want to query. | ||||||
|
|
||||||
| In the flyout: | ||||||
|
|
||||||
| 1. Enter your Databricks **Workspace URL** (e.g. `dbc-1234567a-cbde`). | ||||||
| 2. Enter the **Databricks catalog name** to connect (e.g. `icebench`). | ||||||
| 3. Enter the OAuth **Client ID** and **Client secret** for your service principal. | ||||||
| 4. Enter a **Database name** for the ClickHouse database that exposes your Unity Catalog tables. | ||||||
|
|
||||||
| Query Unity Catalog [Delta Lake](/engines/table-engines/integrations/deltalake) tables using a Databricks Personal Access Token (PAT). See the [Unity Catalog guide](/use-cases/data-lake/unity-catalog#read-delta) for full setup. | ||||||
|
|
||||||
| Before you connect, confirm: | ||||||
|
|
||||||
| - ClickHouse version on 25.12+. | ||||||
| - [Unity Catalog is configured for external data access](https://docs.databricks.com/aws/en/external-access/admin). | ||||||
| - Databricks [Personal Access Token](https://docs.databricks.com/aws/en/dev-tools/auth/pat) with at least `EXTERNAL USE SCHEMA`, `USE CATALOG`, `USE SCHEMA`, and `SELECT` on the target tables. | ||||||
|
|
||||||
| In the flyout: | ||||||
|
|
||||||
| 1. Enter your Databricks **Workspace URL** (e.g. `dbc-1234567a-cbde.azuredatabricks.net`). | ||||||
| 2. Enter the **Databricks catalog name** to connect. | ||||||
| 3. Enter your **Personal Access Token**. | ||||||
| 4. Enter a **Database name** for the ClickHouse database that exposes your Delta tables. | ||||||
|
|
||||||
| :::note | ||||||
| Iceberg and Delta use different authentication in the UI. This will require two separate ClickHouse databases to access both types of tables. | ||||||
| ::: | ||||||
|
|
||||||
| Connect to any catalog that implements the [Iceberg REST Catalog](https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml) specification. See the [REST catalog guide](/use-cases/data-lake/rest-catalog) for full setup. | ||||||
|
|
||||||
| Before you connect, confirm: | ||||||
|
|
||||||
| - ClickHouse version on 25.12+. | ||||||
| - Your REST catalog endpoint is reachable from ClickHouse Cloud. | ||||||
| - You have OAuth client credentials or a bearer token, depending on your catalog configuration. | ||||||
| - You have an S3 or compatible **Storage Endpoint** URI for table data (e.g. `s3://my-bucket/path`). | ||||||
|
|
||||||
| In the flyout: | ||||||
|
|
||||||
| 1. Enter the **Catalog URL** (e.g. `https://catalog.example.com/v1`). | ||||||
| 2. Enter the **Warehouse** or catalog namespace (e.g. `demo`). | ||||||
| 3. Enter the **Storage Endpoint** URI prefix for table storage. | ||||||
| 4. Select an **Authentication method**: **OAuth Client Credentials** or **Bearer Token**, then enter the matching credentials. | ||||||
| 5. Enter a **Database name** for the ClickHouse database that exposes your REST catalog tables. | ||||||
|
|
||||||
| Query [Iceberg](/engines/table-engines/integrations/iceberg) tables in Microsoft Fabric OneLake using Azure AD application credentials. See the [Fabric OneLake guide](/use-cases/data-lake/onelake-catalog) for full setup. | ||||||
|
|
||||||
| Before you connect, confirm: | ||||||
|
|
||||||
| - ClickHouse version on 25.12+. | ||||||
| - Iceberg tables exist in a Fabric workspace. | ||||||
| - You have an Entra ID (Azure AD) application with client ID and secret. | ||||||
| - You have your tenant ID, workspace ID, and a data item ID. Use your **Lakehouse ID** from the Lakehouse page URL. See [Microsoft OneLake prerequisites](https://learn.microsoft.com/en-us/fabric/onelake/table-apis/table-apis-overview#prerequisites) for help locating these values. | ||||||
|
|
||||||
| In the flyout: | ||||||
|
|
||||||
| 1. Enter your Fabric **Workspace ID**. | ||||||
| 2. Enter the **Data Item ID** — use your Lakehouse GUID. Warehouse IDs are not supported. | ||||||
|
Check notice on line 150 in docs/integrations/data-catalogs/index.md
|
||||||
| 3. Enter your Entra ID **Tenant ID**, **Application (client) ID**, and **Client secret**. | ||||||
| 4. Enter a **Database name** for the ClickHouse database that exposes your OneLake tables. | ||||||
|
|
||||||
| Connect to a Snowflake Open Catalog (Polaris) deployment for [Iceberg](/engines/table-engines/integrations/iceberg) tables. See the [Polaris catalog guide](/use-cases/data-lake/polaris-catalog) for full setup. | ||||||
|
|
||||||
| Before you connect, confirm: | ||||||
|
|
||||||
| - ClickHouse version on 26.2+. | ||||||
| - You have a Polaris catalog with OAuth client credentials. | ||||||
| - You have a storage endpoint URI for Iceberg table data (e.g. `s3://company-iceberg-prod/warehouse/`). | ||||||
|
|
||||||
| In the flyout: | ||||||
|
|
||||||
| 1. Enter the **Catalog Account Identifier** (e.g. `ab12345.snowflakecomputing.com`). | ||||||
| 2. Enter the **Catalog Name** (e.g. `snowflake_open_catalog`). | ||||||
| 3. Enter the OAuth **Client ID** and **Client Secret**. | ||||||
| 4. Enter the **Storage Endpoint** URI prefix for table storage. | ||||||
| 5. Enter a **Database name** for the ClickHouse database that exposes your Polaris tables. | ||||||
|
|
||||||
| Connect to Google Cloud BigLake Metastore (aka Lakehouse runtime catalog) for [Iceberg](/engines/table-engines/integrations/iceberg) tables in GCS. See the [BigLake Metastore guide](/use-cases/data-lake/biglake-catalog) for full setup. | ||||||
|
|
||||||
| Before you connect, confirm: | ||||||
|
|
||||||
| - ClickHouse version on 26.2+. | ||||||
| - You have a BigLake Metastore instance with Iceberg tables in GCS. | ||||||
| - You have Google Application Default Credentials (ADC) with client ID, client secret, refresh token, and quota project ID. | ||||||
|
|
||||||
| In the flyout: | ||||||
|
|
||||||
| 1. Enter your **Google ADC Client ID**, **Client Secret**, **Refresh Token**, and **Quota Project ID**. | ||||||
| 2. Enter the **Cloud Storage Bucket** URI for table data (e.g. `gs://biglake-public-nyc-taxi-iceberg`). | ||||||
| 3. Enter a **Database name** for the ClickHouse database that exposes your BigLake tables. | ||||||
|
|
||||||
| 1. Click **Add catalog**. ClickHouse validates the connection and credentials when saving. | ||||||
| 2. On success, a confirmation toast appears with a **View in SQL console** link. Your catalog is listed in the **Linked catalogs** table with its connection status and table count. | ||||||
|
|
||||||
| From the **Actions** menu on a linked catalog row, you can drop the catalog connection. Dropping removes the ClickHouse database binding — it does not delete data in your external catalog. | ||||||
|
|
||||||
| ## Query your data {#query-data} | ||||||
|
|
||||||
| 1. On the **Data sources** page, find your catalog in the **Linked catalogs** table and click **View tables**. | ||||||
|
|
||||||
| <Image img={linked_catalogs_table} alt="Linked catalogs table with View tables action" size="lg" border/> | ||||||
|
|
||||||
| 1. ClickHouse opens the SQL console with your catalog database selected and lists the available tables. | ||||||
|
|
||||||
| <Image img={catalog_tables_browser} alt="SQL console table browser showing catalog tables" size="lg" border/> | ||||||
|
|
||||||
| 1. Write a query in the SQL editor and click **Run**. | ||||||
|
|
||||||
| Wrap the full table name in backticks: | ||||||
|
|
||||||
| ```sql | ||||||
| SELECT * FROM `identity_profiles.identity_profiles_iceberg` | ||||||
| ``` | ||||||
|
|
||||||
| <Image img={catalog_sql_query} alt="SQL query with results using backticks for dotted table name" size="lg" border/> | ||||||
|
|
||||||
| See also: | ||||||
|
|
||||||
| - [Accelerating analytics on lakehouse data](/use-cases/data-lake/getting-started/accelerating-analytics) — load catalog tables into MergeTree for repeated queries | ||||||
| - [Accessing Iceberg data securely](/cloud/data-sources/secure-iceberg) — IAM role setup for AWS Iceberg and Glue access | ||||||
|
|
||||||
| ## Troubleshooting {#troubleshooting} | ||||||
|
|
||||||
| - If you don't see your tables in the SQL console: verify credentials, network access, and table types in catalog. Make sure the tables you expect to see are in supported file and table formats. | ||||||
| - Open up a support ticket if you aren't able to debug. | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1024,6 +1024,7 @@ const sidebars = { | |
| 'integrations/data-sources/cassandra', | ||
| 'integrations/data-ingestion/gcs/index', | ||
| 'integrations/data-ingestion/s3-minio', | ||
| 'integrations/data-catalogs/index', | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added this sidebar so that the preview build would pass, but I'm not convinced with this placement. Also the label as it is right now in the sidebar is "Connect a data catalog", perhaps it can just be Data catalog. |
||
| 'integrations/data-ingestion/emqx/index', | ||
| 'integrations/data-ingestion/insert-local-files', | ||
| 'integrations/data-ingestion/dbms/jdbc-with-clickhouse', | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the intention of these query string params in these links (?catalog=aws-glue). As of now, every link lands at the top of the page.