Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions docs/user_guides/fs/data_source/creation/glue.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# How-To set up an AWS Glue Data Source { #data-source-glue }

## Introduction

The Glue Data Source integrates with the AWS Glue Data Catalog.
It points at a Glue database backed by Amazon S3, where the data always lives.
For this reason the Glue Data Source provides the same S3 credentials (`access_key`, `secret_key`, `session_token`, `region`) as the [S3 Data Source](s3.md).
This works for any data format — Apache Iceberg, Delta Lake and Apache Hudi, as well as plain file formats such as `csv` and `parquet`.

How the Glue Data Catalog itself is used depends on the format:

- Iceberg: the catalog owns the table's current-metadata pointer, so reads and writes are mediated by the catalog (the table is addressed by `<database>.<table>`).
- Delta and Hudi: the on-path transaction log or timeline stays authoritative; the catalog is a discoverability mirror that is registered on create and synced on write so external engines (Athena, EMR, ...) can find the table by name.
- Plain file formats (`csv`, `parquet`, ...): the Data Source is used only for S3 access; nothing is registered in the catalog.

In this guide, you will configure a Data Source in Hopsworks to save all the authentication information needed in order to set up a connection to your AWS Glue database.
When you're finished, you'll be able to read tables using Spark through Hopsworks APIs, and to create managed feature groups whose offline data is stored in the Glue-registered location on S3.

!!! note
Currently, it is only possible to create data sources in the Hopsworks UI.
You cannot create a data source programmatically.

## Prerequisites

Before you begin this guide you'll need to retrieve the following information from your AWS Glue and S3 setup:

- **Database:** You will need the name of the Glue database that contains, or will contain, your tables.
- **Region:** You will need the AWS region in which the Glue Data Catalog and the backing S3 bucket reside.
The region is identified by its code.
- **Authentication Method:** You can authenticate using Access Key/Secret, or use IAM roles.
If you want to use an IAM role it either needs to be attached to the entire Hopsworks cluster or Hopsworks needs to be able to assume the role.
See [IAM role documentation](../../../../setup_installation/admin/roleChaining.md) for more information.
The credentials must grant access both to the Glue Data Catalog and to the backing S3 bucket.

## Creation in the UI

### Step 1: Set up new Data Source

Head to the Data Source View on Hopsworks (1) and set up a new data source (2).

<figure markdown>
![Data Source Creation](../../../../assets/images/guides/fs/data_source/data_source_overview.png)
<figcaption>The Data Source View in the User Interface</figcaption>
</figure>

### Step 2: Enter Glue Information

Enter the details for your Glue connector.
Start by giving it a **name** and an optional **description**.
Then set the name of the Glue **database** you want to point the connector to, and the AWS **region** of the Glue Data Catalog and its backing S3 bucket.

### Step 3: Configure Authentication

#### Instance Role

Choose instance role if you have an EC2 instance profile attached to your Hopsworks cluster nodes with a role which grants access to the Glue Data Catalog and the backing S3 bucket.

#### Temporary Credentials

Choose temporary credentials if you are using [AWS Role chaining](../../../../setup_installation/admin/roleChaining.md) to control the access permission on a project and user role base.
Once you have selected *Temporary Credentials* select the role that gives access to the Glue Data Catalog and the backing S3 bucket.
For this role to appear in the list it needs to have been configured by an administrator, see the [AWS Role chaining documentation](../../../../setup_installation/admin/roleChaining.md) for more details.

!!! warning "Session Duration"
By default, the session duration that the role will be assumed for is 1 hour or 3600 seconds.
This means if you want to use the data source for example to write [training data to S3](../usage.md#writing-training-data), the training dataset creation cannot take longer than one hour.

Your administrator can change the default session duration for AWS data sources, by first [increasing the max session duration of the IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use.html#id_roles_use_view-role-max-session) that you are assuming.
And then changing the `fs_data_source_session_duration` [configuration variable](../../../../setup_installation/admin/variables.md) to the appropriate value in seconds.

#### Access Key/Secret

The most simple authentication method are Access Key/Secret, choose this option to get started quickly, if you are able to retrieve the keys using the IAM user administration.

### Step 4: Save changes

Click on "Save Credentials".

## Feature group path

When creating a feature group from this Data Source and the Glue database has a location, the feature group path is generated automatically by appending the new table to that database location, so no path needs to be set.

Otherwise, the path must be set explicitly on the data source, for example:

=== "PySpark"

```python
ds = fs.get_data_source("glue")
ds.path = "s3://mybucket/iceberg-warehouse/myglue.db/fg_1/"
```

An explicitly set path always takes precedence over the generated one.

## Direct Spark or PyIceberg access

For direct Spark or PyIceberg access outside the feature group APIs, the Data Source supplies the matching catalog properties.
See [`GlueConnector.catalog_options`][hsfs.storage_connector.GlueConnector.catalog_options] (Spark) and [`GlueConnector.pyiceberg_catalog_options`][hsfs.storage_connector.GlueConnector.pyiceberg_catalog_options] (PyIceberg).

## Next Steps

Move on to the [usage guide for data sources](../usage.md) to see how you can use your newly created Glue connector.
7 changes: 4 additions & 3 deletions docs/user_guides/fs/data_source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ There are four main use cases for Data Sources:
This is also called the Connector API.
- Write [training data](../../../concepts/fs/feature_view/offline_api.md) to an external storage system to make it accessible by third parties.
- Managed [feature group](../../../user_guides/fs/feature_group/create.md) that stores offline data in an external storage system.
Currently [S3](../data_source/creation/s3.md) and [GCS](../data_source/creation/gcs.md) connectors are supported.
Currently [S3](../data_source/creation/s3.md), [GCS](../data_source/creation/gcs.md) and [AWS Glue](../data_source/creation/glue.md) connectors are supported.

Data Sources provide two main mechanisms for authentication: using credentials or an authentication role (IAM Role on AWS or Managed Identity on Azure).
Hopsworks supports both a single IAM role (AWS) or Managed Identity (Azure) for the whole Hopsworks cluster or multiple IAM roles (AWS) or Managed Identities (Azure) that can only be assumed by users with a specific role in a specific project.
Expand Down Expand Up @@ -45,8 +45,9 @@ Cloud agnostic storage systems:
For AWS the following storage systems are supported:

1. [S3](creation/s3.md): Read data from a variety of file based storage in S3 such as parquet or CSV.
2. [Redshift](creation/redshift.md): Query Redshift databases and tables using SQL.
3. [SQL](creation/sql.md): Query Amazon SQL (Relational Database Service) using SQL.
2. [AWS Glue](creation/glue.md): Integrate with the AWS Glue Data Catalog over S3, for Iceberg, Delta, Hudi and plain file formats.
3. [Redshift](creation/redshift.md): Query Redshift databases and tables using SQL.
4. [SQL](creation/sql.md): Query Amazon SQL (Relational Database Service) using SQL.

## Azure

Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ nav:
- Kafka: user_guides/fs/data_source/creation/kafka.md
- HopsFS: user_guides/fs/data_source/creation/hopsfs.md
- S3: user_guides/fs/data_source/creation/s3.md
- AWS Glue: user_guides/fs/data_source/creation/glue.md
- Redshift: user_guides/fs/data_source/creation/redshift.md
- ADLS: user_guides/fs/data_source/creation/adls.md
- BigQuery: user_guides/fs/data_source/creation/bigquery.md
Expand Down
Loading