Add dlt integration for EdX.org S3 data ingestion#1874
Conversation
There was a problem hiding this comment.
Pull request overview
This PR integrates dlt (data load tool) into the lakehouse code location to enable Python-based data ingestion. The initial implementation provides an EdX.org S3 data source that loads TSV files containing database table exports from S3.
Changes:
- Added dlt integration dependencies (
dagster-dlt,dlt,duckdb,pyiceberg) - Implemented EdX.org S3 ingestion pipeline for loading 40+ database tables from S3
- Added
SKIP_AIRBYTEenvironment variable for faster local development iteration - Created documentation and helper scripts for dlt pipeline development and testing
Reviewed changes
Copilot reviewed 12 out of 14 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| dg_projects/lakehouse/pyproject.toml | Added dlt-related dependencies to project configuration |
| dg_projects/lakehouse/uv.lock | Lock file updates for new dependencies |
| dg_projects/lakehouse/lakehouse/definitions.py | Added SKIP_AIRBYTE logic to conditionally load Airbyte assets |
| dg_projects/lakehouse/lakehouse/defs/edxorg_s3_ingestion/loads.py | Core dlt source implementation for EdX.org S3 data ingestion |
| dg_projects/lakehouse/lakehouse/defs/edxorg_s3_ingestion/defs.yaml | Dagster component configuration for dlt integration |
| dg_projects/lakehouse/lakehouse/defs/edxorg_s3_ingestion/README.md | Documentation for EdX.org S3 pipeline |
| dg_projects/lakehouse/.dlt/config.toml | dlt non-sensitive configuration for local and production destinations |
| dg_projects/lakehouse/.dlt/secrets.toml.template | Template for dlt credentials configuration |
| dg_projects/lakehouse/scripts/dlt_env.py | Helper script for managing dlt environment switching |
| dg_projects/lakehouse/scripts/test_edxorg_import.py | Import validation script for EdX.org pipeline |
| dg_projects/lakehouse/docs/README.md | Documentation hub for lakehouse guides |
| dg_projects/lakehouse/docs/local_development.md | Local development guide with SKIP_AIRBYTE usage |
| .gitignore | Added dlt-specific file patterns to gitignore |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
106d960 to
554c88f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 25 out of 33 changed files in this pull request and generated 16 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
496ce3d to
d45ba06
Compare
|
Supersedes #1598 |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 25 out of 33 changed files in this pull request and generated 13 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
rachellougee
left a comment
There was a problem hiding this comment.
Initial testing of one table student_courseenrollment looks good. Just a comment about table_format issue I ran into in local testing. We need further testing around incremental loading for large tables
- Add dagster-dlt, dlt[filesystem,parquet], duckdb, pyiceberg dependencies
- Configure dlt with dual destinations (local dev & production S3)
- Add SKIP_AIRBYTE environment variable for faster local development
- Implement edxorg_s3 dlt source for loading TSV files from S3
- Loads 40+ database tables (auth_user, enrollments, assessments, etc.)
- Memory-efficient chunked processing
- Selective table loading capability
- Metadata tracking per file
- Add comprehensive documentation
- Local development guide with SKIP_AIRBYTE usage
- EdX.org S3 pipeline README with examples
- Helper scripts for environment management
- Integrate with Dagster via DltLoadCollectionComponent
The dlt source loads TSV files generated by the edxorg_archive asset from:
s3://ol-data-lake-landing-zone-production/edxorg-raw-data/edxorg/raw_data/
Structure: db_table/{table_name}/{source_system}/{course_id}/{hash}.tsv
This enables analytics on edX.org course data including user enrollments,
assessments, certificates, and course progress across 40+ database tables.
for more information, see https://pre-commit.ci
- Add logging instead of print statements in loads.py - Add noqa comments for CLI utility scripts - Add pragma: allowlist secret for template credential examples - Fix string escaping in dlt_env.py - Add type hints for mypy compliance - Fix magic number and SQL injection warnings
- Created new component-aware Dagster code location for dlt-based data loading - Implements EdX.org S3 TSV ingestion for 40+ database tables - Uses AWS IAM-based authentication (IRSA for K8s, ~/.aws/ for local) - Configured with dual destinations (local filesystem and production S3) - Autoloads components from defs/ directory using load_from_defs_folder - All 40 database table assets verified loading in dg list defs
- Added dlt incremental loading based on file modification_date - Only processes files modified since last successful run - Significantly improves performance for subsequent runs - Added _dlt_file_modified metadata column for tracking - Updated README with incremental loading documentation - State automatically managed by dlt in pipeline storage
- Added dlt[pyiceberg] dependency (version >= 1.21.0) - Configured Iceberg table format for QA and Production environments - QA writes to s3://ol-data-lake-raw-qa/edxorg with ol_warehouse_qa_raw Glue catalog - Production writes to s3://ol-data-lake-raw-production/edxorg with ol_warehouse_production_raw Glue catalog - Local dev uses Parquet format to file:///.dlt/data (no Iceberg overhead) - Pipeline automatically selects destination based on DAGSTER_ENVIRONMENT - Iceberg catalog config uses Glue catalog type for AWS integration - Maintains IAM-based authentication for all S3 operations
- Documented DAGSTER_ENVIRONMENT-based destination selection - Explained local (Parquet), QA (Iceberg+Glue), and production (Iceberg+Glue) configs - Added configuration snippets showing Iceberg catalog setup - Clarified Glue database names: ol_warehouse_qa_raw and ol_warehouse_production_raw
- Set DLT_PROJECT_DIR environment variable to point to data_loading project root - Calculates path relative to loads.py location (4 parents up) - Ensures .dlt/config.toml is found when running 'dg dev' from repo root - Resolves warning about config/secrets files not being found
- Filter to prod-only sources (exclude edge) via glob pattern - Create consolidated non-partitioned downstream assets - Add upstream dependencies on edxorg db_table assets - Each dlt asset consolidates all course data into single table - Update documentation with dependency architecture - Replace component YAML with Python-based asset wrapper
- Change asset keys from edxorg_s3_local/edxorg_{table} to edxorg/tables/{table}
- Add table name prefix: raw__edxorg__s3__tables__{table_name}
- Update documentation to reflect new naming structure
- Implement custom EdxorgDltTranslator using get_asset_spec()
- Map dlt resources to ol_warehouse_raw_data asset keys
- Create one-to-one AssetDep dependencies (non-blocking)
- Move edxorg_db_table_specs to edxorg code location
- Update edxorg definitions to include db_table external specs
- Use modern DagsterDltTranslator API (no deprecation warnings)
- Table names: raw__edxorg__s3__tables__{table_name}
- Use named destination pattern: [destination.<env>] with destination_type - Update config.toml to use destination.local, destination.qa, destination.production - Each named destination has destination_type='filesystem' - Pipeline now correctly resolves bucket_url from environment-specific config - Fixes ConfigFieldMissingException when materializing dlt assets
- Remove dlt dependencies from pyproject.toml (dagster-dlt, dlt, duckdb) - Remove .dlt/ configuration directory - Remove edxorg_s3_ingestion defs and scripts - Update documentation to remove dlt references - Sync uv.lock with updated dependencies dlt integration has been moved to data_loading code location
- Fix Python version constraint to match lakehouse (>=3.13,<3.15) - Remove defs.yaml.bak backup file - Fix upstream asset path in README (../../ -> ../../../../) - Remove Qualtrics reference from related pipelines - Add missing 'course' table to all_tables list - Fix module path in usage example (data_loading not lakehouse) - Fix environment variable name in usage (DAGSTER_ENVIRONMENT)
5d75d3d to
ae1c393
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 29 changed files in this pull request and generated 12 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ae1c393 to
8fc3603
Compare
d4457ae to
4f26792
Compare
…cessing Many of the CSV files that are sent in the edxorg raw data are lacking any identifying information about the courses other than the contextualized file hierarchy in the export. This adds extra columns including that extracted content at the point that we have it. This allows for us to use that course detail downstream when we are aggregating and filtering this data into downstream assets. This also addresses an edge case bug that was causing some course XML files to be erroneously processed as CSV data due to how the file path was parsed by the regular expression.
4f26792 to
e9d934a
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 31 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ADME.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…cessing Many of the CSV files that are sent in the edxorg raw data are lacking any identifying information about the courses other than the contextualized file hierarchy in the export. This adds extra columns including that extracted content at the point that we have it. This allows for us to use that course detail downstream when we are aggregating and filtering this data into downstream assets. This also addresses an edge case bug that was causing some course XML files to be erroneously processed as CSV data due to how the file path was parsed by the regular expression.
| archive_file.unlink() | ||
| continue | ||
| if df.is_empty(): | ||
| if df.collect().is_empty(): |
There was a problem hiding this comment.
Bug: The code calls df.collect() on a Polars LazyFrame after it has already been consumed by df.sink_csv(). A LazyFrame cannot be consumed twice.
Severity: HIGH
Suggested Fix
The emptiness check should be performed before the data is written to the file. Move the if df.collect().is_empty(): block to be before the df.sink_csv() call. This ensures the check is performed on the valid, unconsumed LazyFrame.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: dg_projects/edxorg/edxorg/assets/edxorg_archive.py#L304
Potential issue: In `edxorg_archive.py`, a Polars LazyFrame `df` is created and then
written to a file using `df.sink_csv()`. Immediately after, the code attempts to call
`df.collect().is_empty()` on the same LazyFrame reference. This is invalid because
`sink_csv()` consumes the LazyFrame, and it cannot be executed or collected a second
time. This will cause a runtime error during the archive processing operation,
preventing the data from being properly ingested.
rachellougee
left a comment
There was a problem hiding this comment.
Tested loading a small table, and it works
Overview
This PR adds dlt (data load tool) integration to a new data_loading code location, enabling Python-based data ingestion as a complement to Airbyte. The initial implementation loads EdX.org database table exports from S3 with proper asset dependency tracking.
What's Added
New Code Location: data_loading
Dependencies
EdX.org S3 Data Source
Implements a production-ready dlt source that:
edxorg_archiveasset from S3Source:
s3://ol-data-lake-landing-zone-production/edxorg-raw-data/edxorg/raw_data/db_table/{table}/prod/{course}/*.tsvDestinations:
file:///.dlt/data(Parquet for fast iteration)s3://ol-data-lake-raw-qa/edxorg(Iceberg + Glue:ol_warehouse_qa_raw)s3://ol-data-lake-raw-production/edxorg(Iceberg + Glue:ol_warehouse_production_raw)Asset Architecture
Upstream Assets (edxorg code location)
dg_projects/edxorg/edxorg/assets/edxorg_db_table_specs.pyAssetKey(["edxorg", "raw_data", "db_table", {table}])course_idandsource_system(prod/edge)Downstream Assets (data_loading code location)
dg_projects/data_loading/data_loading/defs/edxorg_s3_ingest/AssetKey(["ol_warehouse_raw_data", "raw__edxorg__s3__tables__{table}"])raw__edxorg__s3__tables__{table}(matches Airbyte naming pattern)AssetDepon corresponding upstream tableDependency Mapping
Custom
EdxorgDltTranslatorclass:get_asset_spec()using modern Dagster APIol_warehouse_raw_dataprefixAssetDepfor lineage trackingTables Loaded (40+)
User & Authentication:
Enrollment & Progress:
Assessments & Submissions:
Certificates & Grading:
And 20+ more (teams, wiki, workflow, user profile data)
Key Features
1. Incremental Loading
incremental("modification_date")on filesystem source.dlt/edxorg_s3/directory2. Environment-Based Configuration
s3://ol-data-lake-raw-qa/edxorgwith Glue catalogs3://ol-data-lake-raw-production/edxorgwith Glue catalogDAGSTER_ENVIRONMENTvariable3. Prod-Only Filtering
db_table/{table}/prod/**/*.tsv4. IAM-Based Authentication
~/.aws/credentialslocally)5. Iceberg + Glue Catalog (QA/Production)
Project Structure
Usage Examples
Materialize via Dagster UI
ol_warehouse_raw_dataraw__edxorg__s3__tables__auth_user)edxorg/raw_data/db_table/auth_userLocal Testing
Environment Variables
DAGSTER_ENVIRONMENT: Controls destination (dev/ci/qa/production)DLT_PROJECT_DIR: Auto-set for config path resolutionTechnical Implementation
Custom DagsterDltTranslator
Named Destination Configuration
Testing
ol_warehouse_raw_data)Benefits
Migration Path
This establishes the foundation for migrating other data sources to dlt:
rest_api_resourcesFiles Changed
New Code Location:
dg_projects/data_loading/pyproject.toml- Dependencies and metadatadg_projects/data_loading/.dlt/config.toml- dlt configurationdg_projects/data_loading/data_loading/definitions.py- Code location setupdg_projects/data_loading/data_loading/defs/edxorg_s3_ingest/- Source implementationEdX.org Updates:
dg_projects/edxorg/edxorg/assets/edxorg_db_table_specs.py- External asset specsdg_projects/edxorg/edxorg/definitions.py- Include db_table specsReferences