Add dlt integration for EdX.org S3 data ingestion by blarghmatey · Pull Request #1874 · mitodl/ol-data-platform

blarghmatey · 2026-01-30T20:47:50Z

Overview

This PR adds dlt (data load tool) integration to a new data_loading code location, enabling Python-based data ingestion as a complement to Airbyte. The initial implementation loads EdX.org database table exports from S3 with proper asset dependency tracking.

What's Added

New Code Location: data_loading

✅ Component-aware project structure for clean separation of concerns
✅ dlt integration using Dagster's native decorator
✅ Environment-based destinations (local/QA/production)
✅ Custom asset dependency mapping via
✅ Incremental loading based on file modification dates

Dependencies

dagster-dlt: Native Dagster integration for dlt pipelines
dlt[filesystem,parquet,pyiceberg]>=1.21.0: Core dlt with Iceberg support
Dependencies: IAM-based authentication (no secrets in code)

EdX.org S3 Data Source

Implements a production-ready dlt source that:

Reads TSV files generated by edxorg_archive asset from S3
Filters to prod sources only (excludes edge/staging)
Consolidates all courses into non-partitioned Iceberg/Parquet tables
Tracks one-to-one dependencies on upstream partitioned assets

Source: s3://ol-data-lake-landing-zone-production/edxorg-raw-data/edxorg/raw_data/db_table/{table}/prod/{course}/*.tsv

Destinations:

Local Dev: file:///.dlt/data (Parquet for fast iteration)
QA: s3://ol-data-lake-raw-qa/edxorg (Iceberg + Glue: ol_warehouse_qa_raw)
Production: s3://ol-data-lake-raw-production/edxorg (Iceberg + Glue: ol_warehouse_production_raw)

Asset Architecture

Upstream Assets (edxorg code location)

Location: dg_projects/edxorg/edxorg/assets/edxorg_db_table_specs.py
Asset Keys: AssetKey(["edxorg", "raw_data", "db_table", {table}])
Partitioning: Multi-partitioned by course_id and source_system (prod/edge)
Type: External AssetSpecs representing TSV files from archive processing

Downstream Assets (data_loading code location)

Location: dg_projects/data_loading/data_loading/defs/edxorg_s3_ingest/
Asset Keys: AssetKey(["ol_warehouse_raw_data", "raw__edxorg__s3__tables__{table}"])
Table Names: raw__edxorg__s3__tables__{table} (matches Airbyte naming pattern)
Partitioning: None - consolidates all upstream partitions
Dependencies: One-to-one AssetDep on corresponding upstream table

Dependency Mapping

Custom EdxorgDltTranslator class:

Overrides get_asset_spec() using modern Dagster API
Maps each dlt resource to ol_warehouse_raw_data prefix
Creates non-blocking AssetDep for lineage tracking
Enables Dagster to show upstream → downstream data flow

Tables Loaded (40+)

User & Authentication:

auth_user, auth_userprofile, student_anonymoususerid

Enrollment & Progress:

student_courseenrollment, courseware_studentmodule, student_courseaccessrole

Assessments & Submissions:

submissions_submission, submissions_score, assessment_* (14 tables)

Certificates & Grading:

certificates_generatedcertificate, grades_persistentcoursegrade, grades_persistentsubsectiongrade

And 20+ more (teams, wiki, workflow, user profile data)

Key Features

1. Incremental Loading

Uses dlt's incremental("modification_date") on filesystem source
Only processes files modified since last successful run
State persists in .dlt/edxorg_s3/ directory
Dramatically reduces runtime after initial load

2. Environment-Based Configuration

Local/CI: Parquet files to local disk (no AWS, no Iceberg overhead)
QA: Iceberg tables to s3://ol-data-lake-raw-qa/edxorg with Glue catalog
Production: Iceberg tables to s3://ol-data-lake-raw-production/edxorg with Glue catalog
Auto-detected via DAGSTER_ENVIRONMENT variable

3. Prod-Only Filtering

File glob: db_table/{table}/prod/**/*.tsv
Excludes edge (staging) environment data automatically
Reduces data volume and processing time

4. IAM-Based Authentication

Uses AWS IAM roles (IRSA in K8s, ~/.aws/credentials locally)
No secrets in configuration files
Credentials resolved at runtime

5. Iceberg + Glue Catalog (QA/Production)

Native Iceberg support via dlt 1.21.0+ and PyIceberg
Automatic Glue catalog registration
Schema evolution support
Queryable via Athena or Trino

Project Structure

dg_projects/data_loading/
├── .dlt/
│   └── config.toml              # Named destinations (local, qa, production)
├── data_loading/
│   ├── definitions.py           # Code location definitions
│   └── defs/
│       └── edxorg_s3_ingest/
│           ├── dagster_assets.py   # Custom DagsterDltTranslator + @dlt_assets
│           ├── defs.py             # Exports for component discovery
│           ├── loads.py            # dlt source and pipeline
│           └── README.md           # Comprehensive documentation

dg_projects/edxorg/
└── edxorg/
    └── assets/
        └── edxorg_db_table_specs.py  # External AssetSpecs for upstream tables

Usage Examples

Materialize via Dagster UI

Navigate to Assets → ol_warehouse_raw_data
Select table (e.g., raw__edxorg__s3__tables__auth_user)
Click "Materialize"
View lineage showing dependency on edxorg/raw_data/db_table/auth_user

Local Testing

# Navigate to data_loading
cd dg_projects/data_loading

# Set environment (optional, defaults to dev/local)
export DAGSTER_ENVIRONMENT=dev

# Run dlt pipeline directly
python -m data_loading.defs.edxorg_s3_ingest.loads

# Query local data with DuckDB
duckdb -c "SELECT * FROM '.dlt/data/raw__edxorg__s3__tables__auth_user/*.parquet' LIMIT 10"

Environment Variables

DAGSTER_ENVIRONMENT: Controls destination (dev/ci/qa/production)
DLT_PROJECT_DIR: Auto-set for config path resolution

Technical Implementation

Custom DagsterDltTranslator

class EdxorgDltTranslator(DagsterDltTranslator):
    def get_asset_spec(self, data: DltResourceTranslatorData) -> AssetSpec:
        # Map to ol_warehouse_raw_data prefix
        asset_key = AssetKey(["ol_warehouse_raw_data", resource.name])
        
        # Extract table name and create upstream dependency
        if resource.name.startswith("raw__edxorg__s3__tables__"):
            table_name = resource.name.replace("raw__edxorg__s3__tables__", "")
            deps = [AssetDep(AssetKey(["edxorg", "raw_data", "db_table", table_name]))]
        
        return default_spec.replace_attributes(key=asset_key, deps=deps)

Named Destination Configuration

# .dlt/config.toml
[destination.local]
destination_type = "filesystem"
bucket_url = "file:///.dlt/data"

[destination.qa]
destination_type = "filesystem"
bucket_url = "s3://ol-data-lake-raw-qa/edxorg"

[iceberg_catalog.qa]
iceberg_catalog_name = "ol_warehouse_qa_raw"
iceberg_catalog_type = "glue"

Testing

✅ Definitions load without errors in both code locations
✅ Asset keys match Airbyte pattern (ol_warehouse_raw_data)
✅ Dependencies show correctly in Dagster lineage graph
✅ Pipeline configuration resolves environment-specific destinations
✅ Custom DagsterDltTranslator uses modern API (no deprecation warnings)
⏳ End-to-end data loading requires AWS credentials

Benefits

Clean Architecture: Separate code location for ingestion logic
Proper Dependencies: Dagster tracks lineage from raw exports → consolidated tables
Environment Aware: Same code works across local/QA/production with different configs
Incremental: Only processes new/modified files on subsequent runs
Production Ready: Iceberg format with Glue catalog integration
Flexible: Python-based, easy to extend for new sources
LLM-Friendly: dlt's design works well with AI-assisted development

Migration Path

This establishes the foundation for migrating other data sources to dlt:

REST APIs: Use dlt's rest_api_resources
File-based sources: Reuse filesystem source pattern
Custom transforms: Python flexibility vs. Airbyte's JSON configs

Files Changed

New Code Location:

dg_projects/data_loading/pyproject.toml - Dependencies and metadata
dg_projects/data_loading/.dlt/config.toml - dlt configuration
dg_projects/data_loading/data_loading/definitions.py - Code location setup
dg_projects/data_loading/data_loading/defs/edxorg_s3_ingest/ - Source implementation

EdX.org Updates:

dg_projects/edxorg/edxorg/assets/edxorg_db_table_specs.py - External asset specs
dg_projects/edxorg/edxorg/definitions.py - Include db_table specs

References

Copilot

Pull request overview

This PR integrates dlt (data load tool) into the lakehouse code location to enable Python-based data ingestion. The initial implementation provides an EdX.org S3 data source that loads TSV files containing database table exports from S3.

Changes:

Added dlt integration dependencies (dagster-dlt, dlt, duckdb, pyiceberg)
Implemented EdX.org S3 ingestion pipeline for loading 40+ database tables from S3
Added SKIP_AIRBYTE environment variable for faster local development iteration
Created documentation and helper scripts for dlt pipeline development and testing

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
dg_projects/lakehouse/pyproject.toml	Added dlt-related dependencies to project configuration
dg_projects/lakehouse/uv.lock	Lock file updates for new dependencies
dg_projects/lakehouse/lakehouse/definitions.py	Added SKIP_AIRBYTE logic to conditionally load Airbyte assets
dg_projects/lakehouse/lakehouse/defs/edxorg_s3_ingestion/loads.py	Core dlt source implementation for EdX.org S3 data ingestion
dg_projects/lakehouse/lakehouse/defs/edxorg_s3_ingestion/defs.yaml	Dagster component configuration for dlt integration
dg_projects/lakehouse/lakehouse/defs/edxorg_s3_ingestion/README.md	Documentation for EdX.org S3 pipeline
dg_projects/lakehouse/.dlt/config.toml	dlt non-sensitive configuration for local and production destinations
dg_projects/lakehouse/.dlt/secrets.toml.template	Template for dlt credentials configuration
dg_projects/lakehouse/scripts/dlt_env.py	Helper script for managing dlt environment switching
dg_projects/lakehouse/scripts/test_edxorg_import.py	Import validation script for EdX.org pipeline
dg_projects/lakehouse/docs/README.md	Documentation hub for lakehouse guides
dg_projects/lakehouse/docs/local_development.md	Local development guide with SKIP_AIRBYTE usage
.gitignore	Added dlt-specific file patterns to gitignore

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 25 out of 33 changed files in this pull request and generated 16 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

blarghmatey · 2026-02-03T21:17:12Z

Supersedes #1598

Copilot

Pull request overview

Copilot reviewed 25 out of 33 changed files in this pull request and generated 13 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rachellougee

Initial testing of one table student_courseenrollment looks good. Just a comment about table_format issue I ran into in local testing. We need further testing around incremental loading for large tables

- Add dagster-dlt, dlt[filesystem,parquet], duckdb, pyiceberg dependencies - Configure dlt with dual destinations (local dev & production S3) - Add SKIP_AIRBYTE environment variable for faster local development - Implement edxorg_s3 dlt source for loading TSV files from S3 - Loads 40+ database tables (auth_user, enrollments, assessments, etc.) - Memory-efficient chunked processing - Selective table loading capability - Metadata tracking per file - Add comprehensive documentation - Local development guide with SKIP_AIRBYTE usage - EdX.org S3 pipeline README with examples - Helper scripts for environment management - Integrate with Dagster via DltLoadCollectionComponent The dlt source loads TSV files generated by the edxorg_archive asset from: s3://ol-data-lake-landing-zone-production/edxorg-raw-data/edxorg/raw_data/ Structure: db_table/{table_name}/{source_system}/{course_id}/{hash}.tsv This enables analytics on edX.org course data including user enrollments, assessments, certificates, and course progress across 40+ database tables.

for more information, see https://pre-commit.ci

- Add logging instead of print statements in loads.py - Add noqa comments for CLI utility scripts - Add pragma: allowlist secret for template credential examples - Fix string escaping in dlt_env.py - Add type hints for mypy compliance - Fix magic number and SQL injection warnings

- Created new component-aware Dagster code location for dlt-based data loading - Implements EdX.org S3 TSV ingestion for 40+ database tables - Uses AWS IAM-based authentication (IRSA for K8s, ~/.aws/ for local) - Configured with dual destinations (local filesystem and production S3) - Autoloads components from defs/ directory using load_from_defs_folder - All 40 database table assets verified loading in dg list defs

- Added dlt incremental loading based on file modification_date - Only processes files modified since last successful run - Significantly improves performance for subsequent runs - Added _dlt_file_modified metadata column for tracking - Updated README with incremental loading documentation - State automatically managed by dlt in pipeline storage

- Added dlt[pyiceberg] dependency (version >= 1.21.0) - Configured Iceberg table format for QA and Production environments - QA writes to s3://ol-data-lake-raw-qa/edxorg with ol_warehouse_qa_raw Glue catalog - Production writes to s3://ol-data-lake-raw-production/edxorg with ol_warehouse_production_raw Glue catalog - Local dev uses Parquet format to file:///.dlt/data (no Iceberg overhead) - Pipeline automatically selects destination based on DAGSTER_ENVIRONMENT - Iceberg catalog config uses Glue catalog type for AWS integration - Maintains IAM-based authentication for all S3 operations

- Documented DAGSTER_ENVIRONMENT-based destination selection - Explained local (Parquet), QA (Iceberg+Glue), and production (Iceberg+Glue) configs - Added configuration snippets showing Iceberg catalog setup - Clarified Glue database names: ol_warehouse_qa_raw and ol_warehouse_production_raw

- Set DLT_PROJECT_DIR environment variable to point to data_loading project root - Calculates path relative to loads.py location (4 parents up) - Ensures .dlt/config.toml is found when running 'dg dev' from repo root - Resolves warning about config/secrets files not being found

- Filter to prod-only sources (exclude edge) via glob pattern - Create consolidated non-partitioned downstream assets - Add upstream dependencies on edxorg db_table assets - Each dlt asset consolidates all course data into single table - Update documentation with dependency architecture - Replace component YAML with Python-based asset wrapper

- Change asset keys from edxorg_s3_local/edxorg_{table} to edxorg/tables/{table} - Add table name prefix: raw__edxorg__s3__tables__{table_name} - Update documentation to reflect new naming structure

- Implement custom EdxorgDltTranslator using get_asset_spec() - Map dlt resources to ol_warehouse_raw_data asset keys - Create one-to-one AssetDep dependencies (non-blocking) - Move edxorg_db_table_specs to edxorg code location - Update edxorg definitions to include db_table external specs - Use modern DagsterDltTranslator API (no deprecation warnings) - Table names: raw__edxorg__s3__tables__{table_name}

- Use named destination pattern: [destination.<env>] with destination_type - Update config.toml to use destination.local, destination.qa, destination.production - Each named destination has destination_type='filesystem' - Pipeline now correctly resolves bucket_url from environment-specific config - Fixes ConfigFieldMissingException when materializing dlt assets

- Remove dlt dependencies from pyproject.toml (dagster-dlt, dlt, duckdb) - Remove .dlt/ configuration directory - Remove edxorg_s3_ingestion defs and scripts - Update documentation to remove dlt references - Sync uv.lock with updated dependencies dlt integration has been moved to data_loading code location

- Fix Python version constraint to match lakehouse (>=3.13,<3.15) - Remove defs.yaml.bak backup file - Fix upstream asset path in README (../../ -> ../../../../) - Remove Qualtrics reference from related pipelines - Add missing 'course' table to all_tables list - Fix module path in usage example (data_loading not lakehouse) - Fix environment variable name in usage (DAGSTER_ENVIRONMENT)

Copilot

Pull request overview

Copilot reviewed 20 out of 29 changed files in this pull request and generated 12 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…cessing Many of the CSV files that are sent in the edxorg raw data are lacking any identifying information about the courses other than the contextualized file hierarchy in the export. This adds extra columns including that extracted content at the point that we have it. This allows for us to use that course detail downstream when we are aggregating and filtering this data into downstream assets. This also addresses an edge case bug that was causing some course XML files to be erroneously processed as CSV data due to how the file path was parsed by the regular expression.

Copilot

Pull request overview

Copilot reviewed 21 out of 31 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ADME.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…cessing Many of the CSV files that are sent in the edxorg raw data are lacking any identifying information about the courses other than the contextualized file hierarchy in the export. This adds extra columns including that extracted content at the point that we have it. This allows for us to use that course detail downstream when we are aggregating and filtering this data into downstream assets. This also addresses an edge case bug that was causing some course XML files to be erroneously processed as CSV data due to how the file path was parsed by the regular expression.

sentry · 2026-02-11T15:17:22Z

                    archive_file.unlink()
                    continue
-                if df.is_empty():
+                if df.collect().is_empty():


Bug: The code calls df.collect() on a Polars LazyFrame after it has already been consumed by df.sink_csv(). A LazyFrame cannot be consumed twice.
_{Severity: HIGH}

Suggested Fix

The emptiness check should be performed before the data is written to the file. Move the if df.collect().is_empty(): block to be before the df.sink_csv() call. This ensures the check is performed on the valid, unconsumed LazyFrame.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: dg_projects/edxorg/edxorg/assets/edxorg_archive.py#L304 Potential issue: In `edxorg_archive.py`, a Polars LazyFrame `df` is created and then written to a file using `df.sink_csv()`. Immediately after, the code attempts to call `df.collect().is_empty()` on the same LazyFrame reference. This is invalid because `sink_csv()` consumes the LazyFrame, and it cannot be executed or collected a second time. This will cause a runtime error during the archive processing operation, preventing the data from being properly ingested.

rachellougee

Tested loading a small table, and it works

blarghmatey requested a review from Copilot January 30, 2026 20:58

Copilot started reviewing on behalf of blarghmatey January 30, 2026 20:58 View session

Copilot AI reviewed Jan 30, 2026

View reviewed changes

blarghmatey force-pushed the tmacey/add-dlt-integration-edxorg-s3 branch 2 times, most recently from 106d960 to 554c88f Compare February 2, 2026 20:20

blarghmatey requested a review from Copilot February 3, 2026 20:35

Copilot started reviewing on behalf of blarghmatey February 3, 2026 20:36 View session

Copilot AI reviewed Feb 3, 2026

View reviewed changes

blarghmatey force-pushed the tmacey/add-dlt-integration-edxorg-s3 branch from 496ce3d to d45ba06 Compare February 3, 2026 20:51

blarghmatey requested a review from Copilot February 3, 2026 21:18

Copilot started reviewing on behalf of blarghmatey February 3, 2026 21:19 View session

blarghmatey requested review from quazi-h and rachellougee February 3, 2026 21:19

Copilot AI reviewed Feb 3, 2026

View reviewed changes

rachellougee self-assigned this Feb 5, 2026

rachellougee approved these changes Feb 9, 2026

View reviewed changes

Comment thread dg_projects/data_loading/data_loading/defs/edxorg_s3_ingest/loads.py

Comment thread dg_projects/data_loading/data_loading/defs/edxorg_s3_ingest/loads.py

blarghmatey and others added 13 commits February 10, 2026 11:57

[pre-commit.ci] auto fixes from pre-commit.com hooks

fc723f6

for more information, see https://pre-commit.ci

Update asset keys and table naming conventions

1953435

- Change asset keys from edxorg_s3_local/edxorg_{table} to edxorg/tables/{table} - Add table name prefix: raw__edxorg__s3__tables__{table_name} - Update documentation to reflect new naming structure

fix: Address bugs from testing

a04b5a2

blarghmatey added 2 commits February 10, 2026 11:59

blarghmatey force-pushed the tmacey/add-dlt-integration-edxorg-s3 branch from 5d75d3d to ae1c393 Compare February 10, 2026 16:59

blarghmatey requested a review from Copilot February 10, 2026 17:01

Copilot started reviewing on behalf of blarghmatey February 10, 2026 17:01 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

blarghmatey force-pushed the tmacey/add-dlt-integration-edxorg-s3 branch from ae1c393 to 8fc3603 Compare February 10, 2026 17:31

sentry Bot reviewed Feb 10, 2026

View reviewed changes

Comment thread dg_projects/edxorg/edxorg/assets/edxorg_archive.py

Comment thread dg_projects/data_loading/data_loading/defs/edxorg_s3_ingest/loads.py

blarghmatey force-pushed the tmacey/add-dlt-integration-edxorg-s3 branch 3 times, most recently from d4457ae to 4f26792 Compare February 10, 2026 18:47

blarghmatey force-pushed the tmacey/add-dlt-integration-edxorg-s3 branch from 4f26792 to e9d934a Compare February 10, 2026 19:24

blarghmatey requested review from Copilot and rachellougee February 10, 2026 19:27

Copilot started reviewing on behalf of blarghmatey February 10, 2026 19:27 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

blarghmatey and others added 2 commits February 10, 2026 14:50

Update dg_projects/data_loading/data_loading/defs/edxorg_s3_ingest/RE…

d43f7ee

…ADME.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

sentry Bot reviewed Feb 10, 2026

View reviewed changes

Comment thread dg_projects/edxorg/edxorg/assets/edxorg_archive.py

fix: Set file format for loader

8112b7f

sentry Bot reviewed Feb 11, 2026

View reviewed changes

rachellougee approved these changes Feb 11, 2026

View reviewed changes

blarghmatey merged commit 2c2b9c3 into main Feb 11, 2026
6 checks passed

blarghmatey deleted the tmacey/add-dlt-integration-edxorg-s3 branch February 11, 2026 21:03

Conversation

blarghmatey commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

What's Added

New Code Location: data_loading

Dependencies

EdX.org S3 Data Source

Asset Architecture

Upstream Assets (edxorg code location)

Downstream Assets (data_loading code location)

Dependency Mapping

Tables Loaded (40+)

Key Features

1. Incremental Loading

2. Environment-Based Configuration

3. Prod-Only Filtering

4. IAM-Based Authentication

5. Iceberg + Glue Catalog (QA/Production)

Project Structure

Usage Examples

Materialize via Dagster UI

Local Testing

Environment Variables

Technical Implementation

Custom DagsterDltTranslator

Named Destination Configuration

Testing

Benefits

Migration Path

Files Changed

References

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blarghmatey commented Feb 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rachellougee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

blarghmatey commented Jan 30, 2026 •

edited

Loading