From 01bd85d9b8dd526ab8b732186f3bbf9a64f77f38 Mon Sep 17 00:00:00 2001 From: Dani Palma Date: Fri, 26 Jun 2026 10:36:52 -0300 Subject: [PATCH 1/2] docs: add and refresh README for every example Write or rewrite a technical, SEO-oriented README.md for all 29 example projects plus the four Python derivation subprojects, and rebuild the root README as a categorized index linking every example. Standardize the product name as "Estuary" throughout (no "Estuary Flow"/"Flow"). Each README documents the architecture in Estuary terms (capture -> collection -> materialization/derivation), what's included, prerequisites, copy-pasteable setup, and how to configure the capture/materialization via the dashboard or flowctl, with links to the relevant connector docs. Co-Authored-By: Claude Opus 4.8 --- README.md | 69 ++- dekaf-kcat/README.md | 119 +++++ dekaf-python/README.md | 121 +++++ derivations-ad-performance/README.md | 147 +++++- derivations-sql-full-outer-join/README.md | 127 +++++- estuary-bytewax/README.md | 161 ++++++- estuary-coaelsce-demo-2025/README.md | 150 ++++++- estuary-demo-movies/README.md | 107 +++++ estuary-motherduck-demo-2025/README.md | 140 +++++- estuary-motherduck-orders/README.md | 175 +++++++- google-sheets-pinecone-rag/README.md | 141 +++++- hands-on-lab-postgres-motherduck/README.md | 419 ++++++++---------- kafka-capture/README.md | 273 ++++++++---- mongodb-pinecone-rag/README.md | 188 +++++++- mongodb-tinybird-clickstream/README.md | 184 ++++++++ oracle-capture/README.md | 164 +++++-- postgres-cdc-bigquery-dbt/README.md | 162 ++++++- postgres-cloudsql-simple-capture/README.md | 189 +++++++- postgres-measure-wal-throughput/README.md | 171 ++++++- postgres-simple-capture/README.md | 177 +++++++- .../README.md | 144 +++++- pyiceberg-aws-glue/README.md | 118 +++++ python-derivations/README.md | 123 +++++ python-derivations/shipments-ai/README.md | 129 ++++++ python-derivations/shipments-joins/README.md | 128 ++++++ .../shipments-stateful/README.md | 159 ++++++- .../shipments-stateless/README.md | 108 +++++ shipments-datagen/README.md | 125 +++++- shipments_eta/README.md | 224 +++++++++- singlestore-webinar-2025/README.md | 199 ++++++++- snowflake-cdc-pinecone-rag/README.md | 182 +++++++- sqlserver-cdc-capture/README.md | 181 ++++++-- sqlserver-cdc-materialize/README.md | 162 ++++++- streaming-lakehouse-iceberg-duckdb/README.md | 164 ++++++- 34 files changed, 5060 insertions(+), 470 deletions(-) create mode 100644 dekaf-python/README.md create mode 100644 estuary-demo-movies/README.md create mode 100644 mongodb-tinybird-clickstream/README.md create mode 100644 pyiceberg-aws-glue/README.md create mode 100644 python-derivations/README.md create mode 100644 python-derivations/shipments-ai/README.md create mode 100644 python-derivations/shipments-joins/README.md create mode 100644 python-derivations/shipments-stateless/README.md diff --git a/README.md b/README.md index da9273a..635dcb1 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,70 @@ # Estuary Examples -This repository is a collection of example projects utilizing Estuary. See subdirectories for specific projects and instructions. +A collection of hands-on, runnable examples for building real-time data pipelines with [Estuary](https://estuary.dev). Each project is self-contained and covers a production-grade pattern: change data capture (CDC) from databases like PostgreSQL, MongoDB, Oracle, and SQL Server; streaming ETL and materializations into warehouses and lakehouses; SQL, TypeScript, and Python derivations; and real-time Retrieval-Augmented Generation (RAG) and AI pipelines. Clone any folder and stream live data in minutes. -For more, check out the Estuary [blog](https://estuary.dev/blog/) and [documentation](https://docs.estuary.dev/). +## What is Estuary? + +Estuary is a real-time data integration platform that captures change data capture (CDC) streams and event data from your databases, SaaS apps, and streams, then materializes them into warehouses, lakehouses, vector databases, and analytics tools with millisecond latency. Learn more at [estuary.dev](https://estuary.dev) and read the [documentation](https://docs.estuary.dev). + +## Database CDC Captures + +| Example | Description | +| --- | --- | +| [postgres-simple-capture](./postgres-simple-capture) | Minimal, self-contained PostgreSQL CDC demo using Docker and an ngrok tunnel to stream row changes into an Estuary collection. | +| [postgres-cloudsql-simple-capture](./postgres-cloudsql-simple-capture) | Real-time PostgreSQL CDC pipeline targeting Google Cloud SQL for PostgreSQL with the Cloud SQL Python Connector. | +| [oracle-capture](./oracle-capture) | Oracle CDC capture from a free, local Oracle Database 23.6 in Docker using LogMiner-based logical replication. | +| [sqlserver-cdc-capture](./sqlserver-cdc-capture) | Self-contained SQL Server 2022 environment with Change Data Capture enabled and a continuous insert/update/delete data generator. | +| [kafka-capture](./kafka-capture) | Capture real-time IoT topics from an Amazon MSK (Apache Kafka) cluster into Estuary using AWS IAM authentication. | +| [estuary-demo-movies](./estuary-demo-movies) | Seed a `movies` table in any ANSI-SQL database as a ready-to-capture source for an Estuary capture. | +| [shipments-datagen](./shipments-datagen) | Dockerized PostgreSQL data generator that continuously mutates realistic shipments data, pre-wired for Estuary CDC. | +| [postgres-measure-wal-throughput](./postgres-measure-wal-throughput) | Measure PostgreSQL WAL throughput to size and forecast a CDC pipeline's change-event volume before you build it. | + +## Materializations & Destinations + +| Example | Description | +| --- | --- | +| [estuary-motherduck-demo-2025](./estuary-motherduck-demo-2025) | Stream PostgreSQL CDC into MotherDuck in real time, keeping analytical tables up to date with low latency. | +| [estuary-motherduck-orders](./estuary-motherduck-orders) | Stream a live pet-store order feed from PostgreSQL into MotherDuck (serverless DuckDB) via CDC. | +| [postgres-cdc-bigquery-dbt](./postgres-cdc-bigquery-dbt) | End-to-end ELT pipeline streaming PostgreSQL CDC into Google BigQuery, then modeling it with dbt. | +| [postgresql-cdc-databricks-fraud-detection](./postgresql-cdc-databricks-fraud-detection) | Real-time fraud detection pipeline streaming PostgreSQL CDC into Databricks for SQL-based lakehouse analysis. | +| [singlestore-webinar-2025](./singlestore-webinar-2025) | Stream PostgreSQL CDC into SingleStore in real time for low-latency analytics (Estuary x SingleStore webinar demo). | +| [mongodb-tinybird-clickstream](./mongodb-tinybird-clickstream) | Capture a live e-commerce clickstream from MongoDB Atlas and materialize it into Tinybird. | +| [sqlserver-cdc-materialize](./sqlserver-cdc-materialize) | Stream SQL Server CDC into Materialize via the Dekaf Kafka-compatible API to power an incrementally maintained view. | +| [pyiceberg-aws-glue](./pyiceberg-aws-glue) | Query an Apache Iceberg table that Estuary materialized to S3 using PyIceberg and the AWS Glue Data Catalog. | + +## Derivations & Transformations + +| Example | Description | +| --- | --- | +| [derivations-ad-performance](./derivations-ad-performance) | Real-time ad performance analytics joining impression and click streams with a stateful TypeScript derivation. | +| [derivations-sql-full-outer-join](./derivations-sql-full-outer-join) | Implement a full outer join across two collections with a SQLite-backed Estuary SQL derivation. | +| [python-derivations](./python-derivations) | Four Python derivation patterns: stateless transforms, stateful aggregation, streaming joins, and ML feature engineering. | + +## Real-Time RAG & AI + +| Example | Description | +| --- | --- | +| [google-sheets-pinecone-rag](./google-sheets-pinecone-rag) | End-to-end real-time RAG: stream Google Sheets rows to Pinecone embeddings and serve a Streamlit chatbot. | +| [mongodb-pinecone-rag](./mongodb-pinecone-rag) | Stream MongoDB product reviews to a Pinecone vector index for a real-time RAG Streamlit chat app. | +| [snowflake-cdc-pinecone-rag](./snowflake-cdc-pinecone-rag) | Stream Snowflake CDC into Pinecone vectors in real time, queried by a Streamlit RAG chatbot. | + +## Streaming, Lakehouse & Stream Processing + +| Example | Description | +| --- | --- | +| [dekaf-kcat](./dekaf-kcat) | Consume a live Estuary collection from the CLI with kcat over Estuary's Kafka-compatible Dekaf API. | +| [dekaf-python](./dekaf-python) | Consume a real-time Estuary collection in Python with `confluent-kafka`, Dekaf, and an Avro schema registry. | +| [estuary-bytewax](./estuary-bytewax) | Stream MongoDB CDC into a Bytewax Python dataflow via Dekaf to compute tumbling-window metrics. | +| [streaming-lakehouse-iceberg-duckdb](./streaming-lakehouse-iceberg-duckdb) | Build a streaming lakehouse: PostgreSQL CDC into Apache Iceberg on S3 (AWS Glue), queried with PyIceberg/DuckDB. | +| [shipments_eta](./shipments_eta) | Real-time freight ETA tracking from MongoDB CDC to Tinybird/ClickHouse with a Next.js dashboard via Dekaf. | + +## Demos, Workshops & Webinars + +| Example | Description | +| --- | --- | +| [hands-on-lab-postgres-motherduck](./hands-on-lab-postgres-motherduck) | Guided hands-on lab: PostgreSQL CDC to MotherDuck with soft delete, hard delete, and SCD2 materialization patterns. | +| [estuary-coaelsce-demo-2025](./estuary-coaelsce-demo-2025) | Self-contained PostgreSQL CDC fraud-detection demo with anomaly injection (Estuary x Coalesce 2025). | + +--- + +Built with [Estuary](https://estuary.dev). Read the [blog](https://estuary.dev/blog/), explore the [documentation](https://docs.estuary.dev), or get started in the [dashboard](https://dashboard.estuary.dev). diff --git a/dekaf-kcat/README.md b/dekaf-kcat/README.md index e69de29..e5c5d26 100644 --- a/dekaf-kcat/README.md +++ b/dekaf-kcat/README.md @@ -0,0 +1,119 @@ +# Consume an Estuary Collection from the CLI with kcat (Kafka) via Dekaf + +Read a live Estuary collection straight from your terminal using [kcat](https://github.com/edenhill/kcat) (the Kafka CLI, formerly `kafkacat`) over Estuary's Kafka-compatible **Dekaf** API. The included `consume.sh` connects to a Dekaf bootstrap endpoint over `SASL_SSL` / `PLAIN` and tails the public demo `wikipedia/recentchange` collection — a real-time stream of Wikipedia edit events — with no Estuary-specific tooling required. + +Because Dekaf speaks the Kafka wire protocol, any existing Kafka consumer (kcat, the Java client, `confluent-kafka`, Spark, Flink, ksqlDB, etc.) can read an Estuary collection as if it were a Kafka topic. + +## How it works + +Estuary [captures](https://docs.estuary.dev/concepts/captures/) data from sources into [collections](https://docs.estuary.dev/concepts/collections/) — schematized JSON streams backed by cloud storage. **Dekaf** exposes those collections through a Kafka-compatible endpoint, so a Kafka consumer can subscribe to a collection as a topic. + +``` +Source ──capture──▶ Estuary collection ──Dekaf (Kafka API)──▶ kcat (this example) + (demo/wikipedia/recentchange-sampled) +``` + +- **Bootstrap server**: the Dekaf endpoint, addressed like a Kafka broker. +- **Topic**: the Estuary collection name. +- **Auth**: SASL `PLAIN` over TLS. Username is the Dekaf task name (or `{}` for public demo topics); password is an Estuary access token (empty for public demo topics). + +## What's included + +- `consume.sh` — a single `kcat` consumer invocation that connects to Dekaf and prints messages from the public Wikipedia recent-changes demo collection. + +## Prerequisites + +- **kcat** installed and on your `PATH`: + - macOS: `brew install kcat` + - Debian/Ubuntu: `apt-get install kcat` + - Or see the [kcat install docs](https://github.com/edenhill/kcat#install). +- Nothing else for the public demo topic. To read **your own** collections you need a free [Estuary account](https://dashboard.estuary.dev) and an Estuary access token. + +## Running it + +The bundled `consume.sh` ships with the legacy bootstrap host `dekaf.fly.dev:9092`, which no longer resolves. Update that one line to Estuary's current Dekaf endpoint, `dekaf.estuary-data.com:9092`, so the script reads: + +```bash +kcat -C \ + -b dekaf.estuary-data.com:9092 \ + -t demo/wikipedia/recentchange-sampled \ + -X security.protocol=sasl_ssl \ + -X sasl.mechanisms=PLAIN \ + -X sasl.username='{}' \ + -X sasl.password='' +``` + +Run it: + +```bash +chmod +x consume.sh +./consume.sh +``` + +`kcat -C` runs in **consumer** mode and streams the Wikipedia `recentchange` events to stdout. Press `Ctrl-C` to stop. + +Flags explained: + +| Flag | Value | Meaning | +| --- | --- | --- | +| `-C` | — | Consumer mode | +| `-b` | `dekaf.estuary-data.com:9092` | Dekaf bootstrap server (Kafka broker) | +| `-t` | `demo/wikipedia/recentchange-sampled` | Collection name, used as the Kafka topic | +| `-X security.protocol` | `sasl_ssl` | Encrypted connection with SASL auth | +| `-X sasl.mechanisms` | `PLAIN` | SASL PLAIN mechanism | +| `-X sasl.username` | `{}` | Public demo placeholder (use your Dekaf task name for private collections) | +| `-X sasl.password` | (empty) | Public demo placeholder (use your Estuary access token for private collections) | + +> **Note on the bootstrap host:** Estuary's current Dekaf endpoint is `dekaf.estuary-data.com:9092` (with the schema registry at `https://dekaf.estuary-data.com`). The bundled `consume.sh` still references the legacy host `dekaf.fly.dev:9092`, which no longer resolves — switch that line to `dekaf.estuary-data.com:9092`. See the [Dekaf reading guide](https://docs.estuary.dev/guides/dekaf_reading_collections_from_kafka/) for the authoritative endpoint and connection settings. + +## Reading your own collections + +To consume a private collection instead of the public demo: + +1. Sign in to the [Estuary dashboard](https://dashboard.estuary.dev) and create a [Dekaf materialization](https://docs.estuary.dev/guides/dekaf_reading_collections_from_kafka/) (or use an existing one) to expose your collection over the Kafka API. +2. Generate an Estuary access token (refresh token) from the dashboard. +3. Run kcat with the production endpoint, your Dekaf task name as the username, and the token as the password: + +```bash +export DEKAF_TASK_NAME="your-org/your-dekaf-task" +export DEKAF_ACCESS_TOKEN="your-estuary-access-token" + +kcat -C \ + -b dekaf.estuary-data.com:9092 \ + -t your-collection-name \ + -X security.protocol=sasl_ssl \ + -X sasl.mechanisms=PLAIN \ + -X sasl.username="$DEKAF_TASK_NAME" \ + -X sasl.password="$DEKAF_ACCESS_TOKEN" +``` + +The Kafka topic name is the collection name as exposed by your Dekaf task. + +## Verify + +If the connection is working, you'll see a continuous stream of JSON Wikipedia edit events printed to your terminal. To consume from the beginning of the available data instead of the live tail, add `-o beginning`: + +```bash +kcat -C -o beginning \ + -b dekaf.estuary-data.com:9092 \ + -t demo/wikipedia/recentchange-sampled \ + -X security.protocol=sasl_ssl \ + -X sasl.mechanisms=PLAIN \ + -X sasl.username='{}' \ + -X sasl.password='' +``` + +To print only a fixed number of messages and exit, add `-c ` (for example `-c 10`). + +## Next steps + +- Consume the same collection from a Python application with the Avro schema registry: see the [`dekaf-python`](../dekaf-python) example in this repository. Note that the Python example subscribes to the topic as `recentchange-sampled` while this kcat example uses the fully-qualified `demo/wikipedia/recentchange-sampled`; both refer to the same demo collection, so use whichever topic string the example you are running already specifies. +- Wire any other Kafka client (Spark, Flink, ksqlDB, the Java client) to an Estuary collection through Dekaf. +- Build your own pipeline: create a [capture](https://dashboard.estuary.dev/captures), land it in a collection, and expose it over Dekaf. + +## References + +- Dekaf — reading collections from Kafka: https://docs.estuary.dev/guides/dekaf_reading_collections_from_kafka/ +- Estuary documentation: https://docs.estuary.dev +- Estuary dashboard: https://dashboard.estuary.dev +- kcat (Kafka CLI): https://github.com/edenhill/kcat diff --git a/dekaf-python/README.md b/dekaf-python/README.md new file mode 100644 index 0000000..4a62a74 --- /dev/null +++ b/dekaf-python/README.md @@ -0,0 +1,121 @@ +# Consume an Estuary Collection in Python with Kafka and Avro via Estuary Dekaf + +Read a real-time Estuary collection from Python using the `confluent-kafka` client, Estuary's **Dekaf** Kafka-compatible API, and an Avro schema registry. This example consumes the `recentchange-sampled` collection (Wikimedia RecentChange events) over SASL_SSL, deserializes each record against the registry-managed Avro schema, and prints it. No Kafka cluster, no Debezium, no schema-registry to operate — Dekaf exposes any Estuary collection as a Kafka topic. + +## How it works + +Estuary captures source data into **collections** (a schematized, real-time data lake of JSON in cloud storage). **Dekaf** is Estuary's Kafka-compatible streaming API: it lets any Kafka consumer read a collection as if it were a Kafka topic, complete with a Confluent-style schema registry that serves Avro schemas for each collection. + +``` +Estuary collection (recentchange-sampled) + │ + ▼ +Dekaf ── Kafka-compatible broker + Avro schema registry + broker: dekaf.estuary-data.com:9092 (SASL_SSL / PLAIN) + schema registry: https://dekaf.estuary-data.com + │ + ▼ +main.py ── confluent-kafka Consumer + AvroDeserializer ──► prints records +``` + +`main.py`: +1. Connects a `confluent_kafka.Consumer` to `dekaf.estuary-data.com:9092` using `security.protocol=SASL_SSL`, `sasl.mechanism=PLAIN`, username = the Dekaf task name, password = an Estuary access token. +2. Connects a `SchemaRegistryClient` to `https://dekaf.estuary-data.com` with the same credentials (`basic.auth.user.info`). +3. Fetches the latest Avro schema for the `recentchange-sampled-value` subject and builds an `AvroDeserializer`. +4. Subscribes to the `recentchange-sampled` topic and polls in a loop, deserializing each message value and printing `id`, `meta.domain`, `timestamp`, and `title`. + +## What's included + +- **`main.py`** — the consumer. Holds the Dekaf endpoint, schema registry URL, target topic (`recentchange-sampled`), Kafka/SASL config, Avro deserialization, and the poll loop. +- **`requirements.txt`** — single dependency: `confluent-kafka[avro,schemaregistry,rules]` (the Kafka client plus Avro + schema-registry extras). Note `main.py` also uses `python-dotenv` (`load_dotenv`) to read credentials from a `.env` file — install it as well (see below). + +## Prerequisites + +- **Python 3.8+** and `pip`. +- A **free Estuary account** — sign up at [https://dashboard.estuary.dev](https://dashboard.estuary.dev). +- The `recentchange-sampled` **collection available in your Estuary account**. This is the Wikimedia RecentChange sample stream. If you don't already have it, create a capture for the Wikimedia / public demo source (or any collection of your own) and update the `topic` variable in `main.py` to match the collection name you want to read. +- A **Dekaf access token** (an Estuary refresh/access token) to authenticate the consumer and schema registry. + +> Dekaf authenticates with `sasl.mechanism=PLAIN` where the username is the Dekaf **task name** and the password is an Estuary **access token**. Public demo topics can be read with username `{}` and an empty password, but this example is wired for an authenticated collection via `DEKAF_TASK_NAME` and `DEKAF_ACCESS_TOKEN`. + +## Setup + +Install the dependencies: + +```bash +pip install -r requirements.txt +pip install python-dotenv +``` + +Create a `.env` file in this directory with your Dekaf credentials: + +```bash +# .env +DEKAF_TASK_NAME=your-dekaf-task-name +DEKAF_ACCESS_TOKEN=your-estuary-access-token +``` + +These map directly to the consumer's `sasl.username` / `sasl.password` and the schema registry's `basic.auth.user.info` in `main.py`. + +### Getting your Dekaf credentials + +1. In the [Estuary dashboard](https://dashboard.estuary.dev), open the collection you want to consume (here, `recentchange-sampled`). +2. Create or open a **Dekaf** materialization/task for it — its name is your `DEKAF_TASK_NAME`. +3. Generate an **access token** (or refresh token) from your account settings — this is your `DEKAF_ACCESS_TOKEN`. + +See the Dekaf guide for the exact steps: [Reading Estuary collections from Kafka (Dekaf)](https://docs.estuary.dev/guides/dekaf_reading_collections_from_kafka/). + +## Running it + +```bash +python main.py +``` + +You should see one line per record as events stream in, for example: + +``` +('', '', '', '') +``` + +Stop with `Ctrl+C` — the script catches `KeyboardInterrupt` and closes the consumer cleanly. + +## Configuration reference + +All connection settings live at the top of `main.py`: + +| Setting | Value | Notes | +| --- | --- | --- | +| `bootstrap.servers` | `dekaf.estuary-data.com` | Dekaf broker (port `9092`) | +| `security.protocol` | `SASL_SSL` | Required by Dekaf | +| `sasl.mechanism` | `PLAIN` | Required by Dekaf | +| `sasl.username` | `DEKAF_TASK_NAME` | Dekaf task name | +| `sasl.password` | `DEKAF_ACCESS_TOKEN` | Estuary access token | +| Schema registry URL | `https://dekaf.estuary-data.com` | Confluent-compatible Avro registry | +| `group.id` | `my-group` | Consumer group | +| `auto.offset.reset` | `latest` | Set to `earliest` to read from the start of the collection | +| `topic` | `recentchange-sampled` | The Estuary collection name | + +To consume a different collection, change `topic` to that collection's name; the deserializer automatically resolves the `<topic>-value` subject from the schema registry. + +## Verify + +- Watch `main.py`'s output — a steady stream of printed records confirms Dekaf is serving the collection and the Avro schema resolved correctly. +- Cross-check against the collection in the [Estuary dashboard](https://dashboard.estuary.dev) (open the collection and view recent documents). +- With [flowctl](https://docs.estuary.dev/concepts/flowctl/) you can read the same collection directly: + + ```bash + flowctl collections read --collection recentchange-sampled --uncommitted | head + ``` + +## Next steps + +- Point any other Kafka client (kcat, Kafka Connect, Flink, Spark, Tinybird, ClickPipes, etc.) at the same Dekaf endpoint — the credentials and schema registry are identical across clients. +- Swap the `print` for your own processing: write to a database, push to another stream, or run real-time analytics. +- Build a full pipeline: capture a source into a collection, optionally transform it with a [derivation](https://docs.estuary.dev/concepts/derivations/), and consume it here. + +## References + +- Dekaf guide: [Reading Estuary collections from Kafka](https://docs.estuary.dev/guides/dekaf_reading_collections_from_kafka/) +- Estuary docs: [https://docs.estuary.dev](https://docs.estuary.dev) +- Estuary dashboard: [https://dashboard.estuary.dev](https://dashboard.estuary.dev) +- `confluent-kafka` Python client: [https://github.com/confluentinc/confluent-kafka-python](https://github.com/confluentinc/confluent-kafka-python) diff --git a/derivations-ad-performance/README.md b/derivations-ad-performance/README.md index 5bbd180..bfb8b62 100644 --- a/derivations-ad-performance/README.md +++ b/derivations-ad-performance/README.md @@ -1,11 +1,144 @@ -# Demo for a simple PostgreSQL capture +# Real-Time Ad Performance Analytics with Estuary TypeScript Derivations and PostgreSQL CDC + +Stream ad impression and ad click events from PostgreSQL into Estuary with change data capture (CDC), then use a stateful **TypeScript derivation** to join the two streams and maintain a running click count per advertising platform in real time. This example shows how to combine multiple captured collections in a single derivation, using `reduce` annotations to incrementally aggregate counts as new events arrive. + +Watch the walkthrough video: https://youtu.be/dbHgn-AdVzU + +## Architecture + +A local PostgreSQL instance is seeded with synthetic ad-tech data and exposed to Estuary's managed control plane over an ngrok TCP tunnel. Estuary captures both tables via Postgres CDC into collections, and a TypeScript derivation reads both collections to emit a per-platform click count. + +``` + ┌──────────────────────────┐ + datagen ──INSERT──▶ │ PostgreSQL (wal_level= │ + (impressions + │ logical, flow_publication)│ + ~10% clicks) │ ad_impressions, ad_clicks │ + └────────────┬──────────────┘ + │ logical replication + ngrok tcp postgres:5432 + │ + ▼ + ┌───────────────────────────────┐ + │ Estuary source-postgres capture│ + └───────────────┬────────────────┘ + │ + ┌───────────────────────┴───────────────────────┐ + ▼ ▼ + .../ad_impressions (collection) .../ad_clicks (collection) + │ │ + │ fromImpressions (click_count: 0) │ fromClicks (click_count: 1) + └───────────────────────┬─────────────────────────┘ + ▼ + TypeScript derivation: ad-clicks-by-platform + keyed on /platform, reduce { sum } on click_count + │ + ▼ + (optional) materialize to any destination +``` + +How the derivation works: + +- `fromImpressions` emits `{ platform, click_count: 0 }` for every impression, so every platform that has been seen shows up even with zero clicks. +- `fromClicks` emits `{ platform, click_count: 1 }` for every click. +- The collection is keyed on `/platform` with `reduce: { strategy: sum }` on `click_count`, so Estuary continuously sums the contributions into a live click tally per platform. + +## What's included + +- **`docker-compose.yml`** — spins up three services: `postgres-ad-performance` (PostgreSQL with `wal_level=logical`, port `5432`), `datagen-ad-performance` (the synthetic data generator), and `ngrok-ad-performance` (a TCP tunnel exposing `postgres:5432`, with the inspector UI on port `4040`). +- **`postgres/init.sql`** — bootstraps the database for CDC: creates the `flow_capture` replication user, the `flow_watermarks` table, the `flow_publication` publication, and the `ad_impressions` and `ad_clicks` tables, then adds all tables to the publication. +- **`datagen/datagen.py`** — continuously inserts fake `ad_impressions` (using `Faker`); for roughly 10% of impressions it also inserts a related `ad_clicks` row. Connection is configured via `POSTGRES_*` environment variables. +- **`datagen/Dockerfile`** / **`datagen/requirements.txt`** — Python 3.12 image with `Faker==25.1.0` and `psycopg2==2.9.9`. +- **`derivation/flow.yaml`** — defines the `ad-clicks-by-platform` derived collection: its schema (with the `sum` reduce on `click_count`), key (`/platform`), and the two TypeScript transforms (`fromImpressions`, `fromClicks`) and their source collections. +- **`derivation/ad-clicks-by-platform.flow.ts`** — the TypeScript transform logic for both `fromImpressions` and `fromClicks`. +- **`derivation/deno.json`** — import map pointing `flow/` at the generated TypeScript types. + +## Prerequisites + +- [Docker](https://docs.docker.com/get-docker/) and Docker Compose +- A free [ngrok](https://ngrok.com/) account and authtoken (the local Postgres must be reachable by Estuary's managed connector) +- A free [Estuary account](https://dashboard.estuary.dev) +- [`flowctl`](https://docs.estuary.dev/concepts/flowctl/) installed and authenticated, to publish the derivation ## Setup -1. Start containers: `docker compose up` -2. Get PostgreSQL URL: `curl -s http://localhost:4040/api/tunnels | jq -r '.tunnels[0].public_url'` -3. Set up Estuary capture -4. ??? -5. Profit! +### 1. Start the stack + +```bash +export NGROK_AUTHTOKEN=<your-ngrok-authtoken> +docker compose up +``` + +This starts PostgreSQL, runs `postgres/init.sql`, begins generating ad events, and opens the ngrok TCP tunnel. + +### 2. Get the public PostgreSQL endpoint + +Read the tunnel's public address from the ngrok inspector at http://localhost:4040, or via the API: + +```bash +curl -s http://localhost:4040/api/tunnels | jq -r '.tunnels[0].public_url' +``` + +This prints something like `tcp://6.tcp.ngrok.io:18923`. Strip the `tcp://` prefix when pasting host/port into Estuary. + +## Configure the Estuary capture + +Create a PostgreSQL CDC capture so the `ad_impressions` and `ad_clicks` tables become Estuary collections. You can do this in the [Estuary dashboard](https://dashboard.estuary.dev/captures) with the **source-postgres** connector ([connector docs](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/)). + +Use these connection values (from `docker-compose.yml` and `postgres/init.sql`): + +| Field | Value | +| --- | --- | +| Server Address | the ngrok host:port from step 2 (without `tcp://`) | +| User | `flow_capture` | +| Password | `password` | +| Database | `postgres` | + +Discover and enable the `public.ad_impressions` and `public.ad_clicks` bindings. After publishing, the capture writes to two collections (named under your tenant prefix, e.g. `<tenant>/.../ad_impressions` and `<tenant>/.../ad_clicks`). + +> The `flow.yaml` in this repo references the source collections as `dani-demo/demo-ad-performance/ad_impressions` and `dani-demo/demo-ad-performance/ad_clicks`. These are placeholders (marked `# Modify this`) — replace them with the actual collection names your capture produces. + +## Deploy the TypeScript derivation + +The derivation lives in `derivation/`. Before publishing, update `derivation/flow.yaml` to match your environment: + +1. Rename the derived collection `dani-demo/demo-ad-performance/ad-clicks-by-platform` to your own tenant prefix (e.g. `<tenant>/ad-performance/ad-clicks-by-platform`). +2. Point the two transform `source.name` fields at the real collection names produced by your capture. +3. Update the import path in `ad-clicks-by-platform.flow.ts` if you renamed the collection. + +Then authenticate and publish: + +```bash +flowctl auth login +cd derivation +flowctl catalog publish --source flow.yaml --auto-approve +``` + +`flowctl` generates the TypeScript types (resolved through `deno.json` at `flow/...`), builds the derivation, and deploys it. + +## Verify + +Confirm the captured streams and the derived aggregate are flowing: + +```bash +# Raw click events +flowctl collections read --collection <tenant>/.../ad_clicks --uncommitted | head + +# Per-platform click counts (one document per platform, with a running sum) +flowctl collections read --collection <tenant>/.../ad-clicks-by-platform --uncommitted | head +``` + +You should see one document per `platform` (`Google Ads`, `Facebook Ads`, `Twitter Ads`) with a `click_count` that increases over time as `datagen` produces more clicks. You can also watch document counts climb on the collection's page in the [Estuary dashboard](https://dashboard.estuary.dev). + +## Next steps + +- Materialize the `ad-clicks-by-platform` collection to a warehouse or database to power a live dashboard — see [materialization connectors](https://docs.estuary.dev/reference/Connectors/materialization-connectors/). +- Extend the derivation to track conversions (the `conversion_flag` column on `ad_clicks`) or compute click-through rate by also summing impressions. +- Learn more about derivations: https://docs.estuary.dev/concepts/derivations/ + +## References -Watch the demo here: https://youtu.be/dbHgn-AdVzU +- Demo video: https://youtu.be/dbHgn-AdVzU +- Estuary docs: https://docs.estuary.dev +- PostgreSQL capture connector: https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/ +- Derivations: https://docs.estuary.dev/concepts/derivations/ +- flowctl: https://docs.estuary.dev/concepts/flowctl/ diff --git a/derivations-sql-full-outer-join/README.md b/derivations-sql-full-outer-join/README.md index bf56724..061b1f8 100644 --- a/derivations-sql-full-outer-join/README.md +++ b/derivations-sql-full-outer-join/README.md @@ -1,12 +1,123 @@ -# Example full outer join implement in SQL +# Full Outer Join Across Collections with a SQL Derivation in Estuary + +This example shows how to implement a **full outer join across two collections** using an **Estuary SQL derivation**. A local PostgreSQL database streams `artists` and `albums` tables into Estuary collections via change data capture (CDC), and a SQLite-backed derivation joins them on `artist_id` to produce a real-time, per-artist rollup of total plays — combining rows that exist in either source, even when a matching row is missing on one side. + +The full write-up is here: [How to Join Collections in Estuary with SQL Derivations](https://estuary.dev/derivations-join-collections-sql/). + +## Architecture + +The pipeline is end-to-end real-time. Inserts into Postgres flow continuously through CDC into source collections, and the derivation reactively re-computes the joined output as new documents arrive. + +```text +Postgres (artists, albums) + │ logical replication (CDC) + ▼ +source-postgres capture ──► dani-demo/demo-music/artists + └─► dani-demo/demo-music/albums + │ + ▼ + SQL derivation (SQLite) — full outer join on /artist_id + │ + ▼ + dani-demo/demo-derivations1/artist_total_plays +``` + +- A **capture** (`source-postgres`) reads the `artists` and `albums` tables from Postgres using logical replication and writes them to two collections. +- A **derivation** consumes both collections. Two transforms (`fromArtists` and `fromAlbums`) are each shuffled on `/artist_id` so related documents land in the same partition, and emit partial documents keyed by `artist_id`. +- The output collection's **reduction annotations** stitch the two sides together: documents are merged by key, `artist_name` is reduced with `maximize` (carries the name from the artists side), and `total_plays` is reduced with `sum` (accumulates plays from the albums side). Because reduction merges on key regardless of which transform produced the document, artists with no albums and albums whose artist row hasn't arrived yet both appear — the behavior of a full outer join. + +## What's included + +- `docker-compose.yml` — spins up three services: `postgres` (image `postgres:latest`, started with `wal_level=logical` for CDC), `datagen` (continuously inserts fake artists and albums), and `ngrok` (TCP tunnel exposing Postgres `5432` so Estuary's managed connector can reach your local database). +- `postgres/init.sql` — runs on first boot: creates the `flow_capture` replication user, grants read/write, creates the `flow_watermarks` table, creates the `flow_publication` publication, and creates the `artists` and `albums` tables, adding all of them to the publication. +- `datagen/datagen.py` — Python generator using `Faker` and `psycopg2` that inserts one artist plus 1-5 albums per loop, every second. +- `datagen/Dockerfile` / `datagen/requirements.txt` — build the datagen container (`Faker==25.1.0`, `psycopg2==2.9.9`). +- `derivation/flow.yaml` — defines the `dani-demo/demo-derivations1/artist_total_plays` collection and its SQL derivation (the join logic), deployed with `flowctl`. + +## Prerequisites + +- [Docker](https://docs.docker.com/get-docker/) and Docker Compose +- A free [ngrok](https://ngrok.com/) account and authtoken (the local DB is exposed through an ngrok TCP tunnel) +- A free [Estuary account](https://dashboard.estuary.dev) +- The [`flowctl` CLI](https://docs.estuary.dev/concepts/flowctl/) (used to publish the derivation) ## Setup -1. Start containers: `docker compose up` -2. Get PostgreSQL URL: `curl -s http://localhost:4040/api/tunnels | jq -r '.tunnels[0].public_url'` -3. Set up Estuary capture -4. Deploy derivation via `flowctl` -5. ??? -5. Profit! +1. Export your ngrok authtoken and start the stack: + + ```bash + export NGROK_AUTHTOKEN=<your-ngrok-authtoken> + docker compose up + ``` + + This starts Postgres, begins generating data, and opens the tunnel. + +2. Get the public Postgres endpoint from the ngrok tunnel: + + ```bash + curl -s http://localhost:4040/api/tunnels | jq -r '.tunnels[0].public_url' + ``` + + You can also open the ngrok dashboard at [http://localhost:4040](http://localhost:4040). The value looks like `tcp://0.tcp.ngrok.io:12345` — strip the `tcp://` prefix when pasting into Estuary, and split the host and port. + +## Configure the Estuary capture + +Create the PostgreSQL capture in the [Estuary dashboard](https://dashboard.estuary.dev/captures) (or via `flowctl`) using the **PostgreSQL** connector (`source-postgres`). Connection values come straight from `docker-compose.yml` and `postgres/init.sql`: + +| Field | Value | +| ---------- | --------------------------------------- | +| Server Address | `<ngrok-host>:<ngrok-port>` (from step 2 above) | +| User | `flow_capture` | +| Password | `password` | +| Database | `postgres` | + +Select the `public.artists` and `public.albums` tables. To match the derivation's source names in `derivation/flow.yaml`, bind them to the collections `dani-demo/demo-music/artists` and `dani-demo/demo-music/albums` (substitute your own tenant prefix for `dani-demo` and update `flow.yaml` accordingly). + +Connector reference: [PostgreSQL capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/). + +## Deploy the SQL derivation + +The derivation in `derivation/flow.yaml` joins the two captured collections. Authenticate and publish it with `flowctl`: + +```bash +flowctl auth login +flowctl catalog publish --source derivation/flow.yaml --auto-approve +``` + +The key parts of `derivation/flow.yaml`: + +- **Two transforms**, each shuffled on `/artist_id`: + - `fromArtists` reads `dani-demo/demo-music/artists` and emits `select $artist_id, $name as artist_name;` + - `fromAlbums` reads `dani-demo/demo-music/albums` and emits `select $artist_id, $total_plays;` +- **Reduction strategy** on the output schema does the join: + - top-level `reduce: { strategy: merge }` + - `artist_name` → `maximize` + - `total_plays` → `sum` +- **Collection key** `/artist_id`, with `artist_id` as the only required field. + +> Note: the names in `flow.yaml` use the `dani-demo` tenant prefix. Replace it with your own Estuary tenant in both the capture bindings and `flow.yaml` so the derivation's `source.name` values resolve to your collections. + +## Verify + +Confirm data is flowing into the derived collection: + +```bash +flowctl collections read --collection dani-demo/demo-derivations1/artist_total_plays --uncommitted | head +``` + +You should see merged documents with `artist_id`, `artist_name`, and an accumulating `total_plays`. You can also watch live throughput on the collection and tasks in the [Estuary dashboard](https://dashboard.estuary.dev). + +## Next steps + +- Add a **materialization** to push `artist_total_plays` into a warehouse such as [BigQuery](https://docs.estuary.dev/reference/Connectors/materialization-connectors/BigQuery/) or [Snowflake](https://docs.estuary.dev/reference/Connectors/materialization-connectors/Snowflake/). +- Adjust the reduction strategies (for example, swap `maximize` for `lastWriteWins`) to change how the join resolves conflicting values. +- Tear everything down with `docker compose down -v`. + +## Resources -Read the full article here: https://estuary.dev/derivations-join-collections-sql/ +- Blog: [How to Join Collections in Estuary with SQL Derivations](https://estuary.dev/derivations-join-collections-sql/) +- [Derivations concept docs](https://docs.estuary.dev/concepts/derivations/) +- [Reductions and reduction annotations](https://docs.estuary.dev/reference/reduction-strategies/) +- [PostgreSQL capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/) +- [`flowctl` docs](https://docs.estuary.dev/concepts/flowctl/) +- [Estuary documentation](https://docs.estuary.dev) diff --git a/estuary-bytewax/README.md b/estuary-bytewax/README.md index dd7f185..59aed43 100644 --- a/estuary-bytewax/README.md +++ b/estuary-bytewax/README.md @@ -1,10 +1,159 @@ -# How to consume data from Estuary using Bytewax +# Stream MongoDB CDC to a Bytewax Dataflow with Estuary and Dekaf + +Stream real-time MongoDB change data capture (CDC) into a [Bytewax](https://bytewax.io/) Python dataflow using [Estuary](https://estuary.dev) and Dekaf, Estuary's Kafka-compatible API. This example captures `bookings` documents from a local MongoDB replica set, lands them in an Estuary collection, and consumes that collection from a Bytewax dataflow that computes tumbling-window booking metrics (total bookings, cancellations, passengers, revenue, and most popular destination) over a simulated space-tourism workload. + +## Architecture + +The pipeline wires MongoDB to Bytewax through Estuary without writing or running any Kafka brokers: + +``` +MongoDB (replica set rs0) ──ngrok TCP tunnel──▶ Estuary capture (source-mongodb) + │ + ▼ + Estuary collection (CDC events) + │ + Dekaf (Kafka API) + │ + ▼ + Bytewax dataflow (main.py) ──▶ windowed metrics → stdout +``` + +- A **capture** uses Estuary's `source-mongodb` connector to read the change stream from the `space_tourism.bookings` collection and write CDC events into an Estuary **collection** (a schematized, real-time data lake in cloud storage). +- **Dekaf** exposes that collection over a Kafka-compatible interface, so any Kafka client can subscribe to it as a topic. +- The Bytewax dataflow uses `KafkaSource` to read the collection through Dekaf, parses MongoDB CDC events, groups them into 5-minute tumbling windows keyed on `booking_id`, and emits aggregate metrics. + +Because Estuary is fully managed, the locally running MongoDB is exposed to the connector through an ngrok TCP tunnel. + +## What's included + +- **`docker-compose.yml`** — spins up three services: + - `mongodb` — `mongo:latest`, container/hostname `mongodb`, run as a single-node replica set `rs0` (required for MongoDB change streams / CDC), with keyfile authentication and root credentials `root` / `password` on port `27017`. A healthcheck calls `rs.initiate(...)` to bootstrap the replica set. + - `datagen` — built from `datagen/`, container `bytewax-datagen`. Continuously writes inserts, updates, and deletes to MongoDB to simulate live booking traffic. + - `ngrok` — `ngrok/ngrok:latest`, container `bytewax-ngrok`. Runs `tcp mongodb:27017` to publish a public TCP endpoint for the Estuary capture, with its inspection UI on port `4040`. +- **`datagen/datagen.py`** — connects to `mongodb://root:password@mongodb:27017`, targets database `space_tourism` and collection `bookings`, and emits one random `INSERT` / `UPDATE` / `DELETE` operation per second. Each booking document has `booking_id`, `customer_id`, `destination`, `booking_date`, `passengers`, and `total_price`. +- **`datagen/Dockerfile`** / **`datagen/requirements.txt`** — Python 3.12 image for the generator (`pymongo==4.8.0`, `python-dotenv==1.0.1`). +- **`mongodb/keyfile`** — keyfile mounted into the MongoDB container to enable internal replica-set authentication (`--keyFile`). Demo material only; do not reuse in production. +- **`main.py`** — the Bytewax dataflow consumer. Reads the Estuary collection through Dekaf with `KafkaSource`, parses CDC events, applies an `EventClock` + `TumblingWindower`, and computes per-window booking metrics. +- **`requirements.txt`** — consumer dependencies: `bytewax[confluent_kafka]==0.21`, `python-dotenv==1.0.1`. + +## Prerequisites + +- [Docker](https://docs.docker.com/get-docker/) and Docker Compose. +- A free [ngrok](https://ngrok.com/) account and authtoken (the local MongoDB must be tunneled so the managed connector can reach it). +- A free [Estuary account](https://dashboard.estuary.dev). +- Python 3.12 to run the Bytewax consumer locally. +- An Estuary access token (refresh token) for Dekaf authentication. Generate one in the dashboard under **Admin → CLI-API → Access Token**. ## Setup -1. docker compose up -2. `curl -s http://localhost:4040/api/tunnels | jq -r '.tunnels[0].public_url'` -3. Set up MongoDB Capture in Estuary -4. Start consuming with Dekaf. +### 1. Start MongoDB, the data generator, and the ngrok tunnel + +Export your ngrok authtoken, then bring up the stack: + +```bash +export NGROK_AUTHTOKEN=<your-ngrok-authtoken> +docker compose up -d +``` + +This starts the `mongodb` replica set, the `bytewax-datagen` generator (which immediately begins writing to `space_tourism.bookings`), and the `bytewax-ngrok` tunnel. + +### 2. Get the public MongoDB endpoint + +Read the public TCP address ngrok assigned to MongoDB: + +```bash +curl -s http://localhost:4040/api/tunnels | jq -r '.tunnels[0].public_url' +``` + +You can also open the ngrok inspection dashboard at <http://localhost:4040>. The value looks like `tcp://0.tcp.ngrok.io:12345`. **Strip the `tcp://` prefix** — you'll paste the host and port separately into Estuary. + +## Configure the Estuary capture + +Create a MongoDB capture in the [Estuary dashboard](https://dashboard.estuary.dev/captures) using the **MongoDB** connector (`source-mongodb`). Use the values from this stack: + +| Field | Value | +| --- | --- | +| Address / Host | the ngrok host (e.g. `0.tcp.ngrok.io:12345`, without `tcp://`) | +| User | `root` | +| Password | `password` | +| Database | `space_tourism` | + +The connector reads the MongoDB change stream and writes CDC events for the `bookings` collection into an Estuary collection. Note the resulting collection's full name (e.g. `<your-tenant>/mongodb/space_tourism/bookings`) — you'll point Bytewax at it. + +Connector reference: <https://docs.estuary.dev/reference/Connectors/capture-connectors/mongodb/> + +## Configure and run the Bytewax consumer + +The dataflow in `main.py` reads the collection through Dekaf. Before running, set two things: + +1. **`KAFKA_TOPIC`** in `main.py` — set it to your full Estuary collection name (it ships with a placeholder): + + ```python + # main.py + KAFKA_TOPIC = "<your-tenant>/mongodb/space_tourism/bookings" + ``` + +2. **`DEKAF_TOKEN`** environment variable — your Estuary access token, used as the SASL password. + +The dataflow connects to Dekaf with these settings (already wired in `main.py`): + +```python +KAFKA_BOOTSTRAP_SERVERS = ["dekaf.estuary.dev:9092"] +add_config = { + "security.protocol": "SASL_SSL", + "sasl.mechanism": "PLAIN", + "sasl.username": "{}", + "sasl.password": os.getenv("DEKAF_TOKEN"), +} +``` + +Install dependencies and run the dataflow: + +```bash +python -m venv .venv && source .venv/bin/activate +pip install -r requirements.txt + +export DEKAF_TOKEN=<your-estuary-access-token> +python -m bytewax.run main.py +``` + +> Note: `main.py` points at the `dekaf.estuary.dev:9092` bootstrap with SASL username `{}`. If your account uses the standard Dekaf endpoint, set the bootstrap to `dekaf.estuary-data.com:9092` and the username to your Dekaf task name. See the Dekaf guide below for the configuration that matches your tenant. + +### What the dataflow does + +- `parse_message` decodes each CDC event. `insert` / `update` events use `fullDocument`; `delete` events use `documentKey`. Other operation types are dropped. +- Events are keyed by `booking_id` and grouped with an `EventClock` (using `booking_date` as event time, with a 10-second system-time grace period) into 5-minute `TumblingWindower` windows aligned to `2024-09-01T00:00:00Z`. +- `calculate_metrics` emits, per window: `total_bookings`, `total_cancellations`, `total_passengers`, `total_revenue`, and `most_popular_destination`. +- `op.inspect(...)` prints incoming and windowed messages; `StdOutSink` prints the computed metrics to stdout. + +## Verify + +- Confirm CDC is flowing in the Estuary dashboard by checking the capture's read/write stats, or stream the collection directly: + + ```bash + flowctl collections read --collection <your-tenant>/mongodb/space_tourism/bookings --uncommitted | head + ``` + +- In the Bytewax terminal, you should see `op.inspect` output for each parsed message followed by per-window metric dictionaries once a window closes. The `datagen` container produces an operation every second, so events appear continuously. + +## Cleanup + +```bash +docker compose down -v +``` + +The `-v` flag also removes the `mongo-data` volume. + +## Next steps + +- Swap `StdOutSink` for a Kafka, file, or database sink to persist the windowed metrics. +- Add an Estuary [materialization](https://dashboard.estuary.dev/materializations) to land the raw `bookings` collection in a warehouse (BigQuery, Snowflake, etc.) alongside the Bytewax stream processing. +- Point the same Dekaf topic at other Kafka-native tools (Flink, ksqlDB, kcat) to fan out the stream. + +## References -Full how-to guide: https://bytewax.io/blog/estuary-flow-mongodb-bytewax-real-time-data +- Full how-to guide: <https://bytewax.io/blog/estuary-flow-mongodb-bytewax-real-time-data> +- Estuary documentation: <https://docs.estuary.dev> +- MongoDB capture connector: <https://docs.estuary.dev/reference/Connectors/capture-connectors/mongodb/> +- Reading collections from Kafka with Dekaf: <https://docs.estuary.dev/guides/dekaf_reading_collections_from_kafka/> +- Bytewax documentation: <https://docs.bytewax.io/> diff --git a/estuary-coaelsce-demo-2025/README.md b/estuary-coaelsce-demo-2025/README.md index 05e7c5d..2a85ce7 100644 --- a/estuary-coaelsce-demo-2025/README.md +++ b/estuary-coaelsce-demo-2025/README.md @@ -1 +1,149 @@ -# Estuary x Coalesce fraud detection demo \ No newline at end of file +# Real-Time PostgreSQL CDC Fraud Detection Demo with Estuary (Estuary x Coalesce) + +A self-contained demo that streams pet-store transactions and product reviews out of PostgreSQL in real time using Change Data Capture (CDC) with [Estuary](https://estuary.dev). A data generator continuously writes `transactions` (with injected high/low-amount anomalies for fraud-detection use cases) and OpenAI-generated `reviews` into Postgres; Estuary captures every change via logical replication and streams it into collections that you can materialize into any downstream warehouse, lake, or analytics tool. This is the dataset used for the Estuary x Coalesce 2025 demo. + +## Architecture + +The pipeline uses Postgres logical replication (CDC) as the source of truth: + +``` +┌──────────────┐ INSERTs ┌─────────────────────┐ logical ┌──────────────────┐ +│ datagen.py │ ───────────────▶ │ PostgreSQL │ replication │ Estuary │ +│ (Faker + │ transactions │ wal_level=logical │ ─────────────▶│ source-postgres │ +│ OpenAI) │ reviews │ flow_publication │ (via ngrok) │ capture │ +└──────────────┘ └─────────────────────┘ └────────┬─────────┘ + │ + ▼ + ┌─────────────────────────┐ + │ Estuary collections │ + │ • products │ + │ • transactions │ + │ • reviews │ + └────────────┬─────────────┘ + │ materialization + ▼ + ┌─────────────────────────┐ + │ Your destination │ + │ (Snowflake, BigQuery, │ + │ Postgres, dbt/Coalesce)│ + └─────────────────────────┘ +``` + +In Estuary terms: + +- A **capture** (`source-postgres`) reads the Postgres write-ahead log via the `flow_publication` publication and the `public.flow_watermarks` table, streaming inserts from `products`, `transactions`, and `reviews`. +- Each table lands in its own real-time **collection** of schematized JSON. +- A **materialization** (configured separately for your warehouse/lake of choice) pushes those collections to the destination where Coalesce, dbt, or your fraud-detection models consume them. + +## What's included + +- **`docker-compose.yml`** — spins up three containers: + - `postgres-cdc-coalesce-postgres` — PostgreSQL (`postgres:latest`) started with `-c wal_level=logical` so logical replication / CDC works. Exposes port `5432`. + - `postgres-cdc-coalesce-datagen` — the Python data generator (built from `datagen/`). + - `postgres-cdc-coalesce-ngrok` — an [ngrok](https://ngrok.com) TCP tunnel (`tcp postgres:5432`) that exposes the local database to Estuary's managed cloud. ngrok's inspector UI is published on port `4040`. +- **`postgres/init.sql`** — runs on first boot. It grants the `postgres` user `REPLICATION` + `pg_read_all_data`, creates the `public.flow_watermarks` table required by the connector, creates the `flow_publication` publication (with `publish_via_partition_root = true`), creates the `products`, `transactions`, and `reviews` tables, adds all of them to the publication, and seeds 51 pet-store products. +- **`datagen/datagen.py`** — connects to Postgres and, in a loop, inserts a transaction (60% of the time) or a review (40% of the time) every second. Transactions deliberately include amount anomalies (5% chance of a $500–$1000 charge, 5% chance of a $0.01–$5 charge) across payment methods including `crypto` — the signal a fraud-detection pipeline keys off of. Reviews are generated by the OpenAI API (`gpt-3.5-turbo-0125`) with an 80/20 positive/negative sentiment split. +- **`datagen/Dockerfile`** / **`datagen/requirements.txt`** — package the generator (`Faker`, `psycopg2`, `openai`) into a Python 3.12 image. + +## Prerequisites + +- [Docker](https://docs.docker.com/get-docker/) and Docker Compose. +- A free [Estuary account](https://dashboard.estuary.dev). +- A free [ngrok account](https://dashboard.ngrok.com/signup) and its **authtoken** — required because Estuary is fully managed and must reach your local Postgres over a public TCP endpoint. +- An [OpenAI API key](https://platform.openai.com/api-keys) — the data generator calls OpenAI to produce realistic review text. (Without it, the generator falls back to a static review string and still works.) +- A destination account if you want to materialize the data (e.g. Snowflake, BigQuery, Postgres) — optional for the capture-only walkthrough. + +## Setup + +Export the two required environment variables, then start the stack: + +```bash +export NGROK_AUTHTOKEN=<your-ngrok-authtoken> +export OPENAI_API_KEY=<your-openai-api-key> + +docker compose up --build +``` + +This launches Postgres (already configured for logical replication), the data generator (writing transactions and reviews once per second), and the ngrok tunnel. + +### Get the public database endpoint + +Estuary connects to Postgres through ngrok. Grab the public TCP address either from the ngrok inspector UI at [http://localhost:4040](http://localhost:4040) or with: + +```bash +curl -s http://localhost:4040/api/tunnels | jq -r ".tunnels[0].public_url" +``` + +You'll get something like `tcp://6.tcp.ngrok.io:18743`. **Strip the `tcp://` prefix** — the host and port (e.g. `6.tcp.ngrok.io:18743`) are what you paste into Estuary. + +## Configure the Estuary capture + +Create a new PostgreSQL capture in the [Estuary dashboard](https://dashboard.estuary.dev/captures/create) (search for the **PostgreSQL** connector, `source-postgres`) and use the values baked into this demo: + +| Field | Value | +| --- | --- | +| Server Address | the ngrok host:port from above (e.g. `6.tcp.ngrok.io:18743`) | +| User | `postgres` | +| Password | `postgres` | +| Database | `postgres` | + +Under the connector's advanced settings, the watermarks table and publication already match the connector defaults created by `init.sql`: + +- Watermarks table: `public.flow_watermarks` +- Publication: `flow_publication` + +Click through discovery and Estuary will bind the `public.products`, `public.transactions`, and `public.reviews` tables, each into its own collection. + +> Connector reference: [PostgreSQL capture connector docs](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/) + +### (Optional) Configure with flowctl + +If you prefer the CLI, authenticate and discover the source: + +```bash +flowctl auth login +flowctl discover --source flow.yaml # after defining the source-postgres capture +flowctl catalog publish --source flow.yaml --auto-approve +``` + +See the [flowctl docs](https://docs.estuary.dev/concepts/flowctl/) for the full workflow. + +## Verify + +Confirm rows are flowing into your collections. In the dashboard, open the capture and watch the document/byte counts climb, or read a collection directly with flowctl: + +```bash +flowctl collections read --collection <your-prefix>/transactions --uncommitted | head +flowctl collections read --collection <your-prefix>/reviews --uncommitted | head +``` + +You should see new `transactions` (note the occasional anomalous `amount`) and AI-generated `reviews` arriving roughly once per second. + +## Configure the Estuary materialization + +To land the data somewhere downstream, create a [materialization](https://dashboard.estuary.dev/materializations/create) and bind the `products`, `transactions`, and `reviews` collections. Pick the connector for your destination, for example: + +- [Snowflake](https://docs.estuary.dev/reference/Connectors/materialization-connectors/Snowflake/) +- [BigQuery](https://docs.estuary.dev/reference/Connectors/materialization-connectors/BigQuery/) +- [PostgreSQL](https://docs.estuary.dev/reference/Connectors/materialization-connectors/PostgreSQL/) + +Once materialized, the transactions and reviews are available in real time for fraud-detection modeling and transformation in tools like Coalesce or dbt. + +## Next steps + +- Build a [derivation](https://docs.estuary.dev/concepts/derivations/) (SQL, TypeScript, or Python) to flag anomalous transactions — e.g. amounts above $500 or paid via `crypto` — as a real-time fraud signal. +- Join `transactions` and `reviews` against the static `products` collection to enrich your fraud-detection features. +- Materialize into Snowflake/BigQuery and transform with Coalesce or dbt. + +## Teardown + +```bash +docker compose down -v +``` + +## References + +- [Estuary documentation](https://docs.estuary.dev) +- [PostgreSQL capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/) +- [Estuary dashboard](https://dashboard.estuary.dev) +- [flowctl CLI](https://docs.estuary.dev/concepts/flowctl/) diff --git a/estuary-demo-movies/README.md b/estuary-demo-movies/README.md new file mode 100644 index 0000000..e110d27 --- /dev/null +++ b/estuary-demo-movies/README.md @@ -0,0 +1,107 @@ +# Seed a Movies Table as a SQL Source for an Estuary Capture + +A minimal demo that creates and seeds a `movies` table in a relational database so it can be used as a source for an [Estuary](https://estuary.dev) capture. Run the single SQL script against PostgreSQL, MySQL, SQL Server, or any ANSI-SQL database, point an Estuary capture at the table, and stream the rows into an Estuary collection in real time. + +This is intentionally a "hello world" source: a tiny static dataset (10 Marvel movie rows) to validate a capture connection, walk through the capture setup flow, or seed downstream demos. + +## Architecture + +The script only provides the **source table**. Estuary does the data movement: + +``` +movies (SQL table) ──capture──▶ Estuary collection ──materialization──▶ destination + (this script) (real-time data lake) (warehouse / DB / Kafka) +``` + +- **Capture (source):** an Estuary capture connector reads the `movies` table and streams rows into an Estuary **collection** — a schematized, real-time copy of the data backed by cloud storage. +- **Materialization (destination):** optionally push the collection to a warehouse, database, or other destination. Not included here — add one once the capture is running. + +## What's included + +- `create-schema` — a SQL DDL + DML script that: + - Creates the `movies` table with columns `prod_id` (`INTEGER`, NOT NULL), `prod_price` (`NUMERIC(10,2)`, NOT NULL), and `prod_descrip` (`VARCHAR(100)`, NOT NULL). + - Adds a unique index `movies_x0` on `prod_id` (the natural primary key, used by the capture as the collection key). + - Inserts 10 movie rows (`X-Men: Apocalypse`, `Doctor Strange`, `Captain America: Civil War`, ...). + +## Prerequisites + +- A reachable SQL database (PostgreSQL, MySQL, SQL Server, etc.) where you can create a table. +- A free Estuary account: https://dashboard.estuary.dev +- The database client for your engine (`psql`, `mysql`, `sqlcmd`, ...). +- If the database runs **locally** (not on a public cloud host), expose it to Estuary's managed connectors with an [ngrok](https://ngrok.com) TCP tunnel or an SSH tunnel. + +## Setup + +Load the schema and seed data into your database. Pick the command for your engine: + +```bash +# PostgreSQL +psql "postgres://USER:PASSWORD@HOST:5432/DBNAME" -f create-schema + +# MySQL +mysql -h HOST -P 3306 -u USER -pPASSWORD DBNAME < create-schema + +# SQL Server +sqlcmd -S HOST,1433 -U USER -P PASSWORD -d DBNAME -i create-schema +``` + +Verify the rows landed: + +```sql +SELECT prod_id, prod_price, prod_descrip FROM movies ORDER BY prod_id; +-- expect 10 rows +``` + +### Exposing a local database (optional) + +Estuary is fully managed, so a database on `localhost` must be reachable from the internet. Expose the DB port with ngrok: + +```bash +export NGROK_AUTHTOKEN=<your-ngrok-authtoken> +ngrok tcp 5432 # use your DB's port: 5432 Postgres, 3306 MySQL, 1433 SQL Server +``` + +Read the public `host:port` from the ngrok dashboard at http://localhost:4040, or: + +```bash +curl -s http://localhost:4040/api/tunnels | jq -r ".tunnels[0].public_url" +``` + +Strip the `tcp://` prefix before pasting the address into Estuary. + +## Configure the Estuary capture + +Create the capture in the Estuary dashboard at https://dashboard.estuary.dev/captures, or via `flowctl`. + +1. Choose the capture connector that matches your database engine: + - PostgreSQL — [`source-postgres`](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/) + - MySQL — [`source-mysql`](https://docs.estuary.dev/reference/Connectors/capture-connectors/MySQL/) + - SQL Server — [`source-sqlserver`](https://docs.estuary.dev/reference/Connectors/capture-connectors/SQLServer/) +2. Enter the connection details — the public host/port (the ngrok address if tunneling), database name, user, and password. +3. In the discovery step, select the `movies` table. Estuary infers the schema and uses the unique key on `prod_id` as the collection key. +4. Save and publish. The 10 rows backfill into the collection; new inserts/updates stream as they happen (for CDC-capable engines, with the prerequisites that connector requires — e.g. `wal_level=logical`, a replication user, and a publication for Postgres). + +> CDC connectors require additional source-side setup (replication user, publication/binlog/CDC enablement). See the connector docs linked above for the exact grants and server settings for your engine. + +## Verify + +Confirm rows are flowing into the collection: + +```bash +flowctl auth login +flowctl collections read --collection <your/collection/name> --uncommitted | head +``` + +Or watch the capture's document and byte counts on its page in the Estuary dashboard. + +## Next steps + +- Add a **materialization** to land the collection in a destination: https://dashboard.estuary.dev/materializations +- Transform the data with a **derivation** (SQL, TypeScript, or Python): https://docs.estuary.dev/concepts/derivations/ + +## Resources + +- Estuary docs: https://docs.estuary.dev +- flowctl CLI: https://docs.estuary.dev/concepts/flowctl/ +- Captures concept: https://docs.estuary.dev/concepts/captures/ +- Estuary blog: https://estuary.dev/blog/ diff --git a/estuary-motherduck-demo-2025/README.md b/estuary-motherduck-demo-2025/README.md index c2efda7..f8c6194 100644 --- a/estuary-motherduck-demo-2025/README.md +++ b/estuary-motherduck-demo-2025/README.md @@ -1 +1,139 @@ -# Estuary x MotherDuck PetStore Demo \ No newline at end of file +# Real-Time PostgreSQL CDC to MotherDuck with Estuary + +A self-contained demo that streams change data capture (CDC) from a local PostgreSQL database into [MotherDuck](https://motherduck.com) in real time using [Estuary](https://estuary.dev). A data generator continuously writes pet-store transactions and AI-generated product reviews into Postgres; Estuary's PostgreSQL capture connector reads the write-ahead log (WAL) via logical replication, lands the changes in Estuary collections, and a MotherDuck materialization keeps the analytical tables up to date with low latency. + +## Architecture + +``` +┌──────────────┐ INSERTs ┌──────────────┐ logical ┌─────────────────┐ materialize ┌──────────────┐ +│ datagen │ ───────────────► │ PostgreSQL │ replication │ Estuary │ ──────────────► │ MotherDuck │ +│ (Python + │ transactions/ │ (wal_level= │ ───────────► │ collections │ │ tables │ +│ OpenAI) │ reviews │ logical) │ (CDC) │ products / │ │ products / │ +└──────────────┘ └──────┬───────┘ │ transactions / │ │ transactions/│ + │ │ reviews │ │ reviews │ + ngrok TCP tunnel └─────────────────┘ └──────────────┘ + (public host:port) +``` + +End-to-end data flow in Estuary terms: + +1. **Capture (source):** the [`source-postgres`](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/) connector reads `products`, `transactions`, and `reviews` from the `flow_publication` publication over logical replication. +2. **Collections:** each captured table becomes an Estuary collection — a schematized, real-time data lake of JSON documents in cloud storage. +3. **Materialization (destination):** the [`materialize-motherduck`](https://docs.estuary.dev/reference/Connectors/materialization-connectors/motherduck/) connector continuously pushes those collections into MotherDuck tables. + +Because Estuary is fully managed, the locally running Postgres is exposed to the hosted connector through an **ngrok TCP tunnel**. + +## What's included + +- **`docker-compose.yml`** — spins up three services: + - `postgres` (container `postgres-cdc-motherduck-postgres`): `postgres:latest` started with `wal_level=logical`, port `5432` published, init script mounted. + - `datagen` (container `postgres-cdc-motherduck-datagen`): builds the `datagen/` image and continuously writes rows into Postgres. + - `ngrok` (container `postgres-cdc-motherduck-ngrok`): runs `tcp postgres:5432` to expose Postgres publicly; the ngrok inspection UI is published on port `4040`. +- **`postgres/init.sql`** — runs on first boot. It grants the `postgres` user `REPLICATION` and `pg_read_all_data`, creates the `public.flow_watermarks` table, creates the `flow_publication` publication (with `publish_via_partition_root = true`), creates the `products`, `transactions`, and `reviews` tables, adds them all to the publication, and seeds `products` with ~50 pet-store items (dog and cat supplies). +- **`datagen/datagen.py`** — a Python loop that every 3 seconds inserts either a `transactions` row (60%) or a `reviews` row (40%). Review text is generated with the OpenAI API (`gpt-3.5-turbo-0125`); transaction amounts intentionally include a small fraction of high/low anomalies for demo analytics. (The script also contains an optional Google Cloud SQL connection path gated by `USE_CLOUD_SQL`; it is off by default.) +- **`datagen/Dockerfile`** / **`datagen/requirements.txt`** — build the data-generator image (`python:3.12`, `psycopg2-binary`, `openai`, `Faker`, `python-dotenv`, etc.). + +## Prerequisites + +- **Docker** and Docker Compose. +- A **verified [ngrok](https://ngrok.com) account** and authtoken — needed to expose the local Postgres to Estuary's hosted connector. +- An **[OpenAI API key](https://platform.openai.com/api-keys)** — the data generator calls OpenAI to write realistic product reviews. (Without it, review inserts fall back to a canned string, but transactions still flow.) +- A free **[Estuary account](https://dashboard.estuary.dev)**. +- A **[MotherDuck account](https://app.motherduck.com)** and a service token (from the MotherDuck console). The MotherDuck materialization stages data through an object-storage bucket, so have a staging bucket and its credentials ready as well — see the [MotherDuck materialization docs](https://docs.estuary.dev/reference/Connectors/materialization-connectors/motherduck/) for the exact requirements. + +## Setup + +Export the required tokens and start the stack: + +```bash +export NGROK_AUTHTOKEN=<your-ngrok-authtoken> +export OPENAI_API_KEY=<your-openai-api-key> + +docker compose up -d +``` + +On first boot, `postgres/init.sql` provisions replication access, the publication, the three tables, and the seed products. The `datagen` service then begins inserting transactions and reviews every 3 seconds. + +### Get the public Postgres endpoint + +ngrok exposes Postgres over a public TCP address. Read it from the ngrok inspection UI at [http://localhost:4040](http://localhost:4040), or grab it from the API: + +```bash +curl -s http://localhost:4040/api/tunnels | jq -r ".tunnels[0].public_url" +# e.g. tcp://6.tcp.ngrok.io:12345 +``` + +Strip the `tcp://` prefix — you'll paste `host:port` (e.g. `6.tcp.ngrok.io:12345`) into Estuary. + +### Verify data is being generated (optional) + +```bash +docker exec -it postgres-cdc-motherduck-postgres \ + psql -U postgres -d postgres -c "SELECT count(*) FROM transactions;" +``` + +Run it again after a few seconds — the count should increase. + +## Configure the Estuary capture + +Set up the PostgreSQL source in the [Estuary dashboard](https://dashboard.estuary.dev/captures): + +1. Go to **Sources → + New Capture** and choose the **PostgreSQL** connector. +2. Enter the connection details from `docker-compose.yml` and the ngrok endpoint: + + | Field | Value | + | -------- | --------------------------------------------- | + | Server Address | `<ngrok-host>:<ngrok-port>` (from step above, no `tcp://`) | + | User | `postgres` | + | Password | `postgres` | + | Database | `postgres` | + +3. Estuary discovers the `products`, `transactions`, and `reviews` tables (it uses the `flow_publication` publication and the `public.flow_watermarks` table created by `init.sql`). Leave the default bindings and **Save and Publish**. + +> The seed `init.sql` grants the `postgres` user `REPLICATION` and `pg_read_all_data` and pre-creates the publication/watermarks table so discovery works out of the box. For production, create a dedicated capture user with least-privilege grants instead — see the [PostgreSQL connector docs](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/). + +Connector image: `ghcr.io/estuary/source-postgres:dev`. + +## Configure the Estuary materialization + +Send the captured collections to MotherDuck from the [Estuary dashboard](https://dashboard.estuary.dev/materializations): + +1. Go to **Destinations → + New Materialization** and choose the **MotherDuck** connector. +2. Provide your MotherDuck **service token**, target **database**, and the **staging bucket** credentials (the connector stages files in object storage before loading into MotherDuck). +3. Under **Source Collections**, add the `products`, `transactions`, and `reviews` collections created by the capture. +4. **Save and Publish.** Estuary backfills existing rows, then continuously applies new CDC events to the MotherDuck tables. + +Connector image: `ghcr.io/estuary/materialize-motherduck:dev`. See the [MotherDuck materialization docs](https://docs.estuary.dev/reference/Connectors/materialization-connectors/motherduck/) for token, bucket, and sync-schedule options. + +## Verify + +- In the Estuary dashboard, the capture, collections, and materialization should all show matching document counts and live throughput. +- Peek at documents flowing through a collection with [flowctl](https://docs.estuary.dev/concepts/flowctl/): + + ```bash + flowctl auth login + flowctl collections read --collection <your-prefix>/transactions --uncommitted | head + ``` + +- Query MotherDuck directly to confirm the data landed: + + ```sql + SELECT payment_method, count(*), round(avg(amount), 2) AS avg_amount + FROM transactions + GROUP BY payment_method + ORDER BY 2 DESC; + ``` + +## Next steps + +- Stop the stack with `docker compose down` (add `-v` to remove the Postgres volume and start clean). +- Explore Estuary [derivations](https://docs.estuary.dev/concepts/derivations/) to transform the `transactions` stream in SQL, TypeScript, or Python — for example, flagging the high/low anomaly amounts the generator produces. +- For a guided, multi-materialization version of this pipeline (soft delete, hard delete, and SCD Type 2), see the sibling [`hands-on-lab-postgres-motherduck`](../hands-on-lab-postgres-motherduck) example. + +## Resources + +- Estuary docs: https://docs.estuary.dev +- PostgreSQL capture connector: https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/ +- MotherDuck materialization connector: https://docs.estuary.dev/reference/Connectors/materialization-connectors/motherduck/ +- flowctl: https://docs.estuary.dev/concepts/flowctl/ +- Estuary dashboard: https://dashboard.estuary.dev diff --git a/estuary-motherduck-orders/README.md b/estuary-motherduck-orders/README.md index c2efda7..36191b4 100644 --- a/estuary-motherduck-orders/README.md +++ b/estuary-motherduck-orders/README.md @@ -1 +1,174 @@ -# Estuary x MotherDuck PetStore Demo \ No newline at end of file +# Real-Time PostgreSQL CDC to MotherDuck with Estuary + +Stream a live pet-store order feed from PostgreSQL into [MotherDuck](https://motherduck.com/) (serverless DuckDB) in real time using [Estuary](https://estuary.dev). This demo spins up a local Postgres configured for logical replication, a data generator that continuously inserts and updates orders, and an ngrok TCP tunnel so the fully managed Estuary `source-postgres` connector can reach the database. Change Data Capture (CDC) events flow into an Estuary collection and are materialized into a MotherDuck table you can query with DuckDB SQL. + +## Architecture + +``` + ngrok TCP tunnel + ┌─────────────┐ inserts/ ┌────────────────┐ :5432 ┌──────────────────────┐ + │ datagen │──updates────▶ │ PostgreSQL │ ─────────▶ │ Estuary source- │ + │ (orders) │ │ wal_level= │ │ postgres (CDC) │ + └─────────────┘ │ logical │ └──────────┬───────────┘ + └────────────────┘ │ capture + ▼ + ┌──────────────────────────┐ + │ Estuary collection │ + │ (schematized JSON) │ + └──────────┬────────────────┘ + │ materialization + ▼ + ┌──────────────────────────┐ + │ MotherDuck (DuckDB) │ + └──────────────────────────┘ +``` + +End-to-end flow in Estuary terms: + +1. **Capture** — the `source-postgres` connector reads the Postgres write-ahead log (WAL) via logical replication and emits insert/update events as CDC documents. +2. **Collection** — captured documents land in an Estuary collection, a real-time, schematized JSON data lake backed by cloud storage. +3. **Materialization** — the `materialize-motherduck` connector continuously pushes the collection into a MotherDuck table. + +Because Estuary is fully managed and hosted, the locally-running Postgres is exposed through an ngrok TCP tunnel so the cloud connector can connect to it. + +## What's included + +- **`docker-compose.yml`** — defines three services: + - `postgres` (container `postgres-cdc-motherduck-postgres`, image `postgres:latest`) started with `wal_level=logical`, listening on port `5432`, seeded from `postgres/init.sql`. + - `datagen` (container `postgres-cdc-motherduck-datagen`) builds the `datagen/` image and continuously writes order data. + - `ngrok` (container `postgres-cdc-motherduck-ngrok`) runs `tcp postgres:5432` to publish a public TCP endpoint; its inspector UI is exposed on port `4040`. +- **`postgres/init.sql`** — bootstraps Postgres for CDC: grants `REPLICATION` and `pg_read_all_data` to the `postgres` user, creates the `public.flow_watermarks` table required by the connector, creates the `flow_publication` publication (with `publish_via_partition_root = true`), and seeds demo tables `products`, `transactions`, and `reviews`. +- **`datagen/datagen.py`** — creates the `orders` table at runtime (`pgcrypto` extension + `gen_random_uuid()` primary key) and loops forever: ~70% of the time it inserts a new order in `placed` status, ~30% of the time it advances a random open order through the lifecycle `placed → packed → shipped → delivered` (or `cancelled`). Supports an optional Google Cloud SQL connection path via `USE_CLOUD_SQL`. +- **`datagen/Dockerfile`** — Python 3.12 image that installs `datagen/requirements.txt` and runs `datagen.py`. +- **`datagen/requirements.txt`** — Python dependencies (`psycopg2-binary`, `Faker`, `SQLAlchemy`, `pg8000`, `cloud-sql-python-connector`, `python-dotenv`). + +### Demo data: the `orders` table + +The generator drives change events against this table: + +| Column | Type | Notes | +| --------------- | ------------- | ----------------------------------------------------------- | +| `order_id` | `UUID` | Primary key, `DEFAULT gen_random_uuid()` | +| `customer_name` | `TEXT` | Faker-generated name | +| `product_name` | `TEXT` | One of 10 pet-store products | +| `status` | `TEXT` | `placed` / `packed` / `shipped` / `delivered` / `cancelled` | +| `created_at` | `TIMESTAMPTZ` | `DEFAULT now()` | + +The `status` updates are what make this a good CDC demo: each row changes over time, and those updates are streamed downstream in real time. + +## Prerequisites + +- [Docker](https://docs.docker.com/get-docker/) and Docker Compose. +- A free [ngrok](https://ngrok.com/) account and authtoken (the local Postgres is tunneled so Estuary can reach it). +- A free [Estuary account](https://dashboard.estuary.dev). +- A [MotherDuck](https://motherduck.com/) account and a service/access token for the materialization. + +## Setup + +1. Clone the repo and change into this directory: + + ```bash + cd estuary-motherduck-orders + ``` + +2. Export your ngrok authtoken (read by the `ngrok` service): + + ```bash + export NGROK_AUTHTOKEN=<your-ngrok-authtoken> + ``` + +3. Start everything: + + ```bash + docker compose up --build + ``` + + This launches Postgres with logical replication, the order generator, and the ngrok tunnel. You should see `Inserted new order.` and `Updated order ... → ...` log lines from the `datagen` container. + +> Note: the `datagen` service references an `OPENAI_API_KEY` environment variable. It is not used by `datagen.py`, so you can leave it unset. + +### Get the public Postgres endpoint + +The ngrok tunnel maps a public `host:port` to the local Postgres. Open the inspector UI at [http://localhost:4040](http://localhost:4040), or grab it from the API: + +```bash +curl -s http://localhost:4040/api/tunnels | jq -r ".tunnels[0].public_url" +``` + +You'll get something like `tcp://6.tcp.ngrok.io:18923`. **Strip the `tcp://` prefix** — the host is `6.tcp.ngrok.io` and the port is `18923` — when entering it into Estuary. + +### Make sure the `orders` table is captured + +`postgres/init.sql` adds `transactions`, `products`, and `reviews` to `flow_publication`, but the `orders` table is created later by the generator. Add it to the publication so its CDC events are replicated: + +```bash +docker exec -it postgres-cdc-motherduck-postgres \ + psql -U postgres -d postgres \ + -c "ALTER PUBLICATION flow_publication ADD TABLE public.orders;" +``` + +## Configure the Estuary capture + +You can wire this up entirely from the [Estuary dashboard](https://dashboard.estuary.dev). + +1. Go to [dashboard.estuary.dev/captures](https://dashboard.estuary.dev/captures) and create a new capture using the **PostgreSQL** connector (`source-postgres`). +2. Enter the connection details from `docker-compose.yml`, using the public endpoint from ngrok: + + | Field | Value | + | ---------- | ------------------------------------ | + | Server Address | `<ngrok-host>:<ngrok-port>` (e.g. `6.tcp.ngrok.io:18923`) | + | User | `postgres` | + | Password | `postgres` | + | Database | `postgres` | + +3. Run discovery and select the `orders` table (and any of `products`, `transactions`, `reviews` you want to stream). Save and publish. + +The connector uses the pre-created `flow_publication` and `public.flow_watermarks` from `init.sql`. + +PostgreSQL capture connector reference: https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/ + +## Configure the Estuary materialization + +1. Go to [dashboard.estuary.dev/materializations](https://dashboard.estuary.dev/materializations) and create a new materialization using the **MotherDuck** connector (`materialize-motherduck`). +2. Provide your MotherDuck connection settings: + - **Token** — your MotherDuck service/access token. + - **Database** — the target MotherDuck database name. + - **Schema** — the destination schema (e.g. `main`). +3. Bind the capture's `orders` collection to a destination MotherDuck table and publish. + +MotherDuck materialization connector reference: https://docs.estuary.dev/reference/Connectors/materialization-connectors/motherduck/ + +## Verify + +- In the Estuary dashboard, open the capture and materialization and watch the **docs / bytes** metrics climb as the generator keeps writing. +- If you use [flowctl](https://docs.estuary.dev/concepts/flowctl/), tail the collection directly: + + ```bash + flowctl auth login + flowctl collections read --collection <your-prefix>/orders --uncommitted | head + ``` + +- Query the destination in MotherDuck: + + ```sql + SELECT status, COUNT(*) AS orders + FROM your_database.main.orders + GROUP BY status + ORDER BY orders DESC; + ``` + + Re-run it after a minute — counts shift as orders advance through their lifecycle, confirming updates are streaming end to end. + +## Next steps + +- Add the other seeded tables (`products`, `transactions`, `reviews`) to the same capture for a richer model. +- Transform the stream with a [derivation](https://docs.estuary.dev/concepts/derivations/) in SQL, TypeScript, or Python (e.g. compute current-status counts or per-product revenue). +- Fan the same collection out to additional destinations (Snowflake, BigQuery, ClickHouse) via more materializations. + +## Resources + +- Estuary docs: https://docs.estuary.dev +- PostgreSQL capture connector: https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/ +- MotherDuck materialization connector: https://docs.estuary.dev/reference/Connectors/materialization-connectors/motherduck/ +- flowctl CLI: https://docs.estuary.dev/concepts/flowctl/ +- Estuary dashboard: https://dashboard.estuary.dev diff --git a/google-sheets-pinecone-rag/README.md b/google-sheets-pinecone-rag/README.md index ec2f4e2..a2884df 100644 --- a/google-sheets-pinecone-rag/README.md +++ b/google-sheets-pinecone-rag/README.md @@ -1,6 +1,139 @@ -# Snowflake CDC & RAG +# Real-Time RAG: Stream Google Sheets to Pinecone for a Streamlit Chatbot with Estuary -This projects showcases the components of an incremental data flow that streams data from -Google Sheets, vectorizes it then loads the embeddings into Pinecone. +This example builds an end-to-end, real-time Retrieval-Augmented Generation (RAG) pipeline. Estuary captures rows from a **Google Sheet** of customer support tickets, generates **OpenAI embeddings**, and materializes the vectors into a **Pinecone** index. A **Streamlit** chat app (LlamaIndex + OpenAI) then answers natural-language questions over the always-current support data — new, updated, and deleted rows in the sheet propagate to Pinecone continuously, so the chatbot's knowledge base stays fresh without batch reloads. -See blog post for details: https://estuary.dev/google-sheets-to-pinecone-rag/ +Reference walkthrough: https://estuary.dev/google-sheets-to-pinecone-rag/ + +## Architecture + +The data flow is: + +``` +Google Sheet ──capture──▶ Estuary collection ──materialize (embed)──▶ Pinecone index ──retrieve──▶ Streamlit RAG chat (LlamaIndex + OpenAI) +("Fake Customer Support" / + "Support Requests") +``` + +- **Capture (source):** The Estuary **Google Sheets** source connector polls the worksheet and streams every insert/update/delete into an Estuary **collection** — a schematized, real-time copy of the sheet backed by cloud storage. +- **Materialization (destination):** The Estuary **Pinecone** materialization connector takes each document from the collection, calls **OpenAI** to produce an embedding, and upserts the vector (with the original row text) into a Pinecone index. In this example the vectors land in the `Support_Requests` namespace, and the raw row text is stored under the `flow_document` key. +- **Retrieval / app:** `rag.py` wires LlamaIndex's `PineconeVectorStore` to that namespace and `flow_document` text key, retrieves the top-5 most similar tickets, and feeds them to an OpenAI chat model. `app.py` is the Streamlit front end. + +Because Estuary is a streaming CDC/ETL platform, the loop is continuous: edit the sheet, and within the connector's polling interval the change is embedded and searchable in Pinecone. + +## What's included + +- **`docker-compose.yml`** — Spins up two services: `datagen` (container `gsheet-ai-datagen`, generates fake support tickets into the Google Sheet) and `streamlit` (container `streamlit`, serves the RAG chat UI on port `8501`). +- **`datagen/`** — Synthetic data generator. + - `datagen.py` — Uses `pygsheets` to authenticate to Google Sheets via a service-account JSON, and OpenAI (`gpt-3.5-turbo-0125`) to write realistic customer-support ticket descriptions. Loops continuously, weighting `insert`/`update`/`delete` operations 70/10/20 so the sheet changes constantly (exercising CDC). Columns: `request_id`, `customer_id`, `request_date`, `request_type`, `status`, `description`. + - `Dockerfile` — `python:3.12` image that runs `datagen.py`. + - `requirements.txt` — `Faker`, `pygsheets`, `python-dotenv`, `openai`. +- **`app.py`** — Streamlit app titled "Real-time RAG with Estuary"; renders the chat interface ("Chat with Google Sheets") and streams responses from the LlamaIndex chat engine. +- **`rag.py`** — Builds the LlamaIndex retriever and chat engine: connects to Pinecone (`PINECONE_API_KEY`, `PINECONE_HOST`), opens the `Support_Requests` namespace with `text_key="flow_document"`, retrieves `similarity_top_k=5`, and answers with OpenAI `gpt-3.5-turbo`. +- **`Dockerfile`** — `python:3.11` image that runs `streamlit run app.py` on port `8501`. +- **`requirements.txt`** — `streamlit`, `llama-index`, `llama-index-vector-stores-pinecone`, `python-dotenv`. +- **`.streamlit/config.toml`** — Streamlit config (static serving enabled, light theme). +- **`estuary_logo.png`** — Logo shown in the app. + +> Note: the Estuary capture and materialization in this example are configured in the **Estuary dashboard**, not committed as a `flow.yaml` here. The sections below walk through that wiring. + +## Prerequisites + +- **Docker** and Docker Compose. +- A **Google Cloud service account** with a JSON key, and the **Google Sheets API** and **Google Drive API** enabled. Share the target Google Sheet with the service-account email. +- A **Google Sheet** named `Fake Customer Support` with a worksheet (tab) named `Support Requests`. Add a header row: `request_id`, `customer_id`, `request_date`, `request_type`, `status`, `description`. +- An **OpenAI API key** (used both by the data generator and by the RAG chat app). +- A **Pinecone account**, API key, and an index host URL (`PINECONE_HOST`). Use an embedding dimension that matches the OpenAI embedding model configured in the Estuary materialization (e.g. `text-embedding-3-small` → 1536 dimensions). +- A free **Estuary account**: https://dashboard.estuary.dev + +## Setup + +### 1. Configure the data generator and app + +Edit `docker-compose.yml` and fill in the values marked `# edit`: + +```yaml +services: + datagen: + environment: + SHEET_NAME: "Fake Customer Support" + WORKSHEET_NAME: "Support Requests" + volumes: + - /path-to-gcp-service-account-cred.json:/credentials.json # edit -> point at your service-account JSON + + streamlit: + ports: + - 8501:8501 + environment: + PINECONE_API_KEY: "" # edit + PINECONE_HOST: "" # edit -> your Pinecone index host URL + OPENAI_API_KEY: "" # edit +``` + +The `datagen` service also needs an `OPENAI_API_KEY` to write ticket descriptions; add it to the `datagen` service's `environment:` (or pass it via a `.env` file picked up by `python-dotenv`). + +### 2. Start the data generator (and later the app) + +You can start the generator on its own first so the sheet begins filling up before you wire up Estuary: + +```bash +docker compose up --build datagen +``` + +`datagen` connects to the Google Sheet and continuously inserts/updates/deletes support requests every ~2 seconds. You should see log lines like `Inserted new support request: [...]`. + +Once Pinecone is populated by the Estuary materialization (next sections), bring up the chat app: + +```bash +docker compose up --build streamlit +``` + +The Streamlit UI is then available at http://localhost:8501. + +## Configure the Estuary capture (Google Sheets → collection) + +1. Open the Estuary dashboard and create a new capture: https://dashboard.estuary.dev/captures +2. Choose the **Google Sheets** source connector. +3. Authenticate to Google (OAuth) or supply the service-account credentials, then point the connector at the spreadsheet URL of your `Fake Customer Support` sheet. +4. Save and publish. Estuary discovers the `Support Requests` worksheet and creates a collection that streams every row change from the sheet. + +Connector reference: https://docs.estuary.dev/reference/Connectors/capture-connectors/google-sheets/ + +## Configure the Estuary materialization (collection → Pinecone) + +1. Create a new materialization: https://dashboard.estuary.dev/materializations +2. Choose the **Pinecone** materialization connector. +3. Provide: + - **Pinecone API key** — same key you put in `PINECONE_API_KEY`. + - **Pinecone index** / host — the index whose host you put in `PINECONE_HOST`. + - **OpenAI API key** — the Pinecone connector calls OpenAI to embed each document before upserting it. + - **Namespace** — `Support_Requests` (this is what `rag.py` reads from; keep them in sync). +4. Bind the Google Sheets collection from the capture above to the Pinecone index and publish. + +The connector embeds each collection document with OpenAI and upserts the vector into Pinecone, storing the source text under the `flow_document` metadata key that `rag.py` uses as its `text_key`. + +Connector reference: https://docs.estuary.dev/reference/Connectors/materialization-connectors/pinecone/ + +> Keep the **namespace** (`Support_Requests`), the **embedding model/dimension**, and the **`flow_document` text key** consistent between the Estuary materialization and `rag.py`, or retrieval will return nothing. + +## Verify + +- **In Estuary:** open the capture and materialization in the dashboard and watch the docs/bytes counters increase as `datagen` mutates the sheet. You can also read the collection directly with flowctl: + ```bash + flowctl collections read --collection <your/collection/name> --uncommitted | head + ``` +- **In Pinecone:** check the index's vector count in the Pinecone console under the `Support_Requests` namespace; it should grow as rows are added. +- **In the app:** open http://localhost:8501 and ask, for example, "Show me open billing issues" or "What authentication problems have customers reported?" The chatbot answers from the retrieved support tickets. Insert a new row via `datagen` (or edit the sheet manually) and confirm a follow-up question reflects it. + +## Next steps + +- Swap the synthetic `datagen` source for a real Google Sheet your team already maintains — the pipeline works unchanged. +- Point the same Estuary collection at additional destinations (a warehouse, another vector store) to reuse the captured data without re-reading the source. +- Tune retrieval (`similarity_top_k`), the embedding model, or the OpenAI chat model in `rag.py` for your use case. + +## Resources + +- Full walkthrough: https://estuary.dev/google-sheets-to-pinecone-rag/ +- Estuary docs: https://docs.estuary.dev +- Google Sheets capture connector: https://docs.estuary.dev/reference/Connectors/capture-connectors/google-sheets/ +- Pinecone materialization connector: https://docs.estuary.dev/reference/Connectors/materialization-connectors/pinecone/ +- Estuary dashboard: https://dashboard.estuary.dev diff --git a/hands-on-lab-postgres-motherduck/README.md b/hands-on-lab-postgres-motherduck/README.md index 9c6fadc..d5e5f7f 100644 --- a/hands-on-lab-postgres-motherduck/README.md +++ b/hands-on-lab-postgres-motherduck/README.md @@ -1,312 +1,265 @@ -# Estuary Hands on Lab (HoL) Workshop +# Real-Time PostgreSQL CDC to MotherDuck Hands-On Lab with Estuary -## Move Data from PostgreSQL to MotherDuck +A guided, hands-on workshop that builds a streaming Change Data Capture (CDC) pipeline from PostgreSQL to MotherDuck using [Estuary](https://dashboard.estuary.dev). You capture change events from a Postgres `products` table into an Estuary collection, then materialize that collection into MotherDuck three different ways — **soft delete** (default), **hard delete**, and **Slowly Changing Dimension Type 2 (SCD2)** — to cover the most common analytics warehousing patterns. -#### Introduction +Everything runs from your laptop: a Docker Compose stack spins up Postgres (with logical replication enabled), a fake data generator, and an ngrok TCP tunnel so the fully managed Estuary connector can reach your local database. -In this hands-on lab, we'll be setting up a streaming CDC pipeline from PostgreSQL to MotherDuck using Estuary. You'll use Estuary's PostgreSQL capture (source) connector and MotherDuck materialization (target) connector to set up an end-to-end CDC pipeline in three steps: -1. You’ll capture change event data from a PostgreSQL database, using a table filled with generated realistic product data. -2. You’ll learn how to configure Estuary to persist data as collections while maintaining data integrity. -3. You will see how you can materialize these collections in MotherDuck to make them ready for analytics using default configuration (soft delete). -You’ll then explore ways to customize your pipelines for different use cases. -4. You’ll create a second materialization to perform hard deletes, where a delete operation will physically delete the record from the target data warehouse. -5. And finally, you’ll create a third materialization which will perform a slowly changing dimension type 2, where all records (inc. updates and deletes) are inserted into the target table. This is for use cases where a history/audit table is required for data warehousing. -By the end of this tutorial, you'll have established robust and efficient data pipelines with near real-time replication of data from PostgreSQL to MotherDuck. +## Architecture -Before you get started, make sure to satisfy all prerequisites to complete this workshop. +``` +PostgreSQL (products) Estuary MotherDuck ++-------------------+ CDC +------------------------------+ +-------------------+ +| postgres_cdc | ------> | source-postgres capture | | lab1 (soft del.) | +| wal_level=logical| | | | -> | lab2 (hard del.) | +| + datagen | | v | | lab3 (SCD2/delta)| ++-------------------+ | collection: <tenant>/ | +-------------------+ + ^ | workshop/public/products | + | ngrok tcp 5432 | | | + +---------------------+ v | + | 3x materialize-motherduck | + +------------------------------+ +``` -#### Prerequisites +- **Capture (source):** Estuary's real-time PostgreSQL connector reads the WAL and streams inserts/updates/deletes from the `products` table. +- **Collection:** Change events land in an Estuary collection (a real-time data lake of schematized JSON in cloud storage). One collection feeds all three materializations. +- **Materializations (destinations):** Three MotherDuck materializations read from the same collection, each demonstrating a different delete/history strategy. -This tutorial will assume you have access to the following resources for this hands-on lab: -- Laptop and web browser: We’ll be running the workshop from your own equipment via web-based UI. -- Docker: for convenience, we are providing a docker compose definition which will allow you to spin up a database and a fake data generator service. -- Estuary account (free tier): We’ll be creating and managing our CDC pipeline from Estuary’s UI. -- MotherDuck account (free tier): The target data warehouse for our data pipeline is MotherDuck. In order to follow along with the hands-on lab, a trial account is perfectly fine. You’ll also need a service token, which can be obtained from the MotherDuck console. -- AWS S3 Bucket: An S3 bucket for staging temporary files. An S3 bucket in us-east-1 is recommended for best performance and costs, since MotherDuck is currently hosted in that region. You’ll also need your access key id and secret access key. -- A verified Ngrok account (free tier): Estuary is a fully managed service. Because the database used in this hands-on lab will be running on your machine, you’ll need something to expose it to the internet. ngrok is a lightweight tool that does just that. -<br> -<br> -<br> -Step 1. Set Up Source Environment Ngrok Authentication Token<br> -Once you have created your Ngrok free account, you will need to obtain your ‘Authentication Token’ for the next step. Save this to paste into the docker-compose.yaml script later. -<br> -<br> +## What's included -**PostgreSQL Setup** +- `docker-compose.yaml` — spins up three services: + - `postgres` (container `postgres_cdc`) running `postgres:latest` with `wal_level=logical`, exposed on port `5432`. + - `datagen` running `materialize/datagen`, generating 10,000 fake product records into Postgres (`-n 10000 -w 1000`). + - `ngrok` running `ngrok/ngrok:latest`, exposing `postgres:5432` over a public TCP tunnel; the ngrok web UI is on port `4040`. +- `init.sql` — runs on first boot via the Postgres entrypoint and creates the `products` table. +- `schemas/products.sql` — the table schema consumed by the `datagen` tool to generate realistic data. (The actual table is created by `init.sql`; `datagen` only reads this file for column/faker definitions.) -If you do not have PostgreSQL already running, we will set this up using a Docker image. Save the below yaml script as docker-compose.yaml. This file will contain the service definitions for the PostgreSQL database, the mock data generator and Ngrok tunnel service (if you prefer, you can download the files from our GitHub repository here). +### `products` data dictionary -We will now create a init.sql script, which contains the products database table that will be used to ingest change data for streaming downstream. +| Column | Type | Notes | +|---------------|-------------|----------------------------------------| +| `id` | `int` | Primary key | +| `name` | `varchar` | `faker.internet.userName()` | +| `merchant_id` | `int` | `NOT NULL`, `faker.datatype.number()` | +| `price` | `int` | `faker.datatype.number()` | +| `status` | `varchar` | `faker.datatype.boolean()` | +| `created_at` | `timestamp` | `DEFAULT now()` | -Create a new folder within the root folder called schemas and paste the below SQL DDL into a file called products.sql. This file contains the schema of the demo data generator. - -*NOTE: This file defines the schema via a create table statement, but the actual table creation happens in the init.sql file, this is just a quirk of the Datagen data generator tool.* +## Prerequisites -For the purpose of this hands-on lab, we will be using the postgres database user. For production use, we recommend creating a dedicated Estuary database user and assign it the privileges and grants required, per our documentation. +- **Docker** (with `docker compose`) — to run the Postgres + datagen + ngrok stack locally. +- **Estuary account** (free tier) — sign up at [dashboard.estuary.dev](https://dashboard.estuary.dev). The capture and materializations are configured entirely from the Estuary UI. +- **MotherDuck account** (free tier) — the target data warehouse. You'll need an **access token** from the MotherDuck console. +- **AWS S3 bucket** — used by MotherDuck for staging temporary files. An `us-east-1` bucket is recommended for best performance and cost (MotherDuck is currently hosted there). You'll need an `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` with read/write access (e.g. `AmazonS3FullAccess`). +- **Verified ngrok account** (free tier) — Estuary is a fully managed service, so the local database must be exposed to the internet. ngrok provides a TCP tunnel. Grab your **authtoken** from the ngrok dashboard. -We’re now ready to start the source database. In order to initialize Postgres, the fake data generator and ngrok service, all you have to do is execute the following command from within the root folder of your hands-on lab: -> docker compose up +> For this lab we use the default `postgres` superuser for simplicity. For production, create a dedicated Estuary replication user with the minimum required grants — see the [PostgreSQL connector docs](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/). -*NOTE: If you run into the following error: -‘postgres_cdc | psql:/docker-entrypoint-initdb.d/init.sql:8: ERROR: syntax error at or near "COMMENT" postgres_cdc | LINE 3: "name" varchar COMMENT 'faker.internet.userName()',’ -Wait a few minutes and then try again, it should resolve itself.* -<br> -<br> -After a few seconds, you should see the services are up and running. The postgres_cdc service should print the following on the terminal: +## Setup: start the source environment -postgres_cdc | LOG: database system is ready to accept connections +1. Get your ngrok **authtoken** and paste it into `docker-compose.yaml`, replacing the `<enter ngrok token here>` placeholder under the `ngrok` service: -While the datagen service will be a little bit more spammy, as it prints every record it generates, but don’t be alarmed, this is enough for us to verify that both are up and running. + ```yaml + ngrok: + image: ngrok/ngrok:latest + environment: + NGROK_AUTHTOKEN: <your-ngrok-authtoken> + command: 'tcp postgres:5432' + ports: + - 4040:4040 + ``` -Let’s see how we can connect to the database and check if we have created the objects. In another command line tab/window execute the following command: +2. From this folder, start the stack: -> docker exec -it postgres_cdc bash + ```bash + docker compose up + ``` -Let’s connect to PostgreSQL using the following command: + When Postgres is ready you'll see: -> psql -h postgres_cdc -U postgres -d postgres + ``` + postgres_cdc | LOG: database system is ready to accept connections + ``` -Enter the password postgres. - + The `datagen` service is verbose — it prints each record it generates. That's expected and confirms data is flowing. -Before we jump into setting up the replication, you can quickly verify the data is being properly generated by connecting to the database and peeking into the products table, as shown below: + > **Note:** On first boot you may briefly see `ERROR: syntax error at or near "COMMENT"` from `init.sql`. Wait a minute and it resolves itself. -Wait a few seconds and enter the select count(*) from products; command again. You should see a difference in values. -<br> -<br> +3. (Optional) Connect to Postgres to confirm data is being generated: -Step 2. Access Estuary Dashboard + ```bash + docker exec -it postgres_cdc bash + psql -h postgres_cdc -U postgres -d postgres # password: postgres + ``` -Hard part done, now comes the easy part! -Open a web browser and go to dashboard.estuary.dev/login to access Estuary’s UI (if you haven’t already registered for a free account, please do so now to continue with this exercise). -On the left-hand menu bar, there are several options for interacting with Estuary: -- Welcome -- Sources -- Collections -- Destinations -- Admin + ```sql + SELECT count(*) FROM products; + ``` -**Sources:** -Every Estuary data pipe-line starts with a capture, which is configured from ‘Sources’. Each table adds rows of data to a corresponding Estuary collection. -<br>**Collections:** -The rows of data of your Estuary data pipe-lines are stored in collections: real-time data lakes of JSON documents in cloud storage (AWS S3, Azure Blob Storage or Google Cloud Storage), which can be re-played if desired instead of going back to the source. -**Destinations:** -A materialization is how Estuary pushes data to an endpoint, binding one or more collections. As rows of data are added to the bound collections, the materialization continuously pushes it to the destination, where it is reflected with very low latency. + Run the count again after a few seconds — it should increase. -<br> -<br> +### Get the public database endpoint -### Lab Exercise 1: End-to-End Data Pipeline -Step 3. Set Up Estuary Capture +The Estuary connector connects to your database through the ngrok tunnel. Get the public `host:port` from the ngrok web UI at [http://localhost:4040](http://localhost:4040), or via: -1. Go to the sources page by clicking on the Sources on the left-hand side of your screen, then click on + New Capture. +```bash +curl -s http://localhost:4040/api/tunnels | jq -r ".tunnels[0].public_url" +``` -2. In the connector search box type in postgres and select the recommended real-time PostgreSQL connector. +This returns something like `tcp://0.tcp.ngrok.io:12345`. **Strip the `tcp://` prefix** when pasting the address into Estuary. -3. Let’s name this workshop for the connector and select the appropriate data-plane closest to your region (e.g. US). +### Source connection values -4. Complete the necessary connection details to login to PostgreSQL and click on History Mode as we’ll need this for later labs. Click on NEXT when done. +| Setting | Value | +|-----------|-----------------------------------------| +| Address | the ngrok host:port (without `tcp://`) | +| Database | `postgres` | +| User | `postgres` | +| Password | `postgres` | -*NOTE: To obtain server address info, login to your Ngrok account and select Endpoints and Traffic Policy from the left-hand menu. This will provide the address to use. Remember to remove tcp:// when pasting the address into Estuary’s UI.* +--- -5. On the following screen, a few points to note: -- Schema evolution: by default enabled with the options:<br> - - Automatically keep schemas up to date<br> - - Automatically add new collections, and<br> - - Changing primary keys re-versions collections. -- Bindings: This will capture the tables from source. Here you can be selective if there are more tables to pick from. -- Backfill: First time you initiate a capture task, Estuary will perform an initial load of existing data within the tables and once completed will stream changes using CDC (if the real-time connector was selected). +## Lab Exercise 1: End-to-end pipeline (soft delete) -*NOTE: If you do not want to perform an initial load of existing data, you can change the backfill option using Backfill Mode.* +### Step 1 — Create the Estuary capture -Leave everything as default settings and select NEXT again, then SAVE AND PUBLISH to deploy the connector and kick off a backfill. +1. In the [Estuary dashboard](https://dashboard.estuary.dev), go to **Sources → + New Capture**. +2. Search for `postgres` and select the **real-time PostgreSQL** connector. +3. Name the capture `workshop` and pick the data plane closest to you (e.g. US). +4. Enter the [source connection values](#source-connection-values) above. Enable **History Mode** — you'll need it for Lab Exercise 3. Click **NEXT**. +5. On the binding/schema screen, leave the defaults: + - **Schema evolution** (enabled): keep schemas up to date, add new collections, re-version collections on primary-key changes. + - **Bindings**: the `products` table maps to a collection. + - **Backfill**: Estuary performs an initial load of existing rows, then streams ongoing changes via CDC. +6. Click **NEXT**, then **SAVE AND PUBLISH**. -*NOTE: You’ll get a warning message about a watermark table not be created. You can ignore this as Estuary will create this for us.* + > You may get a warning that the watermarks table doesn't exist. Ignore it — Estuary creates it for you. -6. Once your capture is set up, you’ll be able to get some insights within the Capture Details page. +7. The capture now appears under **Sources** and produces a collection named `<tenant>/workshop/public/products`. -*NOTE: Notice the additional tabs across the top for specific information about the capture task e.g. History (audit log) and Logs (troubleshoot for errors).* +### Step 2 — Inspect the collection -7. Go back to the Sources main page to see the capture running. +Open **Collections** in the left menu and drill into `<tenant>/workshop/public/products`. The document count should match the capture's metrics. (`Read By` shows `N/A` until a materialization reads from it.) -<br> -<br> -Step 4. View Associating Collection for Table Captured +### Step 3 — Prepare the MotherDuck target -<br>8. From the left-hand side, click on Collections<br> -<br> -*NOTE: Notice that the metrics presented here should match the metrics from the associating capture (I’ve temporarily stopped the data generator to get a consistent count across all my Estuary tasks).* -<br> -<br> -9. You can drill into each collection to view more detailed information +1. **S3 credentials:** In the AWS IAM console, create/select a user with read/write to your S3 bucket (e.g. `AmazonS3FullAccess`) in `us-east-1`. Note the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`. +2. **MotherDuck access token:** In MotherDuck, go to **Settings → Integrations → Access Token → + Create token**. Save the token string. +3. **MotherDuck S3 secret:** In MotherDuck, go to **Settings → Integrations → Secrets → + Add secrets**. Set Name, Secret Type = **Amazon S3**, Access Key ID, Secret Access Key, and Region `us-east-1`. +4. **Create a database** in MotherDuck called `lab1`. -*NOTE: Notice that the ‘Read By’ states N/A. This will change when we set up the materialization task when we assign this collection task to it.* -<br> -<br> -<br>Step 5. Set Up Target Environment +### Step 4 — Create the MotherDuck materialization -**S3 bucket and Access key ID and Secret Access Key** <br> -You can use an existing user, or create a new user for the S3 bucket and for best performance use region us-east-1. Ensure you have assigned read/write permissions (i.e. AmazonS3FullAccess) and create security credentials (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY). You can create a new user from the IAM console. +1. Go to **Destinations → + New Materialization**. +2. Search for `motherduck` and select the **real-time MotherDuck** connector. +3. Name it `lab1` and pick your data plane. +4. Enter the MotherDuck connection details (access token), set **Bucket Path** to `lab1`. Click **NEXT**. -**Create an Access Token within MotherDuck** <br> -You can create an access token from the Settings menu. Under Integrations you’ll find Access Token. Click on + Create token, provide a name and proceed to create token. Make a note of the token access string, which will be required within Estuary during the connection details. + > Below the credentials you can configure a **sync schedule** (time/timezone/days) for micro-batch applies to MotherDuck. Leave it default for the lab. -**Adding Secrets to MotherDuck to Access S3** <br> -You’ll need to provide MotherDuck access to your S3 bucket. Again, you can find this within Settings > Integrations > Secrets. Click on + Add secrets and complete Name, Secret Type (Amazon S3), Access Key ID, Secret Access Key and Region (us-east-1). +5. For **Target Resource Naming Convention**, choose **Use Table Name Only** (so it writes to the `main` schema rather than mirroring the source `public` schema). +6. Under **Advanced Options**, click **ADD** and select the `<tenant>/workshop/public/products` collection. +7. Click **NEXT**, then **SAVE AND PUBLISH**. -**Create a New Database** <br> -Create a new database in MotherDuck for this lab exercise called lab1. +### Step 5 — Verify in MotherDuck -<br> -<br> -Step 6. Set Up Estuary Materialization +Open the MotherDuck UI and confirm rows have landed in the `lab1` database. The row count should match your Estuary capture and collection metrics. -10. Go to the destination page by clicking on the Destinations on the left-hand side of your screen, then click on + New Materialization. +### How it behaves (default settings) -11. In the connector search box type in motherduck and select the recommended real-time MotherDuck connector. +**Capture side:** +- By default, Estuary **coalesces** change events — only the latest state per primary key is emitted (4 updates to the same row become 1). +- **History Mode** (enabled here) captures every transaction without reducing to a final state. -12. Let’s name this lab1 for the connector and select the appropriate data-plane closest to your region (e.g. US). +**Materialization side:** +- By default, Estuary performs **soft deletes**. On a delete, Estuary adds a metadata column marking the row for deletion (with `_meta/op` indicating the operation) but does not physically remove it. +- **Hard Delete** physically removes deleted rows — see Lab Exercise 2. +- **Delta Update** (combined with History Mode) inserts every change as a new row instead of overwriting — see Lab Exercise 3. -13. Complete the necessary connection details to login to MotherDuck and enter lab1 for Bucket Path. Click on NEXT when done. +Change operation types in `_meta/op`: `c` = create/insert, `u` = update, `d` = delete. -*NOTE: Notice below the credential section, you have the ability to define a sync schedule for applying to the target based on time, timezone, start and end times and which days to perform sync. This is useful with data warehouses like MotherDuck to perform micro-batch applies. For this exercise, we will leave as default.* +--- -14. For Target Resource Naming Convention click on the drop-down and select Use Table Name Only as we want to use the main schema and not mirror the source public schema. +## Lab Exercise 2: One-to-many topology (hard delete) -15. Scroll down to Advanced Options and click on the ADD button to select the collection to read from. In this hands-on lab you will need to select the -<tenant name>/workshop/public/products collection. +Add a second materialization that reads from the **same** collection but physically deletes records. -16. Select NEXT again, then SAVE AND PUBLISH to deploy the connector and apply data into MotherDuck. Once your materialization is set up, you’ll be able to get some insights within the Materialization Details page. +1. In MotherDuck, create a new database called `lab2`. +2. In Estuary, go to **Destinations → + New Materialization** and select the **MotherDuck** connector. +3. Name it `lab2`, pick your data plane. +4. Enter the MotherDuck connection details, **check the Hard Delete checkbox**, and set **Bucket Path** to `lab2`. Click **NEXT**. +5. Set **Target Resource Naming Convention** to **Use Table Name Only**. +6. Under **Advanced Options**, click **ADD** and select the `<tenant>/workshop/public/products` collection. +7. Under **Advanced Options → Config → Field Selection**, find the `_meta/op` column and click **EXCLUDE** (not needed here). +8. Click **NEXT**, then **SAVE AND PUBLISH**. Verify in MotherDuck's `lab2` database. -*NOTE: Notice that the metric here matches my capture and collections (I’ve temporarily stopped the data generator to get a consistent count across all my Estuary tasks).* +--- -17. Go back to the Destination main page to see the materialization running. +## Lab Exercise 3: Slowly Changing Dimension Type 2 (delta updates) -<br> -<br> +Add a third materialization that inserts every change (including updates and deletes) as a new row — ideal for audit/history tables in a warehouse. -Step 7. View In MotherDuck - -18. Let’s go back into MotherDuck’s dashboard and verify that data has been replicated, as shown below (I’ve temporarily stopped the data generator to get a consistent count across all my Estuary tasks): - -<br> -<br> -Step 8. Some Observations About Your Newly Created Data Pipeline - - -Capture Side: <br> -• By default, Estuary will coalesce the data to only transmit the current state of the source data-point e.g. If there are 4 updates to an existing row with the same primary key, Estuary will only capture the latest update, as this mirror’s the current state, and ignores the previous 3 updates. <br> -• History Mode enabled will allow you to capture all the transactions without reducing them to a final state (which we have enabled for our lab exercise). <br> - -Materialization Side: <br> -• By default, Estuary will perform soft deletes downstream. When a delete operation is detected, Estuary will add an extra column on the target table indicating that the record is marked for deletion with another metadata column indicating the operation type. The record will not be physically deleted in the target. <br> -• Hard Delete enabled will perform the physical deletes in the destination (we’ll look into this for lab exercise 2). <br> -• Delta Update (in combination with History Mode) ensures all changes are inserted as new rows in the destination, rather than overwriting existing records (we’ll look into this for lab exercise 3). <br> - -Change operation type: 'c' Create/Insert, 'u' Update, 'd' Delete. - - -<br> -<br> - -### Lab Exercise 2: One-to-Many Topology -Step 9. Create a 2nd Materialization, Reading from the Same Collection - -19. In MotherDuck add a new database and call it lab2. - -20. In Estuary, go to the destination page by clicking on the Destinations on the left-hand side of your screen, then click on + New Materialization. - -21. In the connector search box type in motherduck and select the recommended real-time MotherDuck connector. - -22. Let’s name this lab2 for the connector and select the appropriate data-plane closest to your region (e.g. US). - -23. Complete the necessary connection details to login to MotherDuck and ensure to select the Hard Delete checkbox this time and enter lab2 for Bucket Path. Click on NEXT when done. - -24. For Target Resource Naming Convention click on the drop-down and select Use Table Name Only as we want to use the main schema and not mirror the source public schema. - -25. Scroll down to Advanced Options and click on the ADD button to select the collection to read from. In this hands-on lab you will need to select the -<tenant name>/workshop/public/products collection. - -26. Under Advanced Options > Config > Field Selection scroll down to column _meta/op and click EXCLUDE to remove this column from selection as we don’t require this for our purpose in this lab exercise. - -27. Select NEXT again, then SAVE AND PUBLISH to deploy the connector and apply data into MotherDuck. Once your materialization is set up, go back into MotherDuck’s dashboard and verify that data has been replicated. - - -<br> -<br> -<br> - - -### Lab Exercise 3: Slowly Changing Dimension Type 2 -Step 10. Create a 3rd Materialization to Insert All Records - -Prerequisite for this is to enable History Mode on capture side, which we already did in lab exercise 1. - -You’ll also need to configure the products table to use the entire row as the unique identifier for logical replication to support SCD2. Let’s go back into PSQL and perform an ALTER TABLE. +History Mode is already enabled (from Lab Exercise 1). You also need full-row replica identity on the source so logical replication carries enough detail for SCD2: +```sql ALTER TABLE products REPLICA IDENTITY FULL; +``` -Now let’s set up our 3rd materialization task. - -28. In MotherDuck add a new database and call it lab3. - -29. In Estuary. go to the destination page by clicking on the Destinations on the left-hand side of your screen, then click on + New Materialization. - -30. In the connector search box type in motherduck and select the recommended real-time MotherDuck connector. - -31. Let’s name this lab3 for the connector and select the appropriate data-plane closest to your region (e.g. US). - -32. Complete the necessary connection details to login to MotherDuck and enter lab3 for Bucket Path. Click on NEXT when done. - -33. For Target Resource Naming Convention click on the drop-down and select Use Table Name Only as we want to use the main schema. - -34. Scroll down to Advanced Options and click on the ADD button to select the collection to read from. In this hands-on lab you will need to select the -<tenant name>/workshop/public/products collection. - -35. Under Advanced Options > Config > Resource Configuration ensure to select the Delta Update checkbox this time. Click on NEXT when done. - -36. Select NEXT again, then SAVE AND PUBLISH to deploy the connector and apply data into MotherDuck. Once your materialization is set up, go back into MotherDuck’s dashboard and verify that data has been replicated. - - -Step 11. UPDATE and DELETE a Record - -37. At this point you will need to stop the datagen container (if you already haven’t done so). You can do this via command line: - -> docker stop datagen - -Or you can do this via docker UI: - -38. Let’s go back into the PostgreSQL command line (page 6 for log in instructions if you’ve closed this window). - -39. Find a record to change within the source PostgreSQL database. I am going to use record ID 17 as this is the first record I see within my table - -> SELECT * FROM products - ORDER BY id ASC - LIMIT 1; +Then: - +1. In MotherDuck, create a new database called `lab3`. +2. In Estuary, go to **Destinations → + New Materialization** and select the **MotherDuck** connector. +3. Name it `lab3`, pick your data plane. +4. Enter the MotherDuck connection details, set **Bucket Path** to `lab3`. Click **NEXT**. +5. Set **Target Resource Naming Convention** to **Use Table Name Only**. +6. Under **Advanced Options**, click **ADD** and select the `<tenant>/workshop/public/products` collection. +7. Under **Advanced Options → Config → Resource Configuration**, **check the Delta Update checkbox**. Click **NEXT**. +8. Click **NEXT**, then **SAVE AND PUBLISH**. Verify in MotherDuck's `lab3` database. -40. Perform an UPDATE on this record: +### Test it: update and delete a record -> UPDATE products - SET name = 'SmallData' - WHERE id = 17; +Stop the data generator so the table is stable: - -41. Perform a DELETE on this record: +```bash +docker stop datagen +``` -> DELETE FROM products - WHERE id = 17; +Connect to Postgres (`docker exec -it postgres_cdc bash`, then `psql -h postgres_cdc -U postgres -d postgres`) and pick the lowest-id row: - +```sql +SELECT * FROM products ORDER BY id ASC LIMIT 1; +``` +Update it, then delete it (using the example id `17`): +```sql +UPDATE products SET name = 'SmallData' WHERE id = 17; +DELETE FROM products WHERE id = 17; +``` +Check each MotherDuck database to compare behaviors: +- **`lab1` (soft delete):** the row remains with delete metadata set. +- **`lab2` (hard delete):** the row is physically removed. +- **`lab3` (SCD2 / delta update):** the insert, the update, and the delete each appear as separate rows — a full change history. +--- +## Cleanup +```bash +docker compose down -v +``` +Then disable or delete the capture and materializations from the Estuary dashboard, and drop the `lab1`/`lab2`/`lab3` databases in MotherDuck if you no longer need them. +## Next steps +- Swap the MotherDuck destination for another warehouse — e.g. [BigQuery](https://docs.estuary.dev/reference/Connectors/materialization-connectors/BigQuery/), [Snowflake](https://docs.estuary.dev/reference/Connectors/materialization-connectors/Snowflake/), or [Databricks](https://docs.estuary.dev/reference/Connectors/materialization-connectors/databricks/). +- Add an Estuary [derivation](https://docs.estuary.dev/concepts/derivations/) to transform the `products` collection in SQL, TypeScript, or Python before materializing. +- Point the same capture at additional tables by adding bindings. +## References +- [Estuary documentation](https://docs.estuary.dev) +- [PostgreSQL capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/) +- [MotherDuck materialization connector](https://docs.estuary.dev/reference/Connectors/materialization-connectors/motherduck/) +- [Estuary dashboard](https://dashboard.estuary.dev) · [New capture](https://dashboard.estuary.dev/captures) · [New materialization](https://dashboard.estuary.dev/materializations) diff --git a/kafka-capture/README.md b/kafka-capture/README.md index 668c4dc..fd3dad8 100644 --- a/kafka-capture/README.md +++ b/kafka-capture/README.md @@ -1,50 +1,105 @@ -# Kafka MSK IAM Authentication Setup +# Stream Kafka (AWS MSK) IoT Topics to Any Destination with Estuary -This project contains Kafka producer and consumer scripts that use AWS MSK IAM authentication. +Generate and consume real-time IoT data on an [Amazon MSK](https://aws.amazon.com/msk/) (Managed Streaming for Apache Kafka) cluster using **AWS IAM authentication** (`SASL_SSL` + `OAUTHBEARER`), then capture those topics into [Estuary](https://estuary.dev) with the `source-kafka` connector. The Python producer (`datagen/`) emits sensor readings and device metadata to the `iot.readings` and `iot.devices` topics, and the consumer (`consumer/`) reads them back so you can verify connectivity before wiring up the Estuary capture. + +Use this example as a working reference for **MSK IAM authentication from Python (`kafka-python` + `aws-msk-iam-sasl-signer`)** and for streaming Kafka topics into a real-time data pipeline. + +## Architecture + +``` +datagen/datagen.py ──► AWS MSK topics ──► Estuary capture ──► Estuary collections ──► materialization ──► destination + (IoT producer) iot.readings (source-kafka) (real-time JSON) (any connector) (warehouse / lake) + iot.devices + ▲ + │ + consumer/consumer.py (verify) +``` + +- The **producer** authenticates to MSK with AWS IAM (no Kafka passwords), creates the topics if missing, and streams JSON events. +- An **Estuary capture** using the `source-kafka` connector reads the same topics and lands each topic in an [Estuary collection](https://docs.estuary.dev/concepts/collections/) — a real-time, schematized data lake of JSON in cloud storage. +- From there, a [materialization](https://docs.estuary.dev/concepts/materialization/) pushes the collections to any supported destination (Snowflake, BigQuery, Redshift, Iceberg, Postgres, etc.) in real time, with optional [derivations](https://docs.estuary.dev/concepts/derivations/) for SQL/TypeScript/Python transforms in between. + +## What's included + +| Path | Role | +| --- | --- | +| `datagen/datagen.py` | IoT producer. Creates `iot.readings` (3 partitions, replication factor 2) and `iot.devices` (1 partition, replication factor 2), seeds device metadata, then streams ~10 readings/sec with occasional SCD-2-style metadata changes. | +| `datagen/requirements.txt` | Producer deps: `kafka-python`, `aws-msk-iam-sasl-signer-python`, `faker`, `python-dotenv`. | +| `consumer/consumer.py` | IoT consumer. Reads a topic (`iot.readings` by default), pretty-prints each message, and exits after 10s of inactivity. | +| `consumer/run_consumer.sh` | Convenience wrapper around `consumer.py` with `-t/--topic` and `--from-beginning` flags. | +| `consumer/requirements.txt` | Consumer deps: `kafka-python`, `aws-msk-iam-sasl-signer-python`. | +| `check_setup.py` | Pre-flight checker: validates AWS credentials, region, `MSK_BROKERS`, MSK IAM token generation, and basic `kafka:ListClusters` permission. | +| `test_kafka_connection.py` | Minimal MSK connectivity smoke test (admin client + producer metadata). | + +## Data model + +**`iot.readings`** (keyed by `device_id`): + +| Field | Type | Notes | +| --- | --- | --- | +| `device_id` | string | e.g. `thermo-00001` | +| `ts` | string | ISO-8601 UTC timestamp (millisecond precision) | +| `temperature_c` | number | °C | +| `humidity_pct` | number | % | +| `battery_pct` | number | % | +| `status` | string | `ok` / `warn` / `error` (derived from thresholds) | + +**`iot.devices`** (keyed by `device_id`, SCD-2-style — a new record per change): + +| Field | Type | Notes | +| --- | --- | --- | +| `device_id` | string | | +| `effective_from` | string | ISO-8601 UTC timestamp of this version | +| `model` | string | `T900`…`T9000` | +| `firmware_version` | string | e.g. `1.3.0` | +| `site` | string | `nyc_manhattan_hq`, `sp_sao_paulo_lab`, `ldn_office` | +| `room` | string | `conf_a`, `conf_b`, `open_floor`, `server_room` | +| `lat` / `lon` | number | coordinates | ## Prerequisites -1. **AWS Credentials**: The scripts use AWS MSK IAM authentication, which requires valid AWS credentials -2. **Python 3.7+** -3. **Access to an AWS MSK cluster** +- **Python 3.7+** +- An **AWS MSK cluster** with **IAM access control** enabled, reachable from where you run the scripts (security groups / VPC / public access). +- **AWS credentials** with the MSK IAM permissions below (the signer uses the standard AWS default credential chain). +- A free **Estuary account** to create the capture: <https://dashboard.estuary.dev> -## AWS Credentials Configuration +## AWS credentials configuration -The `aws-msk-iam-sasl-signer` library uses the **AWS default credential chain**. Configure credentials using one of these methods: +The `aws-msk-iam-sasl-signer` library uses the **AWS default credential chain**. Configure credentials with any standard method. -### Option 1: Environment Variables (Recommended for Development) +### Option 1 — Environment variables (recommended for local dev) ```bash export AWS_ACCESS_KEY_ID="your-access-key-id" export AWS_SECRET_ACCESS_KEY="your-secret-access-key" -export AWS_REGION="us-east-1" # Your MSK cluster region +export AWS_REGION="us-east-1" # your MSK cluster region export MSK_BROKERS="your-msk-bootstrap-servers" ``` -### Option 2: AWS Credentials File +### Option 2 — AWS credentials file + +`~/.aws/credentials`: -Create/edit `~/.aws/credentials`: ```ini [default] aws_access_key_id = your-access-key-id aws_secret_access_key = your-secret-access-key ``` -Create/edit `~/.aws/config`: +`~/.aws/config`: + ```ini [default] region = us-east-1 ``` -### Option 3: Named AWS Profile - -If using a named profile instead of default: +### Option 3 — Named AWS profile ```bash export AWS_PROFILE="your-profile-name" ``` -Update the scripts to use the named profile by modifying the `MSKTokenProvider` class: +To use a named profile inside the token provider, modify the `MSKTokenProvider` class: ```python class MSKTokenProvider(AbstractTokenProvider): @@ -53,13 +108,11 @@ class MSKTokenProvider(AbstractTokenProvider): return token ``` -### Option 4: IAM Role (for EC2/ECS/Lambda) - -If running on AWS infrastructure, the scripts can automatically use the attached IAM role. +### Option 4 — IAM role (EC2/ECS/Lambda) -### Option 5: Assume Role +Running on AWS infrastructure? The scripts automatically use the attached IAM role — no extra config. -To use an assumed role, modify the `MSKTokenProvider`: +### Option 5 — Assume role ```python class MSKTokenProvider(AbstractTokenProvider): @@ -68,9 +121,9 @@ class MSKTokenProvider(AbstractTokenProvider): return token ``` -## Required IAM Permissions +## Required IAM permissions -Your AWS credentials need these permissions for MSK: +The identity used by the producer/consumer needs these MSK permissions. The same policy (Connect + topic Read/Write + group access) is what an Estuary capture's IAM identity needs to read the topics. ```json { @@ -106,120 +159,164 @@ Your AWS credentials need these permissions for MSK: } ``` -## Environment Variables - -Set these environment variables: +## Environment variables ```bash # Required -export AWS_REGION="us-east-1" # Your MSK cluster region +export AWS_REGION="us-east-1" # your MSK cluster region (default in scripts: us-east-1) export MSK_BROKERS="b-1.cluster.kafka.region.amazonaws.com:9098,b-2.cluster.kafka.region.amazonaws.com:9098" -# Optional - AWS Credentials (if not using other methods) +# Optional — AWS credentials (if not using a profile/role/file) export AWS_ACCESS_KEY_ID="your-access-key" export AWS_SECRET_ACCESS_KEY="your-secret-key" ``` -## Installation +> MSK IAM endpoints use port **9098** (private) or **9198** (public). Use the IAM bootstrap brokers, not the TLS/plaintext ones. + +> Note: the consumer and producer read `MSK_BROKERS` / accept `--bootstrap`; `datagen.py` and `test_kafka_connection.py` ship with a placeholder `BOOTSTRAP = "borker1,broker2,broker3"` — either pass `--bootstrap` (where supported) or edit the constant to your IAM bootstrap servers. The producer also imports `boto3`, so install it (`pip install boto3`) if it is not already present. + +## Setup -### Producer Setup ```bash +# Producer cd datagen pip install -r requirements.txt + +# Consumer (in a separate shell) +cd consumer +pip install -r requirements.txt ``` -### Consumer Setup +Verify your environment before producing anything: + ```bash -cd consumer -pip install -r requirements.txt +# check_setup.py needs boto3/botocore, which aren't in the requirements files. +# (botocore ships with boto3.) The producer also imports boto3. +pip install boto3 +python check_setup.py ``` -## Usage +It runs five checks (AWS credentials, region, `MSK_BROKERS`, MSK token generation, and MSK list-clusters permission) and prints a pass/fail summary. + +> Note: `check_setup.py` imports `boto3` and `botocore` (and the producer imports `boto3` too), but neither is listed in `datagen/requirements.txt` or `consumer/requirements.txt`. Run `pip install boto3` first or you'll hit `ModuleNotFoundError`. + +## Running it + +### 1. Produce IoT data -### Running the Producer ```bash cd datagen python datagen.py +# or with explicit brokers: +python datagen.py --bootstrap "b-1.cluster.kafka.us-east-1.amazonaws.com:9098" ``` -With custom bootstrap servers: -```bash -python datagen.py --bootstrap "your-msk-cluster:9098" -``` +The producer creates the topics if needed, seeds 20 devices into `iot.devices`, then streams readings to `iot.readings` at ~10 events/sec until you press Ctrl+C. -### Running the Consumer +### 2. Consume to verify -Using the convenience script: ```bash cd consumer -./run_consumer.sh # Latest from iot.readings -./run_consumer.sh -t iot.devices # Latest from iot.devices -./run_consumer.sh -t iot.readings --from-beginning # All messages from beginning +./run_consumer.sh # latest from iot.readings +./run_consumer.sh -t iot.devices # latest from iot.devices +./run_consumer.sh -t iot.readings --from-beginning # all messages from the beginning ``` -Using Python directly: +Or call Python directly: + ```bash -cd consumer python consumer.py -t iot.readings --from-beginning ``` -## Configuration Details +Consumer flags: `-b/--bootstrap`, `-t/--topic` (default `iot.readings`), `-g/--group` (default `iot-consumer-group`), `--from-beginning`. -### MSK Cluster Configuration -- **Security Protocol**: SASL_SSL -- **SASL Mechanism**: OAUTHBEARER -- **Authentication**: AWS IAM +### Connection details (matches the scripts) -### Topics Created -- `iot.readings`: Sensor data (temperature, humidity, battery) -- `iot.devices`: Device metadata (location, firmware, etc.) +- **Security protocol:** `SASL_SSL` +- **SASL mechanism:** `OAUTHBEARER` +- **Authentication:** AWS MSK IAM (token from `aws-msk-iam-sasl-signer`) -## Troubleshooting +## Configure the Estuary capture -### Debug Credentials -To see which AWS identity is being used: +Once the topics exist and have data, capture them into Estuary with the **`source-kafka`** connector (image `ghcr.io/estuary/source-kafka:v1`). -```python -# Add this to debug credential issues -token, _ = MSKAuthTokenProvider.generate_auth_token(AWS_REGION, aws_debug_creds=True) -``` +### Via the dashboard -### Common Issues +1. Go to <https://dashboard.estuary.dev/captures> and click **New Capture**. +2. Choose the **Apache Kafka** / **Amazon MSK** (`source-kafka`) connector. +3. Enter the connection settings, using the same values as the scripts: + - **Bootstrap servers:** your IAM bootstrap brokers (e.g. `b-1.cluster.kafka.us-east-1.amazonaws.com:9098`) + - **TLS:** enabled (`SASL_SSL`) + - **Authentication:** AWS MSK IAM — provide the AWS region (`us-east-1`) and the access key / secret of an identity holding the [IAM permissions above](#required-iam-permissions). +4. Discover topics and select `iot.readings` and `iot.devices` to bind. Each becomes an Estuary collection. +5. Save and publish. Estuary backfills existing messages and then streams new ones in real time. -1. **Access Denied**: Check IAM permissions and ensure credentials are configured -2. **Region Mismatch**: Ensure `AWS_REGION` matches your MSK cluster region -3. **Network Issues**: Verify MSK cluster security groups allow access from your IP -4. **Bootstrap Servers**: Ensure you're using the correct MSK bootstrap servers +See the connector reference for the full option list (including AWS IAM auth and schema-registry settings): <https://docs.estuary.dev/reference/Connectors/capture-connectors/apache-kafka/> -### Testing Credentials -```bash -# Test AWS credentials -aws sts get-caller-identity +### Via flowctl + +Prefer the CLI? Authenticate, stub a minimal `flow.yaml` with the `source-kafka` config, discover the topics, then publish: -# Test MSK connectivity -aws kafka list-clusters --region us-east-1 +```yaml +# flow.yaml — minimal source-kafka stub for discovery +captures: + <tenant>/<prefix>/source-kafka: + endpoint: + connector: + image: ghcr.io/estuary/source-kafka:v1 + config: + bootstrap_servers: "b-1.cluster.kafka.us-east-1.amazonaws.com:9098,b-2.cluster.kafka.us-east-1.amazonaws.com:9098" + tls: system_certificates + credentials: + auth_type: AWS + aws_access_key_id: "your-access-key-id" + aws_secret_access_key: "your-secret-access-key" + region: us-east-1 + bindings: [] ``` -## Example Complete Setup +```bash +flowctl auth login +flowctl discover --source flow.yaml # fills in bindings for iot.readings / iot.devices +flowctl catalog publish --source flow.yaml --auto-approve +``` + +`flowctl discover` rewrites `flow.yaml` in place, adding a binding (and discovered collection schema) for each topic. See the [connector reference](https://docs.estuary.dev/reference/Connectors/capture-connectors/apache-kafka/) for the full config and AWS IAM auth option list. + +flowctl docs: <https://docs.estuary.dev/concepts/flowctl/> + +## Verify the pipeline + +Confirm messages are landing in your Estuary collections: ```bash -# 1. Set environment variables -export AWS_ACCESS_KEY_ID="AKIAIOSFODNN7EXAMPLE" -export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" -export AWS_REGION="us-east-1" -export MSK_BROKERS="b-1-public.democluster1.cp6yjz.c21.kafka.us-east-1.amazonaws.com:9198,b-2-public.democluster1.cp6yjz.c21.kafka.us-east-1.amazonaws.com:9198" +flowctl collections read --collection <tenant>/<prefix>/iot.readings --uncommitted | head +``` -# 2. Install dependencies -cd datagen && pip install -r requirements.txt -cd ../consumer && pip install -r requirements.txt +Or watch live throughput and document counts on the capture's page in the [Estuary dashboard](https://dashboard.estuary.dev/captures). -# 3. Start producer (in one terminal) -cd datagen -python datagen.py +## Next steps -# 4. Start consumer (in another terminal) -cd consumer -./run_consumer.sh --from-beginning +- Add a **materialization** to push the collections to a warehouse or lake: <https://dashboard.estuary.dev/materializations> +- Add a **derivation** to transform `iot.devices` into an SCD-2 dimension or to enrich `iot.readings`: <https://docs.estuary.dev/concepts/derivations/> +- Read collections from any Kafka client (no extra infra) via **Dekaf**, Estuary's Kafka-compatible API: <https://docs.estuary.dev/guides/dekaf_reading_collections_from_kafka/> + +## Troubleshooting + +- **Access denied:** verify the [IAM permissions](#required-iam-permissions) and that credentials are loaded (`aws sts get-caller-identity`). +- **Region mismatch:** `AWS_REGION` must match the cluster's region. +- **Network issues:** confirm the MSK security groups allow access from your IP and that you are using the **IAM** bootstrap endpoints (port 9098/9198). +- **Wrong brokers:** `datagen.py` / `test_kafka_connection.py` default to a placeholder; set `MSK_BROKERS` or pass `--bootstrap`. + +Debug which identity the signer uses: + +```python +token, _ = MSKAuthTokenProvider.generate_auth_token(AWS_REGION, aws_debug_creds=True) ``` -The producer will generate IoT sensor data, and the consumer will display the messages in real-time. \ No newline at end of file +## Resources + +- Estuary docs: <https://docs.estuary.dev> +- Apache Kafka / Amazon MSK capture connector: <https://docs.estuary.dev/reference/Connectors/capture-connectors/apache-kafka/> +- Dekaf (read collections as Kafka): <https://docs.estuary.dev/guides/dekaf_reading_collections_from_kafka/> +- AWS MSK IAM access control: <https://docs.aws.amazon.com/msk/latest/developerguide/iam-access-control.html> diff --git a/mongodb-pinecone-rag/README.md b/mongodb-pinecone-rag/README.md index 7b473a1..f8165be 100644 --- a/mongodb-pinecone-rag/README.md +++ b/mongodb-pinecone-rag/README.md @@ -1,6 +1,186 @@ -# Realtime RAG with Estuary and Pinecone +# Stream MongoDB to Pinecone for Real-Time RAG with Estuary -This projects showcases the components of a CDC data flow that streams data from -MongoDB, vectorizes it then loads the embeddings into Pinecone. +This example builds an end-to-end, real-time Retrieval-Augmented Generation (RAG) pipeline. It captures e-commerce product reviews from MongoDB with Estuary, materializes them as vector embeddings into a [Pinecone](https://www.pinecone.io/) index, and serves a [Streamlit](https://streamlit.io/) chat app that answers questions about the reviews using [LlamaIndex](https://www.llamaindex.ai/) and OpenAI. New documents written to MongoDB flow into Pinecone within seconds, so the chatbot always queries fresh data — no batch re-indexing jobs. -See blog post for details: https://estuary.dev/real-time-rag-with-estuary-and-pinecone/ +Full walkthrough: [Real-time RAG with Estuary and Pinecone](https://estuary.dev/real-time-rag-with-estuary-and-pinecone/) + +## Architecture + +``` +MongoDB (ecommerce.reviews) + │ change stream (CDC) + ▼ +Estuary capture ──► Estuary collection (schematized JSON) + │ + ▼ + Estuary materialization (Pinecone connector) + embeds each review's text, upserts vectors + │ + ▼ + Pinecone index (namespace: reviews) + ▲ + │ similarity search (top_k=5) + Streamlit + LlamaIndex + OpenAI chat app +``` + +- **Capture** — the [source-mongodb](https://docs.estuary.dev/reference/Connectors/capture-connectors/mongodb/) connector tails the MongoDB change stream for the `ecommerce.reviews` collection and writes each document into an Estuary **collection** (a real-time, schematized data lake in cloud storage). +- **Materialization** — the [materialize-pinecone](https://docs.estuary.dev/reference/Connectors/materialization-connectors/pinecone/) connector reads the collection, generates an embedding for each review (via OpenAI), and upserts the resulting vectors into a Pinecone index under the `reviews` namespace. The original document text is stored on the vector under the `flow_document` key. +- **Query** — the Streamlit app (`app.py` + `rag.py`) uses LlamaIndex to embed the user's question, retrieve the most similar reviews from Pinecone, and feed them to OpenAI's `gpt-3.5-turbo` as grounding context. + +## What's included + +| Path | Role | +| --- | --- | +| `docker-compose.yml` | Defines two services: `datagen` (container `mongodb-datagen`) seeds MongoDB, and `streamlit` (container `streamlit`) runs the RAG chat app on port `8501`. | +| `datagen/datagen.py` | Reads every CSV in `datagen/data/` and inserts each row into the MongoDB `ecommerce.reviews` collection. | +| `datagen/data/` | Five Amazon product-review CSVs (`amazon_books_Data.csv`, `amazon_ebook_Data.csv`, `amazon_grocery_Data.csv`, `amazon_jwellery_Data.csv`, `amazon_pc_Data.csv`). | +| `datagen/Dockerfile` | Builds the loader image (Python 3.11 + `pymongo`). | +| `app.py` | Streamlit UI: chat interface that queries the RAG engine and renders responses. | +| `rag.py` | Wires up the Pinecone vector store (namespace `reviews`, text key `flow_document`), a `top_k=5` retriever, and a `CondensePlusContextChatEngine` backed by OpenAI `gpt-3.5-turbo`. | +| `Dockerfile` | Builds the Streamlit app image (Python 3.11) and serves it on port `8501`. | +| `requirements.txt` | App dependencies: `streamlit`, `llama-index`, `llama-index-vector-stores-pinecone`, `python-dotenv`. | +| `.streamlit/config.toml` | Streamlit theme + static file serving config. | + +### Review document schema + +Each row loaded from the CSVs becomes a MongoDB document with these fields: + +``` +market_place, customer_id, review_id, product_id, product_parent, +product_title, product_category, star_rating, helpful_votes, +total_votes, vine, verified_purchase, review_headline, review_body, review_date +``` + +## Prerequisites + +- **Docker** and Docker Compose. +- A running **MongoDB** instance that Estuary can reach. The simplest path is a free [MongoDB Atlas](https://www.mongodb.com/atlas) cluster (Atlas exposes a public endpoint and meets the change-stream / replica-set requirement of CDC out of the box). The datagen service defaults to `mongo:mongo@localhost:27017` — point it at your own instance via the environment variables below. +- A free **Estuary** account: [https://dashboard.estuary.dev](https://dashboard.estuary.dev). +- A **Pinecone** account with an API key and an index (cosine metric; dimension must match the embedding model you choose in the connector). +- An **OpenAI** API key (used by the Pinecone materialization to create embeddings, and by the Streamlit app to generate chat responses). + +> **MongoDB CDC requirements:** the source must be a replica set (or Atlas) so the connector can read change streams, and the capture user needs read access to the target database. See the [source-mongodb prerequisites](https://docs.estuary.dev/reference/Connectors/capture-connectors/mongodb/). + +## Setup + +### 1. Configure connection values + +Edit the `streamlit` service environment in `docker-compose.yml` with your real keys: + +```yaml + streamlit: + environment: + PINECONE_API_KEY: "<pinecone-api-key>" + PINECONE_HOST: "<pinecone-host>" # e.g. https://my-index-xxxx.svc.us-east-1-aws.pinecone.io + OPENAI_API_KEY: "<openai-api-key>" +``` + +If you are not using the default local MongoDB, point the `datagen` service at your cluster: + +```yaml + datagen: + environment: + MONGODB_HOST: "<your-host>" + MONGODB_PORT: "27017" + MONGODB_USER: "mongo" + MONGODB_PASSWORD: "mongo" + MONGODB_DB: "ecommerce" + MONGODB_COLLECTION: "reviews" +``` + +> The bundled `datagen.py` builds the URI as `mongodb://USER:PASSWORD@HOST:PORT/`. For MongoDB Atlas (which requires the `mongodb+srv://` scheme and TLS), seed your cluster directly with `mongoimport`/`mongosh` or adapt the connection string in `datagen/datagen.py`. + +### 2. Seed MongoDB + +```bash +docker compose up --build datagen +``` + +This loads all five Amazon review CSVs into the `ecommerce.reviews` collection. Confirm the row count, e.g.: + +```bash +# with mongosh against your instance +mongosh "mongodb://mongo:mongo@localhost:27017/" --eval 'db.getSiblingDB("ecommerce").reviews.countDocuments()' +``` + +## Configure the Estuary capture + +Use the Estuary dashboard to create the MongoDB source. + +1. Go to [https://dashboard.estuary.dev/captures](https://dashboard.estuary.dev/captures) → **New Capture** → search for **MongoDB**. +2. Enter the connection details for your instance: + - **Address / Host:** your MongoDB host (Atlas connection string host, or your public endpoint) + - **User:** `mongo` (or your DB user) + - **Password:** `mongo` (or your DB password) + - **Database:** `ecommerce` +3. In the discovery step, select the `reviews` collection to bind. +4. Save and publish. Estuary backfills existing reviews and then tails the change stream, writing each document into an Estuary collection (e.g. `your-prefix/reviews`). + +Connector reference: [MongoDB capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/mongodb/). + +## Configure the Estuary materialization + +Materialize the collection into Pinecone as embeddings. + +1. Go to [https://dashboard.estuary.dev/materializations](https://dashboard.estuary.dev/materializations) → **New Materialization** → search for **Pinecone**. +2. Provide: + - **Pinecone API key** — same value as `PINECONE_API_KEY` + - **Pinecone index** — your index name + - **OpenAI API key** — used by the connector to embed each document + - **Namespace:** `reviews` (must match the namespace `rag.py` queries) +3. Bind the `reviews` collection from your capture. +4. Save and publish. Estuary now upserts a vector per review into Pinecone, storing the source text under the `flow_document` key. + +Connector reference: [Pinecone materialization connector](https://docs.estuary.dev/reference/Connectors/materialization-connectors/pinecone/). + +> **Why `flow_document`?** The Pinecone connector stores the full source document in vector metadata under `flow_document`. `rag.py` configures the LlamaIndex `PineconeVectorStore` with `text_key="flow_document"` and `namespace="reviews"` so retrieval reads exactly what Estuary wrote. + +## Running the RAG app + +With the capture and materialization live and vectors landing in Pinecone, start the chat app: + +```bash +docker compose up --build streamlit +``` + +Open [http://localhost:8501](http://localhost:8501) and ask questions about the products, for example: + +- "What do reviewers say about laptop sleeves?" +- "Are there any complaints about gluten-free cookie mixes?" +- "Which jewelry products got good reviews for the price?" + +The app embeds your question, retrieves the five most similar reviews from Pinecone, and answers with `gpt-3.5-turbo` grounded in that context. + +To run everything at once: + +```bash +docker compose up --build +``` + +## Verify the pipeline is real-time + +1. Insert a new review into MongoDB: + ```bash + mongosh "mongodb://mongo:mongo@localhost:27017/" --eval \ + 'db.getSiblingDB("ecommerce").reviews.insertOne({product_title:"Test Widget", review_body:"This widget is amazing and very durable.", star_rating:"5"})' + ``` +2. Watch the capture and materialization metrics update in the [Estuary dashboard](https://dashboard.estuary.dev) (docs read/written counts increase). +3. Optionally read the collection directly with flowctl: + ```bash + flowctl collections read --collection your-prefix/reviews --uncommitted | head + ``` +4. Ask the chatbot about the new product — the answer reflects the just-inserted review. + +## Next steps + +- Swap the source for any of Estuary's [148+ connectors](https://docs.estuary.dev/reference/Connectors/) to power RAG over Postgres, Kafka, S3, or SaaS data. +- Add a [derivation](https://docs.estuary.dev/concepts/derivations/) (SQL, TypeScript, or Python) to clean, chunk, or enrich review text before it is embedded. +- Change the retrieval depth (`similarity_top_k` in `rag.py`) or the LLM (`OpenAI(model=...)`) to tune answer quality. + +## References + +- Blog: [Real-time RAG with Estuary and Pinecone](https://estuary.dev/real-time-rag-with-estuary-and-pinecone/) +- [Estuary documentation](https://docs.estuary.dev) +- [MongoDB capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/mongodb/) +- [Pinecone materialization connector](https://docs.estuary.dev/reference/Connectors/materialization-connectors/pinecone/) +- [flowctl CLI](https://docs.estuary.dev/concepts/flowctl/) diff --git a/mongodb-tinybird-clickstream/README.md b/mongodb-tinybird-clickstream/README.md new file mode 100644 index 0000000..6d360db --- /dev/null +++ b/mongodb-tinybird-clickstream/README.md @@ -0,0 +1,184 @@ +# Stream MongoDB Clickstream Data to Tinybird in Real Time with Estuary + +Capture a live e-commerce clickstream from MongoDB Atlas with Estuary's +MongoDB CDC connector and materialize it into [Tinybird](https://www.tinybird.co/) +for real-time analytics. A small data generator continuously inserts clickstream +events into a MongoDB Atlas collection; Estuary captures every change as it +happens and streams it to a Tinybird Data Source, ready for low-latency SQL and +published API endpoints. + +## Architecture + +The pipeline is a standard Estuary capture-to-materialization flow: + +``` +click_stream.csv ──▶ datagen ──▶ MongoDB Atlas ──▶ Estuary capture ──▶ collection ──▶ Estuary materialization ──▶ Tinybird + (insert_one) ecommerce.clickstream (source-mongodb) (materialize-tinybird) +``` + +1. **datagen** reads `datagen/data/click_stream.csv` and inserts each row into the + `ecommerce.clickstream` collection on MongoDB Atlas, one document every 5 + seconds, to simulate a continuous event stream. +2. An Estuary **capture** (`source-mongodb`) tails the collection's change stream + and writes each event into a real-time **collection** (schematized JSON backed + by cloud storage). +3. An Estuary **materialization** (`materialize-tinybird`) pushes the collection + to a Tinybird Data Source, where you can query it with SQL and expose it as an + API. + +Because MongoDB Atlas is already a public, managed endpoint, no ngrok tunnel is +required — Estuary's hosted connector connects directly to your Atlas cluster. + +## What's included + +- `docker-compose.yml` — defines the `datagen` service (`container_name: + mongodb-datagen`) and passes the MongoDB Atlas connection settings as + environment variables. +- `datagen/datagen.py` — connects to Atlas with `pymongo` over a + `mongodb+srv://` URI and inserts every CSV row into the target collection, + sleeping 5 seconds between inserts. +- `datagen/data/click_stream.csv` — the source clickstream dataset. Columns: + `session_id`, `event_name`, `event_time`, `event_id`, `traffic_source`, + `event_metadata`. +- `datagen/Dockerfile` — builds the generator on `python:3.11` and runs + `python -u datagen.py`. +- `datagen/requirements.txt` — pins `pymongo==4.10.1`. + +## Prerequisites + +- [Docker](https://docs.docker.com/get-docker/) and Docker Compose. +- A **MongoDB Atlas** cluster. The included `docker-compose.yml` points at a + placeholder host (`cluster0.vun5h.mongodb.net`); replace it with your own. +- A free **Estuary** account: https://dashboard.estuary.dev +- A free **Tinybird** account and a Workspace: https://www.tinybird.co/ + +## MongoDB Atlas configuration + +The capture relies on MongoDB change streams, which require a replica set (every +Atlas cluster is a replica set by default) and a database user with read access +to the target database. Before running anything: + +1. Create a database user in Atlas (Database Access) with read access to the + `ecommerce` database. +2. Allow Estuary's IP addresses through **Network Access** (or temporarily allow + `0.0.0.0/0` for testing). See the + [MongoDB connector docs](https://docs.estuary.dev/reference/Connectors/capture-connectors/mongodb/) + for the current allowlist. + +## Running the data generator + +Edit the environment block in `docker-compose.yml` to match your Atlas cluster +and credentials: + +```yaml +environment: + MONGODB_HOST: "cluster0.vun5h.mongodb.net" # your Atlas SRV host + MONGODB_PORT: "27017" + MONGODB_USER: "mongo" + MONGODB_PASSWORD: "mongo" + MONGODB_DB: "ecommerce" + MONGODB_COLLECTION: "clickstream" +``` + +> The generator builds the connection string as +> `mongodb+srv://{MONGODB_USER}:{MONGODB_PASSWORD}@{MONGODB_HOST}/`, so +> `MONGODB_HOST` must be the Atlas SRV hostname (the `MONGODB_PORT` value is not +> used by the `+srv` scheme). + +Then build and start the generator: + +```bash +docker compose up -d --build +``` + +Watch it insert events: + +```bash +docker compose logs -f datagen +``` + +Each run inserts the full `click_stream.csv` dataset into +`ecommerce.clickstream`, one document every 5 seconds. + +## Configure the Estuary capture (MongoDB) + +Create the capture in the [Estuary dashboard](https://dashboard.estuary.dev/captures) +(**Sources → New Capture → MongoDB**) using the +[`source-mongodb`](https://docs.estuary.dev/reference/Connectors/capture-connectors/mongodb/) +connector, or via [flowctl](https://docs.estuary.dev/concepts/flowctl/). + +Use the same values you set in `docker-compose.yml`: + +| Setting | Value | +| ---------- | ---------------------------------------------- | +| Address | `mongodb+srv://cluster0.vun5h.mongodb.net` | +| User | `mongo` | +| Password | `mongo` | +| Database | `ecommerce` | + +The connector discovers the `ecommerce.clickstream` collection and creates a +binding that streams its change stream into an Estuary collection (for example +`your-prefix/mongodb-clickstream/ecommerce/clickstream`). + +To authenticate flowctl for CLI-based deploys: + +```bash +flowctl auth login +``` + +## Configure the Estuary materialization (Tinybird) + +Create the materialization in the +[Estuary dashboard](https://dashboard.estuary.dev/materializations) +(**Destinations → New Materialization → Tinybird**) using the +[`materialize-tinybird`](https://docs.estuary.dev/reference/Connectors/materialization-connectors/tinybird/) +connector. + +You'll need, from your Tinybird Workspace: + +- Your Tinybird **region/host** (e.g. `api.us-east.tinybird.co`). +- A Tinybird **Auth Token** with permission to create and append to Data Sources. + +Bind the clickstream collection from the capture above to a Tinybird Data Source. +Estuary streams new events directly into Tinybird, where each clickstream event +becomes a row you can query with SQL and expose as a published API endpoint. + +## Verify data is flowing + +Confirm events are reaching the Estuary collection: + +```bash +flowctl collections read \ + --collection your-prefix/mongodb-clickstream/ecommerce/clickstream \ + --uncommitted | head +``` + +You can also watch live throughput on the capture and materialization tiles in +the [Estuary dashboard](https://dashboard.estuary.dev). On the Tinybird side, +query the Data Source from the Tinybird UI to see clickstream rows arriving in +near real time, for example: + +```sql +SELECT event_name, count() AS events +FROM clickstream +GROUP BY event_name +ORDER BY events DESC +``` + +## Next steps + +- Add an Estuary **derivation** (SQL, TypeScript, or Python) to sessionize or + aggregate events before they reach Tinybird — + https://docs.estuary.dev/concepts/derivations/ +- Fan the same MongoDB collection out to additional destinations (warehouses, + lakehouses, vector stores) by adding more materializations — a collection can + power many destinations at once. +- Build Tinybird Pipes and API endpoints on top of the materialized Data Source + to serve real-time clickstream metrics to your application. + +## Resources + +- Estuary docs: https://docs.estuary.dev +- MongoDB capture connector: https://docs.estuary.dev/reference/Connectors/capture-connectors/mongodb/ +- Tinybird materialization connector: https://docs.estuary.dev/reference/Connectors/materialization-connectors/tinybird/ +- flowctl: https://docs.estuary.dev/concepts/flowctl/ diff --git a/oracle-capture/README.md b/oracle-capture/README.md index 003fe0b..b8d36b7 100644 --- a/oracle-capture/README.md +++ b/oracle-capture/README.md @@ -1,51 +1,151 @@ +# Oracle CDC Capture Demo with Estuary (Free Oracle 23.6 in Docker) -# Oracle Capture Example +Stream change data capture (CDC) from a free, local **Oracle Database 23.6** into [Estuary](https://estuary.dev) in real time. This example spins up Oracle Database 23ai Free in Docker, configures the LogMiner-based replication user that Estuary's Oracle connector needs, exposes the database to the internet with an ngrok TCP tunnel, and captures live `INSERT`/`UPDATE`/`DELETE` events into an Estuary collection — no Oracle license or Oracle Cloud account required. -This project provides a demo Oracle instance in a Docker container to test capturing Oracle data with Estuary. +Video walkthrough: https://www.youtube.com/watch?v=mE7LFSqfwY8 -See the video at: https://www.youtube.com/watch?v=mE7LFSqfwY8 +## Architecture -## Prerequisites +Estuary is fully managed, so it connects to the Oracle database over the public internet. Because Oracle runs locally in Docker here, an ngrok TCP tunnel publishes port `1521` to a reachable `host:port` that you paste into the capture config. + +``` +Oracle 23.6 Free (Docker) Estuary (managed) + ┌──────────────────┐ + │ FREE database │ ┌──────────────────────┐ + │ c##estuary_flow_ │ ngrok │ source-oracle │ + │ user │ TCP │ capture (LogMiner) │ + │ inventory table │──tunnel──► │ │ │ + │ FLOW_WATERMARKS │ :1521 │ ▼ │ + │ ARCHIVELOG mode │ │ collection (JSON) │ + └──────────────────┘ │ │ │ + │ ▼ │ + │ materialization │ + │ (your destination) │ + └──────────────────────┘ +``` + +- **Capture (source):** the `source-oracle` connector uses Oracle LogMiner to read redo/archive logs and emit row-level CDC events. +- **Collection:** each captured table lands in an Estuary collection — a schematized JSON stream backed by cloud storage. +- **Materialization (optional):** push the collection downstream to a warehouse, database, or lake. Or transform it first with a derivation. + +## What's included -To run this project, you will need to have: -* A verified Ngrok account -* Docker installed -* An [Estuary account](https://estuary.dev) +- **`docker-compose.yaml`** — Brings up two services: `oracle-db` (image `oracle/database:23.6.0-free`, port `1521`, `ENABLE_ARCHIVELOG: true`) and `ngrok` (image `ngrok/ngrok:latest`) running `tcp oracle-db:1521` with its inspector UI on port `4040`. Oracle data persists to `./data` (gitignored). +- **`config/init.sql`** — Runs automatically on first boot (mounted at `/opt/oracle/scripts/setup`). It creates the common replication user `c##estuary_flow_user`, grants the privileges Estuary requires (`SELECT ANY TABLE`, `LOGMINING`, `SELECT_CATALOG_ROLE`, `EXECUTE_CATALOG_ROLE`, `SET CONTAINER`, etc.), enables supplemental logging (`ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS`), creates the required `FLOW_WATERMARKS` watermarks table, and seeds a sample `inventory` table with three rows. +- **`.gitignore`** — Excludes `.DS_Store` and the local Oracle data volume (`data/*`). -You do *not* need to have an Oracle license or Oracle Cloud account: we use a free demo instance. +## Prerequisites + +- **Docker** (with `docker compose`). +- A **verified ngrok account** and authtoken — required because the local Oracle database must be reachable by Estuary's managed connector. Sign up at https://dashboard.ngrok.com. +- A free **Estuary account**: https://dashboard.estuary.dev. +- The **Oracle 23.6 Free container image** built locally (see Setup step 1). You do *not* need an Oracle license or Oracle Cloud account. ## Setup -1. Build a container image of the free Oracle Database version 23.6. +### 1. Build the Oracle 23.6 Free container image + +The compose file expects a local image tagged `oracle/database:23.6.0-free`. Oracle publishes the build scripts on GitHub: https://github.com/oracle/docker-images/tree/main/OracleDatabase/SingleInstance. + +Clone that repo, switch to `OracleDatabase/SingleInstance`, and build: + +```bash +./buildContainerImage.sh -v 23.6.0 -f +``` + +### 2. Configure secrets in the compose file + +Edit `docker-compose.yaml` and set: + +- `ORACLE_PWD` — the `SYS`/`SYSTEM` password for the database (replace `YOUR-PW`). +- `NGROK_AUTHTOKEN` — your ngrok authtoken (replace `YOUR-TOKEN`). + +Then edit `config/init.sql` and change the password for the Estuary user. By default it is: + +```sql +CREATE USER c##estuary_flow_user IDENTIFIED BY test123 CONTAINER=ALL; +``` + +Replace `test123` with a strong password and remember it — you'll paste it into the capture config. + +### 3. Start the stack + +```bash +docker compose up +``` + +Wait for the database to finish initializing (Oracle's first boot is slow). `init.sql` runs automatically once the database is ready, creating the user, grants, watermarks table, and sample `inventory` data. + +### 4. Enable backups so LogMiner has archive logs to read + +Oracle's CDC needs archived redo logs available. With the container running, open a shell and run an RMAN backup: + +```bash +docker exec -it <your-container> bash +rman +``` + +Inside `rman`, log in as a DBA (enter your `ORACLE_PWD` when prompted) and configure retention + run a backup: + +``` +CONNECT TARGET "sys@FREE AS SYSDBA" +CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 7 DAYS; +BACKUP DATABASE PLUS ARCHIVELOG; +``` + +### 5. Get the public ngrok endpoint + +The ngrok inspector is exposed on port `4040`. Open http://localhost:4040 to see the public TCP forwarding address, or grab it from the command line: + +```bash +curl -s http://localhost:4040/api/tunnels | jq -r ".tunnels[0].public_url" +``` + +This returns something like `tcp://6.tcp.ngrok.io:14732`. **Strip the `tcp://` prefix** before pasting it into Estuary — the connector wants `host:port` (e.g. `6.tcp.ngrok.io:14732`). + +## Configure the Estuary capture + +Create a new capture in the Estuary dashboard with the **Oracle (Real-time)** / `source-oracle` connector: https://dashboard.estuary.dev/captures. + +Enter these values (matching `docker-compose.yaml` and `config/init.sql`): + +| Field | Value | +| --- | --- | +| **Server address** | Your ngrok endpoint, `host:port` (no `tcp://`) | +| **User** | `c##estuary_flow_user` | +| **Password** | The password you set for `c##estuary_flow_user` in `config/init.sql` | +| **Database** | `FREE` | + +Click **Next** to let Estuary discover tables, select the `inventory` table (and any others), then **Save and Publish**. Estuary backfills the existing rows and then tails the redo logs for new changes. - You can download Oracle's Docker images [here](https://github.com/oracle/docker-images/tree/main). Find instructions for working with Database images in [this directory](https://github.com/oracle/docker-images/tree/main/OracleDatabase/SingleInstance). +Connector reference: https://docs.estuary.dev/reference/Connectors/capture-connectors/OracleDB/ - You can, for example, make a local copy of their repo, navigate to the `/OracleDatabase/SingleInstance` path, and run: +## Verify - ``` - ./buildContainerImage.sh -v 23.6.0 -f - ``` +- In the dashboard, open the capture and watch the **bytes/docs read** metrics climb after publishing. +- Browse the captured collection in the Estuary UI — you should see the three seeded `inventory` rows (`Popcorn`, `Caramel corn`, `Cheese popcorn`). +- Test live CDC by inserting a row in Oracle and confirming it appears in the collection: -2. Clone this project and update the following variables: + ```sql + INSERT INTO c##estuary_flow_user.inventory VALUES ('3456-nopq', 'Kettle corn', 549, 40); + COMMIT; + ``` - * `ORACLE_PWD`: Environment variable in the compose file that will be the root password for your database - * `NGROK_AUTHTOKEN`: Environment variable in the compose file; paste in your own ngrok token to authenticate your account - * Update the password in `init.sql` for the Estuary database user +If you use [flowctl](https://docs.estuary.dev/concepts/flowctl/), you can tail the collection directly: -3. Run the container and make final database configurations. +```bash +flowctl collections read --collection <your-collection-name> --uncommitted | head +``` - 1. Run with `docker compose up` and wait for the database to complete setup - 2. Open a bash session in the running container: `docker exec -it <your-container> bash` - 3. Start `rman` - 4. Log in as a DBA, entering your `ORACLE_PWD` when prompted: `CONNECT TARGET "sys@FREE AS SYSDBA"` - 5. Update the retention policy: `CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 7 DAYS;` - 6. Kick off a backup: `BACKUP DATABASE PLUS ARCHIVELOG;` +## Next steps -4. Create a new Oracle data capture [in Estuary](https://dashboard.estuary.dev/captures) and enter: +- **Materialize the data** to a destination (Snowflake, BigQuery, Postgres, MotherDuck, and more): https://dashboard.estuary.dev/materializations. +- **Transform with a derivation** in SQL, TypeScript, or Python: https://docs.estuary.dev/concepts/derivations/. - * **Server address:** Your ngrok endpoint (should be available from your ngrok dashboard; remove the `tcp://` protocol from the string) - * **User:** `c##estuary_flow_user` - * **Password:** Password for `c##estuary_flow_user` in `init.sql` - * **Database:** `FREE` +## References -5. Save and publish the capture. You can now view test data in an Estuary data collection, send it downstream with a materialization, or transform it with derivations. +- Video walkthrough: https://www.youtube.com/watch?v=mE7LFSqfwY8 +- Oracle capture connector docs: https://docs.estuary.dev/reference/Connectors/capture-connectors/OracleDB/ +- Estuary docs: https://docs.estuary.dev +- flowctl CLI: https://docs.estuary.dev/concepts/flowctl/ +- Oracle Database Docker images: https://github.com/oracle/docker-images/tree/main/OracleDatabase/SingleInstance diff --git a/postgres-cdc-bigquery-dbt/README.md b/postgres-cdc-bigquery-dbt/README.md index 3f27bd0..b168676 100644 --- a/postgres-cdc-bigquery-dbt/README.md +++ b/postgres-cdc-bigquery-dbt/README.md @@ -1,5 +1,161 @@ -# PostgreSQL CDC to BigQuery ELT data flow +# Real-Time PostgreSQL CDC to BigQuery with Estuary and dbt -This projects showcases the components of a CDC data flow that streams data from PostgreSQL into BigQuery, then transforms the results with dbt. +This example demonstrates an end-to-end ELT pipeline that streams change data capture (CDC) events from PostgreSQL into Google BigQuery in real time using [Estuary](https://estuary.dev), then models the raw, append-only data into analytics-ready tables with [dbt](https://www.getdbt.com/). A local Postgres instance with logical replication enabled emits inserts, updates, and deletes that Estuary captures and materializes to BigQuery, where a dbt project deduplicates and incrementalizes the stream into clean `sales` tables. -See blog post for details: https://estuary.dev/efficient-elt-with-estuary-flow-and-dbt/ +## Architecture + +The pipeline follows the standard Estuary data movement pattern, finished off with an in-warehouse dbt transformation: + +``` +PostgreSQL (logical replication) + │ CDC events (insert / update / delete) + ▼ +Estuary capture ──► Estuary collection ──► Estuary materialization + (source-postgres) (schematized JSON) (materialize-bigquery) + │ + ▼ + BigQuery raw table + │ dbt run + ▼ + stg_sales (view) ──► sales (incremental) +``` + +- **Capture**: the `source-postgres` connector reads the Postgres write-ahead log via the `flow_publication` publication and streams every row change into an Estuary **collection** (a real-time, schematized JSON data lake backed by cloud storage). Each document carries Estuary metadata such as `_meta_op` (the CDC operation: `c`/`u`/`d`), `flow_published_at`, and `flow_document`. +- **Materialization**: the `materialize-bigquery` connector continuously pushes the collection into a BigQuery table. +- **Transform**: the dbt project in [`sales_dbt_project/`](./sales_dbt_project) reads that BigQuery table as a source, stages it in `stg_sales`, then builds an **incremental** `sales` model that filters out deletes (`_meta_op != 'd'`) and only processes rows newer than the last run using `flow_published_at`. + +## What's included + +- **`docker-compose.yml`** — spins up three services: `postgres` (Postgres with `wal_level=logical`, exposed on port `5432`), `datagen` (continuous data generator), and `ngrok` (a TCP tunnel so Estuary's managed connector can reach your local database). +- **`postgres/init.sql`** — initializes Postgres for CDC: grants `REPLICATION` and `pg_read_all_data` to the `postgres` user, creates the `public.flow_watermarks` table, creates the `flow_publication` publication, and creates the `public.sales` table used as the source. +- **`datagen/`** — a Python container (`datagen.py`, `Dockerfile`, `requirements.txt`) that connects to Postgres and continuously inserts (70%), deletes (20%), and updates (10%) rows in the `sales` table every second using [Faker](https://faker.readthedocs.io/), producing a realistic CDC stream. +- **`sales_dbt_project/`** — the dbt project that transforms the materialized BigQuery data. See its own [README](./sales_dbt_project/README.md) and the models in [`sales_dbt_project/models/`](./sales_dbt_project/models). +- **`requirements.txt`** — Python dependencies for running dbt locally (`dbt-core==1.8.0`, `dbt-bigquery==1.8.1`). + +### The `sales` source table + +`postgres/init.sql` creates the table that the capture reads: + +| Column | Type | Notes | +| ------------- | --------------- | ------------------------------ | +| `sale_id` | `SERIAL` | Primary key | +| `product_id` | `INTEGER` | `NOT NULL` | +| `customer_id` | `INTEGER` | `NOT NULL` | +| `sale_date` | `TIMESTAMP` | `NOT NULL` | +| `quantity` | `INTEGER` | `NOT NULL` | +| `unit_price` | `NUMERIC(10,2)` | `NOT NULL` | +| `total_price` | `NUMERIC(10,2)` | `NOT NULL` | + +## Prerequisites + +- [Docker](https://docs.docker.com/get-docker/) and Docker Compose +- A free [ngrok](https://ngrok.com/) account and authtoken (required because Estuary is fully managed and must reach your local Postgres over a public TCP tunnel) +- A free [Estuary account](https://dashboard.estuary.dev) +- A Google Cloud project with **BigQuery** and a [service account JSON key](https://docs.estuary.dev/reference/Connectors/materialization-connectors/BigQuery/) that can write to your target dataset (used by both the Estuary materialization and dbt) +- Python 3.12+ to run dbt locally (`pip install -r requirements.txt`) + +## Setup + +### 1. Start Postgres, the data generator, and the ngrok tunnel + +```bash +export NGROK_AUTHTOKEN=<your-ngrok-authtoken> +docker compose up --build -d +``` + +This launches Postgres (initialized via `postgres/init.sql`), starts the `datagen` container generating CDC traffic, and opens an ngrok TCP tunnel to `postgres:5432`. + +### 2. Get the public database endpoint + +Open the ngrok web UI at [http://localhost:4040](http://localhost:4040), or grab the public address from the command line: + +```bash +curl -s http://localhost:4040/api/tunnels | jq -r ".tunnels[0].public_url" +``` + +You'll get something like `tcp://6.tcp.ngrok.io:18922`. **Strip the `tcp://` prefix** — the host and port are what you'll paste into Estuary. + +## Configure the Estuary capture + +Create the PostgreSQL capture in the [Estuary dashboard](https://dashboard.estuary.dev/captures) (search for **PostgreSQL** / `source-postgres`) or with `flowctl`. Use the values from `docker-compose.yml` and `postgres/init.sql`: + +| Field | Value | +| ---------------- | ------------------------------------------------ | +| Server Address | the ngrok host:port (e.g. `6.tcp.ngrok.io:18922`) | +| User | `postgres` | +| Password | `postgres` | +| Database | `postgres` | + +The connector will discover the `public.sales` table (added to `flow_publication` in `init.sql`) and bind it to an Estuary collection. The pre-created `public.flow_watermarks` table and the `wal_level=logical` setting are what make CDC work. + +Connector reference: [PostgreSQL capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/). + +## Configure the Estuary materialization + +Create a BigQuery materialization from the [dashboard](https://dashboard.estuary.dev/materializations) (search for **BigQuery** / `materialize-bigquery`) and bind the `sales` collection from the capture above. Provide: + +- Your Google Cloud **Project ID** +- The target **Dataset** (in this example the dbt project reads from dataset `dani_dev` — set yours and update the dbt source accordingly) +- The **Service Account JSON** key with BigQuery write access +- A **Cloud Storage bucket** Estuary uses to stage data before loading into BigQuery + +Connector reference: [BigQuery materialization connector](https://docs.estuary.dev/reference/Connectors/materialization-connectors/BigQuery/). + +> Note: the dbt project's source ([`sales_dbt_project/models/staging_models.yml`](./sales_dbt_project/models/staging_models.yml)) points at `database: estuary-theatre`, `schema: dani_dev`, `table: sales2`. Update these to match your own BigQuery project, dataset, and the table name your materialization writes to. + +## Transform with dbt + +Once data is landing in BigQuery, run the dbt project to build the staging view and incremental table. + +```bash +pip install -r requirements.txt +``` + +Configure a `sales_dbt_project` profile in `~/.dbt/profiles.yml` pointing at your BigQuery project (using the same service account), then: + +```bash +cd sales_dbt_project +dbt deps +dbt run +dbt test +``` + +This builds two models: + +- **`stg_sales`** — a 1:1 staging view over the Estuary source table that selects the business columns plus Estuary metadata (`_meta_op`, `flow_published_at`, `flow_document`). +- **`sales`** — an `incremental` model that excludes deleted rows (`_meta_op != 'd'`) and, on incremental runs, only ingests rows with `flow_published_at` newer than the latest already loaded, keeping transforms fast and append-friendly. + +See [`sales_dbt_project/README.md`](./sales_dbt_project/README.md) and [`sales_dbt_project/models/`](./sales_dbt_project/models) for the model definitions and column documentation. + +## Verify + +Confirm the pipeline end to end: + +- **Capture/materialization metrics**: check the docs read/written counters on each task in the [Estuary dashboard](https://dashboard.estuary.dev). +- **Read the collection** directly with flowctl: + + ```bash + flowctl collections read --collection <your-tenant>/sales --uncommitted | head + ``` + +- **Query BigQuery** to see live data and CDC operations arriving: + + ```sql + SELECT _meta_op, COUNT(*) FROM `your_project.your_dataset.sales` GROUP BY _meta_op; + ``` + +- **Check the dbt output**: after `dbt run`, the `sales` table should contain no rows where `_meta_op = 'd'`. + +## Next steps + +- Add more source tables to `flow_publication` and re-discover the capture to stream additional collections. +- Extend the dbt project with marts (aggregations, dimensions) on top of the `sales` model. +- Swap the BigQuery materialization for another destination such as [Snowflake](https://docs.estuary.dev/reference/Connectors/materialization-connectors/Snowflake/) or [Databricks](https://docs.estuary.dev/reference/Connectors/materialization-connectors/databricks/). + +## Resources + +- Blog: [Efficient ELT with Estuary and dbt](https://estuary.dev/efficient-elt-with-estuary-flow-and-dbt/) +- [Estuary documentation](https://docs.estuary.dev) +- [PostgreSQL capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/) +- [BigQuery materialization connector](https://docs.estuary.dev/reference/Connectors/materialization-connectors/BigQuery/) +- [flowctl CLI](https://docs.estuary.dev/concepts/flowctl/) diff --git a/postgres-cloudsql-simple-capture/README.md b/postgres-cloudsql-simple-capture/README.md index cb60cd8..5d15747 100644 --- a/postgres-cloudsql-simple-capture/README.md +++ b/postgres-cloudsql-simple-capture/README.md @@ -1 +1,188 @@ -# Simple example for GCP CloudSQL PostgreSQL \ No newline at end of file +# PostgreSQL CDC Capture for Google Cloud SQL with Estuary + +A self-contained example for setting up a real-time **PostgreSQL CDC (Change Data Capture)** pipeline with [Estuary](https://dashboard.estuary.dev), targeting **Google Cloud SQL for PostgreSQL**. It ships a Docker Compose stack (local Postgres + ngrok TCP tunnel) so you can test the Estuary `source-postgres` connector end-to-end without a cloud database, plus a data generator (`datagen/`) written against the **Cloud SQL Python Connector** for streaming continuous inserts, updates, and deletes into a `sales` table on a real Cloud SQL instance. + +The Postgres bootstrap (`postgres/init.sql`) provisions everything Estuary needs for logical-replication CDC: a `flow_capture` replication user, a `flow_watermarks` table, and a `flow_publication` publication. + +## Architecture + +Estuary ingests row-level changes from Postgres via logical replication and lands them in a real-time collection, which you can then materialize to any destination. + +``` +PostgreSQL (Cloud SQL or local Docker) + wal_level=logical + publication: flow_publication + tables: public.sales, public.flow_watermarks + │ (logical replication / CDC) + ▼ +Estuary source-postgres capture ──► Collection (schematized JSON in your tenant) + │ + ▼ +Materialization (BigQuery, Snowflake, Postgres, …) ← optional, your choice +``` + +- **Source / Capture:** the `source-postgres` connector reads inserts, updates, and deletes from `public.sales` over a logical replication slot. +- **Collection:** captured documents are written to an Estuary collection (a real-time data lake of schematized JSON in cloud storage). +- **Materialization:** add a destination connector later to push the collection into your warehouse, database, or lake. Not included here — this example focuses on the capture. + +Because Estuary is fully managed, the Postgres database must be reachable from the public internet. For Cloud SQL you expose a public IP and authorized network (or use the connector's SSH tunnel); for the local Docker Postgres in this repo, the included **ngrok** service opens a TCP tunnel. + +## What's included + +- **`docker-compose.yml`** — spins up three services: + - `postgres` — `postgres:latest` started with `wal_level=logical`, database `postgres`, user/password `postgres`/`postgres`, exposed on host port `5432`. Mounts `postgres/init.sql` as a Docker entrypoint init script. + - `datagen` — builds and runs the data generator (see `datagen/`). + - `ngrok` — `ngrok/ngrok:latest` running `tcp postgres:5432` to expose the local Postgres publicly; its inspection UI is on host port `4040`. +- **`postgres/init.sql`** — runs once on first container start. Creates the `flow_capture` user with `REPLICATION` (password `password`), grants `pg_read_all_data` / `pg_write_all_data`, creates the `public.flow_watermarks` table, creates the `flow_publication` publication (with `publish_via_partition_root = true`), adds `public.flow_watermarks` and `public.sales` to it, and creates the `public.sales` table. +- **`datagen/datagen.py`** — connects to a **Google Cloud SQL** instance using the [Cloud SQL Python Connector](https://github.com/GoogleCloudPlatform/cloud-sql-python-connector) (`pg8000` driver), creates the `sales` table if needed, and loops once per second performing weighted random operations: 70% inserts, 20% deletes, 10% updates. This generates the CDC event stream the Estuary capture consumes. +- **`datagen/Dockerfile`** — `python:3.12` image that installs requirements and runs `python -u datagen.py`. +- **`datagen/requirements.txt`** — `Faker==25.1.0` and `cloud-sql-python-connector[pg8000]`. Note: `datagen.py` also imports `sqlalchemy` and `python-dotenv`, which aren't pinned here — install them too (`pip install SQLAlchemy python-dotenv`) or add them to this file before running the generator outside Docker. + +### `sales` table schema + +| Column | Type | Notes | +|---------------|-----------------|------------------------| +| `sale_id` | `SERIAL` | Primary key | +| `product_id` | `INTEGER` | `NOT NULL` | +| `customer_id` | `INTEGER` | `NOT NULL` | +| `sale_date` | `TIMESTAMP` | `NOT NULL` | +| `quantity` | `INTEGER` | `NOT NULL` | +| `unit_price` | `NUMERIC(10,2)` | `NOT NULL` | +| `total_price` | `NUMERIC(10,2)` | `NOT NULL` | + +## Prerequisites + +- [Docker](https://docs.docker.com/get-docker/) and Docker Compose (for the local test path). +- A free [Estuary account](https://dashboard.estuary.dev). +- An [ngrok account](https://ngrok.com) and authtoken (required to expose the local Docker Postgres so the hosted connector can reach it). +- **For the Cloud SQL data generator only:** a Google Cloud project with a Cloud SQL for PostgreSQL instance, the Cloud SQL Admin API enabled, and [Application Default Credentials](https://cloud.google.com/docs/authentication/application-default-credentials) configured for the Cloud SQL Python Connector. + +## Two ways to run + +This example supports two paths. The data generator (`datagen.py`) is written for **Cloud SQL**; the Docker Compose stack provides a **local Postgres** so you can test the capture mechanics without a cloud database. + +> Note: the `datagen` service env vars in `docker-compose.yml` (`POSTGRES_HOST`, `POSTGRES_PORT`, …) point at the local `postgres` service, but `datagen.py` reads `DB_NAME`, `DB_USER`, `DB_PASSWORD`, and `GCP_PROJECT_ID` / `GCP_REGION` / `GCP_CLOUDSQL_INSTANCE_NAME` to build a Cloud SQL connection. To exercise the generator against real Cloud SQL, run it as a standalone script with those variables set (see below). For a purely local test, the `postgres` service alone is enough to wire up and verify the capture. + +### Option A — Local Postgres + ngrok (fast capture test) + +Start the stack with your ngrok authtoken set: + +```bash +export NGROK_AUTHTOKEN=<your-ngrok-authtoken> +docker compose up -d +``` + +Get the public host:port for the Postgres tunnel: + +```bash +curl -s http://localhost:4040/api/tunnels | jq -r '.tunnels[0].public_url' +# e.g. tcp://2.tcp.ngrok.io:14823 +``` + +Or open the ngrok inspector at <http://localhost:4040> and read the forwarding address. Strip the `tcp://` prefix when pasting into Estuary — you want the host and port separately. + +### Option B — Real Google Cloud SQL + +Provision your Cloud SQL instance with logical replication enabled, then run `postgres/init.sql` against it to create the `flow_capture` user, `flow_watermarks` table, and `flow_publication`. Run the generator locally against the instance: + +```bash +cd datagen +pip install -r requirements.txt + +export GCP_PROJECT_ID=<your-project> +export GCP_REGION=<your-region> +export GCP_CLOUDSQL_INSTANCE_NAME=<your-instance> +export DB_NAME=postgres +export DB_USER=postgres +export DB_PASSWORD=<your-password> + +python -u datagen.py +``` + +The script prints the resolved instance connection name (`<project>:<region>:<instance>`) and then logs each insert/update/delete. + +## Configure the Estuary capture + +Use the [`source-postgres`](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/) connector. Connector image: `ghcr.io/estuary/source-postgres:dev`. + +### Via the Estuary dashboard + +1. Go to <https://dashboard.estuary.dev/captures> and click **New Capture**. +2. Search for and select **PostgreSQL**. +3. Fill in the endpoint config using the values from this example: + + | Field | Local (ngrok) | Cloud SQL | + |------------|----------------------------------------|----------------------------------------| + | Server Address | `<host>:<port>` from the ngrok tunnel | Cloud SQL public IP : `5432` | + | User | `flow_capture` | `flow_capture` | + | Password | `password` | `password` | + | Database | `postgres` | `postgres` | + +4. Save and publish. The connector discovers `public.sales` (and `public.flow_watermarks`) and proposes bindings. Keep the `sales` binding to stream the generated data into a collection. + +> The `flow_capture` user and `password` are defined in `postgres/init.sql`. Change them for any non-demo deployment. + +### Via flowctl + +Authenticate and discover from your `source-postgres` config, then publish: + +```bash +flowctl auth login +flowctl discover --source flow.yaml # generates bindings from the endpoint config +flowctl catalog publish --source flow.yaml --auto-approve +``` + +A minimal capture spec for this source looks like: + +```yaml +captures: + YOUR_PREFIX/postgres-cloudsql/source-postgres: + endpoint: + connector: + image: ghcr.io/estuary/source-postgres:dev + config: + address: "<host>:<port>" # ngrok host:port, or Cloud SQL IP:5432 + user: flow_capture + password: password + database: postgres + bindings: + - resource: + name: sales + namespace: public + target: YOUR_PREFIX/postgres-cloudsql/sales +``` + +See the [flowctl docs](https://docs.estuary.dev/concepts/flowctl/) for installation and auth details. + +## Verify + +Confirm data is flowing: + +- In the dashboard, open the capture and watch the documents/bytes counters increase as `datagen` writes to `sales`. +- With flowctl, tail the collection: + + ```bash + flowctl collections read --collection YOUR_PREFIX/postgres-cloudsql/sales --uncommitted | head + ``` + + You should see `sales` rows arriving with `_meta` CDC fields reflecting inserts, updates, and deletes. + +## Cleanup + +```bash +docker compose down -v +``` + +The `-v` flag removes the Postgres volume so the next `up` re-runs `init.sql` from scratch. + +## Next steps + +- Add a [materialization](https://dashboard.estuary.dev/materializations) to push the `sales` collection into BigQuery, Snowflake, Postgres, or another destination. +- Transform the stream with a [derivation](https://docs.estuary.dev/concepts/derivations/) in SQL, TypeScript, or Python. + +## References + +- Estuary docs: <https://docs.estuary.dev> +- PostgreSQL capture connector reference: <https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/> +- flowctl: <https://docs.estuary.dev/concepts/flowctl/> +- Estuary dashboard: <https://dashboard.estuary.dev> diff --git a/postgres-measure-wal-throughput/README.md b/postgres-measure-wal-throughput/README.md index 49b06b1..8a3a781 100644 --- a/postgres-measure-wal-throughput/README.md +++ b/postgres-measure-wal-throughput/README.md @@ -1,5 +1,170 @@ -# Measure WAL throughput in PostgreSQL +# Measure PostgreSQL WAL Throughput to Size CDC Pipelines with Estuary -Learn how to approximate change event throughput for PostgreSQL by measuring the Write-ahead-log (WAL). +Measure PostgreSQL Write-Ahead-Log (WAL) throughput to approximate the change-event volume a Change Data Capture (CDC) pipeline will produce **before** you build it. This example spins up a self-contained PostgreSQL instance (configured for logical replication, exactly like an Estuary CDC source), continuously generates inserts/updates/deletes, samples the current WAL LSN every minute with `pg_cron`, and exposes SQL views that report bytes-per-second and rolling-window WAL rates. The same Postgres is pre-wired with a `flow_capture` replication user, publication, and watermarks table so you can point an Estuary capture at it over an ngrok tunnel and compare measured WAL throughput against real CDC throughput. -See blog post for details: https://estuary.dev/measuring-postgresql-wal-throughput/ +Reference blog post: https://estuary.dev/measuring-postgresql-wal-throughput/ + +## Why measure WAL throughput + +WAL is PostgreSQL's source of truth for every committed change, and logical-replication CDC connectors (including Estuary's `source-postgres`) read the change stream from it. The byte volume PostgreSQL writes to the WAL over a time window is a close upper bound on the data a CDC connector has to decode and ship. Sampling `pg_current_wal_lsn()` at fixed intervals and diffing consecutive LSNs with `pg_wal_lsn_diff()` gives you a concrete bytes/second figure you can use to size connectors, estimate egress, and forecast cost without first standing up the full pipeline. + +## How it works + +The technique is pure SQL plus a scheduler: + +``` +pg_cron (every 1 min) ──> record_current_wal_lsn() + │ INSERT pg_current_wal_lsn() into wal_lsn_history + ▼ + wal_lsn_history (timestamp, lsn_position) + │ diff consecutive LSNs with pg_wal_lsn_diff() + ├──> wal_volume_analytics (per-sample bytes & rate) + └──> wal_volume_summary (5 min / 15 min / 1 hr / 1 day rollups) +``` + +- `record_current_wal_lsn()` captures the current WAL LSN and appends it to `wal_lsn_history`. +- `pg_cron` runs that function once per minute via the scheduled job `Record WAL LSN every minute` (`*/1 * * * *`). +- `wal_volume_analytics` diffs each sample against the previous one to report `wal_bytes_since_previous`, `bytes_per_second`, and a pretty rate. +- `wal_volume_summary` aggregates samples into `Last 5 minutes`, `Last 15 minutes`, `Last hour`, and `Last day` windows with total size and average rate. + +When data is actively changing (the `datagen` service drives that), the rates in those views approximate the change-event throughput a CDC pipeline would carry. + +### Optional: validate against a real Estuary CDC pipeline + +Because the database is already configured for logical replication, you can also stand up a live Estuary capture and compare: + +``` +PostgreSQL (wal_level=logical, flow_publication, flow_watermarks) + │ exposed via ngrok TCP tunnel + ▼ +Estuary capture (source-postgres) ──> collection (public/sales) +``` + +The Estuary `source-postgres` connector reads the `public.sales` table through the `flow_publication` publication and streams every insert/update/delete into an Estuary collection, where you can observe the actual document/byte throughput in the dashboard. + +## What's included + +- `docker-compose.yml` — defines three services: + - `postgres` (container `postgres-wal-measure`, hostname `postgres`) built from `postgres/Dockerfile`. Started with `wal_level=logical`, `log_statement=all`, and `shared_preload_libraries=pg_cron`. Exposes port `5432`. + - `datagen` (container `datagen-wal-measure`) built from `datagen/Dockerfile`. Continuously writes to the `sales` table to produce WAL activity. + - `ngrok` (container `ngrok-wal-measure`, image `ngrok/ngrok:latest`) runs `tcp postgres:5432` to expose the local database to Estuary's managed connectors. The ngrok inspector is published on port `4040`. +- `postgres/Dockerfile` — `postgres:16` plus the `postgresql-16-cron` package (`pg_cron`). +- `postgres/init.sql` — the heart of the example. Creates the `flow_capture` replication user, grants, the `flow_watermarks` table, the `flow_publication` publication, the `sales` table, the `wal_lsn_history` table, the `record_current_wal_lsn()` function, the `wal_volume_analytics` and `wal_volume_summary` views, enables the `pg_cron` extension, and schedules the per-minute LSN sampling job. +- `datagen/Dockerfile` — `python:3.12` image that installs `requirements.txt` and runs `datagen.py`. +- `datagen/datagen.py` — connects to Postgres and loops forever, performing a weighted mix of `insert` (70%), `delete` (20%), and `update` (10%) on the `sales` table, one operation per second. +- `datagen/requirements.txt` — `Faker==25.1.0` and `psycopg2==2.9.9`. + +## Prerequisites + +- Docker and Docker Compose. +- An ngrok account and authtoken (free tier works) — required only if you want to connect a hosted Estuary capture to the local database. Get a token at https://dashboard.ngrok.com/get-started/your-authtoken. +- A free Estuary account at https://dashboard.estuary.dev — required only for the optional live-CDC comparison. + +You do **not** need Estuary, ngrok, or flowctl just to measure WAL throughput. The measurement runs entirely inside the Postgres container. + +## Running it + +From this directory, set your ngrok token and start the stack: + +```bash +export NGROK_AUTHTOKEN=<your-ngrok-authtoken> +docker compose up --build +``` + +If you only want to measure WAL throughput and skip the tunnel, start just the database and data generator: + +```bash +docker compose up --build postgres datagen +``` + +On startup, `init.sql` runs automatically, `pg_cron` schedules the per-minute sampling job, and `datagen` begins writing to the `sales` table. Let it run for several minutes so the WAL history accumulates enough samples to diff. + +## Inspect WAL throughput + +Open a `psql` session inside the running container: + +```bash +docker exec -it postgres-wal-measure psql -U postgres -d postgres +``` + +Per-sample rate (newest first): + +```sql +SELECT * FROM wal_volume_analytics; +``` + +Example columns returned: `wal_bytes_since_previous`, `wal_size_since_previous`, `bytes_per_second`, `rate_pretty`, `seconds_since_previous`. + +Rolling-window summary: + +```sql +SELECT * FROM wal_volume_summary; +``` + +Returns one row per window (`Last 5 minutes`, `Last 15 minutes`, `Last hour`, `Last day`) with `samples`, `total_wal_size`, `avg_wal_per_minute`, and `avg_rate`. + +Check the raw samples or the live LSN directly: + +```sql +SELECT * FROM wal_lsn_history ORDER BY timestamp DESC LIMIT 10; +SELECT pg_current_wal_lsn(); +``` + +The `avg_rate` from `wal_volume_summary` is your headline number: the approximate WAL throughput, and therefore the approximate CDC change-event volume, for the captured workload. + +## Optional: connect an Estuary CDC capture + +To compare measured WAL throughput against a real Estuary CDC pipeline, expose the database and point a capture at it. + +### 1. Get the public ngrok endpoint + +With the `ngrok` service running, read the public TCP address from the inspector at http://localhost:4040, or via: + +```bash +curl -s http://localhost:4040/api/tunnels | jq -r ".tunnels[0].public_url" +``` + +This prints something like `tcp://6.tcp.ngrok.io:14108`. Strip the `tcp://` prefix when pasting into Estuary — you want `6.tcp.ngrok.io` as the host and `14108` as the port. + +### 2. Create the capture + +In the Estuary dashboard, create a new PostgreSQL capture at https://dashboard.estuary.dev/captures and use the **PostgreSQL** source connector (`source-postgres`). Connector docs: https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/ + +Use the connection values that `postgres/init.sql` and `docker-compose.yml` provision: + +| Field | Value | +| --- | --- | +| Server Address | `<ngrok-host>:<ngrok-port>` (from step 1) | +| Database | `postgres` | +| User | `flow_capture` | +| Password | `password` | + +The connector discovers the `public.sales` table, reads it through the `flow_publication` publication, and uses `public.flow_watermarks` for consistent backfills — all already created by `init.sql`. Select the `public.sales` binding and publish the capture. + +### 3. Verify and compare + +Confirm change events are flowing into the collection: + +```bash +flowctl collections read --collection <YOUR_PREFIX>/public/sales --uncommitted | head +``` + +Then compare: + +- **Measured WAL throughput** from `wal_volume_summary.avg_rate`. +- **Actual CDC throughput** from the capture's bytes/docs metrics in the Estuary dashboard, or via `flowctl catalog status <capture-name>`. + +The two should track closely, validating WAL sampling as a planning tool for sizing CDC pipelines. + +## Next steps + +- Adjust the per-minute schedule in `init.sql` (`*/1 * * * *`) to sample more or less frequently for finer/coarser resolution. +- Replace the synthetic `datagen.py` workload with a snapshot of your real write patterns to forecast production CDC volume. +- Use the measured throughput to plan an Estuary capture and materialization to your warehouse or lakehouse of choice. + +## Resources + +- Blog: [Measuring PostgreSQL WAL Throughput](https://estuary.dev/measuring-postgresql-wal-throughput/) +- Estuary PostgreSQL capture connector docs: https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/ +- Estuary documentation: https://docs.estuary.dev +- flowctl CLI: https://docs.estuary.dev/concepts/flowctl/ diff --git a/postgres-simple-capture/README.md b/postgres-simple-capture/README.md index 41ad3d6..0e7a022 100644 --- a/postgres-simple-capture/README.md +++ b/postgres-simple-capture/README.md @@ -1,9 +1,174 @@ -# Demo for a simple PostgreSQL capture +# Real-Time PostgreSQL CDC Capture with Estuary (Docker + ngrok Demo) + +A minimal, self-contained PostgreSQL change data capture (CDC) demo for [Estuary](https://estuary.dev). It spins up a logical-replication-ready PostgreSQL database with Docker, generates a continuous stream of inserts, updates, and deletes against a `public.sales` table, and exposes the database over an ngrok TCP tunnel so Estuary's fully managed `source-postgres` connector can stream every row change into a real-time Estuary collection. + +This is the simplest possible end-to-end example for learning how Estuary captures Postgres CDC: no cloud database, no external dependencies beyond Docker and free ngrok/Estuary accounts. + +## Architecture + +``` +┌────────────┐ INSERT/UPDATE/DELETE ┌──────────────┐ +│ datagen │ ───────────────────────────► │ PostgreSQL │ +│ (Python) │ public.sales │ wal_level= │ +└────────────┘ │ logical │ + └──────┬───────┘ + │ logical replication + │ (flow_publication) + ┌─────▼──────┐ + │ ngrok │ tcp postgres:5432 + │ TCP tunnel │ + └─────┬──────┘ + │ public host:port + ┌─────────▼──────────┐ + │ Estuary │ + │ source-postgres │ capture + │ ▼ │ + │ collection │ real-time data lake + └────────────────────┘ +``` + +End-to-end data flow in Estuary terms: + +1. **Source** — A local PostgreSQL instance running with `wal_level=logical`, a `flow_capture` replication user, a `flow_publication` publication, and a `flow_watermarks` table (Estuary's CDC bookkeeping table). +2. **Tunnel** — Because Estuary is fully managed, the local database is published to the internet through an ngrok TCP tunnel pointing at `postgres:5432`. +3. **Capture** — Estuary's `source-postgres` connector reads the Postgres write-ahead log (WAL) via the publication and streams every change into an Estuary **collection** (a schematized, real-time JSON dataset backed by cloud storage). +4. **Materialization (optional)** — From the collection you can add a materialization to push rows into any supported destination (BigQuery, Snowflake, Postgres, etc.). That step is left to you. + +## What's included + +- `docker-compose.yml` — Defines three services: `postgres` (PostgreSQL started with `wal_level=logical`, exposed on port `5432`), `datagen` (the load generator), and `ngrok` (TCP tunnel to `postgres:5432`, dashboard on port `4040`). +- `postgres/init.sql` — Runs on first boot. Creates the `flow_capture` replication user, grants read/write access, creates the `flow_watermarks` table, creates the `flow_publication` publication (with `publish_via_partition_root = true`), creates the `public.sales` table, and adds both tables to the publication. +- `datagen/datagen.py` — Connects to Postgres and, once per second, performs a randomized insert / update / delete (weighted 70% / 10% / 20%) against `public.sales` using Faker-generated data, producing a steady CDC workload. +- `datagen/Dockerfile` — Builds the Python 3.12 generator image. +- `datagen/requirements.txt` — Python dependencies: `Faker==25.1.0`, `psycopg2==2.9.9`. + +### The `sales` table + +The capture streams the `public.sales` table created by `init.sql`: + +| Column | Type | Notes | +|---------------|-----------------|----------------------| +| `sale_id` | `SERIAL` | Primary key | +| `product_id` | `INTEGER` | not null | +| `customer_id` | `INTEGER` | not null | +| `sale_date` | `TIMESTAMP` | not null | +| `quantity` | `INTEGER` | not null | +| `unit_price` | `NUMERIC(10,2)` | not null | +| `total_price` | `NUMERIC(10,2)` | not null | + +## Prerequisites + +- [Docker](https://docs.docker.com/get-docker/) and Docker Compose. +- A free [ngrok account](https://dashboard.ngrok.com/signup) and an authtoken (required to expose the local database to Estuary's hosted connector). +- A free [Estuary account](https://dashboard.estuary.dev). ## Setup -1. Start containers: `docker compose up` -2. Get PostgreSQL URL: `curl -s http://localhost:4040/api/tunnels | jq -r '.tunnels[0].public_url'` -3. Set up Estuary capture -4. ??? -5. Profit! \ No newline at end of file +1. Set your ngrok authtoken in the environment (the `ngrok` service reads `NGROK_AUTHTOKEN`): + + ```bash + export NGROK_AUTHTOKEN=<your-ngrok-authtoken> + ``` + +2. Start the stack: + + ```bash + docker compose up -d + ``` + + This builds the `datagen` image, starts PostgreSQL with `wal_level=logical`, runs `postgres/init.sql`, begins generating change events, and opens the ngrok TCP tunnel. + +3. Get the public host and port that Estuary will connect to: + + ```bash + curl -s http://localhost:4040/api/tunnels | jq -r '.tunnels[0].public_url' + ``` + + You can also open the ngrok web dashboard at http://localhost:4040. The URL looks like `tcp://0.tcp.ngrok.io:12345`. **Strip the `tcp://` prefix** — you will paste `0.tcp.ngrok.io` as the host and `12345` as the port into Estuary. + +## Configure the Estuary capture + +The capture uses Estuary's **PostgreSQL** source connector, `source-postgres` ([connector docs](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/)). + +### Option A — Estuary dashboard (recommended) + +1. Go to [dashboard.estuary.dev/captures](https://dashboard.estuary.dev/captures) and click **New Capture**. +2. Search for and select **PostgreSQL**. +3. Fill in the endpoint configuration using the values from this demo: + + | Field | Value | + |------------|----------------------------------------------------| + | Server Address | the ngrok host:port from the step above (e.g. `0.tcp.ngrok.io:12345`) | + | Database | `postgres` | + | User | `flow_capture` | + | Password | `password` | + +4. Click **Next**. Estuary discovers the available tables; select `public.sales` (the `public.flow_watermarks` table is internal bookkeeping and can be left unbound). +5. **Save and Publish**. Estuary backfills the existing rows, then streams new inserts, updates, and deletes from the WAL in real time. + +### Option B — flowctl + +Prefer the CLI? Authenticate and use [flowctl](https://docs.estuary.dev/concepts/flowctl/): + +```bash +flowctl auth login +``` + +Create a `flow.yaml` similar to the following (replace `your-prefix` with your Estuary tenant prefix and set `address` to the ngrok host:port): + +```yaml +captures: + your-prefix/postgres-simple/source-postgres: + endpoint: + connector: + image: ghcr.io/estuary/source-postgres:dev + config: + address: 0.tcp.ngrok.io:12345 + database: postgres + user: flow_capture + password: password + bindings: + - resource: + namespace: public + stream: sales + target: your-prefix/postgres-simple/public/sales +``` + +Then publish and check status: + +```bash +flowctl catalog publish --source flow.yaml --auto-approve +flowctl catalog status your-prefix/postgres-simple/source-postgres +``` + +## Verify + +Confirm data is flowing into the collection: + +```bash +flowctl collections read \ + --collection your-prefix/postgres-simple/public/sales \ + --uncommitted | head +``` + +Or watch live throughput and document counts on the capture's page in the [Estuary dashboard](https://dashboard.estuary.dev/captures). Because `datagen` runs continuously, you should see a steady stream of new documents, plus update and delete events reflecting the WAL changes. + +## Next steps + +- Add a **materialization** to push the `sales` collection into a destination such as [BigQuery](https://docs.estuary.dev/reference/Connectors/materialization-connectors/BigQuery/), [Snowflake](https://docs.estuary.dev/reference/Connectors/materialization-connectors/Snowflake/), or [PostgreSQL](https://docs.estuary.dev/reference/Connectors/materialization-connectors/PostgreSQL/) from [dashboard.estuary.dev/materializations](https://dashboard.estuary.dev/materializations). +- Transform the collection in SQL, TypeScript, or Python with a [derivation](https://docs.estuary.dev/concepts/derivations/). + +## Cleanup + +```bash +docker compose down -v +``` + +Disable or delete the capture in the Estuary dashboard so it stops attempting to reach the (now closed) ngrok tunnel. + +## References + +- [Estuary documentation](https://docs.estuary.dev) +- [PostgreSQL capture connector reference](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/) +- [flowctl CLI](https://docs.estuary.dev/concepts/flowctl/) +- [Estuary dashboard](https://dashboard.estuary.dev) diff --git a/postgresql-cdc-databricks-fraud-detection/README.md b/postgresql-cdc-databricks-fraud-detection/README.md index bc91ecc..be9f268 100644 --- a/postgresql-cdc-databricks-fraud-detection/README.md +++ b/postgresql-cdc-databricks-fraud-detection/README.md @@ -1,5 +1,143 @@ -# Real-time fraud detection with Estuary & Databricks +# Real-Time Fraud Detection with PostgreSQL CDC to Databricks using Estuary -How to implement a real-time fraud detection pipeline with Estuary and Databricks +This example demonstrates a real-time fraud detection pipeline that streams change data capture (CDC) events from PostgreSQL into Databricks using [Estuary](https://estuary.dev). A local Postgres instance with logical replication enabled emits a continuous stream of transaction inserts, updates, and deletes — including deliberately injected anomalies (unusually high and low amounts) — which Estuary captures and materializes into Databricks tables, ready for SQL-based fraud analysis on the lakehouse. -See blog post for details: https://estuary.dev/real-time-fraud-detection-databricks/ +Companion blog post: [Real-Time Fraud Detection with Estuary and Databricks](https://estuary.dev/real-time-fraud-detection-databricks/). + +## Architecture + +The pipeline follows the standard Estuary data movement pattern: a capture reads the source into collections, and a materialization pushes those collections to the destination. + +``` +PostgreSQL (wal_level=logical) + │ CDC events (insert / update / delete) on users + transactions + ▼ +Estuary capture ──► Estuary collections ──► Estuary materialization + (source-postgres) (schematized JSON) (materialize-databricks) + │ + ▼ + Databricks tables (Unity Catalog) + │ SQL fraud analysis + ▼ + anomalous transactions surfaced +``` + +- **Capture**: the `source-postgres` connector reads the Postgres write-ahead log via the `flow_publication` publication and streams every row change into Estuary **collections** (a real-time, schematized JSON data lake backed by cloud storage). Each document carries Estuary metadata such as the CDC operation type, alongside the source columns. +- **Materialization**: the `materialize-databricks` connector continuously pushes the `users` and `transactions` collections into Databricks tables via Unity Catalog. +- **Analysis**: because the anomalous transactions land in Databricks in real time, you can run SQL (or notebooks / SQL alerts) to flag outliers — e.g. amounts far above or below the normal `10.0`–`1000.0` range — as they arrive. + +## What's included + +- **`docker-compose.yml`** — spins up three services: `postgres` (Postgres with `wal_level=logical`, exposed on port `5432`), `datagen` (a continuous transaction generator), and `ngrok` (a TCP tunnel so Estuary's managed connector can reach your local database). Container names are `postgres-cdc-databricks-postgres`, `postgres-cdc-databricks-datagen`, and `postgres-cdc-databricks-ngrok`. +- **`postgres/init.sql`** — initializes Postgres for CDC: grants `REPLICATION` and `pg_read_all_data` to the `postgres` user, creates the `public.flow_watermarks` table, creates the `flow_publication` publication (with `publish_via_partition_root = true`), adds the `public.users` and `public.transactions` tables to the publication, and seeds 20 fake users. +- **`datagen/`** — a Python container (`datagen.py`, `Dockerfile`, `requirements.txt`) that connects to Postgres and every second performs a random insert (70%), delete (20%), or update (10%) against the `transactions` table using [Faker](https://faker.readthedocs.io/). Roughly 10% of generated transactions are anomalies: a 5% chance of an unusually high amount (`1000.0`–`10000.0`) and a 5% chance of an unusually low amount (`0.01`–`1.0`); the rest are normal (`10.0`–`1000.0`). This produces a realistic CDC stream with fraud-like outliers. + +### Source tables + +`postgres/init.sql` creates the two tables that the capture reads: + +**`public.users`** + +| Column | Type | Notes | +| ------------------- | -------------- | ------------------------------ | +| `user_id` | `SERIAL` | Primary key | +| `name` | `VARCHAR(100)` | | +| `email` | `VARCHAR(100)` | | +| `registration_date` | `TIMESTAMP` | Defaults to `CURRENT_TIMESTAMP`| + +**`public.transactions`** + +| Column | Type | Notes | +| ------------------ | --------------- | ------------------------------ | +| `transaction_id` | `SERIAL` | Primary key | +| `user_id` | `INT` | References a user (1–20) | +| `transaction_date` | `TIMESTAMP` | Defaults to `CURRENT_TIMESTAMP`| +| `amount` | `DECIMAL(10,2)` | Normal or anomalous value | + +## Prerequisites + +- [Docker](https://docs.docker.com/get-docker/) and Docker Compose +- A free [ngrok](https://ngrok.com/) account and authtoken (required because Estuary is fully managed and must reach your local Postgres over a public TCP tunnel) +- A free [Estuary account](https://dashboard.estuary.dev) +- A [Databricks](https://www.databricks.com/) workspace with a **SQL Warehouse** and a **Unity Catalog** catalog/schema you can write to, plus a [personal access token](https://docs.estuary.dev/reference/Connectors/materialization-connectors/databricks/) for the materialization + +## Setup + +### 1. Start Postgres, the data generator, and the ngrok tunnel + +```bash +export NGROK_AUTHTOKEN=<your-ngrok-authtoken> +docker compose up --build -d +``` + +This launches Postgres (initialized via `postgres/init.sql`), starts the `datagen` container generating CDC traffic against the `transactions` table, and opens an ngrok TCP tunnel to `postgres:5432`. + +### 2. Get the public database endpoint + +Open the ngrok web UI at [http://localhost:4040](http://localhost:4040), or grab the public address from the command line: + +```bash +curl -s http://localhost:4040/api/tunnels | jq -r ".tunnels[0].public_url" +``` + +You'll get something like `tcp://6.tcp.ngrok.io:18922`. **Strip the `tcp://` prefix** — the host and port are what you'll paste into Estuary. + +## Configure the Estuary capture + +Create the PostgreSQL capture in the [Estuary dashboard](https://dashboard.estuary.dev/captures) (search for **PostgreSQL** / `source-postgres`) or with `flowctl`. Use the values from `docker-compose.yml` and `postgres/init.sql`: + +| Field | Value | +| ---------------- | -------------------------------------------------- | +| Server Address | the ngrok host:port (e.g. `6.tcp.ngrok.io:18922`) | +| User | `postgres` | +| Password | `postgres` | +| Database | `postgres` | + +The connector will discover the `public.users` and `public.transactions` tables (added to `flow_publication` in `init.sql`) and bind each to an Estuary collection. The pre-created `public.flow_watermarks` table and the `wal_level=logical` setting are what make CDC work. + +Connector reference: [PostgreSQL capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/). + +## Configure the Estuary materialization + +Create a Databricks materialization from the [dashboard](https://dashboard.estuary.dev/materializations) (search for **Databricks** / `materialize-databricks`) and bind the `users` and `transactions` collections from the capture above. Provide: + +- Your Databricks **Server Hostname** and **HTTP Path** for the target SQL Warehouse +- The Unity Catalog **Catalog** and **Schema** to write into +- A Databricks **personal access token** with write access + +Connector reference: [Databricks materialization connector](https://docs.estuary.dev/reference/Connectors/materialization-connectors/databricks/). + +## Verify + +Confirm the pipeline end to end: + +- **Capture/materialization metrics**: check the docs read/written counters on each task in the [Estuary dashboard](https://dashboard.estuary.dev). +- **Read a collection** directly with flowctl: + + ```bash + flowctl collections read --collection <your-tenant>/transactions --uncommitted | head + ``` + +- **Query Databricks** to see live data and surface anomalies arriving in real time: + + ```sql + SELECT transaction_id, user_id, amount, transaction_date + FROM <catalog>.<schema>.transactions + WHERE amount > 1000.0 OR amount < 1.0 + ORDER BY transaction_date DESC; + ``` + +## Next steps + +- Build a Databricks SQL alert or notebook that flags anomalous transactions (e.g. amount above `1000.0` or below `1.0`) as they land. +- Join `transactions` to `users` in Databricks to attribute suspicious activity to specific accounts. +- Add more source tables to `flow_publication` and re-discover the capture to stream additional collections. +- Swap Databricks for another destination such as [Snowflake](https://docs.estuary.dev/reference/Connectors/materialization-connectors/Snowflake/) or [BigQuery](https://docs.estuary.dev/reference/Connectors/materialization-connectors/BigQuery/). + +## Resources + +- Blog: [Real-Time Fraud Detection with Estuary and Databricks](https://estuary.dev/real-time-fraud-detection-databricks/) +- [Estuary documentation](https://docs.estuary.dev) +- [PostgreSQL capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/) +- [Databricks materialization connector](https://docs.estuary.dev/reference/Connectors/materialization-connectors/databricks/) +- [flowctl CLI](https://docs.estuary.dev/concepts/flowctl/) diff --git a/pyiceberg-aws-glue/README.md b/pyiceberg-aws-glue/README.md new file mode 100644 index 0000000..03ef85e --- /dev/null +++ b/pyiceberg-aws-glue/README.md @@ -0,0 +1,118 @@ +# Read Estuary Apache Iceberg Tables with PyIceberg and the AWS Glue Catalog + +Query an Apache Iceberg table that Estuary materialized to Amazon S3, using [PyIceberg](https://py.iceberg.apache.org/) with the **AWS Glue Data Catalog**. This is the read/query side of an Estuary Iceberg materialization: Estuary streams your source data into an Iceberg table registered in Glue, and `main.py` loads that table and scans selected columns into a pandas DataFrame — no Spark, no Trino, no query engine to operate. + +## How it works + +Estuary captures source data into **collections** (a schematized, real-time data lake of JSON in cloud storage), then a **materialization** writes those collections to a destination. With the Amazon S3 Iceberg materialization, each collection lands as an **Apache Iceberg table** in an S3 bucket, with table metadata registered in the **AWS Glue Data Catalog** under a namespace (database). PyIceberg reads that catalog directly. + +``` +Source ──capture──► Estuary collection ──materialization──► Apache Iceberg table (S3) + │ metadata registered in + ▼ + AWS Glue Data Catalog + │ + ▼ + main.py (PyIceberg + Glue) ──► pandas DataFrame +``` + +`main.py`: +1. Builds a `GlueCatalog` named `catalog`, authenticating with `AWS_REGION`, `AWS_ACCESS_KEY_ID`, and `AWS_SECRET_ACCESS_KEY` loaded from the environment. +2. Prints `catalog.list_namespaces()` and `catalog.list_tables(namespace=NAMESPACE)` so you can confirm the Glue catalog is reachable and the table exists. +3. Loads the table `{NAMESPACE}.support_requests` with `catalog.load_table(...)`. +4. Runs a `table.scan(...)` selecting the fields `customer_id`, `description`, `request_date`, `request_id`, `request_type`, and `status`, converts the result to a pandas DataFrame with `.to_pandas()`, and prints `df.describe()` and `df.head()`. + +## What's included + +- **`main.py`** — the reader. Constructs the Glue-backed PyIceberg catalog, loads the `support_requests` Iceberg table, scans the selected fields into pandas, and prints summary stats. +- **`requirements.txt`** — pinned dependencies: `pyiceberg==0.6.1` and `boto3==1.34.134`. `main.py` also calls `load_dotenv()`, so install `python-dotenv` as well (see Setup). +- **`.gitignore`** — ignores the local `.env` file that holds your AWS credentials and namespace. + +## Prerequisites + +- **Python 3.8+** and `pip`. +- A **free Estuary account** — sign up at [https://dashboard.estuary.dev](https://dashboard.estuary.dev). +- An existing **Amazon S3 Iceberg materialization** in Estuary that writes the `support_requests` table to S3 and registers it in the **AWS Glue Data Catalog** (see "Configure the Estuary materialization" below). +- **AWS credentials** (access key ID + secret access key) with permission to read the Glue catalog and the underlying S3 data — at minimum `glue:GetDatabase*`, `glue:GetTable*`, `glue:GetPartitions`, and `s3:GetObject` / `s3:ListBucket` on the materialization's bucket. +- The **AWS region** and the **Glue namespace (database)** that the materialization writes to. + +## Setup + +Install the dependencies: + +```bash +pip install -r requirements.txt +pip install python-dotenv +``` + +Create a `.env` file in this directory (it is git-ignored) with your AWS credentials, region, and the Glue namespace used by the materialization: + +```bash +# .env +AWS_REGION=us-east-1 +AWS_ACCESS_KEY_ID=your-access-key-id +AWS_SECRET_ACCESS_KEY=your-secret-access-key +NAMESPACE=your_glue_namespace +``` + +These map directly to the `GlueCatalog(...)` arguments and the `load_table` / `list_tables` calls in `main.py`. `NAMESPACE` must match the Glue database (namespace) that the Estuary materialization writes the `support_requests` table into. + +## Running it + +```bash +python main.py +``` + +Expected output (abbreviated): + +``` +[('your_glue_namespace',), ...] # list_namespaces() +[('your_glue_namespace', 'support_requests'), ...] # list_tables(NAMESPACE) + customer_id request_id ... # df.describe() +... + customer_id description request_date request_id request_type status +0 ... ... ... ... ... ... # df.head() +``` + +If `list_tables` does not show `support_requests`, double-check that `NAMESPACE` matches the materialization's Glue database and that the materialization has published data. + +## Configure the Estuary materialization + +This example only **reads** the Iceberg table. To produce it, set up an **Amazon S3 Iceberg** materialization in Estuary that points at the collection you want to expose as `support_requests`: + +1. In the [Estuary dashboard](https://dashboard.estuary.dev/materializations), create a new materialization and select the **Amazon S3 Iceberg** connector (`ghcr.io/estuary/materialize-s3-iceberg:dev`). +2. Configure the destination: + - **Catalog**: AWS Glue Data Catalog. + - **AWS Region**: the same region you put in `AWS_REGION`. + - **S3 bucket** and **prefix** for the Iceberg data and metadata. + - **AWS access key / secret** with write access to the bucket and Glue. + - **Namespace** (Glue database): the value you put in `NAMESPACE`. +3. Bind your source collection to a table named `support_requests` (the table this example loads). To read different columns, edit the `selected_fields` tuple in `main.py` to match your table's schema. +4. Publish the materialization and let it backfill, then run `python main.py` to query the result. + +Connector reference: [Amazon S3 Iceberg materialization](https://docs.estuary.dev/reference/Connectors/materialization-connectors/amazon-s3-iceberg/). + +> Don't have a pipeline yet? Create a **capture** first (for example PostgreSQL, MySQL, or MongoDB CDC) at [https://dashboard.estuary.dev/captures](https://dashboard.estuary.dev/captures) so there's a collection to materialize into Iceberg. + +## Verify + +- The `list_namespaces()` and `list_tables()` output in `main.py` confirms PyIceberg can reach the Glue catalog and that `support_requests` exists. +- A populated `df.head()` confirms the Iceberg data files in S3 are readable end-to-end. +- Cross-check the row counts and freshness against the materialization's metrics in the [Estuary dashboard](https://dashboard.estuary.dev), or read the source collection directly with [flowctl](https://docs.estuary.dev/concepts/flowctl/): + + ```bash + flowctl collections read --collection <your/collection/name> --uncommitted | head + ``` + +## Next steps + +- Swap pandas for your engine of choice — PyIceberg can also return [PyArrow](https://py.iceberg.apache.org/api/) tables, or you can point DuckDB, Spark, Trino, or Athena at the same Glue catalog and S3 bucket. +- Adjust the `selected_fields` and add row-level filters with `table.scan(row_filter=...)` to push down predicates. +- Build the full real-time lakehouse: capture a source, optionally transform it with a [derivation](https://docs.estuary.dev/concepts/derivations/), and materialize it to Iceberg for query. + +## References + +- PyIceberg documentation: [https://py.iceberg.apache.org/](https://py.iceberg.apache.org/) +- Amazon S3 Iceberg materialization connector: [https://docs.estuary.dev/reference/Connectors/materialization-connectors/amazon-s3-iceberg/](https://docs.estuary.dev/reference/Connectors/materialization-connectors/amazon-s3-iceberg/) +- Estuary docs: [https://docs.estuary.dev](https://docs.estuary.dev) +- Estuary dashboard: [https://dashboard.estuary.dev](https://dashboard.estuary.dev) diff --git a/python-derivations/README.md b/python-derivations/README.md new file mode 100644 index 0000000..156d1e9 --- /dev/null +++ b/python-derivations/README.md @@ -0,0 +1,123 @@ +# Python Derivations in Estuary: Stateless Transforms, Stateful Aggregation, Streaming Joins, and ML Feature Engineering + +Four self-contained example projects that show how to transform real-time collections with **Python derivations** in [Estuary](https://estuary.dev). Each project reads the same captured PostgreSQL `shipments` collection and demonstrates a distinct derivation pattern: a stateless map/transform, a stateful per-customer aggregation with persisted state and JSON merge-patch, a streaming left join that enriches shipments with reference data, and an append-only feature pipeline that emits labeled training rows for shipment delay prediction. + +These examples assume you already have an upstream capture producing a shipments collection (see [`../shipments-datagen`](../shipments-datagen) for the synthetic PostgreSQL source). Every derivation here sources from the collection `Artificial-Industries/postgres-shipments/public/shipments` and writes into the `dani-demo/python-derivations/...` namespace — rename both to match your own tenant before publishing. + +## What is a Python derivation? + +A [derivation](https://docs.estuary.dev/concepts/derivations/) is an Estuary collection that is continuously computed from one or more source collections. As new documents arrive in a source, Estuary runs your transform and produces output documents in the derived collection. Derivations can be written in SQL, TypeScript, or **Python** — these examples use Python. + +A Python derivation is a class implementing the generated `IDerivation` interface, with one `async` method per named transform. Each method receives a `read.doc` (a typed Pydantic model of the source document, including CDC metadata at `doc._meta`/`doc.m_meta`) and `yield`s output `Document`s. Derivations can be: + +- **Stateless** — pure functions of the input document (`shipments-stateless`). +- **Stateful** — they restore persisted state in `__init__(open)`, maintain it in memory, and durably persist it in `start_commit()` (`shipments-stateful`, `shipments-joins`, `shipments-ai`). State is partitioned by the transform's `shuffle.key`, so each worker only manages state for its assigned keys. + +Stateful derivations persist their state with a **JSON merge patch** (`merge_patch=True`), so each transaction only writes the keys that changed — not the entire state document. This scales to millions of keys with thousands changing per transaction. + +## The four projects + +| Folder | Pattern | Source(s) | Derived collection | Key | +| --- | --- | --- | --- | --- | +| [`shipments-stateless/`](shipments-stateless) | Stateless map / field derivation | `shipments` | `processed-shipments` | `/id` | +| [`shipments-stateful/`](shipments-stateful) | Stateful per-customer aggregation + merge-patch | `shipments` | `customer-metrics` | `/customer_id` | +| [`shipments-joins/`](shipments-joins) | Streaming left join (enrichment) | `shipments` + `customer-tiers` (Google Sheet) | `enriched-shipments` | `/shipment_id` | +| [`shipments-ai/`](shipments-ai) | ML feature engineering (append-only training rows) | `shipments` | `shipment-delay-training` | `/shipment_id` | + +### `shipments-stateless` — stateless map/transform + +The simplest pattern. The `shipments` transform reads each shipment document and emits a reshaped one: it concatenates `street_address` + `city` into `full_address`, derives an `is_urgent` flag (priority shipments or `delayed`/`critical` status), builds a `status_summary` string, and computes `days_until_delivery` from `expected_delivery_date`. No state, no `start_commit`. The transform uses `shuffle: any` and `backfill: 1`. See [`shipments-stateless/processed-shipments.flow.py`](shipments-stateless/processed-shipments.flow.py). + +### `shipments-stateful` — stateful aggregation with persisted state + +Maintains a running profile per customer (`total_shipments`, `on_time_count`, `late_count`, `active_shipments`, `avg_delivery_days`, `is_vip`, `last_shipment_date`). It restores state in `__init__`, processes CDC operations (`doc._meta.op` of `c`/`u`/`d`) to correctly handle status transitions (e.g. `In Transit` → `Delivered`) and deletions without double-counting, then persists only the customers it touched via a merge patch in `start_commit()`. The transform shuffles on `/customer_id` so each customer's state lives on one worker. This folder ships its own [`README.md`](shipments-stateful/README.md) with a Mermaid diagram of the full lifecycle. See [`shipments-stateful/customer-metrics.flow.py`](shipments-stateful/customer-metrics.flow.py). + +### `shipments-joins` — streaming left join / enrichment + +A continuously-maintained LEFT JOIN across two collections. The `shipments` transform (left side, `shuffle: any`) stores each shipment in state and emits it enriched with whatever tier data is known. The `customer_tiers` transform (right side, from the Google Sheet collection `dani-demo/customer-tiers/Sheet1`, shuffled on `/customer_id`) stores tier reference data and **re-emits all of that customer's shipments** with the updated `customer_tier`, `customer_region`, and `account_manager`. Because it is a left join, shipments are always emitted even when no tier row exists yet (enrichment fields are null). See [`shipments-joins/enriched-shipments.flow.py`](shipments-joins/enriched-shipments.flow.py). + +### `shipments-ai` — ML feature engineering for delay prediction + +An append-only training-data pipeline. It maintains the same per-customer aggregate state as `shipments-stateful`, but emits exactly **one labeled training row per delivered shipment**. The features (`total_shipments`, `on_time_count`, `late_count`, `active_shipments`, `avg_delivery_days`) are snapshotted *before* the current shipment is applied, to avoid label leakage, and the binary `label` (1 = late, 0 = on time) is computed by comparing the CDC `updated_at` delivery time against `expected_delivery_date`. The resulting collection is a ready-to-train feature store you can materialize to a warehouse and feed to a model. See [`shipments-ai/shipment-delay-training.flow.py`](shipments-ai/shipment-delay-training.flow.py). + +## Project layout + +Each subfolder is a self-contained Estuary Python derivation project with the same structure: + +``` +shipments-<pattern>/ +├── flow.yaml # Defines the derived collection: schema, key, transforms, source(s) +├── <name>.flow.py # The Derivation class — your transform logic +├── <name>.schema.yaml # JSON Schema of the derived collection's documents +├── pyproject.toml # Python project (pydantic, pyright); requires-python >=3.12 +├── pyrightconfig.json # Strict type-checking against flow_generated/python +└── flow_generated/ # Auto-generated types (IDerivation, Document, Request, SourceShipments...) +``` + +- **`flow.yaml`** — the catalog spec. It names the derived collection, its `schema` and `key`, selects `using.python.module: <name>.flow.py`, and lists the `transforms` (each with a `name`, a `source` collection, and a `shuffle` of `any` or a `key`). +- **`<name>.flow.py`** — a `Derivation(IDerivation)` class with one `async def <transform_name>(self, read) -> AsyncIterator[Document]` method per transform. Stateful variants also implement `__init__(open)`, `start_commit(...)`, and `reset()`. +- **`<name>.schema.yaml`** — the JSON Schema for the derived documents. Estuary validates every emitted document against it. +- **`pyproject.toml`** — declares dependencies (`pydantic>=2.0`, `pyright>=1.1`) and `requires-python = ">=3.12"`. +- **`pyrightconfig.json`** — points Pyright at `flow_generated/python` and enables `strict` mode for fully-typed transforms. +- **`flow_generated/`** — generated by `flowctl`. It contains the typed `IDerivation`, `Document`, `Request`, `Response`, and source models (e.g. `SourceShipments`) imported at the top of each `.flow.py`. Do not edit by hand; regenerate with `flowctl generate`. + +## Prerequisites + +- A free [Estuary account](https://dashboard.estuary.dev). +- [`flowctl`](https://docs.estuary.dev/concepts/flowctl/) installed and authenticated — derivations are deployed and their types generated via the CLI. +- Python 3.12+ (for local editing, type-checking, and the catalog tests). +- An existing **source collection** for these derivations to read from. All four read `Artificial-Industries/postgres-shipments/public/shipments`; `shipments-joins` additionally reads a Google Sheets collection (`dani-demo/customer-tiers/Sheet1`). Stand up the upstream shipments capture with [`../shipments-datagen`](../shipments-datagen) (PostgreSQL CDC via the [PostgreSQL capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/)), or repoint the `source` fields at your own collections. + +## Deploy a derivation + +Pick a project and publish it with `flowctl`. The CLI builds the Python module, generates types, runs any catalog tests, and deploys the derived collection. + +```bash +# Authenticate once +flowctl auth login + +# Deploy one of the derivations +cd shipments-stateful +flowctl catalog publish --source flow.yaml --auto-approve +``` + +Before publishing, edit `flow.yaml` to match your environment: + +1. Rename the derived collection from `dani-demo/python-derivations/<name>` to a name under your tenant. +2. Point each transform's `source` at the real collection name produced by your capture (replace `Artificial-Industries/postgres-shipments/public/shipments`, and for `shipments-joins` the `dani-demo/customer-tiers/Sheet1` reference). +3. Keep the `import` path in the `.flow.py` aligned with the collection name (the generated package mirrors the collection path, e.g. `dani_demo.python_derivations.customer_metrics`). Run `flowctl generate --source flow.yaml` after renaming to refresh `flow_generated/`. + +To (re)generate the typed stubs without publishing: + +```bash +flowctl generate --source flow.yaml +``` + +## Verify + +Confirm the derived collection is producing documents: + +```bash +# Stream the derived collection (replace with your renamed collection) +flowctl collections read --collection dani-demo/python-derivations/customer-metrics --uncommitted | head + +# Check the task's control-plane status +flowctl catalog status dani-demo/python-derivations/customer-metrics +``` + +You can also watch document counts climb on the collection's page in the [Estuary dashboard](https://dashboard.estuary.dev). For example, `customer-metrics` produces one document per `customer_id` whose counters update as new shipment events arrive, while `shipment-delay-training` grows by one row each time a shipment is delivered. + +## Next steps + +- Materialize any of these derived collections to a warehouse or database to power dashboards or model training — see the [materialization connectors](https://docs.estuary.dev/reference/Connectors/materialization-connectors/). +- Adapt the patterns: change the aggregation logic in `shipments-stateful`, add more reference sources to `shipments-joins`, or extend the feature set / labeling rule in `shipments-ai`. +- Compare with the TypeScript and SQL derivation examples elsewhere in this repo (e.g. [`../derivations-ad-performance`](../derivations-ad-performance), [`../derivations-sql-full-outer-join`](../derivations-sql-full-outer-join)). + +## References + +- Estuary docs: https://docs.estuary.dev +- Derivations concept: https://docs.estuary.dev/concepts/derivations/ +- flowctl CLI: https://docs.estuary.dev/concepts/flowctl/ +- Collections: https://docs.estuary.dev/concepts/collections/ +- PostgreSQL capture connector (upstream source): https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/ +- Materialization connectors (downstream destinations): https://docs.estuary.dev/reference/Connectors/materialization-connectors/ diff --git a/python-derivations/shipments-ai/README.md b/python-derivations/shipments-ai/README.md new file mode 100644 index 0000000..b1765a5 --- /dev/null +++ b/python-derivations/shipments-ai/README.md @@ -0,0 +1,129 @@ +# Real-Time ML Feature Engineering with Python Derivations in Estuary: Shipment Delay Prediction + +A Python [derivation](https://docs.estuary.dev/concepts/derivations/) in [Estuary](https://estuary.dev) that turns a real-time PostgreSQL CDC stream of `shipments` into an append-only, labeled training dataset for **shipment delay prediction**. As each shipment is delivered, the derivation emits exactly one feature row — per-customer aggregate features plus a binary `label` (1 = late, 0 = on time) — into the derived collection `dani-demo/python-derivations/shipment-delay-training`, ready to materialize to a warehouse and feed to a model. + +This is a stateful, streaming feature store: features are computed incrementally from the change stream and snapshotted *before* the current delivery is applied to avoid label leakage. No batch jobs, no nightly recompute. + +## Architecture + +``` +Artificial-Industries/postgres-shipments/public/shipments (source collection, Postgres CDC) + │ transform "shipments" (shuffle key: /customer_id) + ▼ +shipment-delay-training.flow.py (Python derivation: per-customer state + labeling) + │ one Document per delivered shipment + ▼ +dani-demo/python-derivations/shipment-delay-training (derived collection, key: /shipment_id) + │ (optional) + ▼ +materialization → warehouse / feature store → model training +``` + +- **Source**: the existing collection `Artificial-Industries/postgres-shipments/public/shipments`, produced upstream by a [PostgreSQL CDC capture](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/). +- **Transform**: a single transform named `shipments`, shuffled on `/customer_id` so each customer's running state lives on exactly one worker. +- **Derivation**: maintains per-customer counters (`CustomerState`) restored on startup, updated per CDC event (`op` of `c`/`u`/`d`), and persisted via JSON merge patch in `start_commit()`. +- **Output**: the derived collection `dani-demo/python-derivations/shipment-delay-training`, keyed by `/shipment_id`. Because the key is the shipment and a row is only emitted once per delivery, the collection is effectively append-only training data. + +## How the derivation works + +The transform logic lives in [`shipment-delay-training.flow.py`](shipment-delay-training.flow.py). For every shipment document it receives: + +1. **Skip malformed records** — documents missing `customer_id` or `shipment_status` are ignored. +2. **Snapshot features first** — before applying the current update, it snapshots the customer's current aggregates (`total_shipments`, `on_time_count`, `late_count`, `active_shipments`, `avg_delivery_days`). Snapshotting before the update prevents the delivery being scored from leaking into its own features. +3. **Apply the CDC update to state** — increments `total_shipments` for newly seen shipments, adjusts `active_shipments` on status transitions (active = `In Transit` / `At Checkpoint` / `Out for Delivery`), and on a first-time transition to `Delivered` updates the delivery counters via `_record_delivery()` (computing delivery days from `created_at` → `updated_at`, and on-time vs. late from `expected_delivery_date`). +4. **Emit one training row per delivered shipment** — only when a shipment transitions to `Delivered` does it `yield` a `Document` carrying the snapshotted features and a `label`. The label is computed by `_late_label()`: `expected_delivery_date` is treated as end-of-day (23:59:59 UTC) and compared against the CDC `updated_at` delivery time — `1` if delivered after that, else `0`. + +State persistence: + +- `__init__(open)` restores `State` from `open.state`. +- A `touched_customers` set tracks which customers changed during the transaction. +- `start_commit()` returns a `StartedCommit` with `merge_patch=True`, writing only the touched customers' state — not the entire state document. This scales to large customer counts with only the changed keys written per transaction. +- `reset()` clears in-memory state (used for catalog tests). + +## What's included + +| File | Role | +| --- | --- | +| [`flow.yaml`](flow.yaml) | The catalog spec for the derived collection `dani-demo/python-derivations/shipment-delay-training`: its `schema`, `key` (`/shipment_id`), the `python` derivation module, and the `shipments` transform sourcing `Artificial-Industries/postgres-shipments/public/shipments` shuffled on `/customer_id`. | +| [`shipment-delay-training.flow.py`](shipment-delay-training.flow.py) | The `Derivation(IDerivation)` class — the feature engineering and labeling logic described above. | +| [`shipment-delay-training.schema.yaml`](shipment-delay-training.schema.yaml) | JSON Schema for the emitted documents. Estuary validates every output row against it. | +| [`pyproject.toml`](pyproject.toml) | Python project metadata: `requires-python = ">=3.12"`, dependencies `pydantic>=2.0` and `pyright>=1.1`. | +| [`pyrightconfig.json`](pyrightconfig.json) | Points Pyright at `flow_generated/python` and enables `strict` type checking. | +| `flow_generated/` | Auto-generated typed stubs (`IDerivation`, `Document`, `Request`, `Response`, `SourceShipments`) imported at the top of the `.flow.py`. Regenerate with `flowctl generate`; do not edit by hand. | + +### Output document schema + +Each emitted training row (see [`shipment-delay-training.schema.yaml`](shipment-delay-training.schema.yaml)): + +| Field | Type | Notes | +| --- | --- | --- | +| `shipment_id` | integer | Collection key. | +| `customer_id` | integer | The shipment's customer. | +| `total_shipments` | integer | Feature: customer's shipment count (snapshotted). | +| `on_time_count` | integer | Feature: prior on-time deliveries. | +| `late_count` | integer | Feature: prior late deliveries. | +| `active_shipments` | integer | Feature: currently in-transit shipments. | +| `avg_delivery_days` | number / null | Feature: rolling average delivery days. | +| `label` | integer (0 or 1) | Target: 1 = delivered late, 0 = on time. | + +## Prerequisites + +- A free [Estuary account](https://dashboard.estuary.dev). +- [`flowctl`](https://docs.estuary.dev/concepts/flowctl/) installed and authenticated. Python derivations are deployed (and their typed stubs generated) through the CLI. +- Python 3.12+ for local editing and `strict` type-checking. +- An existing **source collection** for the derivation to read. This example reads `Artificial-Industries/postgres-shipments/public/shipments`. Stand up the upstream shipments capture (PostgreSQL CDC via the [PostgreSQL capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/)) — see the other examples in this repo for a synthetic source — or repoint the transform's `source` at your own collection. + +## Deploy + +Authenticate once, then publish the derivation. `flowctl` builds the Python module, generates types, runs any catalog tests, and deploys the derived collection. + +```bash +# Authenticate (opens a browser; paste the CLI token) +flowctl auth login + +# From this folder, publish the derivation +flowctl catalog publish --source flow.yaml --auto-approve +``` + +To (re)generate the typed stubs in `flow_generated/` without publishing: + +```bash +flowctl generate --source flow.yaml +``` + +### Adapt to your tenant + +Before publishing under your own account, edit [`flow.yaml`](flow.yaml): + +1. Rename the derived collection from `dani-demo/python-derivations/shipment-delay-training` to a name under your tenant. +2. Point the `shipments` transform's `source` at the real collection produced by your capture (replacing `Artificial-Industries/postgres-shipments/public/shipments`). +3. Keep the `import` path in `shipment-delay-training.flow.py` aligned with the collection name — the generated package mirrors the collection path (`dani_demo.python_derivations.shipment_delay_training`). Run `flowctl generate --source flow.yaml` after renaming to refresh `flow_generated/`. + +## Verify + +Confirm the derived collection is producing training rows: + +```bash +# Stream the derived collection (replace with your renamed collection) +flowctl collections read --collection dani-demo/python-derivations/shipment-delay-training --uncommitted | head + +# Check the task's control-plane status +flowctl catalog status dani-demo/python-derivations/shipment-delay-training +``` + +The collection grows by one document each time a shipment is delivered. You can also watch document counts climb on the collection's page in the [Estuary dashboard](https://dashboard.estuary.dev). + +## Next steps + +- **Materialize the feature store**: push `shipment-delay-training` to a warehouse or database to power model training and dashboards — see the [materialization connectors](https://docs.estuary.dev/reference/Connectors/materialization-connectors/) (e.g. [BigQuery](https://docs.estuary.dev/reference/Connectors/materialization-connectors/BigQuery/), [Snowflake](https://docs.estuary.dev/reference/Connectors/materialization-connectors/Snowflake/)). +- **Extend the feature set or labeling rule**: add features to `CustomerState`/`_snapshot_features()`, or change `_late_label()` (for example, a grace window or a multi-class label). +- **Compare patterns**: this folder is one of several Python derivation examples — see the [parent README](../README.md) for the stateless, stateful aggregation, and streaming-join variants that read the same `shipments` source. + +## References + +- Estuary docs: https://docs.estuary.dev +- Derivations concept: https://docs.estuary.dev/concepts/derivations/ +- flowctl CLI: https://docs.estuary.dev/concepts/flowctl/ +- Collections: https://docs.estuary.dev/concepts/collections/ +- PostgreSQL capture connector (upstream source): https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/ +- Materialization connectors (downstream destinations): https://docs.estuary.dev/reference/Connectors/materialization-connectors/ diff --git a/python-derivations/shipments-joins/README.md b/python-derivations/shipments-joins/README.md new file mode 100644 index 0000000..d42f40f --- /dev/null +++ b/python-derivations/shipments-joins/README.md @@ -0,0 +1,128 @@ +# Streaming Left Join in Python: Enrich Shipments with Customer Reference Data using Estuary Derivations + +A **Python derivation** for [Estuary](https://estuary.dev) that maintains a continuously-updated **streaming left join** across two real-time collections. It joins a PostgreSQL `shipments` collection (left side) with a Google Sheets `customer-tiers` reference collection (right side) on `customer_id`, emitting an enriched `enriched-shipments` collection where every shipment carries its customer's `customer_tier`, `customer_region`, and `account_manager`. Because it is a left join, shipments are always emitted — even before a matching tier row exists — and re-emitted whenever either side changes. + +This is the streaming-join / enrichment example in the [Python Derivations collection](../README.md). See the parent README for the shared project layout, what a Python derivation is, and how state and merge-patch persistence work. + +## Architecture + +A derivation is an Estuary collection that is continuously computed from one or more source collections. This one reads from two sources and is **stateful**: it holds both sides of the join in memory (partitioned by join key), persists that state on every transaction, and restores it on restart. + +```text +Artificial-Industries/postgres-shipments/public/shipments (left, CDC) + │ + │ transform: shipments (shuffle: any) + ▼ + ┌──────────────────────────────────────────────────┐ + │ Derivation (enriched-shipments.flow.py) │ + │ in-memory State: customer_id -> { tier, shipments }│ + │ LEFT JOIN on customer_id │ + └──────────────────────────────────────────────────┘ + ▲ + │ transform: customer_tiers (shuffle key: /customer_id) + │ + dani-demo/customer-tiers/Sheet1 (right, Google Sheet reference data) + │ + ▼ + dani-demo/python-derivations/enriched-shipments (derived collection, key: /shipment_id) +``` + +- **Left side — `shipments`** (`Artificial-Industries/postgres-shipments/public/shipments`): each shipment is stored in state and immediately emitted, enriched with whatever tier data is currently known for its `customer_id` (enrichment fields are `null` if no tier row has arrived yet). This is what makes it a *left* join. +- **Right side — `customer_tiers`** (`dani-demo/customer-tiers/Sheet1`): when a tier row arrives or changes, the tier is stored and **all of that customer's shipments are re-emitted** with the updated `customer_tier`, `customer_region`, and `account_manager`. +- **Join state** is keyed by `customer_id`. For each customer it holds one `CustomerTier` (right) and a `dict[shipment_id -> ShipmentData]` (left, one-to-many). State is persisted via a JSON **merge patch** (`merge_patch=True`) so each transaction only writes the customers it touched. +- **Output** lands in `dani-demo/python-derivations/enriched-shipments`, keyed by `/shipment_id`. + +### Join and CDC semantics + +Both sources are change streams, so the derivation honors the CDC operation at `doc._meta.op` (`c` create, `u` update, `d` delete): + +- A shipment **delete** (`op == 'd'`) removes the shipment from state and emits nothing for it. +- A customer-tier **delete** clears the tier and re-emits that customer's shipments with `null` enrichment fields (a Google Sheets row removal arrives as a delete). +- The Google Sheet stores `customer_id` as a string; the derivation casts it to `int` before joining against the integer `customer_id` from shipments. + +## What's included + +| File | Role | +| --- | --- | +| `flow.yaml` | Catalog spec for the derived collection `dani-demo/python-derivations/enriched-shipments`: its `schema`, `key: [/shipment_id]`, the Python module, and the two `transforms` (`shipments` with `shuffle: any`, `customer_tiers` shuffled on `/customer_id`). | +| `enriched-shipments.flow.py` | The `Derivation(IDerivation)` class implementing the left join: `shipments()` (left), `customer_tiers()` (right), `start_commit()` (merge-patch persistence), `__init__()` (state restore), and `reset()`. | +| `enriched-shipments.schema.yaml` | JSON Schema of the derived documents — shipment fields plus the `customer_tier` / `customer_region` / `account_manager` enrichment fields. `required: [shipment_id, customer_id]`. | +| `pyproject.toml` | Python project metadata: `pydantic>=2.0`, `pyright>=1.1`, `requires-python = ">=3.12"`. | +| `pyrightconfig.json` | Points Pyright at `flow_generated/python` and enables `strict` type checking. | +| `flow_generated/` | Auto-generated types (`IDerivation`, `Document`, `Request`, `Response`, `SourceShipments`, `SourceCustomerTiers`). Regenerate with `flowctl generate`; do not edit by hand. | + +### The transform logic + +The class keeps a root `State(customers: dict[int, JoinState])`. Each `JoinState` holds: + +- `tier: CustomerTier | None` — the right side (one per customer: `tier`, `region`, `account_manager`). +- `shipments: dict[int, ShipmentData]` — the left side, keyed by `shipment_id` for efficient updates. + +`shipments(read)` upserts the incoming shipment into state and `yield`s one enriched `Document`. `customer_tiers(read)` upserts the tier and `yield`s one `Document` **per stored shipment** for that customer, propagating the new enrichment to existing rows. A `touched_customers` set tracks which keys changed within a transaction so `start_commit()` writes only those, with `merge_patch=True`. + +The derived `Document` fields are: `shipment_id`, `customer_id`, `shipment_status`, `is_priority`, `city`, `expected_delivery_date`, `customer_tier`, `customer_region`, `account_manager`. + +## Prerequisites + +- A free [Estuary account](https://dashboard.estuary.dev). +- [`flowctl`](https://docs.estuary.dev/concepts/flowctl/) installed and authenticated. Derivations are deployed and their typed stubs generated via the CLI. +- Python 3.12+ for local editing and strict type-checking. +- The two **source collections** this derivation reads from, already present in your tenant: + - A `shipments` collection. The reference is `Artificial-Industries/postgres-shipments/public/shipments`, produced by a [PostgreSQL CDC capture](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/). Stand one up with [`../../shipments-datagen`](../../shipments-datagen) or repoint the `source`. + - A `customer-tiers` reference collection. The reference is `dani-demo/customer-tiers/Sheet1`, produced by the [Google Sheets capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/google-sheets/) with columns `customer_id`, `tier`, `region`, `account_manager`. + +## Deploy the derivation + +Edit `flow.yaml` to match your tenant first (see [Adapt to your environment](#adapt-to-your-environment)), then publish with `flowctl`. The CLI builds the Python module, generates types, runs catalog tests, and deploys the derived collection. + +```bash +# Authenticate once +flowctl auth login + +# Publish the enriched-shipments derivation +flowctl catalog publish --source flow.yaml --auto-approve +``` + +To (re)generate the typed stubs in `flow_generated/` without publishing: + +```bash +flowctl generate --source flow.yaml +``` + +### Adapt to your environment + +The collection names in `flow.yaml` use a `dani-demo` / `Artificial-Industries` tenant. Before publishing: + +1. Rename the derived collection `dani-demo/python-derivations/enriched-shipments` to a name under your tenant. +2. Point the `shipments` transform `source` at your real shipments collection (replace `Artificial-Industries/postgres-shipments/public/shipments`). +3. Point the `customer_tiers` transform `source` at your reference collection (replace `dani-demo/customer-tiers/Sheet1`). Keep its shuffle on `/customer_id`. +4. Keep the `import` path in `enriched-shipments.flow.py` aligned with the derived collection name — the generated package mirrors the collection path. Run `flowctl generate --source flow.yaml` after renaming to refresh `flow_generated/`. + +## Verify + +Confirm the derived collection is producing enriched documents: + +```bash +# Stream the derived collection (use your renamed collection if you changed it) +flowctl collections read --collection dani-demo/python-derivations/enriched-shipments --uncommitted | head + +# Check the derivation task's control-plane status +flowctl catalog status dani-demo/python-derivations/enriched-shipments +``` + +Each output document is a shipment with `customer_tier`, `customer_region`, and `account_manager` populated when a matching `customer-tiers` row exists, or `null` when it does not. Edit a row in the source Google Sheet and you should see that customer's shipments re-emitted with the updated enrichment. You can also watch document counts on the collection's page in the [Estuary dashboard](https://dashboard.estuary.dev). + +## Next steps + +- Materialize `enriched-shipments` to a warehouse or database to power dashboards or operational queries — see the [materialization connectors](https://docs.estuary.dev/reference/Connectors/materialization-connectors/). +- Add more reference sources (e.g. carrier or SLA tables) as additional right-side transforms to layer in more enrichment. +- Compare with the other patterns in this collection: the [stateless map](../shipments-stateless), the [stateful aggregation](../shipments-stateful), and the [ML feature pipeline](../shipments-ai). + +## References + +- Estuary docs: https://docs.estuary.dev +- Derivations concept: https://docs.estuary.dev/concepts/derivations/ +- flowctl CLI: https://docs.estuary.dev/concepts/flowctl/ +- PostgreSQL capture connector (left source): https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/ +- Google Sheets capture connector (right source): https://docs.estuary.dev/reference/Connectors/capture-connectors/google-sheets/ +- Materialization connectors (downstream destinations): https://docs.estuary.dev/reference/Connectors/materialization-connectors/ diff --git a/python-derivations/shipments-stateful/README.md b/python-derivations/shipments-stateful/README.md index 71f1b6d..5cce1e4 100644 --- a/python-derivations/shipments-stateful/README.md +++ b/python-derivations/shipments-stateful/README.md @@ -1,3 +1,27 @@ +# Stateful Python Derivation in Estuary: Per-Customer Shipping Metrics with Persisted State and JSON Merge-Patch + +A stateful [Python derivation](https://docs.estuary.dev/concepts/derivations/) for [Estuary](https://estuary.dev) that aggregates a real-time stream of shipment CDC events into a continuously-updated profile per customer. It restores persisted state when the task starts, maintains an in-memory aggregate as documents arrive, correctly handles CDC create/update/delete operations and status transitions (e.g. `In Transit` → `Delivered`) without double-counting, and durably persists only the customers it touched each transaction via a **JSON merge patch**. + +The derived collection emits one document per `customer_id` with running counters: `total_shipments`, `on_time_count`, `late_count`, `active_shipments`, `avg_delivery_days`, `is_vip`, and `last_shipment_date`. + +## Architecture + +This project defines a single Estuary collection — a Python **derivation** — that reads from an upstream shipments collection and writes per-customer metrics: + +``` +source collection derived collection +Artificial-Industries/ dani-demo/python-derivations/ +postgres-shipments/ ──shuffle──▶ customer-metrics +public/shipments /customer_id (key: /customer_id) +(CDC: create/update/delete) one doc per customer, updated in real time +``` + +- **Source:** `Artificial-Industries/postgres-shipments/public/shipments` — a collection produced upstream by a PostgreSQL CDC capture (see the [../../shipments-datagen](../../shipments-datagen) example or repoint at your own collection). +- **Transform:** a single transform named `shipments`, shuffled on `/customer_id` so each customer's state lives on exactly one worker. State is partitioned by the shuffle key, which is what makes the aggregation horizontally scalable. +- **Derived collection:** `dani-demo/python-derivations/customer-metrics`, keyed on `/customer_id`. Because the collection is keyed by customer, each emitted document replaces the previous version for that customer. + +The lifecycle of the derivation — restore state, process each document, persist a merge patch on commit — is shown below. + ```mermaid flowchart TD Start([Runtime Starts Derivation]) --> Init["<b>__init__(open)</b><br/>Restore persisted state<br/>Initialize touched_customers set"] @@ -42,4 +66,137 @@ flowchart TD style BuildOut fill:#ffd43b,color:#000 style StartCommit fill:#da77f2,color:#fff style Persisted fill:#da77f2,color:#fff -``` \ No newline at end of file +``` + +## What it computes + +For each `customer_id`, the derivation maintains a `CustomerState` and emits a document conforming to `customer-metrics.schema.yaml`: + +| Field | Type | Meaning | +| --- | --- | --- | +| `customer_id` | integer | Unique customer identifier (collection key) | +| `total_shipments` | integer | Total shipments seen for this customer | +| `on_time_count` | integer | Shipments delivered on or before `expected_delivery_date` | +| `late_count` | integer | Shipments delivered after `expected_delivery_date` | +| `active_shipments` | integer | Shipments currently `In Transit` / `At Checkpoint` / `Out For Delivery` | +| `avg_delivery_days` | number \| null | Mean days from creation to delivery (`total_delivery_days / delivered_count`) | +| `is_vip` | boolean | `true` when `total_shipments >= 10` (the `VIP_THRESHOLD`) | +| `last_shipment_date` | string (date) \| null | Date of the customer's most recent shipment | + +The `required` fields in the derived schema are `customer_id` and `total_shipments`. + +## The stateful pattern + +This example demonstrates the full stateful-derivation contract. The relevant code lives in [`customer-metrics.flow.py`](customer-metrics.flow.py). + +### 1. Restore state on startup — `__init__(open)` + +The runtime calls `__init__` with a `Request.Open` whenever the task starts or restarts. `open.state` holds whatever was returned from the previous `start_commit()` (an empty dict on the very first run). The derivation rehydrates it into Pydantic models: + +```python +def __init__(self, open: Request.Open): + super().__init__(open) + self.state = State(**open.state) # restore persisted per-customer metrics + self.touched_customers: dict[int, CustomerState] = {} +``` + +`State` and `CustomerState` are `pydantic.BaseModel`s, which gives automatic JSON serialization for persistence plus validation. `known_shipments: dict[int, str]` inside each `CustomerState` records the last-seen status of every shipment so that CDC transitions and deletions can be reconciled without double-counting. + +### 2. Process documents — `shipments(read)` + +The `async def shipments(self, read) -> AsyncIterator[Document]` method runs once per source document. It: + +- Skips documents missing `customer_id` or `shipment_status` instead of crashing. +- Uses `self.state.customers.setdefault(customer_id, CustomerState())` to get-or-create per-customer state. +- Branches on the CDC operation `doc.m_meta.op` (`doc._meta.op`): `d` (delete) reverses the shipment's contribution via `_handle_deletion()`; `c`/`u` (create/update) apply it via `_process_shipment()`. +- Tracks status transitions using the previously-seen status so `active_shipments` is adjusted correctly, and records on-time vs. late delivery when a shipment first reaches `Delivered`. +- Records the customer in `self.touched_customers`, then `yield`s the customer's current metrics as an output `Document`. + +### 3. Persist a merge patch — `start_commit()` + +At the end of each transaction the runtime calls `start_commit()`. Rather than rewriting the entire state, this derivation returns **only the customers it touched** as a JSON merge patch: + +```python +return Response.StartedCommit( + state=Response.StartedCommit.State( + updated={"customers": {str(cid): c.model_dump() + for cid, c in self.touched_customers.items()}}, + merge_patch=True, # merge with existing state, don't replace it + ) +) +``` + +With `merge_patch=True`, keys present in the update replace the corresponding keys in persisted state; unmentioned keys are left untouched. This scales to millions of customers with only thousands changing per transaction. The `touched_customers` set is cleared for the next transaction. + +### 4. Reset for tests — `reset()` + +`async def reset()` clears all state between catalog tests so state from one test doesn't leak into the next. + +## What's included + +| File | Role | +| --- | --- | +| [`flow.yaml`](flow.yaml) | Catalog spec for the derived collection `dani-demo/python-derivations/customer-metrics`: its `schema`, `key` (`/customer_id`), `using.python.module: customer-metrics.flow.py`, and the `shipments` transform sourced from `Artificial-Industries/postgres-shipments/public/shipments`, shuffled on `/customer_id`. | +| [`customer-metrics.flow.py`](customer-metrics.flow.py) | The `Derivation(IDerivation)` class — `__init__`, the `shipments` transform, the `_handle_deletion` / `_process_shipment` / `_record_delivery` / `_build_output_document` helpers, `start_commit`, and `reset`. | +| [`customer-metrics.schema.yaml`](customer-metrics.schema.yaml) | JSON Schema for the derived documents. Estuary validates every emitted document against it. | +| [`pyproject.toml`](pyproject.toml) | Python project metadata: `requires-python = ">=3.12"`, deps `pydantic>=2.0` and `pyright>=1.1`. | +| [`pyrightconfig.json`](pyrightconfig.json) | Points Pyright at `flow_generated/python` with `strict` type checking. | +| `flow_generated/` | Auto-generated typed stubs imported at the top of the module — `IDerivation`, `Document`, `Request`, `Response`, `SourceShipments` from `dani_demo.python_derivations.customer_metrics`. Do not edit by hand; regenerate with `flowctl`. | + +## Prerequisites + +- A free [Estuary account](https://dashboard.estuary.dev). +- [`flowctl`](https://docs.estuary.dev/concepts/flowctl/) installed and authenticated — Python derivations are deployed and their types generated via the CLI. +- Python 3.12+ for local editing, type-checking, and catalog tests. +- An existing **source collection** for the derivation to read from. This example reads `Artificial-Industries/postgres-shipments/public/shipments`. Stand up the upstream shipments capture with the [../../shipments-datagen](../../shipments-datagen) example (PostgreSQL CDC via the [PostgreSQL capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/)), or repoint the transform's `source` at your own collection. + +## Deploy + +Authenticate once, then publish this derivation. `flowctl` builds the Python module, generates types, runs any catalog tests, and deploys the derived collection. + +```bash +# Authenticate once +flowctl auth login + +# Publish the derivation from this folder +flowctl catalog publish --source flow.yaml --auto-approve +``` + +Before publishing, edit [`flow.yaml`](flow.yaml) to match your environment: + +1. Rename the derived collection `dani-demo/python-derivations/customer-metrics` to a name under your own tenant. +2. Point the `shipments` transform's `source` at the real collection produced by your capture (replace `Artificial-Industries/postgres-shipments/public/shipments`). +3. Keep the import path in `customer-metrics.flow.py` aligned with the collection name — the generated package mirrors the collection path (here `dani_demo.python_derivations.customer_metrics`). After renaming, regenerate the typed stubs: + +```bash +flowctl generate --source flow.yaml +``` + +## Verify + +Confirm the derived collection is producing documents: + +```bash +# Stream the derived collection (use your renamed collection name) +flowctl collections read --collection dani-demo/python-derivations/customer-metrics --uncommitted | head + +# Check the task's control-plane status +flowctl catalog status dani-demo/python-derivations/customer-metrics +``` + +You should see one document per `customer_id` whose counters update as new shipment events arrive. You can also watch document counts climb on the collection's page in the [Estuary dashboard](https://dashboard.estuary.dev). + +## Next steps + +- Materialize `customer-metrics` to a warehouse or database to power dashboards — see the [materialization connectors](https://docs.estuary.dev/reference/Connectors/materialization-connectors/). +- Adjust the aggregation logic: change `VIP_THRESHOLD`, add new per-customer counters to `CustomerState`, or refine the on-time / late delivery rule in `_record_delivery`. +- Compare with the other Python derivation patterns in this repo: stateless map ([../shipments-stateless](../shipments-stateless)), streaming join ([../shipments-joins](../shipments-joins)), and ML feature engineering ([../shipments-ai](../shipments-ai)). See the parent [`../README.md`](../README.md) for an overview. + +## References + +- Estuary docs: https://docs.estuary.dev +- Derivations concept: https://docs.estuary.dev/concepts/derivations/ +- flowctl CLI: https://docs.estuary.dev/concepts/flowctl/ +- Collections: https://docs.estuary.dev/concepts/collections/ +- PostgreSQL capture connector (upstream source): https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/ +- Materialization connectors (downstream destinations): https://docs.estuary.dev/reference/Connectors/materialization-connectors/ diff --git a/python-derivations/shipments-stateless/README.md b/python-derivations/shipments-stateless/README.md new file mode 100644 index 0000000..9c2c6ac --- /dev/null +++ b/python-derivations/shipments-stateless/README.md @@ -0,0 +1,108 @@ +# Stateless Python Derivation in Estuary: Real-Time Shipment Transform + +A minimal, self-contained example of a **stateless Python derivation** in [Estuary](https://estuary.dev). It reads a real-time, CDC-backed `shipments` collection and maps each shipment document independently into a reshaped `processed-shipments` collection — combining address fields, deriving an `is_urgent` flag, building a `status_summary` string, and computing `days_until_delivery`. No state, no aggregation: a pure per-document transform that runs continuously as new shipment events arrive. + +This is the simplest of the four [Python derivation examples](../README.md) in this folder. Start here to learn the derivation file layout and the `flowctl` deploy loop before moving on to the stateful, join, and ML-feature variants. + +## Architecture + +Estuary continuously transforms one collection into another. A derivation is itself a collection that is recomputed from its source(s) as data flows: + +``` +PostgreSQL (CDC) + │ source-postgres capture + ▼ +Artificial-Industries/postgres-shipments/public/shipments (source collection) + │ python derivation (processed-shipments.flow.py) + ▼ +dani-demo/python-derivations/processed-shipments (derived collection, key /id) + │ materialization (optional) + ▼ +warehouse / database +``` + +- **Source collection** — `Artificial-Industries/postgres-shipments/public/shipments`, produced upstream by a PostgreSQL CDC capture (see [`../../shipments-datagen`](../../shipments-datagen) for the synthetic source). +- **Transform** — the `shipments` transform reads each source document via `read.doc` (a typed Pydantic `SourceShipments` model) and `yield`s exactly one output `Document`. It is **stateless**: the output depends only on the current input, so there is no `__init__` state restore and no `start_commit`. The transform uses `shuffle: any` (any worker can process any document, since there is no per-key state) and `backfill: 1`. +- **Derived collection** — `dani-demo/python-derivations/processed-shipments`, keyed on `/id`, validated against `processed-shipments.schema.yaml`. + +### Transform logic + +For each shipment document, `processed-shipments.flow.py` emits: + +| Output field | Type | Derivation | +| --- | --- | --- | +| `id` | integer | Passed through from `doc.id` (the collection key). | +| `full_address` | string | `f"{street_address or 'Unknown'}, {city or 'Unknown'}"` — concatenated street and city. | +| `is_urgent` | boolean | `is_priority` is true, **or** `shipment_status` is `delayed` or `critical`. | +| `status_summary` | string | `f"{shipment_status or 'unknown'} - {'Priority' if is_priority else 'Standard'}"`. | +| `days_until_delivery` | integer | Days between `expected_delivery_date` and today (`0` if no date). The `Z` suffix is normalized to `+00:00` before parsing with `datetime.fromisoformat`. | + +`id`, `full_address`, and `status_summary` are `required` in the output schema; `is_urgent` and `days_until_delivery` are optional. + +## What's included + +- **`flow.yaml`** — the catalog spec. Defines the derived collection `dani-demo/python-derivations/processed-shipments`, its `key: [/id]` and `schema: processed-shipments.schema.yaml`, selects `using.python.module: processed-shipments.flow.py`, and declares one transform named `shipments` sourcing `Artificial-Industries/postgres-shipments/public/shipments` with `shuffle: any` and `backfill: 1`. +- **`processed-shipments.flow.py`** — the `Derivation(IDerivation)` class. Implements a single `async def shipments(self, read) -> AsyncIterator[Document]` method containing the stateless transform logic. +- **`processed-shipments.schema.yaml`** — JSON Schema for the derived documents (`id`, `full_address`, `is_urgent`, `status_summary`, `days_until_delivery`). Estuary validates every emitted document against it. +- **`pyproject.toml`** — Python project metadata. `requires-python = ">=3.12"`; dependencies `pydantic>=2.0` and `pyright>=1.1`. +- **`pyrightconfig.json`** — points Pyright at `flow_generated/python` and enables `strict` type-checking. +- **`flow_generated/`** — auto-generated typed stubs (`IDerivation`, `Document`, `Request`, `SourceShipments`, …) imported at the top of the `.flow.py`. Generated by `flowctl`; do not edit by hand. + +## Prerequisites + +- A free [Estuary account](https://dashboard.estuary.dev). +- [`flowctl`](https://docs.estuary.dev/concepts/flowctl/) installed and authenticated — Python derivations are deployed and their types generated via the CLI. +- Python 3.12+ (for local editing and strict type-checking). +- An existing **source collection** for the derivation to read from. This example reads `Artificial-Industries/postgres-shipments/public/shipments`. Stand up the upstream shipments capture with [`../../shipments-datagen`](../../shipments-datagen) (PostgreSQL CDC via the [PostgreSQL capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/)), or repoint the transform's `source` at your own collection. + +## Deploy + +Publish the derivation with `flowctl`. The CLI builds the Python module, generates types, runs any catalog tests, and deploys the derived collection. + +```bash +# Authenticate once +flowctl auth login + +# From this folder, publish the derivation +flowctl catalog publish --source flow.yaml --auto-approve +``` + +Before publishing, edit `flow.yaml` to match your environment: + +1. Rename the derived collection from `dani-demo/python-derivations/processed-shipments` to a name under your own tenant. +2. Point the `shipments` transform's `source` at the real collection your capture produces (replace `Artificial-Industries/postgres-shipments/public/shipments`). +3. Keep the import path in `processed-shipments.flow.py` aligned with the collection name — the generated package mirrors the collection path (e.g. `dani_demo.python_derivations.processed_shipments`). After renaming, regenerate the typed stubs: + +```bash +flowctl generate --source flow.yaml +``` + +## Verify + +Confirm the derived collection is producing documents: + +```bash +# Stream the derived collection (use your renamed collection name) +flowctl collections read --collection dani-demo/python-derivations/processed-shipments --uncommitted | head + +# Check the task's control-plane status +flowctl catalog status dani-demo/python-derivations/processed-shipments +``` + +You should see one output document per source shipment, with `full_address`, `is_urgent`, `status_summary`, and `days_until_delivery` populated. You can also watch document counts climb on the collection's page in the [Estuary dashboard](https://dashboard.estuary.dev). + +## Next steps + +- Materialize `processed-shipments` to a warehouse or database to power dashboards — see the [materialization connectors](https://docs.estuary.dev/reference/Connectors/materialization-connectors/). +- Add or change derived fields by editing `processed-shipments.flow.py` and `processed-shipments.schema.yaml`, then republish. +- Move beyond stateless transforms with the sibling examples: [`../shipments-stateful`](../shipments-stateful) (per-customer aggregation with persisted state and JSON merge-patch), [`../shipments-joins`](../shipments-joins) (streaming left join), and [`../shipments-ai`](../shipments-ai) (ML feature engineering). + +## References + +- Python derivations overview: [`../README.md`](../README.md) +- Estuary docs: https://docs.estuary.dev +- Derivations concept: https://docs.estuary.dev/concepts/derivations/ +- flowctl CLI: https://docs.estuary.dev/concepts/flowctl/ +- Collections: https://docs.estuary.dev/concepts/collections/ +- PostgreSQL capture connector (upstream source): https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/ +- Materialization connectors (downstream destinations): https://docs.estuary.dev/reference/Connectors/materialization-connectors/ diff --git a/shipments-datagen/README.md b/shipments-datagen/README.md index 94d982a..349cdf9 100644 --- a/shipments-datagen/README.md +++ b/shipments-datagen/README.md @@ -1,30 +1,112 @@ -# Shipments Datagen +# Real-Time PostgreSQL Shipments Data Generator for Estuary CDC -This app generates fake shipments data and stores it in a PostgreSQL database for future ingestion in StarTree. +A self-contained Docker stack that continuously generates realistic, mutating shipments and logistics data in PostgreSQL, wired for change data capture (CDC) with [Estuary](https://dashboard.estuary.dev). The Postgres instance ships with `wal_level=logical`, a replication-ready user, a publication, and a watermarks table, so you can point Estuary's PostgreSQL capture connector at it in minutes and stream inserts, updates, and deletes into a collection, then materialize to any destination (BigQuery, Snowflake, StarTree, ClickHouse, and more) for real-time dashboards and analytics. + +Use this as a synthetic backend to test, demo, and benchmark real-time CDC pipelines without needing a production database. + +## Architecture + +The Python generator drives a steady stream of `INSERT`, `UPDATE`, and `DELETE` operations against the `shipments` table. PostgreSQL's logical replication exposes those changes, an ngrok TCP tunnel makes the local database reachable from Estuary's managed cloud, and Estuary captures the change stream into a collection that can be materialized anywhere. + +``` +datagen (Python + Faker) + │ INSERT / UPDATE / DELETE + ▼ +PostgreSQL 17.4 (wal_level=logical, flow_publication, flow_watermarks) + │ logical replication + ▼ +ngrok TCP tunnel (postgres:5432 → public host:port) + │ + ▼ +Estuary capture (source-postgres) + │ + ▼ +Estuary collection (real-time, schematized JSON) + │ + ▼ +Materialization → BigQuery / Snowflake / StarTree / ClickHouse / ... +``` + +- **Capture (source):** the [PostgreSQL capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/) reads the logical replication stream from the `public.shipments` table. +- **Collection:** captured change events land in an Estuary collection — a real-time data lake of schematized JSON in cloud storage. +- **Materialization (destination):** push the collection to any [supported destination connector](https://docs.estuary.dev/reference/Connectors/materialization-connectors/) to power live dashboards. + +## What's included + +- **`docker-compose.yml`** — spins up three services: `shipments-datagen` (the generator), `shipments-postgres` (PostgreSQL 17.4 started with `wal_level=logical`, exposed on port `5432`), and `shipments-ngrok` (an ngrok TCP tunnel forwarding `postgres:5432`, with its inspection UI on port `4040`). +- **`postgres/init.sql`** — runs on first boot. Grants the `postgres` user `REPLICATION` and read access, creates the `public.flow_watermarks` table, creates the `flow_publication` publication (with `publish_via_partition_root = true`), defines the `ship_status` enum and `coord` composite type, creates the `shipments` table, and adds both tables to the publication. +- **`datagen/datagen.py`** — the main loop. Randomly inserts new shipments, advances existing shipments through their statuses, updates current locations, and deletes shipments older than 30 days, producing a continuous CDC change feed. +- **`datagen/geo.py`** — geographic helpers: picks the nearest warehouse, estimates transit time (~800 mi/day), and moves shipments along a randomized route toward their destination within the United States. +- **`datagen/Dockerfile`** — builds the generator on `python:3.13.2` and runs `python -u datagen.py`. +- **`datagen/requirements.txt`** — Python dependencies: `Faker`, `geopy`, `psycopg2`. ## Prerequisites -* Docker -* Ngrok -* An Estuary account +- [Docker](https://docs.docker.com/get-docker/) (with Docker Compose) +- A free [ngrok](https://ngrok.com/) account and authtoken (required to expose the local PostgreSQL database to Estuary's managed service) +- A free [Estuary account](https://dashboard.estuary.dev) + +## Setup + +1. Add your ngrok authtoken to `docker-compose.yml`, replacing `<YOUR-TOKEN-HERE>` for the `ngrok` service: + + ```yaml + ngrok: + environment: + NGROK_AUTHTOKEN: <YOUR-TOKEN-HERE> + ``` + + Optionally change the `POSTGRES_PASSWORD` (default `postgres`) in both the `datagen` and `postgres` service blocks — keep them in sync. + +2. From the `shipments-datagen` directory, start the stack: + + ```bash + docker compose up -d + ``` -## Quick instructions + On first boot, `postgres/init.sql` provisions the replication user, publication, watermarks table, and `shipments` table. The generator begins producing change events immediately. -1. Add your secrets to the `docker-compose.yml` file. This includes adding your ngrok auth token and your desired Postgres password. +3. Find the public TCP endpoint created by ngrok. Either open the ngrok inspection UI at [http://localhost:4040](http://localhost:4040) (or the **Endpoints** tab of your ngrok dashboard), or run: -2. From the `shipments-datagen` directory, run: `docker-compose up -d` + ```bash + curl -s http://localhost:4040/api/tunnels | jq -r ".tunnels[0].public_url" + ``` -3. Go to the **Endpoints** tab in your ngrok dashboard to find the public endpoint associated with your newly-created Postgres database. + You'll get something like `tcp://0.tcp.ngrok.io:12345`. Strip the `tcp://` prefix — the host and port are what you paste into Estuary. -4. Enter the ngrok host URL and your Postgres details into [Estuary's Postgres capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/). +## Configure the Estuary capture -5. [Materialize to any supported connector.](https://docs.estuary.dev/reference/Connectors/materialization-connectors/) +Create a new capture in the [Estuary dashboard](https://dashboard.estuary.dev/captures) using the **PostgreSQL** connector ([docs](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/), image `ghcr.io/estuary/source-postgres:dev`). Use the values baked into this stack: -We'll be materializing to StarTree to create a spiffy real-time dashboard. While the frontend is in the works, this code can be used to easily set up a data-generating backend to test pipeline setup and CDC. +| Setting | Value | +| --- | --- | +| Server Address | the ngrok host:port from the step above (e.g. `0.tcp.ngrok.io:12345`) | +| User | `postgres` | +| Password | your `POSTGRES_PASSWORD` (default `postgres`) | +| Database | `postgres` | + +The connector auto-discovers `public.shipments` and uses the pre-created `flow_publication` and `public.flow_watermarks`. Save and publish; the capture begins backfilling and then streaming live CDC change events into a new collection. + +## Configure the materialization + +Once data is flowing into your collection, create a [materialization](https://dashboard.estuary.dev/materializations) to a destination of your choice — pick any [materialization connector](https://docs.estuary.dev/reference/Connectors/materialization-connectors/) (BigQuery, Snowflake, ClickHouse, StarTree, and more) and bind it to the `shipments` collection. + +This dataset was originally built to drive a real-time StarTree dashboard, but the generator works as a drop-in CDC backend for testing any pipeline setup. Note that `delivery_coordinates` and `current_location` are PostgreSQL composite (`coord`) types — handy for geospatial dashboards. + +## Verify + +Confirm change events are arriving by reading from the collection with [flowctl](https://docs.estuary.dev/concepts/flowctl/): + +```bash +flowctl auth login +flowctl collections read --collection <your/collection/name> --uncommitted | head +``` + +You can also watch throughput and document counts on the capture's page in the Estuary dashboard, or query your destination table after the materialization is live. ## The data -Generated data is currently associated with a single `shipments` table that consists of: +Generated data is associated with a single `shipments` table: | Field name | Data type | Description | | --- | --- | --- | @@ -42,6 +124,19 @@ Generated data is currently associated with a single `shipments` table that cons | `expected_delivery_date` | date | Approximate expected delivery, based on distance and shipment priority | | `is_priority` | boolean | Whether or not a shipment is considered priority; affects initial processing time | -New shipments are generated approximately every minute. Existing shipments are updated approximately every 15 minutes and progress through shipment statuses while updating their current locations. While initial and ending coordinates should be actual points within the United States, the route between the two is randomly generated rather than corresponding to actual roads. +New shipments are generated roughly every 15-75 seconds. Existing shipments are updated only after their `updated_at` is older than 15 minutes; they then progress through shipment statuses while updating their current locations. Shipments older than 30 days are periodically deleted, producing delete change events for CDC. While initial and ending coordinates are real points within the United States, the route between the two is randomly generated rather than corresponding to actual roads. + +The data is meant for demonstration purposes, providing a facsimile of real-time shipping and logistics data. + +## Next steps + +- Add a [derivation](https://docs.estuary.dev/concepts/derivations/) to transform the `shipments` collection in SQL, TypeScript, or Python — for example, computing on-time delivery rates or per-status counts. +- Materialize to a warehouse or real-time analytics store and build a live logistics dashboard. + +## Resources -The data is meant to be used for demonstration purposes, providing a facsimile of real-time shipping and logistics data. +- [Estuary documentation](https://docs.estuary.dev) +- [PostgreSQL capture connector reference](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/) +- [Materialization connectors reference](https://docs.estuary.dev/reference/Connectors/materialization-connectors/) +- [flowctl CLI](https://docs.estuary.dev/concepts/flowctl/) +- [Estuary dashboard](https://dashboard.estuary.dev) diff --git a/shipments_eta/README.md b/shipments_eta/README.md index ccb7ead..05b5070 100644 --- a/shipments_eta/README.md +++ b/shipments_eta/README.md @@ -1,5 +1,223 @@ -# Real-time Freight Tracking with Estuary and Tinybird +# Real-Time Freight Shipment ETA Tracking with Estuary, MongoDB, and Tinybird -Read the article for more details +Stream live freight shipment events out of MongoDB with Estuary's change data capture (CDC), consume those collections from Tinybird over Estuary's Kafka-compatible Dekaf API, compute updated ETAs and delay analytics in ClickHouse SQL, and visualize everything in a real-time Next.js dashboard. -https://estuary.dev/real-time-freight-tracking-estuary-tinybird/ +This example simulates a logistics workload: shipments are continuously inserted, updated (locations, statuses, delays), and enriched with traffic/weather data in MongoDB. Estuary captures every change in real time, materializes it into Tinybird, and a published Tinybird Pipe joins shipments against current traffic/weather to recompute each shipment's expected delivery time on the fly. + +Reference article: https://estuary.dev/real-time-freight-tracking-estuary-tinybird/ + +## Architecture + +``` +MongoDB (shipping db) Estuary Tinybird Next.js dashboard +┌─────────────────────┐ capture ┌──────────────────┐ Dekaf ┌──────────────────────┐ ┌────────────────┐ +│ shipments │ ───────────▶ │ collections │ (Kafka API) │ Data Sources │ │ Tremor charts │ +│ checkpoints │ source- │ shipping/... │ ───────────▶ │ shipments │ │ - delayed cust │ +│ traffic_weather │ mongodb │ (real-time lake) │ │ checkpoints │ HTTP │ - route perf │ +│ (datagen CDC sim) │ │ │ │ traffic_weather │ ◀──── │ - status dist │ +└─────────────────────┘ └──────────────────┘ │ + Shipping.pipe │ └────────────────┘ + └──────────────────────┘ +``` + +Data flow in Estuary terms: + +- **Capture (source):** the `source-mongodb` connector reads the `shipping` database and streams inserts/updates/deletes from the `shipments`, `checkpoints`, and `traffic_weather` collections. +- **Collections:** each MongoDB collection lands in an Estuary collection (a real-time data lake of schematized JSON), e.g. `Dani/shipments-demo/shipping/shipments`. +- **Consumption (Dekaf):** Tinybird reads those collections directly through Estuary's Kafka-compatible **Dekaf** API. Each Tinybird Data Source is bound to a Kafka topic (`KAFKA_TOPIC`) through a named Tinybird Kafka connection (`KAFKA_CONNECTION_NAME`) that maps to an Estuary collection. +- **Transformation (Tinybird):** `Shipping.pipe` and `eta.sql` run ClickHouse SQL to deduplicate to the latest version of each shipment, join against the latest traffic/weather row per route, and recompute ETAs (`expected_delivery_date + INTERVAL impact_on_ETA_minutes MINUTE`). +- **Visualization:** the Next.js dashboard calls published Tinybird Pipe endpoints and renders the metrics with Tremor charts. + +## What's included + +| Path | Role | +| --- | --- | +| `docker-compose.yml` | Spins up the `datagen` service (container `mongodb-shipping-datagen`) that simulates the live MongoDB CDC workload. | +| `datagen/datagen.py` | Connects to MongoDB and continuously inserts new shipments, updates statuses/locations/delays, and emits traffic/weather rows into the `shipping` database (`shipments`, `checkpoints`, `traffic_weather` collections). | +| `datagen/requirements.txt` | Python deps for the generator: `pymongo[srv]`, `faker`, `python-dotenv`. | +| `datagen/Dockerfile` | Builds the generator image (`python:3.11`, runs `python -u datagen.py`). | +| `tinybird/datasources/shipments.datasource` | Tinybird Data Source bound to the Estuary collection topic `Dani/shipments-demo/shipping/shipments` via Dekaf; flattens nested arrays (`delays[:]`, `events[:]`) and `current_location` into columns. | +| `tinybird/datasources/checkpoints.datasource` | Data Source for the `checkpoints` collection topic. | +| `tinybird/datasources/traffic_weather.datasource` | Data Source for the `traffic_weather` collection topic (route conditions and `impact_on_ETA_minutes`). | +| `tinybird/pipes/Shipping.pipe` | Multi-node Pipe: latest shipment per `shipment_id`, latest traffic/weather per `route_id`, per-shipment delays, route performance, status distribution, and top delayed customers. | +| `tinybird/eta.sql` | Standalone ClickHouse query that joins `shipments` to `traffic_weather` and computes `updated_eta` for in-transit shipments. | +| `dashboard/` | Next.js 15 + Tremor app that queries the published Tinybird Pipe endpoints and renders charts. (See `dashboard/README.md` for the standard Next.js commands.) | + +## Prerequisites + +- **Docker** with Compose (to run the data generator). +- A **MongoDB** instance Estuary can reach. `datagen/datagen.py` defaults to a MongoDB Atlas `mongodb+srv://` connection; the easiest path is a free [MongoDB Atlas](https://www.mongodb.com/atlas) cluster (which is publicly reachable, so no tunnel is needed). For CDC, MongoDB must run as a replica set (Atlas clusters always do). +- A free **Estuary** account: https://dashboard.estuary.dev +- A **Tinybird** account (free tier is fine): https://www.tinybird.co +- **Node.js 18+** (only for the `dashboard/` app). + +## Step 1 — Generate the MongoDB CDC workload + +`datagen/datagen.py` reads its MongoDB connection from environment variables (`MONGODB_USER`, `MONGODB_PASSWORD`, `MONGODB_HOST`) and builds a `mongodb+srv://` connection string. Point it at your MongoDB (Atlas recommended). + +The bundled `docker-compose.yml` hardcodes placeholder values in its `environment:` block (`MONGODB_HOST: "localhost"`, `MONGODB_USER: "mongo"`, `MONGODB_PASSWORD: "mongo"`, plus an unused `MONGODB_PORT`) that will **not** reach an Atlas cluster. Edit those values to your MongoDB credentials/host first: + +```yaml +# docker-compose.yml + environment: + MONGODB_HOST: "cluster0.xxxxx.mongodb.net" # your Atlas cluster host + MONGODB_USER: "<your-mongodb-user>" + MONGODB_PASSWORD: "<your-mongodb-password>" +``` + +Then build and run the generator: + +```bash +# From the shipments_eta/ directory. +docker compose up --build +``` + +Alternatively, run it without Docker — `datagen.py` picks up the exported env vars directly: + +```bash +cd datagen +pip install -r requirements.txt +export MONGODB_USER=<your-mongodb-user> +export MONGODB_PASSWORD=<your-mongodb-password> +export MONGODB_HOST=<your-cluster-host> # e.g. cluster0.xxxxx.mongodb.net +python -u datagen.py +``` + +`datagen.py` builds the connection string as: + +``` +mongodb+srv://${MONGODB_USER}:${MONGODB_PASSWORD}@${MONGODB_HOST}/?retryWrites=true&w=majority&appName=Cluster0 +``` + +It then loops forever in `simulate_cdc_workload()`: + +- inserts 2–5 new shipments per cycle into `shipping.shipments`, +- updates non-delivered shipments with new `status`, `current_location`, `events`, and `delays`, +- periodically inserts traffic/weather rows into `shipping.traffic_weather`, +- seeds 10 `checkpoints` once on startup. + +Watch the logs to confirm inserts/updates are happening: + +```bash +docker compose logs -f datagen +``` + +> Note: the bundled `docker-compose.yml` defines only the generator and ships with placeholder `environment` values (`localhost` / `mongo` / `mongo`) plus an unused `MONGODB_PORT`. On the Docker path, Compose's `environment` values override your shell, so edit them in the file as shown above rather than relying on `export`. Because `datagen.py` connects with `mongodb+srv://`, the port is ignored. If you run MongoDB locally instead of Atlas, expose it to Estuary's managed connectors with a publicly reachable host or an ngrok TCP tunnel (`ngrok tcp 27017`). + +## Step 2 — Configure the Estuary MongoDB capture + +Capture the three collections from your MongoDB `shipping` database into Estuary collections. + +Using the dashboard: + +1. Open https://dashboard.estuary.dev/captures and click **New Capture**. +2. Choose the **MongoDB** connector (`source-mongodb`). +3. Enter your connection details: + - **Address:** your MongoDB host (e.g. `cluster0.xxxxx.mongodb.net` for Atlas, or the ngrok host for a local DB). + - **User / Password:** the `MONGODB_USER` / `MONGODB_PASSWORD` you set above. + - **Database:** `shipping`. +4. Let the connector discover collections and bind `shipments`, `checkpoints`, and `traffic_weather`. +5. Publish. Estuary begins backfilling and then streaming change events into collections such as `<your-prefix>/shipping/shipments`. + +Prefer the CLI? This repo doesn't ship an Estuary spec, so create one yourself: run `flowctl auth login`, scaffold a capture spec with `flowctl raw discover --connector ghcr.io/estuary/source-mongodb:dev` (or `flowctl catalog pull-specs` to edit an existing draft), then publish your spec with `flowctl catalog publish --source <your-spec>.yaml`. See https://docs.estuary.dev/concepts/flowctl/ + +MongoDB capture connector docs: https://docs.estuary.dev/reference/Connectors/capture-connectors/mongodb/ + +> The Data Sources in this repo reference topics under the `Dani/shipments-demo/shipping/...` prefix. Replace `Dani/shipments-demo` with your own Estuary tenant/prefix when you wire up Tinybird. + +## Step 3 — Connect Tinybird to Estuary via Dekaf + +Tinybird consumes the Estuary collections through Estuary's Kafka-compatible **Dekaf** API. In Tinybird, create a Kafka connection (for example, named `Estuary`) with these settings: + +- **Bootstrap servers:** `dekaf.estuary-data.com:9092` +- **Security protocol:** `SASL_SSL` +- **SASL mechanism:** `PLAIN` +- **Username:** your Dekaf task name (or `{}` for public demo topics) +- **Password:** an Estuary access/refresh token (generate one in the Estuary dashboard) +- **Schema registry:** `https://dekaf.estuary-data.com` + +Then create the three Data Sources, each bound to the Kafka topic that matches its Estuary collection: + +| Data Source | Kafka topic | +| --- | --- | +| `shipments` | `Dani/shipments-demo/shipping/shipments` | +| `checkpoints` | `Dani/shipments-demo/shipping/checkpoints` | +| `traffic_weather` | `Dani/shipments-demo/shipping/traffic_weather` | + +The `.datasource` files already declare the JSON-path schema mappings (for example `delays__reason Array(String) json:$.delays[:].reason` and `current_location_latitude Nullable(Float32) json:$.current_location.latitude`), so update each file's `KAFKA_CONNECTION_NAME` (to match the connection you created above) and its `KAFKA_TOPIC` / `KAFKA_GROUP_ID` lines (to your tenant prefix), then push them with the Tinybird CLI: + +```bash +# Authenticate the Tinybird CLI first (tb auth), then from the tinybird/ directory: +tb push datasources/shipments.datasource +tb push datasources/checkpoints.datasource +tb push datasources/traffic_weather.datasource +tb push pipes/Shipping.pipe +``` + +Dekaf docs: https://docs.estuary.dev/guides/dekaf_reading_collections_from_kafka/ + +## Step 4 — Build the ETA / analytics transformations + +The transformations run as ClickHouse SQL inside Tinybird: + +- **`Shipping.pipe`** is the main multi-node Pipe: + - `latest_shipments` — dedupes to the most recent row per `shipment_id` using `ROW_NUMBER() OVER (PARTITION BY shipment_id ORDER BY __timestamp DESC)`. + - `latest_traffic_weather` — most recent traffic/weather row per `route_id`. + - `delays` — joins latest shipments to latest traffic/weather and computes `updated_eta = expected_delivery_date + INTERVAL impact_on_ETA_minutes MINUTE`, plus total/avg delay minutes per shipment. + - `route_performance` — congestion insights aggregated per origin–destination route. + - `shipment_status_distribution` — status counts per route. + - `top_delayed_customers` — customers with the most cumulative delay minutes. +- **`eta.sql`** is a standalone query showing the core ETA recompute: join `shipments` to `traffic_weather` on `route_id` for `status = 'In Transit'` and return original vs. updated ETA with delay reasons. Note: as committed it references `arrayJoin(s.delays_reason, ', ')`, but the Tinybird Data Source flattens that field to the `Array(String)` column `delays__reason` (double underscore) — to run it against the Data Source, use `arrayStringConcat(s.delays__reason, ', ')`. + +Publish the relevant Pipe nodes as API endpoints (`Shipping.json`, `route_perf.json`, `route_staus_stats.json`) so the dashboard can query them. + +## Step 5 — Run the dashboard + +The dashboard is a Next.js 15 app that reads the published Tinybird endpoints with Tremor charts. + +```bash +# From the dashboard/ directory. +npm install +npm run dev +# open http://localhost:3000 +``` + +Set your Tinybird host and token in `dashboard/src/app/page.tsx` before running. The file currently ships with a `"xyz"` placeholder token — replace it with your Tinybird read token: + +```ts +const TINYBIRD_API_BASE = "https://api.us-east.tinybird.co/v0/pipes/Shipping.json"; +const TINYBIRD_TOKEN = "xyz" // replace "xyz" with your Tinybird read token +``` + +The page renders three panels driven by the Tinybird Pipe endpoints: + +- **Top Delayed Customers (minutes)** — from `Shipping.json`. +- **Average Delay by Route (minutes)** — from `route_perf.json`. +- **Status Distribution per route** — from `route_staus_stats.json`. + +## Verify + +- **Generator:** `docker compose logs -f datagen` should show `Inserted N new shipments.` / `Updated N shipments.` lines. +- **Estuary collection:** confirm change events are flowing: + + ```bash + flowctl collections read --collection <your-prefix>/shipping/shipments --uncommitted | head + ``` + + Or check throughput in the Estuary dashboard. +- **Tinybird:** query a Data Source or the published Pipe endpoint and confirm rows arrive within seconds of the generator's writes. ETAs in the `delays` node should shift as new `traffic_weather` rows change `impact_on_ETA_minutes`. +- **Dashboard:** the charts at http://localhost:3000 should populate and update as data flows. + +## Next steps + +- Add a `delete` simulation (uncomment `delete_old_shipments()` in `datagen.py`) to exercise Estuary CDC deletes end to end. +- Materialize the same collections into a warehouse (BigQuery, Snowflake, Databricks) directly from Estuary for historical analytics alongside the real-time Tinybird path. +- Add derivations in Estuary (SQL, TypeScript, or Python) to pre-aggregate or enrich shipments before they reach Tinybird. + +## Resources + +- Blog: [Real-Time Freight Tracking with Estuary and Tinybird](https://estuary.dev/real-time-freight-tracking-estuary-tinybird/) +- Estuary docs: https://docs.estuary.dev +- MongoDB capture connector: https://docs.estuary.dev/reference/Connectors/capture-connectors/mongodb/ +- Reading collections from Kafka (Dekaf): https://docs.estuary.dev/guides/dekaf_reading_collections_from_kafka/ +- flowctl: https://docs.estuary.dev/concepts/flowctl/ +- Tinybird: https://www.tinybird.co/docs diff --git a/singlestore-webinar-2025/README.md b/singlestore-webinar-2025/README.md index 0d2f697..c3ee93e 100644 --- a/singlestore-webinar-2025/README.md +++ b/singlestore-webinar-2025/README.md @@ -1,14 +1,207 @@ -# Estuary x SingleStore Webinar +# Real-Time PostgreSQL CDC to SingleStore with Estuary + +Stream PostgreSQL change data capture (CDC) into [SingleStore](https://www.singlestore.com/) in real time using [Estuary](https://estuary.dev). This is the demo environment from the Estuary x SingleStore webinar: a local Postgres instance configured for logical replication, a Python data generator that continuously writes a realistic pet-store order stream, and an ngrok tunnel so the fully managed Estuary connector can reach your local database. Estuary captures every insert and update from Postgres into a collection and materializes it into SingleStore for low-latency analytics. + +## Architecture + +The pipeline follows the standard Estuary capture -> collection -> materialization pattern: + +``` +PostgreSQL (wal_level=logical) Estuary SingleStore ++-------------------------+ +------------------------+ +---------------+ +| products / transactions | | source-postgres | | analytics | +| reviews / orders | --> | capture --> collection | --> | tables / | +| flow_publication | | (real-time data lake) | | pipelines | ++-------------------------+ +------------------------+ +---------------+ + ^ ^ + | datagen.py | ngrok tcp tunnel + | (inserts + status updates) | (exposes local Postgres to Estuary) +``` + +1. **Capture** — the `source-postgres` connector reads the Postgres write-ahead log (WAL) via the `flow_publication` publication and streams row-level inserts/updates into Estuary collections. +2. **Collections** — each captured table becomes a schematized collection: a real-time data lake of JSON stored in cloud storage. +3. **Materialization** — a SingleStore materialization pushes the collections into SingleStore tables, kept continuously up to date. + +Because Estuary is fully managed, your locally running Postgres has to be reachable from the internet. The included ngrok service publishes a TCP tunnel to `postgres:5432` that you paste into the Estuary capture configuration. + +## What's included + +- **`docker-compose.yml`** — spins up three services: + - `postgres` (container `postgres-cdc-motherduck-postgres`, image `postgres:latest`) started with `wal_level=logical` for CDC, listening on host port `5432`. + - `datagen` (container `postgres-cdc-motherduck-datagen`) built from `datagen/`, which continuously writes order activity into Postgres. + - `ngrok` (container `postgres-cdc-motherduck-ngrok`, image `ngrok/ngrok:latest`) running `tcp postgres:5432`, with its inspection UI on port `4040`. +- **`postgres/init.sql`** — runs on first boot to prepare Postgres for Estuary CDC. It grants the `postgres` user `REPLICATION` and `pg_read_all_data`, creates the `public.flow_watermarks` watermarks table, creates the `flow_publication` publication (with `publish_via_partition_root = true`), creates the `products`, `transactions`, and `reviews` tables, adds all of them to the publication, and seeds the `products` table with a static catalog of pet-store items. +- **`datagen/datagen.py`** — connects to Postgres, creates an `orders` table (UUID primary key via `pgcrypto`), then loops forever generating fake orders with [Faker](https://faker.readthedocs.io/) and randomly advancing their `status` through `placed -> packed -> shipped -> delivered` (or `cancelled`). It can optionally target Google Cloud SQL instead of local Postgres (see env vars below). +- **`datagen/Dockerfile`** — Python 3.12 image that installs `requirements.txt` and runs `datagen.py`. +- **`datagen/requirements.txt`** — Python dependencies: `psycopg2-binary`, `Faker`, `SQLAlchemy`, `pg8000`, `cloud-sql-python-connector[pg8000]`, `python-dotenv`, `openai`. + +### Tables captured + +| Table | Key | Notes | +| --- | --- | --- | +| `products` | `product_id` | Seeded with a static pet-store catalog in `init.sql`. | +| `transactions` | `transaction_id` | Empty by default; available for the publication. | +| `reviews` | `review_id` | Empty by default; available for the publication. | +| `orders` | `order_id` (UUID) | Created and continuously updated by `datagen.py`. | + +## Prerequisites + +- [Docker](https://docs.docker.com/get-docker/) and Docker Compose +- A free [ngrok](https://ngrok.com/) account and authtoken (used to expose local Postgres to Estuary) +- A free [Estuary account](https://dashboard.estuary.dev) +- A [SingleStore](https://www.singlestore.com/) account (Helios/Cloud) with a database and database user for the destination ## Setup +### 1. Set environment variables + +The compose file reads `NGROK_AUTHTOKEN` (required) and `OPENAI_API_KEY` (optional; passed to the datagen container but not required for the core pipeline). + ```bash -pip install -r requirements.txt +export NGROK_AUTHTOKEN=<your-ngrok-authtoken> +# Optional: +export OPENAI_API_KEY=<your-openai-key> ``` -## Run +### 2. Start the stack ```bash +docker compose up -d +``` + +This starts Postgres with logical replication enabled, applies `postgres/init.sql`, begins generating order data, and opens the ngrok TCP tunnel. + +### 3. Get the public Postgres endpoint + +Estuary connects to Postgres through ngrok. Read the public `host:port` from the ngrok inspection UI at [http://localhost:4040](http://localhost:4040), or grab it from the command line: + +```bash +curl -s http://localhost:4040/api/tunnels | jq -r ".tunnels[0].public_url" +# e.g. tcp://6.tcp.ngrok.io:18642 +``` + +Strip the `tcp://` prefix — you will paste `6.tcp.ngrok.io` as the host and `18642` as the port into Estuary. + +### Running the data generator locally (alternative to Docker) + +If you only want to produce data against an existing Postgres without the full compose stack: + +```bash +pip install -r datagen/requirements.txt python datagen/datagen.py ``` +Configure the target via environment variables (defaults shown): `POSTGRES_HOST=localhost`, `POSTGRES_PORT=5432`, `POSTGRES_DB=postgres`, `POSTGRES_USER=postgres`, `POSTGRES_PASSWORD=postgres`. To target Google Cloud SQL instead, set `USE_CLOUD_SQL=true` along with `CLOUD_SQL_CONNECTION_NAME`, `CLOUD_SQL_USER`, `CLOUD_SQL_PASSWORD`, and `CLOUD_SQL_DB`. + +## Configure the Estuary capture (PostgreSQL CDC) + +Create the source capture from the Estuary dashboard. + +1. Go to [dashboard.estuary.dev/captures](https://dashboard.estuary.dev/captures) and click **New Capture**. +2. Choose the **PostgreSQL** connector (`source-postgres`). +3. Enter the connection details from `docker-compose.yml` and the ngrok endpoint from step 3 above: + + | Field | Value | + | --- | --- | + | Server Address | `<ngrok-host>:<ngrok-port>` (e.g. `6.tcp.ngrok.io:18642`) | + | Database | `postgres` | + | User | `postgres` | + | Password | `postgres` | + +4. The connector uses the publication and watermarks table created by `init.sql` (`flow_publication` and `public.flow_watermarks`). Select the tables to capture (`products`, `transactions`, `reviews`, `orders`). +5. Save and publish. Each captured table becomes a collection. + +Connector reference: [PostgreSQL capture connector docs](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/). + +> The default `postgres`/`postgres` credentials and open ngrok tunnel are fine for a local webinar demo. Do not use them for anything you care about — rotate credentials and lock down access for real workloads. + +## Configure the Estuary materialization (SingleStore) + +Estuary offers two ways to land Estuary collections in SingleStore. Pick one. + +### Option A — direct SingleStore materialization (recommended) + +Writes directly into SingleStore tables over the MySQL wire protocol. + +1. Go to [dashboard.estuary.dev/materializations](https://dashboard.estuary.dev/materializations) and click **New Materialization**. +2. Choose the **SingleStore** connector (`materialize-singlestore`). +3. Provide your SingleStore connection details: + + | Field | Value | + | --- | --- | + | Address | SingleStore host and port, e.g. `svc-abc123.aws-region.svc.singlestore.com:3333` | + | Database | your SingleStore database name | + | User | SingleStore database user | + | Password | SingleStore password | + + For SingleStore Helios/Cloud, expand **Advanced Options**, set SSL Mode to `verify_ca`, and supply SingleStore's TLS/SSL certificate as the SSL Server CA. +4. Bind the Postgres collections (`products`, `orders`, etc.) to destination tables and publish. + +Connector reference: [SingleStore materialization connector docs](https://docs.estuary.dev/reference/Connectors/materialization-connectors/MySQL/singlestore-mysql/). + +### Option B — Dekaf (SingleStore Kafka ingestion pipeline) + +Exposes your collections as Kafka-compatible topics that SingleStore pulls with a native `CREATE PIPELINE ... LOAD DATA KAFKA` statement. + +1. Create a **SingleStore (Dekaf)** materialization in Estuary and set an auth token of your choosing. +2. Note the full materialization name (`YOUR-ORG/YOUR-PREFIX/YOUR-MATERIALIZATION`) — this is the SASL/schema-registry username. +3. In the SingleStore SQL Editor, create a table and a pipeline pointed at the Dekaf broker: + + ```sql + CREATE PIPELINE orders_pipeline AS + LOAD DATA KAFKA "dekaf.estuary-data.com:9092/orders" + CONFIG '{ + "security.protocol":"SASL_SSL", + "sasl.mechanism":"PLAIN", + "sasl.username":"YOUR-ORG/YOUR-PREFIX/YOUR-MATERIALIZATION", + "broker.address.family": "v4", + "schema.registry.username": "YOUR-ORG/YOUR-PREFIX/YOUR-MATERIALIZATION", + "fetch.wait.max.ms": "2000" + }' + CREDENTIALS '{ + "sasl.password": "YOUR_AUTH_TOKEN", + "schema.registry.password": "YOUR_AUTH_TOKEN" + }' + INTO TABLE orders + FORMAT AVRO SCHEMA REGISTRY 'https://dekaf.estuary-data.com' + ( ... ); + ``` + +Connector reference: [SingleStore (Dekaf) materialization connector docs](https://docs.estuary.dev/reference/Connectors/materialization-connectors/Dekaf/singlestore/). + +## Verify + +- In the Estuary dashboard, open your capture and materialization and watch the documents/bytes counters increase as `datagen.py` writes orders. +- Read live from a collection with flowctl: + + ```bash + flowctl auth login + flowctl collections read --collection <your-prefix>/orders --uncommitted | head + ``` + +- Query the destination in SingleStore to confirm rows are landing and order statuses advance over time: + + ```sql + SELECT status, COUNT(*) FROM orders GROUP BY status; + ``` + +## Teardown + +```bash +docker compose down -v +``` + +## Next steps + +- Add a [derivation](https://docs.estuary.dev/concepts/derivations/) to transform the order stream in SQL, TypeScript, or Python (e.g. join `orders` against the seeded `products` catalog, or compute rolling order-status funnels). +- Point the same Postgres collections at additional destinations (Snowflake, BigQuery, ClickHouse) by adding more materializations. +- Swap the local Postgres + ngrok setup for a managed Postgres (RDS, Cloud SQL, Supabase, Neon) and remove the tunnel. + +## Resources + +- [Estuary documentation](https://docs.estuary.dev) +- [Estuary dashboard](https://dashboard.estuary.dev) +- [PostgreSQL capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/) +- [SingleStore materialization connector](https://docs.estuary.dev/reference/Connectors/materialization-connectors/MySQL/singlestore-mysql/) +- [SingleStore (Dekaf) materialization connector](https://docs.estuary.dev/reference/Connectors/materialization-connectors/Dekaf/singlestore/) +- [flowctl CLI](https://docs.estuary.dev/concepts/flowctl/) diff --git a/snowflake-cdc-pinecone-rag/README.md b/snowflake-cdc-pinecone-rag/README.md index 14c9fbe..4865c81 100644 --- a/snowflake-cdc-pinecone-rag/README.md +++ b/snowflake-cdc-pinecone-rag/README.md @@ -1,6 +1,180 @@ -# Snowflake CDC & RAG +# Real-Time Snowflake CDC to Pinecone for RAG with Estuary -This projects showcases the components of a CDC data flow that streams data from -Snowflake, vectorizes it then loads the embeddings into Pinecone. +Stream change data capture (CDC) from Snowflake into a Pinecone vector database in real time with [Estuary](https://estuary.dev), then query it with a Streamlit Retrieval-Augmented Generation (RAG) chatbot. New and updated customer support tickets in Snowflake are captured, embedded into vectors, and materialized into Pinecone so the chat app always answers from fresh data — no batch reindexing. -See blog post for details: \ No newline at end of file +## Architecture + +The pipeline uses Estuary to move data from Snowflake to Pinecone, and a LlamaIndex + OpenAI Streamlit app to query it: + +``` +Snowflake (SUPPORT_REQUESTS) + │ CDC capture (source-snowflake) + ▼ +Estuary collection ──► text embedding ──► materialization (materialize-pinecone) + │ │ + │ ▼ + │ Pinecone (namespace: Support_Requests) + │ ▲ + ▼ │ + datagen/ writes/updates/deletes rows Streamlit RAG app (app.py / rag.py) +``` + +End to end, in Estuary terms: + +1. A **capture** using the Snowflake source connector streams CDC events from the `SUPPORT_REQUESTS` table into an Estuary **collection** (a schematized, real-time data lake of JSON in cloud storage). +2. A **materialization** using the Pinecone connector embeds the collection's documents and writes the resulting vectors to a Pinecone index under the `Support_Requests` namespace. +3. The **Streamlit app** retrieves the most relevant vectors from Pinecone and feeds them to an OpenAI chat model to answer questions about the support tickets. + +## What's included + +- `docker-compose.yml` — spins up two services: `snowflake-cdc-datagen` (seeds and mutates the Snowflake table) and `snowflake-cdc-streamlit` (the RAG chat UI on port `8501`). +- `datagen/datagen.py` — connects to Snowflake, creates the `SUPPORT_REQUESTS` table if missing, and continuously inserts (70%), updates (10%), and deletes (20%) realistic support tickets. Descriptions are generated with OpenAI (`gpt-3.5-turbo-0125`) and Faker. +- `datagen/Dockerfile`, `datagen/requirements.txt` — container image and dependencies for the data generator. +- `app.py` — the Streamlit front end ("Chat with Snowflake"): session handling, chat history, prompt input, and source attribution. +- `rag.py` — wires up LlamaIndex: a `PineconeVectorStore` (namespace `Support_Requests`, text key `flow_document`), a top-5 retriever, an OpenAI `gpt-3.5-turbo` LLM, and a `CondensePlusContextChatEngine`. +- `Dockerfile` — builds the Streamlit container (`streamlit run app.py --server.port 8501`). +- `requirements.txt` — Streamlit app dependencies (Streamlit, LlamaIndex, the Pinecone vector store integration, python-dotenv). +- `.streamlit/config.toml` — enables static serving and the light theme. +- `estuary_logo.png` — branding shown in the chat UI. + +## Prerequisites + +- [Docker](https://docs.docker.com/get-docker/) and Docker Compose. +- A free [Estuary account](https://dashboard.estuary.dev). +- A **Snowflake** account with a warehouse, database, schema, role, and a user/password that can read and write the `SUPPORT_REQUESTS` table. +- A **GCP service account JSON key** for Snowflake-to-Estuary staging (mounted by the datagen service — see `docker-compose.yml`). Snowflake captures stage data through cloud storage; see the connector docs below. +- A **Pinecone** account with an index, plus its API key and index host URL. +- An **OpenAI** API key (used by both the data generator and the RAG app). + +## Setup + +### 1. Configure environment variables + +Edit `docker-compose.yml` and fill in the empty values. + +For the `datagen` service (Snowflake source): + +```yaml +environment: + SNOWFLAKE_ACCOUNT: "<your-account>" + SNOWFLAKE_USER: "<your-user>" + SNOWFLAKE_PASSWORD: "<your-password>" + SNOWFLAKE_ROLE: "<your-role>" + SNOWFLAKE_WAREHOUSE: "<your-warehouse>" + SNOWFLAKE_DATABASE: "<your-database>" + SNOWFLAKE_SCHEMA: "<your-schema>" + SNOWFLAKE_TABLE: "SUPPORT_REQUESTS" +``` + +The datagen service also needs an `OPENAI_API_KEY` to generate ticket descriptions — add it under the `datagen` service `environment` block, or place it in a `.env` file that Compose loads. Update the volume mount to point at your GCP service-account credentials file: + +```yaml +volumes: + - /absolute/path/to/gcp-service-account-cred.json:/credentials.json +``` + +For the `streamlit` service (RAG app): + +```yaml +environment: + PINECONE_API_KEY: "<your-pinecone-api-key>" + PINECONE_HOST: "<your-pinecone-index-host>" + OPENAI_API_KEY: "<your-openai-api-key>" +``` + +### 2. Generate data in Snowflake + +Start the data generator so the `SUPPORT_REQUESTS` table exists and begins filling with rows: + +```bash +docker compose up -d datagen +``` + +Watch it work: + +```bash +docker compose logs -f datagen +``` + +You should see `Inserted new support request`, `Updated support request`, and `Deleted support request` messages every couple of seconds. The table schema is: + +| Column | Type | +| ------------- | ------ | +| REQUEST_ID | INT | +| CUSTOMER_ID | INT | +| REQUEST_DATE | STRING | +| REQUEST_TYPE | STRING | +| STATUS | STRING | +| DESCRIPTION | STRING | + +## Configure the Estuary capture (Snowflake CDC) + +Create the capture in the [Estuary dashboard](https://dashboard.estuary.dev/captures): + +1. Click **New Capture** and choose the **Snowflake** source connector. +2. Enter your Snowflake connection details — the same `SNOWFLAKE_ACCOUNT`, user, password, role, warehouse, database, and schema you set in `docker-compose.yml`. +3. Select the `SUPPORT_REQUESTS` table to bind it to an Estuary collection. +4. Save and publish. Estuary begins streaming the table's change data into the collection. + +Connector reference: [Snowflake source connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/snowflake/). + +## Configure the Estuary materialization (Pinecone) + +Create the materialization in the [Estuary dashboard](https://dashboard.estuary.dev/materializations): + +1. Click **New Materialization** and choose the **Pinecone** destination connector. +2. Provide your Pinecone API key, index, and embedding configuration (the connector embeds documents before upserting vectors). +3. Bind the Snowflake collection from the capture above to the Pinecone index, using the namespace **`Support_Requests`** so it matches what `rag.py` queries. +4. Save and publish. Estuary embeds and upserts vectors into Pinecone in real time as rows change in Snowflake. + +> The RAG app reads vectors from the `Support_Requests` namespace and uses `flow_document` as the text key (`rag.py`). Keep these aligned with the materialization config. + +Connector reference: [Pinecone materialization connector](https://docs.estuary.dev/reference/Connectors/materialization-connectors/pinecone/). + +## Running the RAG app + +Once data is flowing into Pinecone, start the Streamlit chat UI: + +```bash +docker compose up -d streamlit +``` + +Open [http://localhost:8501](http://localhost:8501) and ask questions about the customer support tickets, for example: + +- "What are customers complaining about most?" +- "Summarize the open billing issues." +- "Are there any authentication failures reported recently?" + +The app retrieves the top 5 matching support tickets from Pinecone and answers with OpenAI `gpt-3.5-turbo`, citing the source documents it used. + +To run everything at once: + +```bash +docker compose up -d +``` + +## Verify + +- **Snowflake**: query `SELECT COUNT(*) FROM SUPPORT_REQUESTS;` and confirm the count grows as datagen runs. +- **Estuary**: in the dashboard, check the capture and materialization task metrics for non-zero documents and bytes flowing. You can also tail the collection: + + ```bash + flowctl collections read --collection <your/collection/name> --uncommitted | head + ``` + +- **Pinecone**: confirm vectors appear in the index under the `Support_Requests` namespace. +- **App**: ask a question about a ticket you can see in Snowflake and verify the answer reflects it. + +## Next steps + +- Swap the OpenAI chat model in `rag.py` (`gpt-3.5-turbo`) for a different model. +- Point the same Snowflake collection at additional destinations (a warehouse, search index, etc.) via more materializations — no re-capture needed. +- Add a [derivation](https://docs.estuary.dev/concepts/derivations/) in SQL, TypeScript, or Python to clean or enrich tickets before they reach Pinecone. + +## Resources + +- [Estuary documentation](https://docs.estuary.dev) +- [Estuary dashboard](https://dashboard.estuary.dev) +- [Snowflake source connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/snowflake/) +- [Pinecone materialization connector](https://docs.estuary.dev/reference/Connectors/materialization-connectors/pinecone/) +- [flowctl CLI](https://docs.estuary.dev/concepts/flowctl/) diff --git a/sqlserver-cdc-capture/README.md b/sqlserver-cdc-capture/README.md index cadf944..6395607 100644 --- a/sqlserver-cdc-capture/README.md +++ b/sqlserver-cdc-capture/README.md @@ -1,35 +1,125 @@ -# SQL Server CDC Capture Demo +# SQL Server CDC Capture Demo with Estuary -A self-contained SQL Server environment with CDC enabled and a data generator -producing continuous inserts, updates, and deletes against a `sales` table. -Useful for demoing or testing an Estuary SQL Server CDC capture end-to-end. +A self-contained SQL Server 2022 environment with Change Data Capture (CDC) +enabled and a data generator producing continuous inserts, updates, and deletes +against a `dbo.sales` table. An [Estuary](https://estuary.dev) capture +streams every change in real time into a collection, giving you an end-to-end, +copy-pasteable way to demo or test SQL Server CDC replication without touching a +production database. + +## Architecture + +The pipeline runs entirely from Docker, with the local SQL Server exposed to +Estuary's hosted connector through an ngrok TCP tunnel: + +``` +┌──────────────┐ inserts/updates/deletes ┌──────────────────────┐ +│ datagen │ ───────────────────────────────▶ │ SQL Server 2022 │ +│ (faker loop) │ once per second │ SampleDB.dbo.sales │ +└──────────────┘ │ (CDC enabled) │ + └──────────┬───────────┘ + │ CDC change tables + ┌──────────▼───────────┐ + │ ngrok tcp 1433 │ + │ public host:port │ + └──────────┬───────────┘ + │ source-sqlserver + ┌──────────▼───────────┐ + │ Estuary │ + │ capture ──▶ collection + │ .../dbo/sales │ + └──────────────────────┘ +``` + +- **Capture (source):** the [`source-sqlserver`](https://docs.estuary.dev/reference/Connectors/capture-connectors/SQLServer/) + connector reads from SQL Server CDC change tables and emits change events. +- **Collection:** each captured change lands in the `your-prefix/sqlserver-cdc/dbo/sales` + collection, a schematized JSON stream keyed on `/sale_id` with reduction + annotations that apply creates, updates, and deletes (`_meta.op` of `c`/`u`/`d`). +- **Materialization (next step):** point a materialization at the collection to + push the data into a destination such as Snowflake, BigQuery, or Postgres. ## What's included -- `sqlserver/` — SQL Server 2022 with an init script that enables CDC, creates - the `sales` table, and sets up the `flow_capture` user. -- `datagen/` — Python container that inserts, updates, and deletes rows in - `dbo.sales` once per second. -- `ngrok` — exposes SQL Server publicly over TCP so the hosted Estuary - connector can reach it. -- `flow.yaml` — Estuary capture spec wiring the SQL Server source to a - collection in your tenant. +- `docker-compose.yml` — spins up three services: `sql-server` (the database), + `sql-server-datagen` (the load generator), and `sql-server-ngrok` (the TCP + tunnel). +- `sqlserver/Dockerfile` — builds SQL Server 2022 (`mcr.microsoft.com/mssql/server:2022-latest`) + and runs `init.sql` on startup. +- `sqlserver/init.sql` — creates the `SampleDB` database, enables CDC at the + database level, creates the `flow_capture` login/user with the permissions the + connector needs, creates the `dbo.sales` table, and enables CDC on it. +- `datagen/datagen.py` — Python script using `pyodbc` and `Faker` that inserts, + updates, and deletes rows in `dbo.sales` roughly once per second (weighted + 70% insert, 20% delete, 10% update). +- `datagen/Dockerfile` / `datagen/requirements.txt` — container image and pinned + dependencies (`Faker==25.1.0`, `pyodbc==5.1.0`) for the data generator. +- `flow.yaml` — the Estuary capture spec: a `source-sqlserver` capture with one + binding for `dbo.sales`, targeting the `.../dbo/sales` collection. +- `your-prefix/` — the imported collection spec and inferred JSON schema + (`sales.schema.yaml`) generated by discovery. Rename this prefix to your own + tenant before publishing. + +## Prerequisites + +- [Docker](https://docs.docker.com/get-docker/) and Docker Compose. +- A free [ngrok](https://ngrok.com/) account and an authtoken (required because + the database runs locally and Estuary is fully managed/hosted). +- A free Estuary account: <https://dashboard.estuary.dev>. +- [`flowctl`](https://docs.estuary.dev/concepts/flowctl/) installed and + authenticated (`flowctl auth login`) if you want to deploy from `flow.yaml`. ## Running it -Set your ngrok token and start the stack: +Set your ngrok authtoken and start the stack: ```bash export NGROK_AUTHTOKEN=<your-token> docker compose up -d ``` -Grab the public host/port from the ngrok dashboard at http://localhost:4040. +This starts SQL Server, runs `init.sql`, begins generating change events, and +opens a public TCP tunnel to port `1433`. + +Grab the public host and port for the tunnel from the ngrok dashboard at +<http://localhost:4040>, or with a one-liner: + +```bash +curl -s http://localhost:4040/api/tunnels | jq -r ".tunnels[0].public_url" +# e.g. tcp://0.tcp.sa.ngrok.io:11059 +``` + +Strip the `tcp://` prefix when you paste the value into Estuary — the connector +expects a bare `host:port` (e.g. `0.tcp.sa.ngrok.io:11059`). + +## Configure the Estuary capture + +You can wire up the capture either via the Estuary dashboard or via `flowctl` +using the provided `flow.yaml`. + +The connection values are fixed by `init.sql` and `docker-compose.yml`: -## Creating the Estuary capture +| Field | Value | +| ---------- | -------------------------------------- | +| Connector | SQL Server (`ghcr.io/estuary/source-sqlserver:v0`) | +| Address | the ngrok host:port from above | +| Database | `SampleDB` | +| User | `flow_capture` | +| Password | `Secretsecret1` | +| Namespace | `dbo` | +| Table | `sales` | -`flow.yaml` is the spec deployed via [flowctl](https://docs.estuary.dev/concepts/flowctl/). -It defines a SQL Server CDC capture and one binding for the `dbo.sales` table: +### Option A — Dashboard + +1. Go to <https://dashboard.estuary.dev/captures> and create a new capture. +2. Choose the **SQL Server** connector. +3. Enter the address, database, user, and password from the table above. +4. Save and publish. Estuary discovers `dbo.sales` and creates a collection + bound to it. + +### Option B — flowctl + flow.yaml + +`flow.yaml` defines a SQL Server CDC capture with one binding for `dbo.sales`: ```yaml captures: @@ -50,12 +140,15 @@ captures: target: your-prefix/sqlserver-cdc/dbo/sales ``` -Before publishing, edit `address` to match the host/port from the ngrok -dashboard, and replace `your-prefix` with your own tenant/prefix everywhere it -appears (in `flow.yaml`, in the imported yamls under `your-prefix/`, and the -directory name itself). +Before publishing: + +1. Edit `address` in `flow.yaml` to match the ngrok host/port (without + `tcp://`). +2. Replace `your-prefix` with your own tenant/prefix everywhere it appears — in + `flow.yaml`, in the imported specs under `your-prefix/`, and in the directory + name itself. -Then discover, publish, and verify: +Then discover (optional), publish, and check status: ```bash # (Re-)run discovery to refresh bindings from the source — optional once @@ -65,22 +158,48 @@ flowctl discover --source flow.yaml # Publish the capture and the generated collection. flowctl catalog publish --source flow.yaml --auto-approve -# Check status; first transition should be PENDING → BACKFILLING → OK. +# Check status; first transitions should be PENDING → BACKFILLING → OK. flowctl catalog status your-prefix/sqlserver-cdc/source-sqlserver +``` + +## Verify + +Confirm change events are flowing through the collection: -# Peek at a few documents flowing through the collection. +```bash flowctl collections read \ --collection your-prefix/sqlserver-cdc/dbo/sales \ --uncommitted | head ``` -### Credentials +You can also watch live throughput and document counts for the capture in the +Estuary dashboard at <https://dashboard.estuary.dev>. + +## Credentials reference + +`init.sql` provisions a dedicated CDC user that the capture connects as, grants +the permissions the connector requires (`SELECT` on schemas `dbo` and `cdc`, +plus `VIEW DATABASE STATE`), enables CDC on the database, and enables CDC on +`dbo.sales` with `@role_name = 'flow_capture'`: + +- **CDC user:** `flow_capture` +- **CDC password:** `Secretsecret1` +- **Database:** `SampleDB` + +The `sa` account (password `SuperSecurePassword1`, set in `docker-compose.yml`) +is used only by the local data generator, not by the Estuary connector. + +## Next steps -The `init.sql` script provisions a dedicated CDC user the capture connects as: +- Add a materialization to send the captured data to a warehouse or database: + <https://dashboard.estuary.dev/materializations>. +- Add a [derivation](https://docs.estuary.dev/concepts/derivations/) to + transform the `sales` collection in SQL, TypeScript, or Python. +- Tear down the demo with `docker compose down -v`. -- user: `flow_capture` -- password: `Secretsecret1` -- database: `SampleDB` +## Resources -It also grants the permissions the connector needs (`SELECT` on `dbo` and `cdc`, -plus `VIEW DATABASE STATE`) and enables CDC on `dbo.sales`. +- Estuary docs: <https://docs.estuary.dev> +- SQL Server capture connector reference: + <https://docs.estuary.dev/reference/Connectors/capture-connectors/SQLServer/> +- flowctl reference: <https://docs.estuary.dev/concepts/flowctl/> diff --git a/sqlserver-cdc-materialize/README.md b/sqlserver-cdc-materialize/README.md index 58c849d..c3818b9 100644 --- a/sqlserver-cdc-materialize/README.md +++ b/sqlserver-cdc-materialize/README.md @@ -1,5 +1,161 @@ -# SQL Server CDC to Materialize +# SQL Server CDC to Materialize for Real-Time Analytics with Estuary -See blog post for details: +Stream change data capture (CDC) events from Microsoft SQL Server into [Materialize](https://materialize.com/) — the streaming operational database — using [Estuary](https://estuary.dev) and the Dekaf Kafka-compatible API. This example spins up a SQL Server instance with CDC enabled, generates a continuous stream of insert/update/delete operations against a `sales` table, captures those changes with Estuary, and consumes them in Materialize as a Kafka source to power an incrementally maintained `sales_anomalies` view. -https://estuary.dev/cdc-sqlserver-materialize/ +Companion blog post: https://estuary.dev/cdc-sqlserver-materialize/ + +## Architecture + +``` +┌────────────┐ INSERT/UPDATE/DELETE ┌──────────────┐ CDC capture ┌──────────────────┐ +│ datagen │ ───────────────────────► │ SQL Server │ ───────────────► │ Estuary │ +│ (faker) │ dbo.sales │ (CDC + MSSQL│ source- │ collection │ +└────────────┘ │ Agent) │ sqlserver │ .../sales │ + └──────┬───────┘ └────────┬─────────┘ + │ ngrok TCP tunnel │ Dekaf + │ (port 1433 -> public) │ (Kafka API + CSR) + ▼ ▼ + Estuary connector ┌──────────────────┐ + reaches local DB │ Materialize │ + │ KAFKA SOURCE + │ + │ sales_anomalies │ + └──────────────────┘ +``` + +End-to-end flow in Estuary terms: + +1. **Capture** — the `source-sqlserver` connector reads the SQL Server CDC change tables for `dbo.sales` and streams every insert, update, and delete into an Estuary **collection**. +2. **Collection** — the change stream lands as schematized JSON in your Estuary collection (a real-time, durable data lake backing the pipeline). +3. **Consume via Dekaf** — instead of a managed materialization connector, Materialize reads the collection directly through Estuary's **Dekaf** Kafka-compatible API. Materialize is configured with a `KAFKA` connection and a `CONFLUENT SCHEMA REGISTRY` connection pointing at Dekaf, then declares a `CREATE SOURCE` over the collection topic with `ENVELOPE UPSERT`. +4. **Transform in Materialize** — the `sales_anomalies` view computes a rolling 7-day per-customer average spend and surfaces sales whose `total_price` exceeds 1.5x that average, kept fresh incrementally by Materialize. + +Because Estuary is fully managed, the SQL Server instance running on your machine is exposed to the connector through an `ngrok` TCP tunnel. + +## What's included + +- **`docker-compose.yml`** — defines three services: `sql-server` (the source database, port `1433`), `datagen` (the load generator), and `ngrok` (TCP tunnel exposing SQL Server to Estuary, web UI on port `4040`). +- **`sqlserver/Dockerfile`** — builds on `mcr.microsoft.com/mssql/server:2022-latest`, copies in `init.sql`, and runs it after the server starts. +- **`sqlserver/init.sql`** — creates the `SampleDB` database, enables CDC at the database level (`sys.sp_cdc_enable_db`), creates the `flow_capture` login/user with `SELECT` on the `dbo` and `cdc` schemas, creates the `dbo.flow_watermarks` table required by the Estuary connector, creates the `dbo.sales` table, and enables CDC on both tables via `sys.sp_cdc_enable_table`. +- **`datagen/datagen.py`** — connects via `pyodbc` and continuously performs weighted random operations against `dbo.sales` (70% inserts, 20% deletes, 10% updates), one per second, using `Faker` to generate realistic sale rows. +- **`datagen/Dockerfile`** + **`datagen/requirements.txt`** — Python 3.12 image with the Microsoft ODBC Driver 18 and `mssql-tools18`; pins `Faker==25.1.0` and `pyodbc==5.1.0`. +- **`materialize-setup.sql`** — the SQL you run inside Materialize: creates the Dekaf `KAFKA` and `CONFLUENT SCHEMA REGISTRY` connections, the `CREATE SOURCE sqlserver_sales` Kafka source, the `sales_anomalies` view, and an index on it. + +## Prerequisites + +- [Docker](https://docs.docker.com/get-docker/) and Docker Compose. +- A free [ngrok](https://ngrok.com/) account and authtoken (the SQL Server instance runs locally and must be reachable by Estuary's hosted connector). +- A free [Estuary account](https://dashboard.estuary.dev). +- A [Materialize](https://materialize.com/) account (Materialize Cloud or a self-managed instance with the `psql` client). + +## Setup + +### 1. Start the stack + +```bash +export NGROK_AUTHTOKEN=<your-ngrok-authtoken> +docker compose up --build +``` + +This builds and starts: + +- `sql-server` — SQL Server 2022 (Developer edition) with the SQL Server Agent enabled (required for CDC). The `init.sql` script creates `SampleDB`, the `dbo.sales` table, the `flow_capture` capture user, the `dbo.flow_watermarks` table, and enables CDC. +- `datagen` — begins inserting/updating/deleting rows in `dbo.sales` once the database healthcheck passes. +- `ngrok` — opens a TCP tunnel to `sql-server:1433`. + +### 2. Get the public SQL Server endpoint + +Open the ngrok inspector at http://localhost:4040, or grab it from the API: + +```bash +curl -s http://localhost:4040/api/tunnels | jq -r ".tunnels[0].public_url" +# e.g. tcp://6.tcp.ngrok.io:14820 +``` + +Strip the `tcp://` prefix — the host (e.g. `6.tcp.ngrok.io`) and port (e.g. `14820`) go into the Estuary capture config. + +## Configure the Estuary capture + +Create a new SQL Server capture in the [Estuary dashboard](https://dashboard.estuary.dev/captures) (**+ NEW CAPTURE → SQL Server**) using the [`source-sqlserver` connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/SQLServer/). The local stack provisions everything the connector needs, so use these exact values: + +| Field | Value | +| --- | --- | +| Server Address | the ngrok host:port from step 2 (e.g. `6.tcp.ngrok.io:14820`) | +| Database | `SampleDB` | +| User | `flow_capture` | +| Password | `Secretsecret1` | + +The capture discovers the `dbo.sales` table and writes its CDC stream into a collection (e.g. `<your-prefix>/<capture-name>/sales`). Note the full collection name — you'll reference it as the Dekaf topic in Materialize. + +> The `flow_capture` user, its `SELECT` grants on `dbo`/`cdc`, the `dbo.flow_watermarks` table, and database/table-level CDC are all created automatically by `sqlserver/init.sql`. No manual SQL Server setup is required. + +## Configure Materialize (consume via Dekaf) + +Materialize reads the Estuary collection through [Dekaf](https://docs.estuary.dev/guides/dekaf_reading_collections_from_kafka/), Estuary's Kafka-compatible API. The `materialize-setup.sql` file contains the full script. Before running it: + +1. Generate an Estuary access/refresh token at https://dashboard.estuary.dev/admin/api and paste it into the `estuary_refresh_token` secret. +2. Replace the `TOPIC` value with **your** full collection name from the capture step. The file ships with `Dani/sqlservertest1/sales` as an example — change it to your collection (e.g. `<your-prefix>/<capture-name>/sales`). + +Then apply it against Materialize: + +```bash +psql "<your-materialize-connection-string>" -f materialize-setup.sql +``` + +Key statements in `materialize-setup.sql`: + +```sql +CREATE SECRET estuary_refresh_token AS '<your-estuary-refresh-token>'; + +CREATE CONNECTION estuary_connection TO KAFKA ( + BROKER 'dekaf.estuary.dev', + SECURITY PROTOCOL = 'SASL_SSL', + SASL MECHANISMS = 'PLAIN', + SASL USERNAME = '{}', + SASL PASSWORD = SECRET estuary_refresh_token +); + +CREATE CONNECTION csr_estuary_connection TO CONFLUENT SCHEMA REGISTRY ( + URL 'https://dekaf.estuary.dev', + USERNAME = '{}', + PASSWORD = SECRET estuary_refresh_token +); + +CREATE SOURCE sqlserver_sales + FROM KAFKA CONNECTION estuary_connection (TOPIC 'Dani/sqlservertest1/sales') + FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY CONNECTION csr_estuary_connection + ENVELOPE UPSERT; +``` + +The script then defines the `sales_anomalies` view (rolling 7-day per-customer average spend, flagging sales above 1.5x that average) and indexes it on `customer_id`. + +## Verify + +Confirm change events are landing in your Estuary collection from the dashboard, or with `flowctl`: + +```bash +flowctl auth login +flowctl collections read --collection <your-prefix>/<capture-name>/sales --uncommitted | head +``` + +In Materialize, confirm rows are flowing and the anomaly view is populated: + +```sql +SELECT count(*) FROM sqlserver_sales; +SELECT * FROM sales_anomalies LIMIT 20; +``` + +Because `datagen` runs continuously, you'll see counts and anomalies change as new sales stream in. + +## Next steps + +- Add a [managed materialization](https://dashboard.estuary.dev/materializations) (Snowflake, BigQuery, ClickHouse, Postgres, etc.) on the same `sales` collection to fan out to more destinations. +- Apply a [derivation](https://docs.estuary.dev/concepts/derivations/) to transform the collection in SQL, TypeScript, or Python before it reaches downstream systems. +- Extend `sqlserver/init.sql` with more tables and let the capture discover them. + +## Resources + +- Blog post: [Real-Time CDC from SQL Server to Materialize](https://estuary.dev/cdc-sqlserver-materialize/) +- Estuary docs: https://docs.estuary.dev +- SQL Server capture connector: https://docs.estuary.dev/reference/Connectors/capture-connectors/SQLServer/ +- Reading collections from Kafka (Dekaf): https://docs.estuary.dev/guides/dekaf_reading_collections_from_kafka/ +- flowctl: https://docs.estuary.dev/concepts/flowctl/ diff --git a/streaming-lakehouse-iceberg-duckdb/README.md b/streaming-lakehouse-iceberg-duckdb/README.md index 96e8299..907632c 100644 --- a/streaming-lakehouse-iceberg-duckdb/README.md +++ b/streaming-lakehouse-iceberg-duckdb/README.md @@ -1,11 +1,161 @@ -# Streaming Lakehouse with Apache Iceberg +# Real-Time PostgreSQL CDC to Apache Iceberg Lakehouse with Estuary, AWS Glue & DuckDB/PyIceberg + +Build a streaming lakehouse: stream change data capture (CDC) from PostgreSQL into an [Apache Iceberg](https://iceberg.apache.org/) lakehouse on Amazon S3 (cataloged by AWS Glue) using [Estuary](https://estuary.dev), then query the Iceberg tables with PyIceberg / DuckDB. A local Postgres database is seeded and continuously mutated by a data generator, exposed to Estuary's managed connectors via an ngrok TCP tunnel, captured in real time, and materialized into Iceberg where `main.py` reads the `transactions` table and reconstructs the latest row state from CDC operation metadata. + +This example accompanies the walkthrough: [Building a Streaming Lakehouse with Estuary and Iceberg](https://estuary.dev/building-streaming-lakehouse-flow-iceberg/). + +## Architecture + +``` +PostgreSQL (logical replication) Estuary (managed) Apache Iceberg lakehouse ++-----------------------------+ +----------------------+ +-------------------------+ +| users | | source-postgres | | AWS Glue catalog | +| transactions --CDC--> | ngrok | capture | --coll--> | S3 data files | +| transaction_metadata | =======>| -> collections | materialize| iceberg-rest/glue tables| ++-----------------------------+ TCP +----------------------+ +-------------------------+ + ^ | + | inserts/updates/deletes (datagen) PyIceberg/DuckDB (main.py) + scan + rebuild latest state +``` + +Data flow in Estuary terms: + +1. **Capture** — the `source-postgres` connector reads the Postgres write-ahead log (logical replication) and streams inserts, updates, and deletes from `public.users`, `public.transactions`, and `public.transaction_metadata`. +2. **Collections** — each table lands in an Estuary collection: a schematized, real-time data lake of JSON documents in cloud storage. Each document carries CDC metadata under `_meta` (including the operation type `_meta.op`: `c`/`u`/`d`). +3. **Materialization** — the Iceberg materialization connector writes the collections to Apache Iceberg tables backed by S3 and registered in the AWS Glue Data Catalog. +4. **Query** — `main.py` uses PyIceberg's `GlueCatalog` to load the `transactions` Iceberg table into pandas and rebuilds the current state by applying the captured operations. + +## What's included + +| Path | Role | +| --- | --- | +| `docker-compose.yml` | Spins up three services: `postgres-streaming-lakehouse` (Postgres with `wal_level=logical`), `datagen-streaming-lakehouse` (the load generator), and `ngrok-streaming-lakehouse` (a TCP tunnel exposing Postgres to Estuary). | +| `postgres/init.sql` | Runs on first boot: creates the `flow_capture` replication user, grants read/write, creates the `public.flow_watermarks` table, creates the `flow_publication` publication, defines the `users`, `transactions`, and `transaction_metadata` tables, and seeds 20 users. | +| `datagen/datagen.py` | Continuously inserts (70%), deletes (20%), and updates (10%) random transactions once per second, including occasional anomalous amounts, to produce a steady CDC stream. Inserts also write the associated transaction metadata and deletes remove it; updates touch the `transactions` table only. | +| `datagen/Dockerfile` | Builds the Python 3.12 image for the data generator. | +| `datagen/requirements.txt` | Data generator dependencies: `Faker==25.1.0`, `psycopg2==2.9.9`. | +| `main.py` | Reads the materialized `transactions` Iceberg table via PyIceberg `GlueCatalog`, prints stats, and reconstructs latest row state from `_meta.op` (filtering deletes, applying updates, aggregating per user). | +| `requirements.txt` | Query-side dependencies: `pyiceberg==0.6.1`, `boto3==1.34.134`, `pandas`, and `python-dotenv` (imported by `main.py`). | + +## Prerequisites + +- [Docker](https://docs.docker.com/get-docker/) and Docker Compose. +- A free [ngrok](https://ngrok.com/) account and authtoken — Estuary is fully managed, so the local Postgres must be exposed via an ngrok TCP tunnel for the capture connector to reach it. +- A free [Estuary account](https://dashboard.estuary.dev). +- An AWS account with: + - An S3 bucket for Iceberg data files. + - AWS Glue Data Catalog access (the materialization registers Iceberg tables in Glue). + - IAM credentials (`AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY`) with permission to read/write the bucket and manage Glue tables. +- Python 3.12+ to run `main.py` (the query/consumer step). ## Setup -1. Start containers: `docker compose up` -2. Get PostgreSQL URL: `curl -s http://localhost:4040/api/tunnels | jq -r '.tunnels[0].public_url'` -3. Set up Estuary capture -4. ??? -5. Profit! +### 1. Start the local stack + +Set your ngrok authtoken and bring up the containers: + +```bash +export NGROK_AUTHTOKEN=<your-ngrok-authtoken> +docker compose up +``` + +This starts Postgres (with logical replication enabled), applies `postgres/init.sql`, launches the data generator, and opens the ngrok tunnel. You should see the generator logging `Inserted new transaction ...` once it connects. + +### 2. Get the public Postgres endpoint + +The ngrok tunnel forwards `postgres:5432`. Read the public `host:port` from the ngrok dashboard at <http://localhost:4040>, or via the API: + +```bash +curl -s http://localhost:4040/api/tunnels | jq -r '.tunnels[0].public_url' +``` + +This prints something like `tcp://6.tcp.ngrok.io:18632`. Strip the `tcp://` prefix when pasting into Estuary (use `6.tcp.ngrok.io:18632`). + +## Configure the Estuary capture (PostgreSQL CDC) + +The `postgres/init.sql` script has already provisioned everything the `source-postgres` connector needs: the `flow_capture` user, the `flow_watermarks` table, and the `flow_publication` publication. Create the capture in the [Estuary dashboard](https://dashboard.estuary.dev/captures): + +1. Go to **Captures → New Capture** and select the **PostgreSQL** connector. +2. Enter the connection details: + + | Field | Value | + | --- | --- | + | Server Address | the ngrok `host:port` from step 2 (no `tcp://`) | + | Database | `postgres` | + | User | `flow_capture` | + | Password | `password` | + +3. Publish. The connector discovers `public.users`, `public.transactions`, and `public.transaction_metadata` and writes each to an Estuary collection. + +Connector reference: [PostgreSQL capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/). + +> The seeded `flow_capture` credentials (`flow_capture` / `password`) and publication name (`flow_publication`) come directly from `postgres/init.sql`. Postgres is started with `-c wal_level=logical` in `docker-compose.yml`. + +## Configure the Estuary materialization (Apache Iceberg) + +Create a materialization from the captured collections into Apache Iceberg backed by S3 + AWS Glue in the [Estuary dashboard](https://dashboard.estuary.dev/materializations): + +1. Go to **Materializations → New Materialization** and choose the **Apache Iceberg** connector. +2. Provide your AWS credentials, S3 bucket, AWS region, and the Glue catalog namespace you want the tables created under. +3. Link the `users`, `transactions`, and `transaction_metadata` collections from the capture above and publish. + +Note: the Apache Iceberg materialization connector provisions and runs on AWS EMR Serverless to write Iceberg data files, so you'll also need to supply the EMR Serverless configuration (application/role) required by the connector in addition to the AWS credentials, S3 bucket, region, and Glue namespace — see the connector reference below for the full set of required fields. + +Connector reference: [Apache Iceberg materialization connector](https://docs.estuary.dev/reference/Connectors/materialization-connectors/apache-iceberg/). + +Use the **same** namespace, AWS region, and credentials here that you will pass to `main.py` so the query step can find the materialized tables. + +## Verify + +Confirm CDC is flowing before querying Iceberg: + +- In the Estuary dashboard, watch the capture and materialization metrics climb as the generator runs (it commits an insert/update/delete every second). +- Or tail a collection with flowctl: + + ```bash + flowctl auth login + flowctl collections read --collection <your-prefix>/transactions --uncommitted | head + ``` + + Each document includes a `_meta` object; `_meta.op` is `c` (create), `u` (update), or `d` (delete) — this is exactly what `main.py` reads. + +## Query the Iceberg lakehouse (PyIceberg / DuckDB) + +`main.py` loads the materialized `transactions` table from the AWS Glue catalog and reconstructs the current state from the CDC operations. + +Install the query dependencies and set the AWS / namespace environment variables (`main.py` reads them via `python-dotenv`, so an `.env` file also works): + +```bash +pip install -r requirements.txt + +export AWS_REGION=<your-aws-region> +export AWS_ACCESS_KEY_ID=<your-access-key-id> +export AWS_SECRET_ACCESS_KEY=<your-secret-access-key> +export NAMESPACE=<the-glue-namespace-from-the-materialization> + +python main.py +``` + +What `main.py` does: + +- Connects to the AWS Glue catalog and lists namespaces and tables. +- Loads `{NAMESPACE}.transactions` and scans it into a pandas DataFrame. +- Parses `flow_document` JSON to extract `_meta.op` for each row. +- Filters out deletes (`d`), applies updates (`u`) by keeping the last document per `transaction_id`, and aggregates total transaction amount per `user_id`. + +> Because the materialization is append-aware CDC, the raw Iceberg table contains the full change history. `main.py` shows the standard pattern for collapsing that history into the latest state — the same logic you would express in DuckDB SQL against the Iceberg tables. + +## Next steps + +- Add an Estuary [derivation](https://docs.estuary.dev/concepts/derivations/) (SQL, TypeScript, or Python) to compute the latest-state or per-user aggregates inside Estuary instead of in `main.py`. +- Point a different SQL engine at the same Iceberg tables (DuckDB's `iceberg` extension, Trino, Spark, Athena) — the lakehouse is engine-agnostic. +- Swap the local Postgres for a production database; the capture, collections, and materialization stay identical. + +## References -Read the blog post at: https://estuary.dev/building-streaming-lakehouse-flow-iceberg/ +- Blog: [Building a Streaming Lakehouse with Estuary and Iceberg](https://estuary.dev/building-streaming-lakehouse-flow-iceberg/) +- [Estuary documentation](https://docs.estuary.dev) +- [PostgreSQL capture connector](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/) +- [Apache Iceberg materialization connector](https://docs.estuary.dev/reference/Connectors/materialization-connectors/apache-iceberg/) +- [flowctl CLI](https://docs.estuary.dev/concepts/flowctl/) +- [PyIceberg](https://py.iceberg.apache.org/) · [DuckDB Iceberg extension](https://duckdb.org/docs/extensions/iceberg.html) From f83a25c897e86a877baaf30afd895a55058eb068 Mon Sep 17 00:00:00 2001 From: Dani Palma <dani@estuary.dev> Date: Fri, 26 Jun 2026 10:39:00 -0300 Subject: [PATCH 2/2] fix: correct latent bugs in example projects - dekaf-kcat: point consume.sh at the live Dekaf host (dekaf.fly.dev was retired) - postgres-cloudsql: add sqlalchemy and python-dotenv (imported by datagen.py) - streaming-lakehouse: add python-dotenv (imported by main.py) - shipments-stateful derivation: match the generator's "Out For Delivery" status casing - shipments_eta: pass MongoDB env vars through docker-compose instead of dead localhost/mongo defaults - shipments_eta: eta.sql uses arrayStringConcat on the flattened delays__reason column Sync the affected READMEs to the corrected behavior. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --- dekaf-kcat/README.md | 4 +-- dekaf-kcat/consume.sh | 2 +- postgres-cloudsql-simple-capture/README.md | 2 +- .../datagen/requirements.txt | 4 ++- .../customer-metrics.flow.py | 6 ++-- shipments_eta/README.md | 29 ++++--------------- shipments_eta/docker-compose.yml | 10 ++++--- shipments_eta/tinybird/eta.sql | 2 +- .../requirements.txt | 3 +- 9 files changed, 24 insertions(+), 38 deletions(-) diff --git a/dekaf-kcat/README.md b/dekaf-kcat/README.md index e5c5d26..0e45ed1 100644 --- a/dekaf-kcat/README.md +++ b/dekaf-kcat/README.md @@ -31,7 +31,7 @@ Source ──capture──▶ Estuary collection ──Dekaf (Kafka API)── ## Running it -The bundled `consume.sh` ships with the legacy bootstrap host `dekaf.fly.dev:9092`, which no longer resolves. Update that one line to Estuary's current Dekaf endpoint, `dekaf.estuary-data.com:9092`, so the script reads: +The script as committed: ```bash kcat -C \ @@ -64,7 +64,7 @@ Flags explained: | `-X sasl.username` | `{}` | Public demo placeholder (use your Dekaf task name for private collections) | | `-X sasl.password` | (empty) | Public demo placeholder (use your Estuary access token for private collections) | -> **Note on the bootstrap host:** Estuary's current Dekaf endpoint is `dekaf.estuary-data.com:9092` (with the schema registry at `https://dekaf.estuary-data.com`). The bundled `consume.sh` still references the legacy host `dekaf.fly.dev:9092`, which no longer resolves — switch that line to `dekaf.estuary-data.com:9092`. See the [Dekaf reading guide](https://docs.estuary.dev/guides/dekaf_reading_collections_from_kafka/) for the authoritative endpoint and connection settings. +> **Note on the bootstrap host:** the script uses Estuary's production Dekaf endpoint, `dekaf.estuary-data.com:9092` (with the schema registry at `https://dekaf.estuary-data.com`). A legacy host, `dekaf.fly.dev:9092`, appeared in older versions of this example but no longer resolves — use `dekaf.estuary-data.com:9092`. See the [Dekaf reading guide](https://docs.estuary.dev/guides/dekaf_reading_collections_from_kafka/) for the authoritative endpoint and connection settings. ## Reading your own collections diff --git a/dekaf-kcat/consume.sh b/dekaf-kcat/consume.sh index b0a4dd6..875e2e1 100644 --- a/dekaf-kcat/consume.sh +++ b/dekaf-kcat/consume.sh @@ -1,5 +1,5 @@ kcat -C \ - -b dekaf.fly.dev:9092 \ + -b dekaf.estuary-data.com:9092 \ -t demo/wikipedia/recentchange-sampled \ -X security.protocol=sasl_ssl \ -X sasl.mechanisms=PLAIN \ diff --git a/postgres-cloudsql-simple-capture/README.md b/postgres-cloudsql-simple-capture/README.md index 5d15747..16d6ffa 100644 --- a/postgres-cloudsql-simple-capture/README.md +++ b/postgres-cloudsql-simple-capture/README.md @@ -36,7 +36,7 @@ Because Estuary is fully managed, the Postgres database must be reachable from t - **`postgres/init.sql`** — runs once on first container start. Creates the `flow_capture` user with `REPLICATION` (password `password`), grants `pg_read_all_data` / `pg_write_all_data`, creates the `public.flow_watermarks` table, creates the `flow_publication` publication (with `publish_via_partition_root = true`), adds `public.flow_watermarks` and `public.sales` to it, and creates the `public.sales` table. - **`datagen/datagen.py`** — connects to a **Google Cloud SQL** instance using the [Cloud SQL Python Connector](https://github.com/GoogleCloudPlatform/cloud-sql-python-connector) (`pg8000` driver), creates the `sales` table if needed, and loops once per second performing weighted random operations: 70% inserts, 20% deletes, 10% updates. This generates the CDC event stream the Estuary capture consumes. - **`datagen/Dockerfile`** — `python:3.12` image that installs requirements and runs `python -u datagen.py`. -- **`datagen/requirements.txt`** — `Faker==25.1.0` and `cloud-sql-python-connector[pg8000]`. Note: `datagen.py` also imports `sqlalchemy` and `python-dotenv`, which aren't pinned here — install them too (`pip install SQLAlchemy python-dotenv`) or add them to this file before running the generator outside Docker. +- **`datagen/requirements.txt`** — `Faker==25.1.0`, `cloud-sql-python-connector[pg8000]`, `SQLAlchemy`, and `python-dotenv`. ### `sales` table schema diff --git a/postgres-cloudsql-simple-capture/datagen/requirements.txt b/postgres-cloudsql-simple-capture/datagen/requirements.txt index d79db43..03d128f 100644 --- a/postgres-cloudsql-simple-capture/datagen/requirements.txt +++ b/postgres-cloudsql-simple-capture/datagen/requirements.txt @@ -1,2 +1,4 @@ Faker==25.1.0 -cloud-sql-python-connector[pg8000] \ No newline at end of file +cloud-sql-python-connector[pg8000] +SQLAlchemy +python-dotenv \ No newline at end of file diff --git a/python-derivations/shipments-stateful/customer-metrics.flow.py b/python-derivations/shipments-stateful/customer-metrics.flow.py index bd8d293..74aee30 100644 --- a/python-derivations/shipments-stateful/customer-metrics.flow.py +++ b/python-derivations/shipments-stateful/customer-metrics.flow.py @@ -166,7 +166,7 @@ def _handle_deletion( # Decrement the appropriate counters based on what we knew customer.total_shipments = max(0, customer.total_shipments - 1) - if previous_status in ('In Transit', 'At Checkpoint', 'Out for Delivery'): + if previous_status in ('In Transit', 'At Checkpoint', 'Out For Delivery'): customer.active_shipments = max(0, customer.active_shipments - 1) elif previous_status == 'Delivered': customer.delivered_count = max(0, customer.delivered_count - 1) @@ -193,10 +193,10 @@ def _process_shipment( # Handle status transitions. The key insight is that we need to # properly account for a shipment moving between states. is_active_status = current_status in ( - 'In Transit', 'At Checkpoint', 'Out for Delivery' + 'In Transit', 'At Checkpoint', 'Out For Delivery' ) was_active_status = previous_status in ( - 'In Transit', 'At Checkpoint', 'Out for Delivery' + 'In Transit', 'At Checkpoint', 'Out For Delivery' ) if previous_status else False # Update active shipment count based on status transition diff --git a/shipments_eta/README.md b/shipments_eta/README.md index 05b5070..f865501 100644 --- a/shipments_eta/README.md +++ b/shipments_eta/README.md @@ -52,34 +52,15 @@ Data flow in Estuary terms: ## Step 1 — Generate the MongoDB CDC workload -`datagen/datagen.py` reads its MongoDB connection from environment variables (`MONGODB_USER`, `MONGODB_PASSWORD`, `MONGODB_HOST`) and builds a `mongodb+srv://` connection string. Point it at your MongoDB (Atlas recommended). - -The bundled `docker-compose.yml` hardcodes placeholder values in its `environment:` block (`MONGODB_HOST: "localhost"`, `MONGODB_USER: "mongo"`, `MONGODB_PASSWORD: "mongo"`, plus an unused `MONGODB_PORT`) that will **not** reach an Atlas cluster. Edit those values to your MongoDB credentials/host first: - -```yaml -# docker-compose.yml - environment: - MONGODB_HOST: "cluster0.xxxxx.mongodb.net" # your Atlas cluster host - MONGODB_USER: "<your-mongodb-user>" - MONGODB_PASSWORD: "<your-mongodb-password>" -``` - -Then build and run the generator: +`datagen/datagen.py` reads its MongoDB connection from environment variables (`MONGODB_USER`, `MONGODB_PASSWORD`, `MONGODB_HOST`) and builds a `mongodb+srv://` connection string. The bundled `docker-compose.yml` passes these through from your shell (`${VAR:-default}` substitution), so export them and start the generator: ```bash # From the shipments_eta/ directory. -docker compose up --build -``` - -Alternatively, run it without Docker — `datagen.py` picks up the exported env vars directly: - -```bash -cd datagen -pip install -r requirements.txt export MONGODB_USER=<your-mongodb-user> export MONGODB_PASSWORD=<your-mongodb-password> export MONGODB_HOST=<your-cluster-host> # e.g. cluster0.xxxxx.mongodb.net -python -u datagen.py + +docker compose up --build ``` `datagen.py` builds the connection string as: @@ -101,7 +82,7 @@ Watch the logs to confirm inserts/updates are happening: docker compose logs -f datagen ``` -> Note: the bundled `docker-compose.yml` defines only the generator and ships with placeholder `environment` values (`localhost` / `mongo` / `mongo`) plus an unused `MONGODB_PORT`. On the Docker path, Compose's `environment` values override your shell, so edit them in the file as shown above rather than relying on `export`. Because `datagen.py` connects with `mongodb+srv://`, the port is ignored. If you run MongoDB locally instead of Atlas, expose it to Estuary's managed connectors with a publicly reachable host or an ngrok TCP tunnel (`ngrok tcp 27017`). +> Note: the bundled `docker-compose.yml` defines only the generator and passes `MONGODB_HOST`, `MONGODB_USER`, and `MONGODB_PASSWORD` through from your shell via `${VAR:-default}` substitution, so the `export`s above take effect. The defaults are non-functional placeholders — set all three. Because `datagen.py` connects with `mongodb+srv://`, no port is needed. If you run MongoDB locally instead of Atlas, expose it to Estuary's managed connectors with a publicly reachable host or an ngrok TCP tunnel (`ngrok tcp 27017`). ## Step 2 — Configure the Estuary MongoDB capture @@ -166,7 +147,7 @@ The transformations run as ClickHouse SQL inside Tinybird: - `route_performance` — congestion insights aggregated per origin–destination route. - `shipment_status_distribution` — status counts per route. - `top_delayed_customers` — customers with the most cumulative delay minutes. -- **`eta.sql`** is a standalone query showing the core ETA recompute: join `shipments` to `traffic_weather` on `route_id` for `status = 'In Transit'` and return original vs. updated ETA with delay reasons. Note: as committed it references `arrayJoin(s.delays_reason, ', ')`, but the Tinybird Data Source flattens that field to the `Array(String)` column `delays__reason` (double underscore) — to run it against the Data Source, use `arrayStringConcat(s.delays__reason, ', ')`. +- **`eta.sql`** is a standalone query showing the core ETA recompute: join `shipments` to `traffic_weather` on `route_id` for `status = 'In Transit'` and return original vs. updated ETA with delay reasons. It uses `arrayStringConcat(s.delays__reason, ', ')` to match the flattened `Array(String)` column `delays__reason` in `shipments.datasource`. Publish the relevant Pipe nodes as API endpoints (`Shipping.json`, `route_perf.json`, `route_staus_stats.json`) so the dashboard can query them. diff --git a/shipments_eta/docker-compose.yml b/shipments_eta/docker-compose.yml index 2c24144..72fb518 100644 --- a/shipments_eta/docker-compose.yml +++ b/shipments_eta/docker-compose.yml @@ -5,7 +5,9 @@ services: restart: unless-stopped environment: - MONGODB_HOST: "localhost" - MONGODB_PORT: "27017" - MONGODB_USER: "mongo" - MONGODB_PASSWORD: "mongo" + # datagen.py builds a mongodb+srv:// connection string (no port needed). + # Export MONGODB_HOST / MONGODB_USER / MONGODB_PASSWORD before `docker compose up`; + # the placeholder defaults below are non-functional and must be overridden. + MONGODB_HOST: "${MONGODB_HOST:-your-cluster.mongodb.net}" + MONGODB_USER: "${MONGODB_USER:-your-user}" + MONGODB_PASSWORD: "${MONGODB_PASSWORD:-your-password}" diff --git a/shipments_eta/tinybird/eta.sql b/shipments_eta/tinybird/eta.sql index 5c89389..f4324ee 100644 --- a/shipments_eta/tinybird/eta.sql +++ b/shipments_eta/tinybird/eta.sql @@ -4,7 +4,7 @@ SELECT s.current_location_longitude, s.expected_delivery_date AS original_eta, s.expected_delivery_date + INTERVAL t.impact_on_ETA_minutes MINUTE AS updated_eta, - arrayJoin(s.delays_reason, ', ') AS delay_reasons + arrayStringConcat(s.delays__reason, ', ') AS delay_reasons FROM shipments AS s LEFT JOIN diff --git a/streaming-lakehouse-iceberg-duckdb/requirements.txt b/streaming-lakehouse-iceberg-duckdb/requirements.txt index 61cc59c..0224f9b 100644 --- a/streaming-lakehouse-iceberg-duckdb/requirements.txt +++ b/streaming-lakehouse-iceberg-duckdb/requirements.txt @@ -1,3 +1,4 @@ pyiceberg==0.6.1 boto3==1.34.134 -pandas \ No newline at end of file +pandas +python-dotenv \ No newline at end of file