Skip to content

Airbyte Worker Toolkit - Analyze, estimate, and optimize Data Worker usage

Notifications You must be signed in to change notification settings

rwask/worker-toolkit

Repository files navigation

Airbyte Worker Toolkit

Analyze, estimate, and optimize Airbyte Data Worker usage for customers and prospects.

What It Does

The Worker Toolkit helps AEs and SEs answer the question: "How many Data Workers does this customer need?"

It provides four workflows:

Mode Use Case Input Output
Mode 1 Analyze current customer usage Org ID or Workspace ID P99 workers, hourly heatmap, connection inventory
Mode 2 Estimate prospect needs Questionnaire or export Worker estimate with ranges
Mode 3 Optimize schedules Org ID + contracted workers Cron reschedule recommendations to reduce peak
Mode 4 Generate PDF reports Customer name or org ID Branded PDF (scheduling optimization or capacity analysis)

Worker Model

All plans use the same universal Data Worker model:

Workers = (Concurrent API connections / 5) + (Concurrent DB connections / 2)
  • API connectors (Salesforce, GitHub, Jira, etc.): 5 per worker
  • DB/File connectors (Postgres, S3, Snowflake, etc.): 2 per worker
  • Capacity is combinable -- API and DB share the same workers
  • Billing uses P99 (99th percentile) of hourly peak workers

Installation

Prerequisites

  • Python 3.11+
  • Claude Code CLI with MCP tools configured:
    • PyAirbyte MCP (for Airbyte Cloud API access)
    • Metabase MCP (for billing meter data -- optional but recommended)
  • Airbyte Cloud API credentials (AIRBYTE_CLOUD_CLIENT_ID, AIRBYTE_CLOUD_CLIENT_SECRET)

Setup

# Navigate to the worker-toolkit directory
cd worker-toolkit

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env with your Airbyte Cloud API credentials

Environment Variables

Create a .env file with:

AIRBYTE_CLOUD_CLIENT_ID=your_client_id
AIRBYTE_CLOUD_CLIENT_SECRET=your_client_secret

Usage

With Claude Code (Recommended)

The primary interface is the /workers slash command in Claude Code:

/workers

This activates the Worker Toolkit agent which guides you through all four modes interactively, using MCP tools to fetch data from Airbyte Cloud and Metabase.

Claude Code Setup

  1. Copy the command file to your Claude Code project:

    mkdir -p .claude/commands
    cp .claude/commands/workers.md .claude/commands/
  2. Ensure your Claude Code has MCP servers configured:

    • PyAirbyte MCP -- for mcp__pyairbyte__* tools
    • Metabase MCP -- for mcp__metabase__execute_query (database_id=2, BigQuery)
  3. Run /workers in Claude Code and follow the prompts.

With Interactive CLI

source .venv/bin/activate
python3 worker_optimizer.py

This launches an interactive menu with arrow-key navigation for all four modes.

With CLI Flags (Mode 4 Reports)

# Generate report for one customer
source .env && python3 generate_scheduling_reports.py --use-airbyte-api --customers "CustomerName"

# Generate reports for all over-utilized customers
source .env && python3 generate_scheduling_reports.py --use-airbyte-api --over-utilized-only

# List all customers with utilization status
python3 generate_scheduling_reports.py --dry-run

Modes in Detail

Mode 1: Calculate Current Usage

Analyzes actual worker usage for an existing customer.

Data Sources (priority order):

  1. Metabase billing meter (authoritative) -- queries organization_data_worker_usage_daily and workspace_data_worker_usage_hourly tables
  2. Airbyte Cloud API (fallback) -- fetches job history and calculates concurrency overlap

What it produces:

  • P99 worker usage (the billing metric)
  • 30-day daily worker trend
  • Hourly peak/quiet hour pattern
  • Connection inventory with API/DB classification
  • Utilization percentage vs contracted workers

Mode 2: Estimate Prospect Needs

Three sub-modes for different prospect scenarios:

Sub-mode When to Use Accuracy
2A: Questionnaire New prospect, no Airbyte usage Estimate (statistical model)
2B: Cloud/OSS Export Prospect has existing Airbyte deployment High (actual job data)
2C: Quick Estimate Ballpark for pricing discussions Low-Medium (ranges)

Mode 3: Optimize Worker Usage

Finds the minimum schedule changes to reduce peak worker usage.

How it works:

  1. Fetches Metabase billing data for ground truth peak hours
  2. Fetches connection details and job durations from Airbyte API
  3. Runs greedy minimum-move optimization algorithm
  4. Outputs specific Quartz cron expressions for rescheduled connections
  5. Shows before/after comparison

Example output:

3 moves needed to reduce peak from 1.50 to 0.90 workers:
  1. demandbase_jira     15:00 -> 00:00  (0 0 0 * * ?)  8h52m
  2. shield_hubspot_OP   15:00 -> 00:00  (0 0 0 * * ?)  1h32m
  3. shield_hubspot_NYC  15:00 -> 00:00  (0 0 0 * * ?)  1h31m

Mode 4: Generate PDF Reports

Produces branded Airbyte PDF reports for customers. Auto-selects report type based on utilization:

Utilization Report Type Content
>= 85% Scheduling Optimization Peak hours, connections to reschedule, cron recommendations
< 85% Capacity Analysis Headroom, growth potential, expansion recommendations

Sub-modes:

  • 4A: Single customer report (fuzzy name search)
  • 4B: Batch reports (filter by utilization)
  • 4C: List available customers
  • 4D: Non-Metabase customers (Cloud Sales Assist, Free tier, trials)

Project Structure

worker-toolkit/
├── worker_optimizer.py              # Interactive CLI menu (all modes)
├── generate_scheduling_reports.py   # Mode 4 report generator (CLI)
├── requirements.txt                 # Python dependencies
├── .claude/commands/workers.md      # Claude Code /workers command
├── src/                             # Core modules
│   ├── config.py                    # Worker model config (API/5 + DB/2)
│   ├── worker_calculator.py         # Worker calculation engine
│   ├── connector_classifier.py      # API vs DB connector classification
│   ├── job_overlap_analyzer.py      # Minute-by-minute concurrency analysis
│   ├── metabase_billing_data.py     # Metabase billing meter queries
│   ├── metabase_worker_data.py      # Metabase node_calculator queries
│   ├── airbyte_cloud_data.py        # Org mappings + API integration
│   ├── scheduling_optimization_report.py  # PDF report (over-utilized)
│   ├── underutilization_report.py   # PDF report (under-utilized)
│   ├── non_metabase_customer_report.py    # PDF report (non-Metabase)
│   ├── optimize_worker_usage_fast.py      # Fast optimization algorithm
│   ├── cron_generator.py            # Quartz cron expression utilities
│   └── ...                          # Additional utilities
├── mode2/                           # Prospect estimation scripts
│   ├── mode2_prospect_questionnaire.py    # Mode 2A
│   ├── mode2_analyze_oss_export.py        # Mode 2B
│   ├── mode2_custom_estimate.py           # Mode 2C
│   └── mode2_oss_export_script.py         # Export script for prospects
├── tests/                           # Test suite
└── examples/                        # Example reports and templates

Key Metabase Tables

When the Metabase MCP is available, the toolkit queries these BigQuery tables (database_id=2):

Table Purpose
airbyte_warehouse.organization_data_worker_usage_daily Daily peak workers (P99 billing)
airbyte_warehouse.workspace_data_worker_usage_hourly Hourly worker patterns (peak hours)
airbyte_warehouse.account Account metadata (name, ARR, owner)
airbyte_warehouse.organization Org metadata (plan, connections, lifecycle)

Contributing

  1. All plans use the same worker model -- do not add plan-specific logic
  2. Metabase billing data is the authoritative source for worker usage
  3. Use job overlap analysis (not connection counts) for concurrency calculations
  4. Customer-specific scripts go in FinalReports/{CustomerName}/, not in src/

About

Airbyte Worker Toolkit - Analyze, estimate, and optimize Data Worker usage

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages