Skip to content

feat(amp-worker-gc): standalone GC job for control-plane scheduling#1989

Open
mitchhs12 wants to merge 10 commits intomainfrom
mitchhs12/gc-job-extraction
Open

feat(amp-worker-gc): standalone GC job for control-plane scheduling#1989
mitchhs12 wants to merge 10 commits intomainfrom
mitchhs12/gc-job-extraction

Conversation

@mitchhs12
Copy link
Contributor

@mitchhs12 mitchhs12 commented Mar 17, 2026

Summary

Extracts garbage collection from the worker's compaction task into a standalone job type managed by the controller via the job ledger. This decouples GC from compaction so it can run independently regardless of whether materialization jobs are active.

Changes

  • New crate amp-worker-gc: Job descriptor, idempotency key, error types, OTel metrics, and collection algorithm (stream expired files → delete metadata → delete physical files)
  • Worker dispatch: New Gc variant in JobDescriptor enum with execution wiring in job_impl.rs
  • Controller scheduling: Background task that periodically scans active physical table revisions and schedules GC jobs via the job ledger, respecting per-location last_success_at interval checks
  • Config toggle: GcSchedulingConfig with enabled (default false) and interval (default 60s) — GC scheduling is off by default for safe rollout
  • Integration tests: 3 tests verifying the full GC pipeline against real Postgres + filesystem

@mitchhs12 mitchhs12 force-pushed the mitchhs12/gc-job-extraction branch from c571586 to 74eb098 Compare March 18, 2026 15:13
@mitchhs12 mitchhs12 marked this pull request as ready for review March 18, 2026 16:13
…eduling

Extract garbage collection from the worker's compaction task into a standalone
job type managed by the controller via the job ledger. This decouples GC from
compaction so it can run independently regardless of whether materialization
jobs are active.

New crate: amp-worker-gc with job descriptor, idempotency key, and collection
algorithm. Controller schedules GC jobs per active physical table revision on
a 60s interval. Workers execute them using the same stream-expired → delete-
metadata → delete-files algorithm as the existing Collector.
Add GcSchedulingConfig with `enabled` (default false) and `interval`
(default 60s) fields so GC scheduling can be toggled without code changes.
The controller only spawns the GC scheduling task when enabled, preventing
unintended GC job creation on deployment.
…egration tests

- Add GcMetrics (expired_files_found, metadata_entries_deleted, files_deleted,
  files_not_found) with OpenTelemetry counters keyed by location_id
- Add last_success_at check in schedule_gc_jobs() to respect the configured
  interval between GC runs per location (RFC compliance)
- Add 3 integration tests in tests/src/tests/it_gc.rs verifying the full
  collection algorithm against real Postgres + filesystem
- Update config.sample.toml with [gc_scheduling] section
The workspace_crates_match_amp_crates_list test validates that the
hardcoded AMP_CRATES list matches actual workspace members. Adding the
new amp-worker-gc crate to the workspace requires updating this list.
…e_ref tracing

- Add GcSchedulerMetrics to controller scheduler with gc_jobs_dispatched_total,
  gc_jobs_skipped_in_flight_total, and gc_jobs_skipped_too_recent_total counters
- Add table_ref field to GC job execution tracing span, recorded from the
  revision path after lookup
- Pass Meter to Scheduler for metrics initialization
… entries

Instead of paginating through all active revisions and scheduling no-op
GC jobs, query gc_manifest for distinct location_ids with expired entries.
This reduces unnecessary job ledger writes and worker notifications.

Also removes the active-only filter so deactivated revisions are GC-eligible.
GC is only concerned with the state of the gc_manifest, not the revision's
active status.
… entries

Instead of paginating through all active revisions and scheduling no-op
GC jobs, query gc_manifest for distinct location_ids with expired entries.
This reduces unnecessary job ledger writes and worker notifications.

Also removes the active-only filter so deactivated revisions are GC-eligible.
GC is only concerned with the state of the gc_manifest, not the revision's
active status.

Add gc::locations_with_expired_entries() query to metadata-db and 3 integration
tests verifying pre-filter behavior, empty results, and deactivated revisions.
…ests

Replace raw pgtemp usage with the testlib MetadataDbFixture for consistency
with other integration tests. Extract shared setup into a GcTestCtx wrapper
struct with helper methods (register_revision, register_file, file_ids,
gc_context, data_store). Remove pgtemp direct dependency from tests crate.
@mitchhs12 mitchhs12 force-pushed the mitchhs12/gc-job-extraction branch from c95850e to ec68363 Compare March 20, 2026 16:21
Add two integration tests covering the scheduler's in-flight skip
and recently-completed cooldown logic in schedule_gc_jobs(), plus
helper methods on GcTestCtx to reduce boilerplate.
@mitchhs12 mitchhs12 requested a review from JohnSwan1503 March 20, 2026 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant