Skip to content

Default publication calibration to A100#1068

Open
MaxGhenis wants to merge 1 commit into
mainfrom
codex/default-publication-a100
Open

Default publication calibration to A100#1068
MaxGhenis wants to merge 1 commit into
mainfrom
codex/default-publication-a100

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

  • Default full publication calibration fits to A100-40GB instead of T4 for both regional and national weights.
  • Update the workflow-dispatch fallback and Makefile defaults to match.
  • Add a source-level contract test so the production default does not drift back to T4.

Why

The current 2024 publication package has 20,680 targets x 5,159,570 records. Both T4 fits failed before epoch 1 with CUDA OOM:

  • regional T4 call fc-01KS2TFYAN0VG2985M3ED4WRNH
  • national T4 call fc-01KS2TFYDDSZKRQFWJ36EPCEKQ
  • failure: tried to allocate another 3.65 GiB with only 3.33 GiB free on a 14.56 GiB T4

The A100-40GB rerun from the same saved package passed that point and reached epoch 100+ for both fits.

Local checks

uv run pytest tests/unit/test_pipeline_source_contracts.py -q
uv run ruff check modal_app/pipeline.py tests/unit/test_pipeline_source_contracts.py
uv run ruff format --check modal_app/pipeline.py tests/unit/test_pipeline_source_contracts.py
git diff --check upstream/main..HEAD

@MaxGhenis MaxGhenis force-pushed the codex/default-publication-a100 branch from 38363ae to 3d7fefd Compare May 20, 2026 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant