Skip to content

v0.1.0: monorepo merge — unified training + inference, canonical model, archive-free inference, vendored baselines#41

Draft
xuefei-wang wants to merge 269 commits into
vanvalenlab:masterfrom
xuefei-wang:master
Draft

v0.1.0: monorepo merge — unified training + inference, canonical model, archive-free inference, vendored baselines#41
xuefei-wang wants to merge 269 commits into
vanvalenlab:masterfrom
xuefei-wang:master

Conversation

@xuefei-wang

@xuefei-wang xuefei-wang commented May 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

Merges the separate training repository (deepcelltypes-cell-type-assignment-pytorch)
into this repo and replaces the legacy CellTypeCLIPModel inference path with the
current canonical model. This is the v0.1.0 release cut.

Before this PR, vanvalenlab/deepcell-types was inference-only — it shipped
CellTypeCLIPModel, the dct_kit/ helpers, and a top-level __init__ that
exported just predict. After it, a single package covers training and
inference: inference stays a plain pip install deepcell-types, the full
training pipeline lives behind a [train] extra, and the four paper comparison
baselines are vendored behind per-baseline extras.

⚠️ Breaking changes — see below.

Canonical model

model.py is rewritten around CellTypeAnnotator; CellTypeCLIPModel /
CellTypeDataEncoder are removed. Canonical training defaults (scripts/train.py,
click-based CLI): --resnet_channels 48, --domain_weight 0.1,
--best_metric macro_f1.

  • Mean-intensity injection — per-cell mean marker intensity is scattered
    into a marker-position vector and injected as a CLS residual. The output
    projection is zero-init, so warm-starting from a checkpoint preserves
    predictions at step 0.
  • DANN domain adaptation via a gradient-reversal head, on by default
    (--domain_weight 0.1; 0 disables it).
  • Adapter-style fine-tuning: --freeze_backbone trains only the
    mean-intensity branches on top of an existing checkpoint; --unfreeze_ct_head
    additionally co-adapts the CT head / CLS token / final norm without unfreezing
    the transformer backbone.
  • Padding-channel positions are explicitly zeroed (masked_fill) through the
    channel encoder, fusion, and mean-intensity paths so masked tokens contribute
    exactly zero rather than leaking bias/spatial_feat into the transformer.
  • Self-describing checkpoints: scripts/train.py bundles ct2idx, n_heads,
    and compat_marker0_zero into the checkpoint, and inference asserts the
    vocabulary ordering matches (a permuted vocabulary previously passed the
    count-only check and silently mislabeled cells).

Canonical-only inference

  • Archive-free by default: the marker / cell-type registry ships as a small
    packaged vocab.json snapshot, so pip install deepcell-types +
    download_model() is enough to run predict() — the multi-GB TissueNet zarr
    archive is no longer required (pass zarr_path= / set
    DEEPCELL_TYPES_ZARR_PATH only if you need it). Verified identical
    predictions with vs. without the archive on the paper checkpoint.
  • Post-hoc abstention on by default (ct_abstention_k=0.2), bucketed
    per-FOV everywhere (CLI, Python API, library): cells below an IQR fence on
    the FOV confidence distribution are relabeled to the "Unknown" sentinel
    (skipped when k is disabled or the FOV has <4 cells).
  • Custom preprocessing hook: predict(..., preprocess=...) overrides the
    per-FOV normalization without retraining, backed by a bounded op library
    (apply_config, make_preprocessor, DEFAULT_CONFIG) and a
    composition-guided adaptation loop (skills/preproc-adapt/).
  • The bright-spot clip percentile (DCTConfig.PERCENTILE_THRESHOLD) is now
    99.9, matching the recipe the training archive was built with (was 99.0,
    a carryover from the original packaging).
  • predict(return_probabilities=True) returns a PredictionResult dataclass
    with the full per-cell softmax matrix, cell indices, and the pre-abstention
    argmax labels (cell_types_raw).
  • _torch_load_weights loads with weights_only=True and emits a loud warning
    if it has to fall back to unsafe pickle on an older torch; a missing
    checkpoint raises a clear FileNotFoundError pointing at download_model().

New public API

  • predict, DCTConfig, PredictionResult, preprocess_fov, apply_config,
    make_preprocessor, and DEFAULT_CONFIG are importable from deepcell_types
    directly. preprocess_fov(raw, mask, native_mpp, channel_names) → PreprocessedFov is the standalone preprocessing entry point.

Monorepo: training pipeline

  • deepcell_types.training ships from this repo behind pip install "deepcell-types[train]": config.py, dataset.py, archive.py,
    annotations.py, baseline_features.py, gold_metadata.py, losses.py,
    metrics.py, patch.py, utils.py, abstention.py.
  • Scripts under scripts/: train.py, pretrain.py, predict.py,
    generate_openai_embeddings.py, generate_splits.py, split_val_for_test.py,
    plus the release-archive gate (validate_archive_contract.py,
    check_release_archive.sh).
  • Canonical split manifests committed under splits/
    (fov_split{,_valsubset,_test}.json + README), so the published
    train/val/test partition is reproducible from the repo.
  • Experiment logging is plain Python logging — no Weights & Biases dependency
    anywhere (--enable_wandb is gone; confusion matrices save locally as PNGs).
  • zarr>=3.1 pulls the Python floor up to 3.11 for the train extra.

Baselines

  • Four paper comparison baselines vendored under deepcell_types/baselines/
    (cellsighter, maps, nimbus, xgb), invoked through the unified runner
    python -m deepcell_types.baselines <name>, each with a self-contained
    install extra (baseline-cellsighter, baseline-maps, baseline-nimbus,
    baseline-xgboost).
  • Each baseline ships a README documenting every deviation from its upstream
    source; third-party licenses are tracked in deepcell_types/baselines/NOTICE.
  • extract_features_from_zarr(missing_value=...) lets each baseline choose its
    absent-marker sentinel: MAPS / CellSighter keep 0.0; XGBoost can pass
    np.nan so absent markers route through XGBoost's learned missing direction
    instead of being conflated with "present, intensity 0.0". The feature matrix
    records a present_markers mask and the cache stays missing-value-agnostic.

Breaking changes

  • CellTypeCLIPModel removed. No shim — use from deepcell_types import predict, DCTConfig.
  • All predict() arguments after mpp are keyword-only, preventing
    accidental transposition of the adjacent string arguments. device= is the
    preferred spelling (device_num= remains a deprecated alias).
  • predict(num_workers=...) default is now 0 (was 24) — 24 workers
    OOM'd machines with <64 GB RAM.
  • Abstention on by default changes returned labels vs. the unfiltered argmax
    of prior releases; pass ct_abstention_k=0 to recover raw argmax.
  • Clip percentile 99.0 → 99.9 shifts ~5% of predicted labels; on a
    held-out test-split sample it reproduces the canonical predictions slightly
    better (92.5% vs 91.9% argmax agreement).

Packaging / infra

  • Package data now ships vocab.json, channel_mapping.yaml, and
    training/config/*.yaml (incl. combined_celltypes.yaml), which were
    previously outside the package tree and absent after pip install.
  • tifffile declared in the [train] extra.
  • CI workflow added (.github/workflows/ci.yml); inference vs. [train] test
    boundary enforced.
  • LICENSE text matches the OSI Apache 2.0 text exactly (LIC: Revert licence text to exactly match OSI Apache 2 #42); NOTICE
    aligned to the vanvalenlab convention.

Tests

35 test modules under tests/ (plus tests/baselines/) covering canonical
inference, abstention CLI, checkpoint round-trip, dataset/split/sampler
behavior, preprocessing + the preprocess hook, losses, hierarchical eval,
archive-contract validation, baseline feature splits, and vendored-baseline
equivalence against upstream.

See CHANGELOG.md
for the full 0.1.0 entry and migration notes.

xuefei-wang and others added 30 commits May 11, 2026 07:18
From reviews/2026-05-10-2345/simplification.md H1+H2 and complexity.md H2:

- Delete _zarr_group_filesystem_path and _read_v3_1d_array from
  training/utils.py. Both were verbatim copies of annotations.py's
  group_filesystem_path / read_v3_1d_array with zero callers across
  the repo (verified by grep). The annotations.py versions are the
  canonical ones imported by training/dataset.py.

- Delete the three pass-through static shim methods on FullImageDataset
  (_group_filesystem_path, _read_v3_1d_array, _centroid_to_cell_idx_fast).
  None were called anywhere — adding zero value, only obscuring that
  the real helpers live in annotations.py. Note: _build_centroid_tree
  is kept (also flagged but not in the HIGH list).

- Backport the zstd-level-aware codec read from dct_kit/config.py into
  annotations.py:read_v3_1d_array. The old training-side copy hardcoded
  Zstd(level=0) while the inference side correctly reads level from
  the codec config. With archives written at a non-zero compression
  level the training-side read would silently produce garbage. Both
  paths now share the level-aware contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es (Theme F)

config.py and utils.py had grown to 1.3k and 1.5k LOC, mixing archive
fingerprinting, patch extraction, metric trackers, baseline IO, and the
core TissueNetConfig/RNG/log helpers in one place each. Carve four
focused modules out (verbatim, no logic changes):

- training/archive.py: zarr v3 alpha metadata patch, archive metadata
  / array fingerprinting, FOV-key discovery, and the per-process caches.
- training/patch.py: per-cell patch extraction
  (compute_distance_transform, extract_patch_from_zarr, extract_patch).
- training/metrics.py: confusion-matrix hierarchy adjustment,
  MP per-marker reduction, MPMetricsTracker, LossesAndMetrics,
  build_label_remap.
- training/baseline_features.py: baseline classifier feature extraction
  pipeline (_conf_mat_summary, compute_baseline_metrics,
  save_baseline_predictions, _extract_all_dataset_features,
  extract_features_from_zarr, _get_cell_data_from_ds).

Re-exports at the bottom of config.py and utils.py keep all
tests/scripts working unchanged (230 passed, 1 skipped, matching the
pre-split baseline). dataset.py is updated to import directly from
the new homes for cached_archive_metadata_fingerprint and extract_patch.

Two non-mechanical touches required to keep monkey-patch-based tests
green:
- baseline_features.extract_features_from_zarr looks up
  _discover_fov_keys and _extract_all_dataset_features via the
  config / utils modules at call time, so tests that monkeypatch
  those symbols on the legacy modules still take effect after the
  split. _FINGERPRINT_CACHE / _FOV_KEYS_CACHE dicts are re-exported
  from config.py for the same reason (test_dataset_cache mutates them).
- metrics.LossesAndMetrics.compute defers import of _conf_mat_summary
  to method-call time to avoid a metrics <-> baseline_features import
  cycle (baseline_features needs adjust_conf_mat_hierarchy at module
  load).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
From reviews/2026-05-10-2345/docs.md HIGH findings:

- README: add a "Training" section describing the [train] extra and the
  four main entry points under scripts/. Move "Download the model"
  after "Installation" (was non-executable in reading order).

- docs/index.md: add a "Training" section explaining that training-only
  code lives under deepcell_types.training, gated behind the [train]
  extra, with pointers to scripts/{train,predict,pretrain,
  benchmark_gold_standard,ingest_gold_to_zarr}.py. Fix the long-standing
  "sorce" typo.

- docs/site/tutorial.md: bump the example archive placeholder from
  tissuenet-v8.zarr → tissuenet-v9.zarr to match DCTConfig's probe
  order (v9 is the canonical contemporary archive).

The docs.md HIGH for the broken `from utils import download_training_data`
import in docs/site/API-key.md was fixed in 88b95f9.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five MEDIUM/HIGH findings from reviews/2026-05-10-2345 in one batch:

- complexity H1: TissueNetConfig.get_marker_positivity() and
  marker_positivity_labels[] now share a single LazyMarkerPositivityDict.
  Previously the plain-dict cache populated by get_marker_positivity()
  was discarded the first time marker_positivity_labels was accessed
  (the property replaced the field), causing wasted I/O and divergent
  caches. _marker_positivity_cache is now Optional[LazyMP...] and
  lazily constructed on first access; get_marker_positivity routes
  through marker_positivity_labels for a single source of truth.

- numerical M1: MarkerEmbeddingLayer.forward zeros output for
  padding positions (ch_idx == -1). Without this, F.normalize(proj(0))
  yielded a unit-norm direction equal to F.normalize(proj.bias) — a
  non-trivial embedding flowing into the transformer for tokens that
  should be invisible.

- numerical M2: CellTypeAnnotator.forward zeros spatial features
  for padding positions BEFORE the fusion concat. Otherwise padding
  tokens enter self.fusion with [0, spatial_feat] and emerge as
  W_spatial @ spatial_feat + bias.

- API M1: rename predict(tissue_exclude=...) → predict(tissue_filter=...).
  The old name was inverted — "tissue_exclude='colon'" actually meant
  "filter TO colon-associated cell types". The deprecated alias stays
  (keyword-only) and emits DeprecationWarning; passing both raises
  TypeError.

- API M3: predict(return_probabilities=True) returns a
  PredictionResult dataclass with cell_types, probabilities (full per-
  cell softmax matrix), and cell_indices. Default behaviour
  unchanged (returns list[str]). PredictionResult and DCTConfig are
  now hoisted to top-level so `from deepcell_types import
  PredictionResult, DCTConfig` works.

Tests: 233 passed, 1 skipped. Added 3 new tests covering
return_probabilities, tissue_exclude DeprecationWarning, and the
both-args TypeError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- tests M3: add a regression anchor in test_train_loop_smoke.py that
  asserts scripts/train.py still contains the AMP scheduler-gate
  predicate. The 2-line _run_gated_step helper is faithful to the
  production behavior but a silent drift would otherwise let the
  emulator tests pass while real training desynchronizes OneCycleLR.

- tests M2: same idea for test_zero_channel_masking.py. The unit-test
  helper is a verbatim copy of __getitem__'s masking block; a refactor
  could let the copy drift. New test asserts
  training/dataset.py still contains _zero_channel_cache and
  fov_zero_mask.

- docs M4: add CHANGELOG.md documenting the 0.0.1 → 0.1.0 release
  (canonical-only refactor, training subpackage, breaking removal of
  CellTypeCLIPModel, deprecated tissue_exclude alias, num_workers=0
  default, TissueNetConfig env-var default). Bump version in
  pyproject.toml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
complexity H8: replace FullImageDataset.indices' positional 8-tuple
with a CellIndexRecord NamedTuple. Named fields make grep / refactor
safe (no more record[6] / record[5] magic numbers across 10+ call
sites). NamedTuple IS a tuple, so positional access still works for
backward compat with serialized caches that stored raw 8-tuples.
Production call sites in dataset.py now use .ct_label_standard,
.dataset_name, .fov_name, .ds_idx, .domain accessors. Mock-index
constructors in tests/{test_v2,test_samplers,test_stratified_splits,
test_dataset_splits}.py updated to build CellIndexRecord instances.

complexity H7: introduce DataLoaderConfig dataclass + matching
create_dataloader_from_config(zarr_dir, dct_config, cfg) wrapper.
Lets new callers pass a single discoverable object instead of 20+
keyword arguments. The legacy keyword signature of create_dataloader
is preserved verbatim so train.py / predict.py / tests don't need
any change. Field defaults mirror create_dataloader's defaults
exactly — DataLoaderConfig() is equivalent to no-override.

Tests: 235 passed, 1 skipped (analysis-only env failure unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collapse training pipeline into deepcell-types (canonical-only)
…ne submodule rebase

Three independent bugs surfaced when running training against the current
master HEAD from a fresh workspace install:

1. tissue_idx kwarg mismatch (scripts/train.py:121, scripts/predict.py:208 + 334)
   scripts pass `tissue_idx=batch_data.tissue_idx` to
   `CellTypeAnnotator.forward(...)`, but the model's forward signature is
   `(sample, spatial_context, ch_idx, padding_mask, ct_exclude=None,
   return_attn_weights=False, domain_idx=None)` — no `tissue_idx`. The
   tissue-FiLM MP head experiment was rolled back (see memory
   `v10_mp_expansion_tissue_negative.md`) and the model dropped the
   parameter, but the scripts kept passing it. Result: every training /
   prediction run dies at the first forward pass with
   `TypeError: ...got an unexpected keyword argument 'tissue_idx'`.
   Fix: drop the kwarg at all three call sites. `batch_data.tissue_idx`
   is still populated by the dataloader and remains available to anyone
   who needs it downstream — the model just doesn't consume it.

2. Circular import between training/utils.py and training/baseline_features.py
   utils.py re-exports four symbols from baseline_features.py at module
   level for backward compat. baseline_features.py also imports private
   helpers (`_atomic_np_savez` etc.) from utils.py. When utils.py is
   imported first (training path) the cycle resolves fine, but when
   baseline_features.py is imported first (baseline path — e.g.
   `import xgb.run`), the partially-initialized utils.py reaches back to
   `baseline_features._extract_all_dataset_features` before that name is
   defined, and ImportError fires.
   Fix: convert the re-exports to a module-level `__getattr__` so the
   lookup is deferred until actual access, by which point both modules
   have finished initializing. Existing callers
   (`from deepcell_types.training.utils import save_baseline_predictions`,
   verified in tests/test_v2.py) keep working.

3. Submodule rebase (baselines/{maps,cellsighter,xgboost,nimbus})
   Each baseline's pyproject.toml listed `deepcelltypes @ git+...
   deepcelltypes-cell-type-assignment-pytorch.git` as a dep; that URL
   now resolves to the renamed research workspace (no longer a Python
   package) and `uv pip install` fails with a metadata-name mismatch.
   Each baseline also imported from `deepcelltypes.{config,utils,dataset}`
   — the pre-refactor flat layout. Companion commits on each submodule's
   `fix/post-refactor-imports` branch replace the dep URL with a plain
   `deepcell-types` and rebase imports onto
   `deepcell_types.training.{config,utils,dataset,metrics,baseline_features}`.
   This parent commit bumps the submodule pointers to those branch tips.

End-to-end verification: with the three fixes, a fresh workspace `uv sync`
+ smoke training (`scripts/train.py` with the v10 split + svd_512_v6
embeddings) gets through model build, GPU allocation, and reaches batch 0
of epoch 0. The xgboost baseline imports cleanly after
`uv pip install -e baselines/xgboost`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…port

fix(train,predict,utils): tissue_idx kwarg + circular import + baseline submodule rebase
uv.lock is regenerated on every branch switch when pyproject.toml
shapes differ (master vs training), so keeping it tracked produces
constant churn. /reviews/ holds local /deep-review outputs.
Untouched since Oct 2024 and broken since the kit was inlined in
0a8108e: it COPYs a non-existent top-level requirements.txt and
pip-installs the deleted deepcelltypes-kit/ directory. No CI, docs,
or scripts reference it.
…us v0.0.5

model.py: replace channel_feat[padding_mask] = 0.0 (in-place under AMP
autocast on a tensor in the backward graph) with an out-of-place
masked_fill on padding_mask.unsqueeze(-1). Eliminates the latent
"a leaf Variable that requires grad has been used in an in-place
operation" risk and the gradient corruption it would cause on padding
rows.

metrics.py: make MPMetricsTracker.compute symmetric across mp_macro_f1
and mp_macro_accuracy w.r.t. vacuous markers (n_pos_gt == 0 and
n_pos_pred == 0). f1s already appended np.nan + used nanmean; accuracies
appended a real value + used mean, asymmetrically inflating macro
accuracy. Now both go through nanmean with np.nan sentinels, so the two
headline MP numbers come from the same denominator.

scripts/benchmark_gold_standard.py: support nimbus-inference v0.0.5 +
the actual Pan-Multiplex gold-standard directory layout. The script
previously called Nimbus.prepare_normalization_dict (removed in v0.0.5)
and assumed per-subset labels/ + raw/ dirs; the real layout has
<subset>/fovs/ plus a single central gold_standard_groundtruth.csv.
Now: prepare_normalization_dict is invoked on MultiplexDataset (its
v0.0.5 home), discover_gold_standard_subsets accepts both layouts and
pivots the central CSV per-FOV when needed, the segmentation naming
convention probes both Pan-Multiplex (<fov>.ome.tif) and legacy
DeepCell (<fov>_whole_cell.tiff) names, and the image suffix is
autodetected. Smoke against /data/xwang3/nimbus_gold_standard/
gold_standard_labelled completes end-to-end (macro F1 0.7400,
micro 0.8382, 56 markers, 939K cell-marker pairs).

Pytest baseline unchanged: 255 passed / 1 skipped / 1 failed (the
known post-PR-#62 analysis.validate_mp_refinement path drift).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After PR #62 split the monorepo, `analysis/` lives in the research
workspace and is no longer importable from this repo's pytest session.
The stage7 synthetic-gold-validation test imports
`analysis.validate_mp_refinement` and fails collection with
`ModuleNotFoundError: No module named 'analysis'` unless the workspace
is on PYTHONPATH.

Guard the import with `pytest.importorskip(...)` so the suite reports
skipped instead of failed in the default sibling-repo-only invocation.
Bumps the sibling pytest baseline to 255 passed / 2 skipped / 0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… --zarr_dir defaults

train.py / predict.py / pretrain.py used to default --zarr_dir to
DATA_DIR / "tissuenet-caitlin-labels.zarr", forcing users to set
DATA_DIR to the wrapper directory and have the scripts append the
inner archive name. Switch to default --zarr_dir = DATA_DIR so the
env var holds the actual archive root directly; this matches both
how TissueNetConfig(zarr_path=...) is invoked elsewhere and how the
baseline runners take their --zarr_dir.

The three baseline submodules (xgboost, cellsighter, maps) make the
same change on their --zarr_dir defaults; pointers are bumped here.
The cellsighter submodule also includes a smoke-safety fix
(best_macro_acc=-inf so the first val pass always saves a checkpoint
even when macro_accuracy is exactly 0.0); the xgboost submodule
includes a label-tightening fix for tiny subset smokes where
GroupShuffleSplit can leave compact labels with zero examples in
inner_train (rejected by modern xgboost.sklearn.XGBClassifier).

Smoke verification on the v10 7-dataset subset (post-tier-3-repair
archive) — all 3 baselines + main model now complete end-to-end:
  - main train.py (cuda:0)         train_macro_acc=0.0268, best ckpt saved
  - xgb baseline (CPU)             macro=0.2209, CSV + model.json saved
  - cellsighter baseline (cuda:1)  macro=0.0131, CSV + .pth saved
  - maps baseline (cuda:2)         val_loss=5.96, CSV + .pth + stats saved

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CellTypeAnnotator.forward zeroed padding rows in two places. The first
(`channel_feat[padding_mask] = 0.0`) was switched to an out-of-place
`masked_fill` in 782c611 to avoid an in-place write on a tensor that's
in the AMP autocast backward graph. The second site
(`spatial_expanded[padding_mask] = 0.0`) was left as an in-place write,
guarded by a defensive `.clone()` on the preceding `expand()` view.

That guard is correct today, but the asymmetry is a trap: anyone who
removes the `.clone()` thinking it's redundant will silently reintroduce
the same AMP-graph hazard the earlier fix addressed. Switching to the
same masked_fill pattern removes the trap and drops the now-unneeded
clone — masked_fill materializes the expand() view into a fresh tensor.

Pytest unchanged: 255 passed / 2 skipped / 0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop LoRA from MarkerEmbeddingLayer (and CLI flags / auto-detect):
  the trainable projection makes a LoRA adapter mathematically redundant
  (proj.W + lora_B @ lora_A collapses into proj_eff.W). Confirmed on v8
  that LoRA-8 ties exactly with no-LoRA.
- Add --mean_intensity_mode {none|cls_residual|per_channel|both} to
  CellTypeAnnotator: scatter per-cell mean marker intensity into a global
  marker-position vector and inject as a CLS residual and/or per-channel
  feature. Zero-init the output projection so warm-start from a baseline
  ckpt preserves predictions at step 0.
- Add --freeze_backbone to train.py: requires_grad=False on everything
  except intensity_cls_branch/intensity_per_channel_proj. Use with
  --pretrained_path to train a cheap mean-intensity adapter on top of an
  existing ckpt.
- Final-eval val-cap automation: when --max_val_samples is set (cheap
  per-epoch val), final eval is rebuilt with no cap so the headline test
  number is apples-to-apples vs baselines (which never cap their val).
- Auto-detect mean_intensity_mode from ckpt keys in predict.py,
  benchmark_gold_standard.py, and deepcell_types.predict.
- Make pretrained loading tolerate numpy-scalar metadata
  (torch.load weights_only=False for pretrained_path).
- Add scripts/fold_lora_into_proj.py utility to fold legacy LoRA weights
  into proj.weight so old ckpts load against the LoRA-free model.
- Change canonical defaults: --resnet_channels 48, --domain_weight 0.1,
  --mean_intensity_mode cls_residual, --best_metric macro_f1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(baselines): NaN missing-value support + XGBoost submodule bump

Adds a ``missing_value: float = 0.0`` knob to
``extract_features_from_zarr`` so each baseline can pick its preferred
sentinel for absent markers:

- MAPS / CellSighter continue to receive 0.0 (default; their MLP/CNN
  layers don't tolerate NaN).
- XGBoost now receives NaN (per submodule update), routing absent
  markers through XGBoost's ``missing=NaN`` learned per-split default
  direction rather than conflating them with "marker present, mean
  intensity 0.0".

Implementation:
- ``_extract_all_dataset_features`` records a per-dataset
  ``present_markers`` bool mask alongside features/labels/cell_sizes.
- ``extract_features_from_zarr`` accumulates per-split block metadata
  (``{split}_block_sizes``, ``{split}_block_absent``) and applies
  ``_apply_missing_value`` post-extraction (and post-cache-load). The
  cache stays missing-value-agnostic — same .npz/pickle serves a MAPS
  run and an XGBoost run.
- Cache version bumped 5 -> 6 so legacy caches without
  ``present_markers`` are rebuilt automatically.

Submodule bump (baselines/xgboost):
- fix(tuning): carve FOV-grouped inner-val for early stopping instead
  of leaking the test set into best_iteration.
- feat(missing): pass missing_value=np.nan from run.py and tuning.py.

Tests:
- tests/test_baseline_feature_splits.py: 5 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(submodule): bump baselines/xgboost to include FOV-grouped Optuna fix

dbca43c switches the Optuna inner-val split from cell-level
StratifiedShuffleSplit to FOV-grouped GroupShuffleSplit so
hyperparameter selection sees the same FOV-generalisation gap as the
reported test set (and drops the singleton duplication workaround).

See xuefei-wang/deepcelltypes-xgboost#3 for the full change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(submodule): bump baselines/xgboost for train_best_model retighten

57b4997 adds label-space re-tightening in train_best_model to handle
the case where GroupShuffleSplit on small splits leaves some classes
absent from inner-train (mirrors run.py:178-204). At full scale this
is a no-op; on small splits it allows the tuned XGBoost path to
complete without an XGBClassifier label-space rejection.

Surfaced during smoke testing on a 4-FOV split.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(submodule): re-pin baselines/xgboost to the merged main HEAD

xuefei-wang/deepcelltypes-xgboost#3 merged as 6cda78d (squash). Re-pin
to the merged HEAD on main so the pointer tracks an actual branch tip
rather than the pre-squash branch commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xuefei-wang/deepcelltypes-xgboost#4 widens the tuning.py --metric
click.Choice to also accept macro_f1 / weighted_f1, matching the
research workspace's headline metric. Bumps the submodule pointer
to the merged commit (470a74d).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#6)

Pulls xuefei-wang/deepcelltypes-xgboost#5 — each saved XGB ckpt now writes
a sidecar <model>.remap.json with the post-GSS → ct2idx mapping, so
out-of-band evaluators don't need to replay the GroupShuffleSplit to
recover the label-space.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulls xuefei-wang/deepcelltypes-maps#3 — MAPS output head now covers all
51 archive ct2idx classes instead of just classes seen in train, removing
the 5–10 pp macro-F1 artifact from classes with zero train support.
Existing v10 ckpts unaffected (eval-side reads n_out from the ckpt).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…backbone (#8)

The existing --freeze_backbone freezes every parameter except the
mean-intensity branches (intensity_cls_branch / intensity_per_channel_proj).
The CT classifier head, CLS token, and final norm — all CT-task layers —
stay frozen.

That's a tight definition of "adapter only" but it leaves a known
limitation: the CT head can't adapt to whatever the mean-intensity branch
adds to the CLS embedding, so the model's improvement is capped by the
pretrained head's biases.

Add --unfreeze_ct_head (default off). With this flag set alongside
--freeze_backbone, the freeze policy additionally re-enables:

  - ct_head            (the CT classifier MLP, ~105K params)
  - final_norm         (LayerNorm before heads, ~512 params)
  - cls_token          (the trainable CLS embedding parameter)

The heavy backbone (transformer 3.2M, per-channel encoder 130K, marker
embedder LoRA 175K, spatial encoder 57K) stays frozen as before.

Use case: train Frozen-CLS variants where you want the new mean-intensity
side-input AND the CT head to co-adapt without unleashing the full
transformer. Brings the trainable share from ~3% to ~6% of total
parameters.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both `scripts/predict.py --ct_abstention_k` and the module-form
`deepcell_types.predict()` now default to k=0.5 — the v10 published
headline operating point (≈9% of cells abstained, +5pp macro_F1 on kept
cells; clears every baseline including XGBoost-tuned on the held-out
test split).

Module-form predict() additions:
- `ct_abstention_k=0.5` parameter, k<=0 / None disables.
- Per-FOV IQR fence Q1 - k*IQR on max-softmax (the whole FOV is a single
  tissue×modality group). compute_iqr_fence already guards n<4.
- Abstained cells get the sentinel label "Unknown" in `cell_types`.
  Original argmax preserved in PredictionResult.cell_types_raw, with a
  boolean PredictionResult.abstained mask alongside.

scripts/predict.py:
- --ct_abstention_k default flipped from None to 0.5. Set 0 or a negative
  value to disable. Help text updated to point at
  docs/reports/ct_iqr_abstention_test.md instead of the older audit doc.
- Guard tightened to `k > 0` so the disable contract is explicit.

Tests:
- Replaced test_default_no_abstention_column with two new tests:
  test_default_k_0_5_abstention_is_on (≈10% abstained on synthetic frame)
  and test_disable_abstention_with_nonpositive_k (k<=0 / None as no-op).
- All 24 existing canonical-inference + abstention tests pass; the
  1-cell test_predict_* cases trip compute_iqr_fence's n<4 guard, so
  no abstention fires and assertions hold.

Backwards-compat note: callers that don't pass `ct_abstention_k=` will
now see "Unknown" labels appear for low-confidence cells. To restore the
pre-change behaviour, pass `ct_abstention_k=0` (or None) at the call
site, or `--ct_abstention_k 0` on the CLI.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…,score) (#10)

benchmark_gold_standard.py evaluates marker-positivity at a single
hardcoded threshold (default 0.5). Per-marker threshold tuning needs the
raw per-cell scores, which the script wasn't persisting.

When the DCT_GOLD_PREDS_CSV env var is set, after the inference pass
the script writes a flat CSV of (fov, cell_id, channel, pred_score) for
every prediction the model produced. Downstream callers can then apply
oracle CV per-marker τ (or any other threshold-tuning protocol) without
re-running the model — see analysis/rescore_gold_oracle_cv.py in the
research workspace, which consumes this CSV to produce the
final_*_gold_metrics_learned.json adaptive-τ tables.

No behaviour change when the env var is unset (no file written).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Module docstring previously documented `k=None (off): default`. The
production default in both `scripts/predict.py:81` and
`deepcell_types/predict.py:242` is `k=0.5`. Updated the Pareto-sweep
note to reflect the current operating point.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the data-provenance audit, the --min_channels=3 filter is vacuous on
the labeled v10 corpus — the 622 archive FOVs excluded from v10 are
unlabeled (no standardized_source annotations), not channel-filtered. The
filter logic in dataset.py / baseline_features.py is retained and gated on
`min_channels > 0`, so callers who pass it explicitly still get the
behavior; only the default changes from 3 → 0 across the 4 CLI scripts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ntime

After dropping the --min_channels default 3 → 0 (the filter is a no-op on
the labeled v10 corpus), load_fov_splits was strict-failing on the
recorded vs runtime min_channels metadata. Add min_channels to
_ADVISORY_SPLIT_METADATA_KEYS so the mismatch logs a warning instead of
raising. Restores load-compatibility with all existing fov_split_v10*.json
files (which carry min_channels=3 in their metadata).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e #79) (#11)

* docs(abstention): fix stale docstring — k=0.5 is the default

Module docstring previously documented `k=None (off): default`. The
production default in both `scripts/predict.py:81` and
`deepcell_types/predict.py:242` is `k=0.5`. Updated the Pareto-sweep
note to reflect the current operating point.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* scripts: drop --min_channels default 3 → 0 (filter is a no-op on v10)

Per the data-provenance audit, the --min_channels=3 filter is vacuous on
the labeled v10 corpus — the 622 archive FOVs excluded from v10 are
unlabeled (no standardized_source annotations), not channel-filtered. The
filter logic in dataset.py / baseline_features.py is retained and gated on
`min_channels > 0`, so callers who pass it explicitly still get the
behavior; only the default changes from 3 → 0 across the 4 CLI scripts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(splits): tolerate min_channels mismatch between split file and runtime

After dropping the --min_channels default 3 → 0 (the filter is a no-op on
the labeled v10 corpus), load_fov_splits was strict-failing on the
recorded vs runtime min_channels metadata. Add min_channels to
_ADVISORY_SPLIT_METADATA_KEYS so the mismatch logs a warning instead of
raising. Restores load-compatibility with all existing fov_split_v10*.json
files (which carry min_channels=3 in their metadata).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(predict): FOV-grouped train sampler for --learn_mp_thresholds (#79)

Root cause: when --learn_mp_thresholds is on, predict.py builds the train
loader with use_weighted_sampler=False, which falls back to shuffle=True.
Each random batch of 256 hits ~256 unique FOVs, and FullImageDataset's
_get_zarr_arrays runs `raw_np = raw_zarr[:]` on the first hit per worker
(populating _zero_channel_cache) even when the FOV exceeds the per-worker
numpy budget — a full ~1 GB cold zarr load per FOV. With 8 spawn workers
× prefetch=4 × random FOVs, batch 0 waits on terabytes of cold zarr reads
and effectively never arrives. Training avoids this entirely because
FOVGroupedSampler keeps each worker on one FOV at a time.

Fix: add SequentialFOVGroupedSampler — uniform-coverage counterpart to
FOVGroupedSampler with the same cache-locality guarantee — and a
fov_grouped_train flag on create_dataloader to enable it. predict.py
passes the flag when --learn_mp_thresholds is set, so the original
issue-#79 invocation now runs to completion.

Smoke (8 workers, spawn, v10 test split): first 5 batches in 5 min
(cold-load), subsequent batches stream from per-worker numpy cache.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ty tissue buckets

Three cleanups from the 2026-05-19 triple-review low-priority list:

(1) deepcell_types/model.py constructor defaults aligned with CLI:
    - resnet_base_channels: 32 → 48 (canonical paper recipe, matches
      --resnet_channels default in scripts/train.py + predict.py)
    - mean_intensity_mode: "none" → "cls_residual" (canonical paper recipe,
      matches --mean_intensity_mode default in scripts/train.py)
    Direct callers like `CellTypeAnnotator(...)` without explicit kwargs
    were previously silently building the pre-v10 model variant.

(2) deepcell_types/training/config.py:_compute_all_mappings:
    Drop tissues whose tissue_celltype_mapping ends up with an empty
    allowed-CT set. Previously 4 tissues (esophagus, immune,
    musculoskeletal, colon) had keys created on first sighting but never
    populated, since their only FOVs lacked standardized_source
    annotations. Empty sets are a bug attractor under
    --apply_tissue_mask (the mask becomes all-Inf → NaN softmax). Now
    the empty entries are filtered out and --apply_tissue_mask just
    skips unmapped tissues.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pinned commit a21b97f was the head of merged PR #2; main tip b5447d1
is the merge commit. Same effective content, but the pointer now
matches main rather than a non-tip commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xuefei-wang and others added 30 commits June 22, 2026 14:24
Rebased onto current master (atop the rebased #52). MAPS adapter:
- epoch schedule --max_epochs 500 / --min_epochs 250 / --patience 100 with
  early stopping on a FOV-grouped inner-validation loss (reported test set
  never feeds selection);
- DCT-safe normalization default (train-set z-score then /255), with a
  /255-only ablation via --no_znorm; reproducibility metadata recorded;
- stale normalization-default comment corrected to match the code.

Final tree taken from the pre-rebase tip (81ac2ad); the original branch's
merge-commit resolution (DCT-safe README wording) is preserved here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- model.py: thread ct_head_width / ct_head_depth through the annotator so the
  residual-MLP head size is no longer hardcoded at 512/4.
- predict.py: _infer_ct_head_params() derives head arch/width/depth from a
  checkpoint state_dict when the config omits them, so config-less resMLP
  checkpoints load; master's ct_out_key / vocab-guard logic is preserved.
- retrain_head.py: record ct_head_width/depth + stage-2 provenance in the
  deployable checkpoint config.
- scripts/predict.py: build the model through the inferred head params.

Salvaged from PR #55 / stale #41. The released v0.1.0 checkpoint is a legacy MLP
head and still loads unchanged. Note: scripts/predict.py carries a near-duplicate
of _infer_ct_head_params (follow-up: dedupe into deepcell_types/predict.py).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review of #56:
- scripts/predict.py imported a near-duplicate of _infer_ct_head_params;
  import the canonical helper from deepcell_types.predict instead (single
  source of truth, no drift).
- Wrap the call so a self-inconsistent checkpoint (config says resmlp but
  the state_dict lacks ct_head.inp.0.weight) raises a clear ValueError
  instead of a bare KeyError, matching the deepcell_types.predict path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
docs(baselines): correct stale faithfulness claims in baseline READMEs
…ning

feat(maps): DCT-safe training schedule and normalization controls
feat(model): configurable resMLP head + config-less head-shape inference (salvage of #55)
Release-readiness fixes to the user-facing predict() path:

- Abstention is now opt-in: ct_abstention_k defaults to None (raw argmax)
  instead of 0.2. The old default silently relabelled low-confidence cells
  to "Unknown" in the plain list[str] return at a benchmark-tuned operating
  point; k=0.2 still reproduces the paper operating point.
- Mask all-zero channels in PatchDataset, matching the training dataloader
  (which attention-masks them). A listed marker that is all-zero on a FOV
  was previously fed as a present zero token with a real marker embedding,
  an input the model was trained never to see. Dropping it is equivalent to
  training's attention mask (padding channels are inert for the CLS).
- Reject non-finite (NaN/inf) raw up front instead of silently labelling
  every cell as class 0 via a poisoned softmax.
- Size patch tensors to the real channel count instead of MAX_NUM_CHANNELS;
  padding tokens are provably inert, so this is numerically identical while
  avoiding the per-channel ResNet + quadratic-transformer work over padding.

Tests: abstention-opt-in default, non-finite rejection, all-zero channel
masking, padding numerical-inertness, and a config/preprocessing
MPP+percentile parity assertion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- README: the quickstart downloaded the latest checkpoint but then called
  predict(model_name="deepcell-types_2026-05-17"), which resolves to a file
  download_model() never wrote -> FileNotFoundError on copy-paste. Capture the
  path download_model() returns and pass it straight to predict(). Also add the
  DEEPCELL_ACCESS_TOKEN prerequisite to the model-download section (previously
  only documented elsewhere).
- docs/index.md "Recognized channels" limitation said the registry comes from
  the zarr archive; with 0.1.0 it ships in the packaged vocab.json by default
  and the archive is an override. Corrected to match.
- CHANGELOG: abstention entry now describes the opt-in (default None) behavior;
  added entries for all-zero-channel masking, the non-finite input guard, and
  the real-channel-width tensor sizing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Close the highest-value test gaps the review flagged:

- test_auth.py (new): the download/integrity/extraction layer (utils/_auth.py)
  was entirely untested. Cover the md5/sha256 hash dispatch + bad-length error,
  extract_archive's zip-slip / tar-traversal / tar-symlink rejection and the
  benign-archive happy path, fetch_data's cache-hit and missing-token branches
  (no network), and the model-registry digest shapes.
- compat_marker0_zero now has a behavioral test asserting it zeros marker-0's
  mean-intensity column (the released-checkpoint parity contract), via a hook on
  the intensity CLS branch — the branch's final layer is zero-initialized, so the
  contract must be checked on the branch input, not the fresh model's ct_logits.
- An end-to-end numeric regression pin: a fixed-seed checkpoint on a fixed FOV
  must reproduce a golden softmax fingerprint and be deterministic across calls,
  so preprocessing/forward drift fails instead of shipping green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scripts/evaluate_on_test.sh required embeddings/svd_512.npz (absent from the
repo, regenerable only via OpenAI + the multi-GB archive), so the headline
test-split evaluation was not runnable out of the box — even though
load_state_dict overwrites the marker embeddings with the checkpoint's, making
the SVD file's values unused (only its shape matters).

- scripts/predict.py: load the checkpoint before the marker embeddings and add
  _resolve_marker_embeddings(), which builds a correctly-shaped zeros
  placeholder from the checkpoint when --svd_embeddings_path is omitted.
- evaluate_on_test.sh: the SVD path is now optional (passed only when set), and
  the default MODEL_CKPT points at the download_model() cache (~/.deepcell/
  models) instead of a local dct-final-ckpt path.
- test_scripts_predict.py: unit-test the placeholder / delegate / error paths
  (pure, no archive or network needed).

Not run end-to-end here (needs the registration-gated archive in DATA_DIR); a
single confirmatory real-archive run is recommended before trusting the numbers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
deepcell_types.baselines.maps.run does `import click` at module top, so
tests/baselines/test_maps_normalization.py (which imports normalize_features
from it) raised a ModuleNotFoundError collection error — not a clean skip — on
an inference-only install with no [train]/baseline-maps extra. This turned the
inference-only CI job red (pre-existing on master). The baselines conftest's
hand-maintained collect_ignore list was missing this entry; add a click gate
matching the other baseline-test gates.

Verified by simulating an inference-only collection (extra-only packages hidden
via a meta-path blocker): `pytest --collect-only tests` now exits 0 with the
maps-normalization module excluded and no remaining collection errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…w findings

Addresses code-review findings on PR #57:
- dataset.py: use the same `max(axis=1) == 0` criterion as the training
  dataloader for all-zero channel masking (the previous `(== 0).all(axis=1)`
  diverged for negative-valued input, breaking the claimed training parity).
- dataset.py: fix the comment's reference to a non-existent test file; the
  real test is test_channel_padding_is_numerically_inert.
- dataset.py: broaden the all-masked error message — channels can now be
  dropped for being unmatched, duplicate, or all-zero, not only unmatched.
- test_canonical_inference.py: pin the opt-in `ct_abstention_k=None` default
  via a signature check (the behavioural assertions use a uniform input that
  never trips the IQR fence, so they alone don't catch a default regression).
- scripts/predict.py: note in --ct_abstention_k help that the batch CLI
  deliberately defaults abstention ON (paper reproduction) while the predict()
  library API defaults it OFF.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…=None)

The tutorial still described abstention as on-by-default (k=0.2) and told
users to disable it with k=0, contradicting PR #57's change making it opt-in.
Now documents the None default (raw argmax for every cell) and shows k=0.2 to
enable the paper's headline operating point.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Release-readiness fixes: README quickstart, abstention opt-in, inference parity, tests
Drop docs/reviews/ (internal multi-agent review + ablation reports kept only
for provenance; nothing in code/docs references them) and the dev-only
`if __name__ == "__main__"` runner appended to tests/test_v2.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The [analysis] extra documented figure scripts (plot_tsne.py,
plot_experiment_results.py) that do not exist in the repo, and nothing imports
its only deps (seaborn, openpyxl). Remove the extra, its mention in `all`, and
the dangling allowlist entries in tests/conftest.py. Retarget a package-data
comment to HierarchicalLoss (the real consumer of combined_celltypes.yaml).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- scripts/train.py: docstring said "DANN disabled by default" but
  --domain_weight defaults to 0.1 (DANN enabled).
- Remove references to internal-monorepo files not shipped in the public repo
  (analysis/ct_abstention_iqr.py, preprocess_for_training.py,
  analysis.test_split_summary, dct-final-ckpt/).
- training/utils.py: BatchData.tissue_idx docstring described a removed
  'index 0 = null token' scheme; the code now raises on a missing tissue.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove public-but-uncalled API surface (0.1.0 is unreleased, so never shipped):
TissueNetConfig.{get_excluded_ct_indices, get_channel_embedding,
get_celltype_embedding, combined_celltype_mapping, color_mapping, core_tree,
lineage_mapping, validate}, the now-unused yaml import, and
create_dataloader_from_config (plus its dataset re-export and __all__ entry).
DataLoaderConfig is kept (exercised by the test suite).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- splits/fov_split_test_current.json: replace a leaked author-machine zarr_path
  (/data/xwang3/...) with the $DATA_DIR placeholder used by the other three.
- splits/README.md: document fov_split_test_current.json (the actual default
  headline-eval split) and the prior- vs current-archive fingerprint split.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- scripts/pretrain.py: usage/runtime hints used bare `python pretrain.py` /
  `python train.py`, which only run from scripts/; align to `python scripts/...`
  to match the README.
- baselines/nimbus/run.py: reword the in-code TODO documenting the centroid
  scale_factor overlap as a 'Known limitation' note (the limitation is real and
  documented intentionally; not unfinished work to flag for release).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
All four methods (DCT, MAPS, CellSighter, XGBoost) now use the main
DeepCell-Types sampler by default — sqrt-inverse-frequency with a
1000-count floor — so the baseline-vs-DCT comparison no longer confounds
the method with its class-balancing scheme. Each method's own faithful
sampler stays available as an opt-in ablation.

- samplers.py: factor the DCT weight formula into a shared label-array
  helper `compute_sample_weights_dct()`; `compute_sample_weights()` now
  delegates to it (byte-identical weights — DCT main path unchanged).
- CellSighter: default `--class_balance` equal -> sqrt (`sqrt` already
  maps to the DCT sampler in `create_dataloader`); the faithful
  equal-proportion + size_data recipe stays as the `equal` ablation.
- MAPS: add `--class_balance {dct,full_inv_freq,none}`, default `dct`;
  `full_inv_freq` is the faithful mahmoodlab/MAPS `n/count` sampler.
- XGBoost (plain + tuned): add `--class_balance {dct,none}`, default
  `dct`, applied as a per-row `sample_weight` in `fit()` (the tree analog
  of the neural samplers); `none` restores faithful unweighted XGBoost.
- READMEs updated to reflect the new default + the faithful ablations.

Code-only change; baseline numbers must be regenerated by retraining
with the new default before they land in any figure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…it metadata)

Follow-up to the release-cleanup audit on this PR:

- scripts/train.py: align the DEFAULT_LOSS_WEIGHTS fallback (domain 0.0 -> 0.1)
  with the documented "DANN enabled by default" / --domain_weight default.
  Behavior-neutral for the CLI (the per-run loss_weights dict always overrides
  "domain" with --domain_weight), fixing only the fallback used by a programmatic
  forward_one_batch(loss_weights=None) caller.
- scripts/evaluate_on_test.sh: drop the surviving internal `dct-final-ckpt/`
  reference from the header comment; point at the public deepcell_types.baselines
  path instead.
- tests/conftest.py: remove the deleted `[analysis]` extra from the optional-
  extras comment.
- splits/fov_split_test_current.json: correct stale metadata (num_val_fovs
  431 -> 129; add num_heldout_fovs: 302) to match the actual val/heldout keys
  (verified: val=129, bit-identical to fov_split_test.json; heldout=302).

Full suite: 358 passed, 1 skipped. ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… provenance

Two fixes from the PR #60 deep-review.

M1 — XGB sample_weight scale. `compute_sample_weights_dct` returns raw
sqrt-inverse-frequency weights (mean several×, up to ~47× on floored rare
classes). A WeightedRandomSampler is scale-invariant, but XGBoost consumes
sample_weight as an absolute per-row multiplier on the summed gradient/hessian
per leaf, so the raw weights silently inflate hessian mass and weaken
reg_lambda / min_child_weight relative to the unweighted run — confounding
class balance with reduced regularization. Add a `normalize` kwarg
(default False, preserving the resampling and main-model paths bit-for-bit;
verified: test_samplers 24/24 still pass) and opt the three XGB sites
(run.py, tuning.py objective + train_best_model) into normalize=True so the
dct-vs-none ablation isolates balancing.

M3 — provenance. Because all baselines now default to the shared sampler, two
prediction CSVs trained under different schemes are byte-schema-identical.
`save_baseline_predictions` now writes the active class_balance (and size_data
for CellSighter) to a sidecar `*.meta.json` — a sidecar, not a CSV column, so
the prediction schema and downstream softmax-column selection are unchanged.

Also warn when CellSighter `--size_data` is set under a non-`equal` scheme
(silently inert otherwise), and soften the sampler docstring (rare-tail-is-
unbalanced note; drop the unverified "236 cells" figure).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…defaults

PR #60 added `--class_balance` to the maps/xgboost/xgboost-tune commands but
did not update the frozen-option snapshot tests, so test_xgboost_*_frozen and
test_maps_*_frozen failed with "Extra items in the left set: 'class_balance'"
(CI-blocking). Add `class_balance` to XGBOOST_OPTS, XGBOOST_TUNE_OPTS, and
MAPS_OPTS, and lock the unified-sampler defaults (xgb/xgboost-tune/maps -> dct,
cellsighter -> sqrt) via default-value assertions so a silent default flip is
caught. Verified: 9/9 frozen-option tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…servative claim

From the PR #60 deep-review docs findings:
- Note that CellSighter's `--class_balance sqrt` and MAPS/XGB's `dct` select
  the identical shared scheme (the equivalence was only in a code docstring,
  so a user scripting "same sampler everywhere" hit click Invalid value).
- Document that the 1000-count floor leaves the rare tail effectively
  unbalanced (the scheme mainly rebalances the head), so "balanced" is not
  overstated for the region macro-F1 weights most.
- Hedge the XGB README's "had made the XGBoost rare-class macro number
  conservative" — an unmeasured causal claim — to "we expect ... (direction
  not yet measured)", and document the new mean-1 sample_weight normalization.
- Replace bare "WeightedRandomSampler" with a note that the main model wraps it
  as FOVGroupedSampler (identical draw distribution).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
create_dataloader_from_config (its only functional consumer) was removed
earlier in this PR, leaving DataLoaderConfig as an exported-but-inert dataclass
with no ergonomic call path. Remove it for a clean release surface; the full
keyword API of create_dataloader remains the single way to configure a loader.

- dataloader.py: drop the class and its now-orphaned `dataclass`/`typing`
  imports; update the module docstring.
- dataset.py: drop the back-compat re-export and the `__all__` entry.
- test_training_import_order.py: drop the DataLoaderConfig import assertions.

Full suite: 358 passed, 1 skipped. ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mpler

feat(baselines): unify all baselines on the DCT sampler by default
Rebased onto master (which now has #60's --class_balance / DCT-sampler default).
Adds an opt-in --val_split_file to MAPS, CellSighter, XGBoost (plain + tuned) so
each trains on the FULL --split_file train and selects its checkpoint / early-stop
/ Optuna-trial on an EXTERNAL validation set — the 'val' FOVs of --val_split_file,
capped to 200k cells at seed 42 (mirroring dataloader.py:269-273 max_val_samples),
scored with the canonical hierarchical ct_macro_f1 (metrics.py:399-419). The
reported set stays --split_file 'val'. Legacy inner-val carve is unchanged when
the flag is absent.

This matches how the main DCT model selects (val_macro_f1 on the canonical
302-FOV validation, 200k cap), making baseline model-selection consistent with
DCT instead of each baseline self-carving a different 10% inner-val.

Combined-state fix (needed once #58's eval_set_external path meets #60's
sample_weight): thread class_balance through xgb/tuning.py
_run_canonical_val_tuning -> run_tuning / train_best_model, and apply
compute_sample_weights_dct in the eval_set_external branch — otherwise the
tuned-XGB canonical run would tune with DCT weights but ship an unweighted final
model. Also adds --features_cache to xgboost-tune for cache reuse, and lists
val_split_file in the frozen CLI option snapshots.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…selection

feat(baselines): --val_split_file canonical external-val selection (consistent with DCT)
chore: release-readiness cleanup (cruft, stale refs, dead API, split docs)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants