v0.1.0: monorepo merge — unified training + inference, canonical model, archive-free inference, vendored baselines by xuefei-wang · Pull Request #41 · vanvalenlab/deepcell-types

xuefei-wang · 2026-05-30T16:48:37Z

Summary

Merges the separate training repository (deepcelltypes-cell-type-assignment-pytorch)
into this repo and replaces the legacy CellTypeCLIPModel inference path with the
current canonical model. This is the v0.1.0 release cut.

Before this PR, vanvalenlab/deepcell-types was inference-only — it shipped
CellTypeCLIPModel, the dct_kit/ helpers, and a top-level __init__ that
exported just predict. After it, a single package covers training and
inference: inference stays a plain pip install deepcell-types, the full
training pipeline lives behind a [train] extra, and the four paper comparison
baselines are vendored behind per-baseline extras.

⚠️ Breaking changes — see below.

Canonical model

model.py is rewritten around CellTypeAnnotator; CellTypeCLIPModel /
CellTypeDataEncoder are removed. Canonical training defaults (scripts/train.py,
click-based CLI): --resnet_channels 48, --domain_weight 0.1,
--best_metric macro_f1.

Mean-intensity injection — per-cell mean marker intensity is scattered
into a marker-position vector and injected as a CLS residual. The output
projection is zero-init, so warm-starting from a checkpoint preserves
predictions at step 0.
DANN domain adaptation via a gradient-reversal head, on by default
(--domain_weight 0.1; 0 disables it).
Adapter-style fine-tuning: --freeze_backbone trains only the
mean-intensity branches on top of an existing checkpoint; --unfreeze_ct_head
additionally co-adapts the CT head / CLS token / final norm without unfreezing
the transformer backbone.
Padding-channel positions are explicitly zeroed (masked_fill) through the
channel encoder, fusion, and mean-intensity paths so masked tokens contribute
exactly zero rather than leaking bias/spatial_feat into the transformer.
Self-describing checkpoints: scripts/train.py bundles ct2idx, n_heads,
and compat_marker0_zero into the checkpoint, and inference asserts the
vocabulary ordering matches (a permuted vocabulary previously passed the
count-only check and silently mislabeled cells).

Canonical-only inference

Archive-free by default: the marker / cell-type registry ships as a small
packaged vocab.json snapshot, so pip install deepcell-types +
download_model() is enough to run predict() — the multi-GB TissueNet zarr
archive is no longer required (pass zarr_path= / set
DEEPCELL_TYPES_ZARR_PATH only if you need it). Verified identical
predictions with vs. without the archive on the paper checkpoint.
Post-hoc abstention on by default (ct_abstention_k=0.2), bucketed
per-FOV everywhere (CLI, Python API, library): cells below an IQR fence on
the FOV confidence distribution are relabeled to the "Unknown" sentinel
(skipped when k is disabled or the FOV has <4 cells).
Custom preprocessing hook: predict(..., preprocess=...) overrides the
per-FOV normalization without retraining, backed by a bounded op library
(apply_config, make_preprocessor, DEFAULT_CONFIG) and a
composition-guided adaptation loop (skills/preproc-adapt/).
The bright-spot clip percentile (DCTConfig.PERCENTILE_THRESHOLD) is now
99.9, matching the recipe the training archive was built with (was 99.0,
a carryover from the original packaging).
predict(return_probabilities=True) returns a PredictionResult dataclass
with the full per-cell softmax matrix, cell indices, and the pre-abstention
argmax labels (cell_types_raw).
_torch_load_weights loads with weights_only=True and emits a loud warning
if it has to fall back to unsafe pickle on an older torch; a missing
checkpoint raises a clear FileNotFoundError pointing at download_model().

New public API

predict, DCTConfig, PredictionResult, preprocess_fov, apply_config,
make_preprocessor, and DEFAULT_CONFIG are importable from deepcell_types
directly. preprocess_fov(raw, mask, native_mpp, channel_names) → PreprocessedFov is the standalone preprocessing entry point.

Monorepo: training pipeline

deepcell_types.training ships from this repo behind pip install "deepcell-types[train]": config.py, dataset.py, archive.py,
annotations.py, baseline_features.py, gold_metadata.py, losses.py,
metrics.py, patch.py, utils.py, abstention.py.
Scripts under scripts/: train.py, pretrain.py, predict.py,
generate_openai_embeddings.py, generate_splits.py, split_val_for_test.py,
plus the release-archive gate (validate_archive_contract.py,
check_release_archive.sh).
Canonical split manifests committed under splits/
(fov_split{,_valsubset,_test}.json + README), so the published
train/val/test partition is reproducible from the repo.
Experiment logging is plain Python logging — no Weights & Biases dependency
anywhere (--enable_wandb is gone; confusion matrices save locally as PNGs).
zarr>=3.1 pulls the Python floor up to 3.11 for the train extra.

Baselines

Four paper comparison baselines vendored under deepcell_types/baselines/
(cellsighter, maps, nimbus, xgb), invoked through the unified runner
python -m deepcell_types.baselines <name>, each with a self-contained
install extra (baseline-cellsighter, baseline-maps, baseline-nimbus,
baseline-xgboost).
Each baseline ships a README documenting every deviation from its upstream
source; third-party licenses are tracked in deepcell_types/baselines/NOTICE.
extract_features_from_zarr(missing_value=...) lets each baseline choose its
absent-marker sentinel: MAPS / CellSighter keep 0.0; XGBoost can pass
np.nan so absent markers route through XGBoost's learned missing direction
instead of being conflated with "present, intensity 0.0". The feature matrix
records a present_markers mask and the cache stays missing-value-agnostic.

Breaking changes

CellTypeCLIPModel removed. No shim — use from deepcell_types import predict, DCTConfig.
All predict() arguments after mpp are keyword-only, preventing
accidental transposition of the adjacent string arguments. device= is the
preferred spelling (device_num= remains a deprecated alias).
predict(num_workers=...) default is now 0 (was 24) — 24 workers
OOM'd machines with <64 GB RAM.
Abstention on by default changes returned labels vs. the unfiltered argmax
of prior releases; pass ct_abstention_k=0 to recover raw argmax.
Clip percentile 99.0 → 99.9 shifts ~5% of predicted labels; on a
held-out test-split sample it reproduces the canonical predictions slightly
better (92.5% vs 91.9% argmax agreement).

Packaging / infra

Package data now ships vocab.json, channel_mapping.yaml, and
training/config/*.yaml (incl. combined_celltypes.yaml), which were
previously outside the package tree and absent after pip install.
tifffile declared in the [train] extra.
CI workflow added (.github/workflows/ci.yml); inference vs. [train] test
boundary enforced.
LICENSE text matches the OSI Apache 2.0 text exactly (LIC: Revert licence text to exactly match OSI Apache 2 #42); NOTICE
aligned to the vanvalenlab convention.

Tests

35 test modules under tests/ (plus tests/baselines/) covering canonical
inference, abstention CLI, checkpoint round-trip, dataset/split/sampler
behavior, preprocessing + the preprocess hook, losses, hierarchical eval,
archive-contract validation, baseline feature splits, and vendored-baseline
equivalence against upstream.

See CHANGELOG.md
for the full 0.1.0 entry and migration notes.

From reviews/2026-05-10-2345/simplification.md H1+H2 and complexity.md H2: - Delete _zarr_group_filesystem_path and _read_v3_1d_array from training/utils.py. Both were verbatim copies of annotations.py's group_filesystem_path / read_v3_1d_array with zero callers across the repo (verified by grep). The annotations.py versions are the canonical ones imported by training/dataset.py. - Delete the three pass-through static shim methods on FullImageDataset (_group_filesystem_path, _read_v3_1d_array, _centroid_to_cell_idx_fast). None were called anywhere — adding zero value, only obscuring that the real helpers live in annotations.py. Note: _build_centroid_tree is kept (also flagged but not in the HIGH list). - Backport the zstd-level-aware codec read from dct_kit/config.py into annotations.py:read_v3_1d_array. The old training-side copy hardcoded Zstd(level=0) while the inference side correctly reads level from the codec config. With archives written at a non-zero compression level the training-side read would silently produce garbage. Both paths now share the level-aware contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…es (Theme F) config.py and utils.py had grown to 1.3k and 1.5k LOC, mixing archive fingerprinting, patch extraction, metric trackers, baseline IO, and the core TissueNetConfig/RNG/log helpers in one place each. Carve four focused modules out (verbatim, no logic changes): - training/archive.py: zarr v3 alpha metadata patch, archive metadata / array fingerprinting, FOV-key discovery, and the per-process caches. - training/patch.py: per-cell patch extraction (compute_distance_transform, extract_patch_from_zarr, extract_patch). - training/metrics.py: confusion-matrix hierarchy adjustment, MP per-marker reduction, MPMetricsTracker, LossesAndMetrics, build_label_remap. - training/baseline_features.py: baseline classifier feature extraction pipeline (_conf_mat_summary, compute_baseline_metrics, save_baseline_predictions, _extract_all_dataset_features, extract_features_from_zarr, _get_cell_data_from_ds). Re-exports at the bottom of config.py and utils.py keep all tests/scripts working unchanged (230 passed, 1 skipped, matching the pre-split baseline). dataset.py is updated to import directly from the new homes for cached_archive_metadata_fingerprint and extract_patch. Two non-mechanical touches required to keep monkey-patch-based tests green: - baseline_features.extract_features_from_zarr looks up _discover_fov_keys and _extract_all_dataset_features via the config / utils modules at call time, so tests that monkeypatch those symbols on the legacy modules still take effect after the split. _FINGERPRINT_CACHE / _FOV_KEYS_CACHE dicts are re-exported from config.py for the same reason (test_dataset_cache mutates them). - metrics.LossesAndMetrics.compute defers import of _conf_mat_summary to method-call time to avoid a metrics <-> baseline_features import cycle (baseline_features needs adjust_conf_mat_hierarchy at module load). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

From reviews/2026-05-10-2345/docs.md HIGH findings: - README: add a "Training" section describing the [train] extra and the four main entry points under scripts/. Move "Download the model" after "Installation" (was non-executable in reading order). - docs/index.md: add a "Training" section explaining that training-only code lives under deepcell_types.training, gated behind the [train] extra, with pointers to scripts/{train,predict,pretrain, benchmark_gold_standard,ingest_gold_to_zarr}.py. Fix the long-standing "sorce" typo. - docs/site/tutorial.md: bump the example archive placeholder from tissuenet-v8.zarr → tissuenet-v9.zarr to match DCTConfig's probe order (v9 is the canonical contemporary archive). The docs.md HIGH for the broken `from utils import download_training_data` import in docs/site/API-key.md was fixed in 88b95f9. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Five MEDIUM/HIGH findings from reviews/2026-05-10-2345 in one batch: - complexity H1: TissueNetConfig.get_marker_positivity() and marker_positivity_labels[] now share a single LazyMarkerPositivityDict. Previously the plain-dict cache populated by get_marker_positivity() was discarded the first time marker_positivity_labels was accessed (the property replaced the field), causing wasted I/O and divergent caches. _marker_positivity_cache is now Optional[LazyMP...] and lazily constructed on first access; get_marker_positivity routes through marker_positivity_labels for a single source of truth. - numerical M1: MarkerEmbeddingLayer.forward zeros output for padding positions (ch_idx == -1). Without this, F.normalize(proj(0)) yielded a unit-norm direction equal to F.normalize(proj.bias) — a non-trivial embedding flowing into the transformer for tokens that should be invisible. - numerical M2: CellTypeAnnotator.forward zeros spatial features for padding positions BEFORE the fusion concat. Otherwise padding tokens enter self.fusion with [0, spatial_feat] and emerge as W_spatial @ spatial_feat + bias. - API M1: rename predict(tissue_exclude=...) → predict(tissue_filter=...). The old name was inverted — "tissue_exclude='colon'" actually meant "filter TO colon-associated cell types". The deprecated alias stays (keyword-only) and emits DeprecationWarning; passing both raises TypeError. - API M3: predict(return_probabilities=True) returns a PredictionResult dataclass with cell_types, probabilities (full per- cell softmax matrix), and cell_indices. Default behaviour unchanged (returns list[str]). PredictionResult and DCTConfig are now hoisted to top-level so `from deepcell_types import PredictionResult, DCTConfig` works. Tests: 233 passed, 1 skipped. Added 3 new tests covering return_probabilities, tissue_exclude DeprecationWarning, and the both-args TypeError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- tests M3: add a regression anchor in test_train_loop_smoke.py that asserts scripts/train.py still contains the AMP scheduler-gate predicate. The 2-line _run_gated_step helper is faithful to the production behavior but a silent drift would otherwise let the emulator tests pass while real training desynchronizes OneCycleLR. - tests M2: same idea for test_zero_channel_masking.py. The unit-test helper is a verbatim copy of __getitem__'s masking block; a refactor could let the copy drift. New test asserts training/dataset.py still contains _zero_channel_cache and fov_zero_mask. - docs M4: add CHANGELOG.md documenting the 0.0.1 → 0.1.0 release (canonical-only refactor, training subpackage, breaking removal of CellTypeCLIPModel, deprecated tissue_exclude alias, num_workers=0 default, TissueNetConfig env-var default). Bump version in pyproject.toml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

complexity H8: replace FullImageDataset.indices' positional 8-tuple with a CellIndexRecord NamedTuple. Named fields make grep / refactor safe (no more record[6] / record[5] magic numbers across 10+ call sites). NamedTuple IS a tuple, so positional access still works for backward compat with serialized caches that stored raw 8-tuples. Production call sites in dataset.py now use .ct_label_standard, .dataset_name, .fov_name, .ds_idx, .domain accessors. Mock-index constructors in tests/{test_v2,test_samplers,test_stratified_splits, test_dataset_splits}.py updated to build CellIndexRecord instances. complexity H7: introduce DataLoaderConfig dataclass + matching create_dataloader_from_config(zarr_dir, dct_config, cfg) wrapper. Lets new callers pass a single discoverable object instead of 20+ keyword arguments. The legacy keyword signature of create_dataloader is preserved verbatim so train.py / predict.py / tests don't need any change. Field defaults mirror create_dataloader's defaults exactly — DataLoaderConfig() is equivalent to no-override. Tests: 235 passed, 1 skipped (analysis-only env failure unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Collapse training pipeline into deepcell-types (canonical-only)

…ne submodule rebase Three independent bugs surfaced when running training against the current master HEAD from a fresh workspace install: 1. tissue_idx kwarg mismatch (scripts/train.py:121, scripts/predict.py:208 + 334) scripts pass `tissue_idx=batch_data.tissue_idx` to `CellTypeAnnotator.forward(...)`, but the model's forward signature is `(sample, spatial_context, ch_idx, padding_mask, ct_exclude=None, return_attn_weights=False, domain_idx=None)` — no `tissue_idx`. The tissue-FiLM MP head experiment was rolled back (see memory `v10_mp_expansion_tissue_negative.md`) and the model dropped the parameter, but the scripts kept passing it. Result: every training / prediction run dies at the first forward pass with `TypeError: ...got an unexpected keyword argument 'tissue_idx'`. Fix: drop the kwarg at all three call sites. `batch_data.tissue_idx` is still populated by the dataloader and remains available to anyone who needs it downstream — the model just doesn't consume it. 2. Circular import between training/utils.py and training/baseline_features.py utils.py re-exports four symbols from baseline_features.py at module level for backward compat. baseline_features.py also imports private helpers (`_atomic_np_savez` etc.) from utils.py. When utils.py is imported first (training path) the cycle resolves fine, but when baseline_features.py is imported first (baseline path — e.g. `import xgb.run`), the partially-initialized utils.py reaches back to `baseline_features._extract_all_dataset_features` before that name is defined, and ImportError fires. Fix: convert the re-exports to a module-level `__getattr__` so the lookup is deferred until actual access, by which point both modules have finished initializing. Existing callers (`from deepcell_types.training.utils import save_baseline_predictions`, verified in tests/test_v2.py) keep working. 3. Submodule rebase (baselines/{maps,cellsighter,xgboost,nimbus}) Each baseline's pyproject.toml listed `deepcelltypes @ git+... deepcelltypes-cell-type-assignment-pytorch.git` as a dep; that URL now resolves to the renamed research workspace (no longer a Python package) and `uv pip install` fails with a metadata-name mismatch. Each baseline also imported from `deepcelltypes.{config,utils,dataset}` — the pre-refactor flat layout. Companion commits on each submodule's `fix/post-refactor-imports` branch replace the dep URL with a plain `deepcell-types` and rebase imports onto `deepcell_types.training.{config,utils,dataset,metrics,baseline_features}`. This parent commit bumps the submodule pointers to those branch tips. End-to-end verification: with the three fixes, a fresh workspace `uv sync` + smoke training (`scripts/train.py` with the v10 split + svd_512_v6 embeddings) gets through model build, GPU allocation, and reaches batch 0 of epoch 0. The xgboost baseline imports cleanly after `uv pip install -e baselines/xgboost`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…port fix(train,predict,utils): tissue_idx kwarg + circular import + baseline submodule rebase

uv.lock is regenerated on every branch switch when pyproject.toml shapes differ (master vs training), so keeping it tracked produces constant churn. /reviews/ holds local /deep-review outputs.

Untouched since Oct 2024 and broken since the kit was inlined in 0a8108e: it COPYs a non-existent top-level requirements.txt and pip-installs the deleted deepcelltypes-kit/ directory. No CI, docs, or scripts reference it.

…us v0.0.5 model.py: replace channel_feat[padding_mask] = 0.0 (in-place under AMP autocast on a tensor in the backward graph) with an out-of-place masked_fill on padding_mask.unsqueeze(-1). Eliminates the latent "a leaf Variable that requires grad has been used in an in-place operation" risk and the gradient corruption it would cause on padding rows. metrics.py: make MPMetricsTracker.compute symmetric across mp_macro_f1 and mp_macro_accuracy w.r.t. vacuous markers (n_pos_gt == 0 and n_pos_pred == 0). f1s already appended np.nan + used nanmean; accuracies appended a real value + used mean, asymmetrically inflating macro accuracy. Now both go through nanmean with np.nan sentinels, so the two headline MP numbers come from the same denominator. scripts/benchmark_gold_standard.py: support nimbus-inference v0.0.5 + the actual Pan-Multiplex gold-standard directory layout. The script previously called Nimbus.prepare_normalization_dict (removed in v0.0.5) and assumed per-subset labels/ + raw/ dirs; the real layout has <subset>/fovs/ plus a single central gold_standard_groundtruth.csv. Now: prepare_normalization_dict is invoked on MultiplexDataset (its v0.0.5 home), discover_gold_standard_subsets accepts both layouts and pivots the central CSV per-FOV when needed, the segmentation naming convention probes both Pan-Multiplex (<fov>.ome.tif) and legacy DeepCell (<fov>_whole_cell.tiff) names, and the image suffix is autodetected. Smoke against /data/xwang3/nimbus_gold_standard/ gold_standard_labelled completes end-to-end (macro F1 0.7400, micro 0.8382, 56 markers, 939K cell-marker pairs). Pytest baseline unchanged: 255 passed / 1 skipped / 1 failed (the known post-PR-#62 analysis.validate_mp_refinement path drift). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After PR #62 split the monorepo, `analysis/` lives in the research workspace and is no longer importable from this repo's pytest session. The stage7 synthetic-gold-validation test imports `analysis.validate_mp_refinement` and fails collection with `ModuleNotFoundError: No module named 'analysis'` unless the workspace is on PYTHONPATH. Guard the import with `pytest.importorskip(...)` so the suite reports skipped instead of failed in the default sibling-repo-only invocation. Bumps the sibling pytest baseline to 255 passed / 2 skipped / 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… --zarr_dir defaults train.py / predict.py / pretrain.py used to default --zarr_dir to DATA_DIR / "tissuenet-caitlin-labels.zarr", forcing users to set DATA_DIR to the wrapper directory and have the scripts append the inner archive name. Switch to default --zarr_dir = DATA_DIR so the env var holds the actual archive root directly; this matches both how TissueNetConfig(zarr_path=...) is invoked elsewhere and how the baseline runners take their --zarr_dir. The three baseline submodules (xgboost, cellsighter, maps) make the same change on their --zarr_dir defaults; pointers are bumped here. The cellsighter submodule also includes a smoke-safety fix (best_macro_acc=-inf so the first val pass always saves a checkpoint even when macro_accuracy is exactly 0.0); the xgboost submodule includes a label-tightening fix for tiny subset smokes where GroupShuffleSplit can leave compact labels with zero examples in inner_train (rejected by modern xgboost.sklearn.XGBClassifier). Smoke verification on the v10 7-dataset subset (post-tier-3-repair archive) — all 3 baselines + main model now complete end-to-end: - main train.py (cuda:0) train_macro_acc=0.0268, best ckpt saved - xgb baseline (CPU) macro=0.2209, CSV + model.json saved - cellsighter baseline (cuda:1) macro=0.0131, CSV + .pth saved - maps baseline (cuda:2) val_loss=5.96, CSV + .pth + stats saved Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CellTypeAnnotator.forward zeroed padding rows in two places. The first (`channel_feat[padding_mask] = 0.0`) was switched to an out-of-place `masked_fill` in 782c611 to avoid an in-place write on a tensor that's in the AMP autocast backward graph. The second site (`spatial_expanded[padding_mask] = 0.0`) was left as an in-place write, guarded by a defensive `.clone()` on the preceding `expand()` view. That guard is correct today, but the asymmetry is a trap: anyone who removes the `.clone()` thinking it's redundant will silently reintroduce the same AMP-graph hazard the earlier fix addressed. Switching to the same masked_fill pattern removes the trap and drops the now-unneeded clone — masked_fill materializes the expand() view into a fresh tensor. Pytest unchanged: 255 passed / 2 skipped / 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Drop LoRA from MarkerEmbeddingLayer (and CLI flags / auto-detect): the trainable projection makes a LoRA adapter mathematically redundant (proj.W + lora_B @ lora_A collapses into proj_eff.W). Confirmed on v8 that LoRA-8 ties exactly with no-LoRA. - Add --mean_intensity_mode {none|cls_residual|per_channel|both} to CellTypeAnnotator: scatter per-cell mean marker intensity into a global marker-position vector and inject as a CLS residual and/or per-channel feature. Zero-init the output projection so warm-start from a baseline ckpt preserves predictions at step 0. - Add --freeze_backbone to train.py: requires_grad=False on everything except intensity_cls_branch/intensity_per_channel_proj. Use with --pretrained_path to train a cheap mean-intensity adapter on top of an existing ckpt. - Final-eval val-cap automation: when --max_val_samples is set (cheap per-epoch val), final eval is rebuilt with no cap so the headline test number is apples-to-apples vs baselines (which never cap their val). - Auto-detect mean_intensity_mode from ckpt keys in predict.py, benchmark_gold_standard.py, and deepcell_types.predict. - Make pretrained loading tolerate numpy-scalar metadata (torch.load weights_only=False for pretrained_path). - Add scripts/fold_lora_into_proj.py utility to fold legacy LoRA weights into proj.weight so old ckpts load against the LoRA-free model. - Change canonical defaults: --resnet_channels 48, --domain_weight 0.1, --mean_intensity_mode cls_residual, --best_metric macro_f1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(baselines): NaN missing-value support + XGBoost submodule bump Adds a ``missing_value: float = 0.0`` knob to ``extract_features_from_zarr`` so each baseline can pick its preferred sentinel for absent markers: - MAPS / CellSighter continue to receive 0.0 (default; their MLP/CNN layers don't tolerate NaN). - XGBoost now receives NaN (per submodule update), routing absent markers through XGBoost's ``missing=NaN`` learned per-split default direction rather than conflating them with "marker present, mean intensity 0.0". Implementation: - ``_extract_all_dataset_features`` records a per-dataset ``present_markers`` bool mask alongside features/labels/cell_sizes. - ``extract_features_from_zarr`` accumulates per-split block metadata (``{split}_block_sizes``, ``{split}_block_absent``) and applies ``_apply_missing_value`` post-extraction (and post-cache-load). The cache stays missing-value-agnostic — same .npz/pickle serves a MAPS run and an XGBoost run. - Cache version bumped 5 -> 6 so legacy caches without ``present_markers`` are rebuilt automatically. Submodule bump (baselines/xgboost): - fix(tuning): carve FOV-grouped inner-val for early stopping instead of leaking the test set into best_iteration. - feat(missing): pass missing_value=np.nan from run.py and tuning.py. Tests: - tests/test_baseline_feature_splits.py: 5 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(submodule): bump baselines/xgboost to include FOV-grouped Optuna fix dbca43c switches the Optuna inner-val split from cell-level StratifiedShuffleSplit to FOV-grouped GroupShuffleSplit so hyperparameter selection sees the same FOV-generalisation gap as the reported test set (and drops the singleton duplication workaround). See xuefei-wang/deepcelltypes-xgboost#3 for the full change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(submodule): bump baselines/xgboost for train_best_model retighten 57b4997 adds label-space re-tightening in train_best_model to handle the case where GroupShuffleSplit on small splits leaves some classes absent from inner-train (mirrors run.py:178-204). At full scale this is a no-op; on small splits it allows the tuned XGBoost path to complete without an XGBClassifier label-space rejection. Surfaced during smoke testing on a 4-FOV split. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(submodule): re-pin baselines/xgboost to the merged main HEAD xuefei-wang/deepcelltypes-xgboost#3 merged as 6cda78d (squash). Re-pin to the merged HEAD on main so the pointer tracks an actual branch tip rather than the pre-squash branch commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

xuefei-wang/deepcelltypes-xgboost#4 widens the tuning.py --metric click.Choice to also accept macro_f1 / weighted_f1, matching the research workspace's headline metric. Bumps the submodule pointer to the merged commit (470a74d). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#6) Pulls xuefei-wang/deepcelltypes-xgboost#5 — each saved XGB ckpt now writes a sidecar <model>.remap.json with the post-GSS → ct2idx mapping, so out-of-band evaluators don't need to replay the GroupShuffleSplit to recover the label-space. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pulls xuefei-wang/deepcelltypes-maps#3 — MAPS output head now covers all 51 archive ct2idx classes instead of just classes seen in train, removing the 5–10 pp macro-F1 artifact from classes with zero train support. Existing v10 ckpts unaffected (eval-side reads n_out from the ckpt). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…backbone (#8) The existing --freeze_backbone freezes every parameter except the mean-intensity branches (intensity_cls_branch / intensity_per_channel_proj). The CT classifier head, CLS token, and final norm — all CT-task layers — stay frozen. That's a tight definition of "adapter only" but it leaves a known limitation: the CT head can't adapt to whatever the mean-intensity branch adds to the CLS embedding, so the model's improvement is capped by the pretrained head's biases. Add --unfreeze_ct_head (default off). With this flag set alongside --freeze_backbone, the freeze policy additionally re-enables: - ct_head (the CT classifier MLP, ~105K params) - final_norm (LayerNorm before heads, ~512 params) - cls_token (the trainable CLS embedding parameter) The heavy backbone (transformer 3.2M, per-channel encoder 130K, marker embedder LoRA 175K, spatial encoder 57K) stays frozen as before. Use case: train Frozen-CLS variants where you want the new mean-intensity side-input AND the CT head to co-adapt without unleashing the full transformer. Brings the trainable share from ~3% to ~6% of total parameters. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both `scripts/predict.py --ct_abstention_k` and the module-form `deepcell_types.predict()` now default to k=0.5 — the v10 published headline operating point (≈9% of cells abstained, +5pp macro_F1 on kept cells; clears every baseline including XGBoost-tuned on the held-out test split). Module-form predict() additions: - `ct_abstention_k=0.5` parameter, k<=0 / None disables. - Per-FOV IQR fence Q1 - k*IQR on max-softmax (the whole FOV is a single tissue×modality group). compute_iqr_fence already guards n<4. - Abstained cells get the sentinel label "Unknown" in `cell_types`. Original argmax preserved in PredictionResult.cell_types_raw, with a boolean PredictionResult.abstained mask alongside. scripts/predict.py: - --ct_abstention_k default flipped from None to 0.5. Set 0 or a negative value to disable. Help text updated to point at docs/reports/ct_iqr_abstention_test.md instead of the older audit doc. - Guard tightened to `k > 0` so the disable contract is explicit. Tests: - Replaced test_default_no_abstention_column with two new tests: test_default_k_0_5_abstention_is_on (≈10% abstained on synthetic frame) and test_disable_abstention_with_nonpositive_k (k<=0 / None as no-op). - All 24 existing canonical-inference + abstention tests pass; the 1-cell test_predict_* cases trip compute_iqr_fence's n<4 guard, so no abstention fires and assertions hold. Backwards-compat note: callers that don't pass `ct_abstention_k=` will now see "Unknown" labels appear for low-confidence cells. To restore the pre-change behaviour, pass `ct_abstention_k=0` (or None) at the call site, or `--ct_abstention_k 0` on the CLI. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…,score) (#10) benchmark_gold_standard.py evaluates marker-positivity at a single hardcoded threshold (default 0.5). Per-marker threshold tuning needs the raw per-cell scores, which the script wasn't persisting. When the DCT_GOLD_PREDS_CSV env var is set, after the inference pass the script writes a flat CSV of (fov, cell_id, channel, pred_score) for every prediction the model produced. Downstream callers can then apply oracle CV per-marker τ (or any other threshold-tuning protocol) without re-running the model — see analysis/rescore_gold_oracle_cv.py in the research workspace, which consumes this CSV to produce the final_*_gold_metrics_learned.json adaptive-τ tables. No behaviour change when the env var is unset (no file written). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Module docstring previously documented `k=None (off): default`. The production default in both `scripts/predict.py:81` and `deepcell_types/predict.py:242` is `k=0.5`. Updated the Pareto-sweep note to reflect the current operating point. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per the data-provenance audit, the --min_channels=3 filter is vacuous on the labeled v10 corpus — the 622 archive FOVs excluded from v10 are unlabeled (no standardized_source annotations), not channel-filtered. The filter logic in dataset.py / baseline_features.py is retained and gated on `min_channels > 0`, so callers who pass it explicitly still get the behavior; only the default changes from 3 → 0 across the 4 CLI scripts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ntime After dropping the --min_channels default 3 → 0 (the filter is a no-op on the labeled v10 corpus), load_fov_splits was strict-failing on the recorded vs runtime min_channels metadata. Add min_channels to _ADVISORY_SPLIT_METADATA_KEYS so the mismatch logs a warning instead of raising. Restores load-compatibility with all existing fov_split_v10*.json files (which carry min_channels=3 in their metadata). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e #79) (#11) * docs(abstention): fix stale docstring — k=0.5 is the default Module docstring previously documented `k=None (off): default`. The production default in both `scripts/predict.py:81` and `deepcell_types/predict.py:242` is `k=0.5`. Updated the Pareto-sweep note to reflect the current operating point. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * scripts: drop --min_channels default 3 → 0 (filter is a no-op on v10) Per the data-provenance audit, the --min_channels=3 filter is vacuous on the labeled v10 corpus — the 622 archive FOVs excluded from v10 are unlabeled (no standardized_source annotations), not channel-filtered. The filter logic in dataset.py / baseline_features.py is retained and gated on `min_channels > 0`, so callers who pass it explicitly still get the behavior; only the default changes from 3 → 0 across the 4 CLI scripts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(splits): tolerate min_channels mismatch between split file and runtime After dropping the --min_channels default 3 → 0 (the filter is a no-op on the labeled v10 corpus), load_fov_splits was strict-failing on the recorded vs runtime min_channels metadata. Add min_channels to _ADVISORY_SPLIT_METADATA_KEYS so the mismatch logs a warning instead of raising. Restores load-compatibility with all existing fov_split_v10*.json files (which carry min_channels=3 in their metadata). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(predict): FOV-grouped train sampler for --learn_mp_thresholds (#79) Root cause: when --learn_mp_thresholds is on, predict.py builds the train loader with use_weighted_sampler=False, which falls back to shuffle=True. Each random batch of 256 hits ~256 unique FOVs, and FullImageDataset's _get_zarr_arrays runs `raw_np = raw_zarr[:]` on the first hit per worker (populating _zero_channel_cache) even when the FOV exceeds the per-worker numpy budget — a full ~1 GB cold zarr load per FOV. With 8 spawn workers × prefetch=4 × random FOVs, batch 0 waits on terabytes of cold zarr reads and effectively never arrives. Training avoids this entirely because FOVGroupedSampler keeps each worker on one FOV at a time. Fix: add SequentialFOVGroupedSampler — uniform-coverage counterpart to FOVGroupedSampler with the same cache-locality guarantee — and a fov_grouped_train flag on create_dataloader to enable it. predict.py passes the flag when --learn_mp_thresholds is set, so the original issue-#79 invocation now runs to completion. Smoke (8 workers, spawn, v10 test split): first 5 batches in 5 min (cold-load), subsequent batches stream from per-worker numpy cache. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ty tissue buckets Three cleanups from the 2026-05-19 triple-review low-priority list: (1) deepcell_types/model.py constructor defaults aligned with CLI: - resnet_base_channels: 32 → 48 (canonical paper recipe, matches --resnet_channels default in scripts/train.py + predict.py) - mean_intensity_mode: "none" → "cls_residual" (canonical paper recipe, matches --mean_intensity_mode default in scripts/train.py) Direct callers like `CellTypeAnnotator(...)` without explicit kwargs were previously silently building the pre-v10 model variant. (2) deepcell_types/training/config.py:_compute_all_mappings: Drop tissues whose tissue_celltype_mapping ends up with an empty allowed-CT set. Previously 4 tissues (esophagus, immune, musculoskeletal, colon) had keys created on first sighting but never populated, since their only FOVs lacked standardized_source annotations. Empty sets are a bug attractor under --apply_tissue_mask (the mask becomes all-Inf → NaN softmax). Now the empty entries are filtered out and --apply_tissue_mask just skips unmapped tissues. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pinned commit a21b97f was the head of merged PR #2; main tip b5447d1 is the merge commit. Same effective content, but the pointer now matches main rather than a non-tip commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rebased onto current master (atop the rebased #52). MAPS adapter: - epoch schedule --max_epochs 500 / --min_epochs 250 / --patience 100 with early stopping on a FOV-grouped inner-validation loss (reported test set never feeds selection); - DCT-safe normalization default (train-set z-score then /255), with a /255-only ablation via --no_znorm; reproducibility metadata recorded; - stale normalization-default comment corrected to match the code. Final tree taken from the pre-rebase tip (81ac2ad); the original branch's merge-commit resolution (DCT-safe README wording) is preserved here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- model.py: thread ct_head_width / ct_head_depth through the annotator so the residual-MLP head size is no longer hardcoded at 512/4. - predict.py: _infer_ct_head_params() derives head arch/width/depth from a checkpoint state_dict when the config omits them, so config-less resMLP checkpoints load; master's ct_out_key / vocab-guard logic is preserved. - retrain_head.py: record ct_head_width/depth + stage-2 provenance in the deployable checkpoint config. - scripts/predict.py: build the model through the inferred head params. Salvaged from PR #55 / stale #41. The released v0.1.0 checkpoint is a legacy MLP head and still loads unchanged. Note: scripts/predict.py carries a near-duplicate of _infer_ct_head_params (follow-up: dedupe into deepcell_types/predict.py). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Address review of #56: - scripts/predict.py imported a near-duplicate of _infer_ct_head_params; import the canonical helper from deepcell_types.predict instead (single source of truth, no drift). - Wrap the call so a self-inconsistent checkpoint (config says resmlp but the state_dict lacks ct_head.inp.0.weight) raises a clear ValueError instead of a bare KeyError, matching the deepcell_types.predict path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(baselines): correct stale faithfulness claims in baseline READMEs

…ning feat(maps): DCT-safe training schedule and normalization controls

feat(model): configurable resMLP head + config-less head-shape inference (salvage of #55)

Release-readiness fixes to the user-facing predict() path: - Abstention is now opt-in: ct_abstention_k defaults to None (raw argmax) instead of 0.2. The old default silently relabelled low-confidence cells to "Unknown" in the plain list[str] return at a benchmark-tuned operating point; k=0.2 still reproduces the paper operating point. - Mask all-zero channels in PatchDataset, matching the training dataloader (which attention-masks them). A listed marker that is all-zero on a FOV was previously fed as a present zero token with a real marker embedding, an input the model was trained never to see. Dropping it is equivalent to training's attention mask (padding channels are inert for the CLS). - Reject non-finite (NaN/inf) raw up front instead of silently labelling every cell as class 0 via a poisoned softmax. - Size patch tensors to the real channel count instead of MAX_NUM_CHANNELS; padding tokens are provably inert, so this is numerically identical while avoiding the per-channel ResNet + quadratic-transformer work over padding. Tests: abstention-opt-in default, non-finite rejection, all-zero channel masking, padding numerical-inertness, and a config/preprocessing MPP+percentile parity assertion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- README: the quickstart downloaded the latest checkpoint but then called predict(model_name="deepcell-types_2026-05-17"), which resolves to a file download_model() never wrote -> FileNotFoundError on copy-paste. Capture the path download_model() returns and pass it straight to predict(). Also add the DEEPCELL_ACCESS_TOKEN prerequisite to the model-download section (previously only documented elsewhere). - docs/index.md "Recognized channels" limitation said the registry comes from the zarr archive; with 0.1.0 it ships in the packaged vocab.json by default and the archive is an override. Corrected to match. - CHANGELOG: abstention entry now describes the opt-in (default None) behavior; added entries for all-zero-channel masking, the non-finite input guard, and the real-channel-width tensor sizing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Close the highest-value test gaps the review flagged: - test_auth.py (new): the download/integrity/extraction layer (utils/_auth.py) was entirely untested. Cover the md5/sha256 hash dispatch + bad-length error, extract_archive's zip-slip / tar-traversal / tar-symlink rejection and the benign-archive happy path, fetch_data's cache-hit and missing-token branches (no network), and the model-registry digest shapes. - compat_marker0_zero now has a behavioral test asserting it zeros marker-0's mean-intensity column (the released-checkpoint parity contract), via a hook on the intensity CLS branch — the branch's final layer is zero-initialized, so the contract must be checked on the branch input, not the fresh model's ct_logits. - An end-to-end numeric regression pin: a fixed-seed checkpoint on a fixed FOV must reproduce a golden softmax fingerprint and be deterministic across calls, so preprocessing/forward drift fails instead of shipping green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

scripts/evaluate_on_test.sh required embeddings/svd_512.npz (absent from the repo, regenerable only via OpenAI + the multi-GB archive), so the headline test-split evaluation was not runnable out of the box — even though load_state_dict overwrites the marker embeddings with the checkpoint's, making the SVD file's values unused (only its shape matters). - scripts/predict.py: load the checkpoint before the marker embeddings and add _resolve_marker_embeddings(), which builds a correctly-shaped zeros placeholder from the checkpoint when --svd_embeddings_path is omitted. - evaluate_on_test.sh: the SVD path is now optional (passed only when set), and the default MODEL_CKPT points at the download_model() cache (~/.deepcell/ models) instead of a local dct-final-ckpt path. - test_scripts_predict.py: unit-test the placeholder / delegate / error paths (pure, no archive or network needed). Not run end-to-end here (needs the registration-gated archive in DATA_DIR); a single confirmatory real-archive run is recommended before trusting the numbers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

deepcell_types.baselines.maps.run does `import click` at module top, so tests/baselines/test_maps_normalization.py (which imports normalize_features from it) raised a ModuleNotFoundError collection error — not a clean skip — on an inference-only install with no [train]/baseline-maps extra. This turned the inference-only CI job red (pre-existing on master). The baselines conftest's hand-maintained collect_ignore list was missing this entry; add a click gate matching the other baseline-test gates. Verified by simulating an inference-only collection (extra-only packages hidden via a meta-path blocker): `pytest --collect-only tests` now exits 0 with the maps-normalization module excluded and no remaining collection errors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…w findings Addresses code-review findings on PR #57: - dataset.py: use the same `max(axis=1) == 0` criterion as the training dataloader for all-zero channel masking (the previous `(== 0).all(axis=1)` diverged for negative-valued input, breaking the claimed training parity). - dataset.py: fix the comment's reference to a non-existent test file; the real test is test_channel_padding_is_numerically_inert. - dataset.py: broaden the all-masked error message — channels can now be dropped for being unmatched, duplicate, or all-zero, not only unmatched. - test_canonical_inference.py: pin the opt-in `ct_abstention_k=None` default via a signature check (the behavioural assertions use a uniform input that never trips the IQR fence, so they alone don't catch a default regression). - scripts/predict.py: note in --ct_abstention_k help that the batch CLI deliberately defaults abstention ON (paper reproduction) while the predict() library API defaults it OFF. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…=None) The tutorial still described abstention as on-by-default (k=0.2) and told users to disable it with k=0, contradicting PR #57's change making it opt-in. Now documents the None default (raw argmax for every cell) and shows k=0.2 to enable the paper's headline operating point. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Release-readiness fixes: README quickstart, abstention opt-in, inference parity, tests

Drop docs/reviews/ (internal multi-agent review + ablation reports kept only for provenance; nothing in code/docs references them) and the dev-only `if __name__ == "__main__"` runner appended to tests/test_v2.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The [analysis] extra documented figure scripts (plot_tsne.py, plot_experiment_results.py) that do not exist in the repo, and nothing imports its only deps (seaborn, openpyxl). Remove the extra, its mention in `all`, and the dangling allowlist entries in tests/conftest.py. Retarget a package-data comment to HierarchicalLoss (the real consumer of combined_celltypes.yaml). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- scripts/train.py: docstring said "DANN disabled by default" but --domain_weight defaults to 0.1 (DANN enabled). - Remove references to internal-monorepo files not shipped in the public repo (analysis/ct_abstention_iqr.py, preprocess_for_training.py, analysis.test_split_summary, dct-final-ckpt/). - training/utils.py: BatchData.tissue_idx docstring described a removed 'index 0 = null token' scheme; the code now raises on a missing tissue. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Remove public-but-uncalled API surface (0.1.0 is unreleased, so never shipped): TissueNetConfig.{get_excluded_ct_indices, get_channel_embedding, get_celltype_embedding, combined_celltype_mapping, color_mapping, core_tree, lineage_mapping, validate}, the now-unused yaml import, and create_dataloader_from_config (plus its dataset re-export and __all__ entry). DataLoaderConfig is kept (exercised by the test suite). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- splits/fov_split_test_current.json: replace a leaked author-machine zarr_path (/data/xwang3/...) with the $DATA_DIR placeholder used by the other three. - splits/README.md: document fov_split_test_current.json (the actual default headline-eval split) and the prior- vs current-archive fingerprint split. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- scripts/pretrain.py: usage/runtime hints used bare `python pretrain.py` / `python train.py`, which only run from scripts/; align to `python scripts/...` to match the README. - baselines/nimbus/run.py: reword the in-code TODO documenting the centroid scale_factor overlap as a 'Known limitation' note (the limitation is real and documented intentionally; not unfinished work to flag for release). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

All four methods (DCT, MAPS, CellSighter, XGBoost) now use the main DeepCell-Types sampler by default — sqrt-inverse-frequency with a 1000-count floor — so the baseline-vs-DCT comparison no longer confounds the method with its class-balancing scheme. Each method's own faithful sampler stays available as an opt-in ablation. - samplers.py: factor the DCT weight formula into a shared label-array helper `compute_sample_weights_dct()`; `compute_sample_weights()` now delegates to it (byte-identical weights — DCT main path unchanged). - CellSighter: default `--class_balance` equal -> sqrt (`sqrt` already maps to the DCT sampler in `create_dataloader`); the faithful equal-proportion + size_data recipe stays as the `equal` ablation. - MAPS: add `--class_balance {dct,full_inv_freq,none}`, default `dct`; `full_inv_freq` is the faithful mahmoodlab/MAPS `n/count` sampler. - XGBoost (plain + tuned): add `--class_balance {dct,none}`, default `dct`, applied as a per-row `sample_weight` in `fit()` (the tree analog of the neural samplers); `none` restores faithful unweighted XGBoost. - READMEs updated to reflect the new default + the faithful ablations. Code-only change; baseline numbers must be regenerated by retraining with the new default before they land in any figure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…it metadata) Follow-up to the release-cleanup audit on this PR: - scripts/train.py: align the DEFAULT_LOSS_WEIGHTS fallback (domain 0.0 -> 0.1) with the documented "DANN enabled by default" / --domain_weight default. Behavior-neutral for the CLI (the per-run loss_weights dict always overrides "domain" with --domain_weight), fixing only the fallback used by a programmatic forward_one_batch(loss_weights=None) caller. - scripts/evaluate_on_test.sh: drop the surviving internal `dct-final-ckpt/` reference from the header comment; point at the public deepcell_types.baselines path instead. - tests/conftest.py: remove the deleted `[analysis]` extra from the optional- extras comment. - splits/fov_split_test_current.json: correct stale metadata (num_val_fovs 431 -> 129; add num_heldout_fovs: 302) to match the actual val/heldout keys (verified: val=129, bit-identical to fov_split_test.json; heldout=302). Full suite: 358 passed, 1 skipped. ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… provenance Two fixes from the PR #60 deep-review. M1 — XGB sample_weight scale. `compute_sample_weights_dct` returns raw sqrt-inverse-frequency weights (mean several×, up to ~47× on floored rare classes). A WeightedRandomSampler is scale-invariant, but XGBoost consumes sample_weight as an absolute per-row multiplier on the summed gradient/hessian per leaf, so the raw weights silently inflate hessian mass and weaken reg_lambda / min_child_weight relative to the unweighted run — confounding class balance with reduced regularization. Add a `normalize` kwarg (default False, preserving the resampling and main-model paths bit-for-bit; verified: test_samplers 24/24 still pass) and opt the three XGB sites (run.py, tuning.py objective + train_best_model) into normalize=True so the dct-vs-none ablation isolates balancing. M3 — provenance. Because all baselines now default to the shared sampler, two prediction CSVs trained under different schemes are byte-schema-identical. `save_baseline_predictions` now writes the active class_balance (and size_data for CellSighter) to a sidecar `*.meta.json` — a sidecar, not a CSV column, so the prediction schema and downstream softmax-column selection are unchanged. Also warn when CellSighter `--size_data` is set under a non-`equal` scheme (silently inert otherwise), and soften the sampler docstring (rare-tail-is- unbalanced note; drop the unverified "236 cells" figure). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…defaults PR #60 added `--class_balance` to the maps/xgboost/xgboost-tune commands but did not update the frozen-option snapshot tests, so test_xgboost_*_frozen and test_maps_*_frozen failed with "Extra items in the left set: 'class_balance'" (CI-blocking). Add `class_balance` to XGBOOST_OPTS, XGBOOST_TUNE_OPTS, and MAPS_OPTS, and lock the unified-sampler defaults (xgb/xgboost-tune/maps -> dct, cellsighter -> sqrt) via default-value assertions so a silent default flip is caught. Verified: 9/9 frozen-option tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…servative claim From the PR #60 deep-review docs findings: - Note that CellSighter's `--class_balance sqrt` and MAPS/XGB's `dct` select the identical shared scheme (the equivalence was only in a code docstring, so a user scripting "same sampler everywhere" hit click Invalid value). - Document that the 1000-count floor leaves the rare tail effectively unbalanced (the scheme mainly rebalances the head), so "balanced" is not overstated for the region macro-F1 weights most. - Hedge the XGB README's "had made the XGBoost rare-class macro number conservative" — an unmeasured causal claim — to "we expect ... (direction not yet measured)", and document the new mean-1 sample_weight normalization. - Replace bare "WeightedRandomSampler" with a note that the main model wraps it as FOVGroupedSampler (identical draw distribution). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

create_dataloader_from_config (its only functional consumer) was removed earlier in this PR, leaving DataLoaderConfig as an exported-but-inert dataclass with no ergonomic call path. Remove it for a clean release surface; the full keyword API of create_dataloader remains the single way to configure a loader. - dataloader.py: drop the class and its now-orphaned `dataclass`/`typing` imports; update the module docstring. - dataset.py: drop the back-compat re-export and the `__all__` entry. - test_training_import_order.py: drop the DataLoaderConfig import assertions. Full suite: 358 passed, 1 skipped. ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…mpler feat(baselines): unify all baselines on the DCT sampler by default

Rebased onto master (which now has #60's --class_balance / DCT-sampler default). Adds an opt-in --val_split_file to MAPS, CellSighter, XGBoost (plain + tuned) so each trains on the FULL --split_file train and selects its checkpoint / early-stop / Optuna-trial on an EXTERNAL validation set — the 'val' FOVs of --val_split_file, capped to 200k cells at seed 42 (mirroring dataloader.py:269-273 max_val_samples), scored with the canonical hierarchical ct_macro_f1 (metrics.py:399-419). The reported set stays --split_file 'val'. Legacy inner-val carve is unchanged when the flag is absent. This matches how the main DCT model selects (val_macro_f1 on the canonical 302-FOV validation, 200k cap), making baseline model-selection consistent with DCT instead of each baseline self-carving a different 10% inner-val. Combined-state fix (needed once #58's eval_set_external path meets #60's sample_weight): thread class_balance through xgb/tuning.py _run_canonical_val_tuning -> run_tuning / train_best_model, and apply compute_sample_weights_dct in the eval_set_external branch — otherwise the tuned-XGB canonical run would tune with DCT weights but ship an unweighted final model. Also adds --features_cache to xgboost-tune for cache reuse, and lists val_split_file in the frozen CLI option snapshots. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…selection feat(baselines): --val_split_file canonical external-val selection (consistent with DCT)

chore: release-readiness cleanup (cruft, stale refs, dead API, split docs)

xuefei-wang and others added 30 commits May 11, 2026 07:18

Merge pull request #39 from xuefei-wang/refactor/canonical-only-monorepo

3e178b7

Collapse training pipeline into deepcell-types (canonical-only)

Merge pull request #4 from xuefei-wang/fix/tissue-idx-and-circular-im…

ede32a2

…port fix(train,predict,utils): tissue_idx kwarg + circular import + baseline submodule rebase

chore(gitignore): ignore uv.lock and /reviews/ artifacts

8df93c6

uv.lock is regenerated on every branch switch when pyproject.toml shapes differ (master vs training), so keeping it tracked produces constant churn. /reviews/ holds local /deep-review outputs.

chore: drop orphan Dockerfile

bfdb9ae

Untouched since Oct 2024 and broken since the kit was inlined in 0a8108e: it COPYs a non-existent top-level requirements.txt and pip-installs the deleted deepcelltypes-kit/ directory. No CI, docs, or scripts reference it.

Merge branch 'master' of https://github.com/xuefei-wang/deepcell-types

1804166

xuefei-wang and others added 30 commits June 22, 2026 14:24

Merge pull request #52 from xuefei-wang/fix/baseline-doc-faithfulness

d5f4ab9

docs(baselines): correct stale faithfulness claims in baseline READMEs

Merge pull request #53 from xuefei-wang/feat/maps-paper-faithful-trai…

577f6e3

…ning feat(maps): DCT-safe training schedule and normalization controls

Merge pull request #56 from xuefei-wang/salvage/pr55-sampler-resmlp

7d470a0

feat(model): configurable resMLP head + config-less head-shape inference (salvage of #55)

Merge pull request #57 from xuefei-wang/fix/release-readiness

4a1acd7

Release-readiness fixes: README quickstart, abstention opt-in, inference parity, tests

Merge pull request #60 from xuefei-wang/feat/baselines-unified-dct-sa…

370b3c2

…mpler feat(baselines): unify all baselines on the DCT sampler by default

Merge pull request #58 from xuefei-wang/feat/baselines-canonical-val-…

94f8c2c

…selection feat(baselines): --val_split_file canonical external-val selection (consistent with DCT)

Merge pull request #59 from xuefei-wang/chore/release-cleanup

9d0ccf0

chore: release-readiness cleanup (cruft, stale refs, dead API, split docs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.1.0: monorepo merge — unified training + inference, canonical model, archive-free inference, vendored baselines#41

v0.1.0: monorepo merge — unified training + inference, canonical model, archive-free inference, vendored baselines#41
xuefei-wang wants to merge 269 commits into
vanvalenlab:masterfrom
xuefei-wang:master

xuefei-wang commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

xuefei-wang commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Canonical model

Canonical-only inference

New public API

Monorepo: training pipeline

Baselines

Breaking changes

Packaging / infra

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xuefei-wang commented May 30, 2026 •

edited

Loading