Skip to content

feat: v2 overhaul — correctness fixes, self-conditioning/CFG, discrete mode, DPO, perf#18

Merged
devnull37 merged 3 commits into
mainfrom
feature/dimba-v2-overhaul
May 28, 2026
Merged

feat: v2 overhaul — correctness fixes, self-conditioning/CFG, discrete mode, DPO, perf#18
devnull37 merged 3 commits into
mainfrom
feature/dimba-v2-overhaul

Conversation

@devnull37
Copy link
Copy Markdown
Owner

Summary

Major correctness + research upgrade of DIMBA over the v1 concept paper. Validated end-to-end (compileall clean + 13/13 runtime smoke). 39 files, +7,326 / -701.

Correctness fixes

  • Conditioning leak removed — prompt is clean context (clean-prefix + pooled prompt), never the target; response-only loss. (This was present in the v1 paper.)
  • Real zero-terminal-SNR cosine schedule (Lin et al. 2023) — previously a docstring-only claim.
  • Bidirectional Mamba denoiser; genuine Mamba-2 kernel preference with graceful fallback.
  • SimpleMamba2 rewritten — stable negative-A, per-channel input (was collapsing the inner dim), no double norm/residual; underflow→NaN guard.
  • Correct x0-DDIM sampler (+ v-prediction); fixed FiLM identity-init, the 3-tuple forward(), get_model_config, and the denoise_step helper reference.

Research upgrades

  • Self-conditioning, classifier-free guidance, min-SNR-γ weighting, cross-entropy / rounding anchor.

New capabilities

  • Discrete / masked + hybrid diffusion (corruption.py, masked_sampling.py, DIMBA.predict_token_logits).
  • DPO / IPO / SimPO + diffusion-ELBO / VRPO surrogate; pluggable verifiable rewards for GRPO (token-overlap demoted to a warned legacy option).
  • Vectorized parallel selective-scan, torch.compile helper, MLX backend skeleton.
  • ELBO best-of-K reranking.

Infra & docs

  • GitHub Actions CI (py3.10/3.12, CPU torch, pytest + black + mypy), pre-commit, benchmark script, new tests.
  • README (What''s-New + corrected claims), docs/IMPROVEMENT_PLAN.md, docs/RESEARCH_DIRECTIONS.md, docs/OVERHAUL_STATUS.md, CHANGELOG.md.

Validation

  • python -m compileall clean across the package.
  • 13/13 end-to-end runtime smoke: all model modes, sampling + CFG, masked hook, corruption, masked sampling, parallel-scan parity (9.5e-7).
  • CI re-runs the full pytest suite on clean Linux runners with working torch.

Notes

  • Built via 6 parallel agents + coupled core surgery; the agents cross-validated each other''s work.
  • scripts/train_interactive.py (in-progress WIP) intentionally excluded.
  • Follow-ups: first-class masked-mode training script + a [MASK] token; cross-attention conditioning; real speed/quality benchmarks once compute lands.

🤖 Generated with Claude Code

…DPO, perf

Major correctness + research upgrade over the v1 concept paper.

Correctness:
- Remove prompt-conditioning leak (clean-prefix context + pooled prompt; response-only loss)
- Implement real zero-terminal-SNR cosine schedule (Lin et al. 2023)
- Bidirectional Mamba denoiser; prefer genuine Mamba-2 kernels with fallback
- Rewrite SimpleMamba2 (stable negative-A, per-channel input, no double norm/residual)
- Correct x0-DDIM sampler (+ v-prediction); fix FiLM init, 3-tuple forward, get_model_config
- Fix denoise_step helper reference; guard SimpleMamba2 scan against underflow NaN

Research upgrades:
- Self-conditioning, classifier-free guidance, min-SNR weighting, cross-entropy/rounding anchor

New capabilities:
- Discrete/masked + hybrid diffusion (corruption, masked sampling, predict_token_logits)
- DPO/IPO/SimPO + diffusion-ELBO/VRPO surrogate; pluggable verifiable rewards for GRPO
- Vectorized parallel selective-scan, torch.compile helper, MLX backend skeleton
- ELBO best-of-K reranking

Infra/docs: GitHub Actions CI, pre-commit, benchmark, tests, CHANGELOG, README,
IMPROVEMENT_PLAN, RESEARCH_DIRECTIONS, OVERHAUL_STATUS.

Validated: compileall clean + 13/13 runtime smoke (venv python).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

devnull37 and others added 2 commits May 27, 2026 19:13
Scale the encoded signal to ~unit variance before diffusion so the schedule's SNR
is meaningful (a la Stable Diffusion's 0.18215). Embeddings initialized at std 0.02
against unit-variance noise were crushing the effective SNR at every timestep, which
is the crux of why latent/continuous text diffusion is harder to train.

- DIMBA.latent_scale folded into encode_latent/decode_latent (round-trips exactly);
  default 1/embed_init_std for the embedding path, 1.0 for the projector/VAE path.
- DIMBA.calibrate_latent_scale(batch): measure the encoded-signal std and set the
  factor (recommended before training in latent/VAE mode).
- Configurable TokenEmbedding init_std; latent_scale + embed_init_std in model config.
- Tests + end-to-end smoke updated (14/14 OK).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the stale paper-era guide (which documented the conditioning leak and the
MSE-only loss as *the* procedure) with an accurate v2 reference: current data flow,
model API conventions (no leak, 3-tuple forward, latent_scale round-trip, calibrate),
the three diffusion modes, training via compute_dimba_losses, inference, post-training,
the torch-teardown / venv-python / MPS environment gotchas, the file map, and the
current PR status + open follow-ups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@devnull37 devnull37 merged commit bb1bbba into main May 28, 2026
0 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant