feat: v2 overhaul — correctness fixes, self-conditioning/CFG, discrete mode, DPO, perf#18
Merged
Merged
Conversation
…DPO, perf Major correctness + research upgrade over the v1 concept paper. Correctness: - Remove prompt-conditioning leak (clean-prefix context + pooled prompt; response-only loss) - Implement real zero-terminal-SNR cosine schedule (Lin et al. 2023) - Bidirectional Mamba denoiser; prefer genuine Mamba-2 kernels with fallback - Rewrite SimpleMamba2 (stable negative-A, per-channel input, no double norm/residual) - Correct x0-DDIM sampler (+ v-prediction); fix FiLM init, 3-tuple forward, get_model_config - Fix denoise_step helper reference; guard SimpleMamba2 scan against underflow NaN Research upgrades: - Self-conditioning, classifier-free guidance, min-SNR weighting, cross-entropy/rounding anchor New capabilities: - Discrete/masked + hybrid diffusion (corruption, masked sampling, predict_token_logits) - DPO/IPO/SimPO + diffusion-ELBO/VRPO surrogate; pluggable verifiable rewards for GRPO - Vectorized parallel selective-scan, torch.compile helper, MLX backend skeleton - ELBO best-of-K reranking Infra/docs: GitHub Actions CI, pre-commit, benchmark, tests, CHANGELOG, README, IMPROVEMENT_PLAN, RESEARCH_DIRECTIONS, OVERHAUL_STATUS. Validated: compileall clean + 13/13 runtime smoke (venv python). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Scale the encoded signal to ~unit variance before diffusion so the schedule's SNR is meaningful (a la Stable Diffusion's 0.18215). Embeddings initialized at std 0.02 against unit-variance noise were crushing the effective SNR at every timestep, which is the crux of why latent/continuous text diffusion is harder to train. - DIMBA.latent_scale folded into encode_latent/decode_latent (round-trips exactly); default 1/embed_init_std for the embedding path, 1.0 for the projector/VAE path. - DIMBA.calibrate_latent_scale(batch): measure the encoded-signal std and set the factor (recommended before training in latent/VAE mode). - Configurable TokenEmbedding init_std; latent_scale + embed_init_std in model config. - Tests + end-to-end smoke updated (14/14 OK). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the stale paper-era guide (which documented the conditioning leak and the MSE-only loss as *the* procedure) with an accurate v2 reference: current data flow, model API conventions (no leak, 3-tuple forward, latent_scale round-trip, calibrate), the three diffusion modes, training via compute_dimba_losses, inference, post-training, the torch-teardown / venv-python / MPS environment gotchas, the file map, and the current PR status + open follow-ups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Major correctness + research upgrade of DIMBA over the v1 concept paper. Validated end-to-end (
compileallclean + 13/13 runtime smoke). 39 files, +7,326 / -701.Correctness fixes
SimpleMamba2rewritten — stable negative-A, per-channel input (was collapsing the inner dim), no double norm/residual; underflow→NaN guard.forward(),get_model_config, and thedenoise_stephelper reference.Research upgrades
New capabilities
corruption.py,masked_sampling.py,DIMBA.predict_token_logits).torch.compilehelper, MLX backend skeleton.Infra & docs
README(What''s-New + corrected claims),docs/IMPROVEMENT_PLAN.md,docs/RESEARCH_DIRECTIONS.md,docs/OVERHAUL_STATUS.md,CHANGELOG.md.Validation
python -m compileallclean across the package.Notes
scripts/train_interactive.py(in-progress WIP) intentionally excluded.[MASK]token; cross-attention conditioning; real speed/quality benchmarks once compute lands.🤖 Generated with Claude Code