BoggersTheLanguageModel is a production-grade continuous attractor language model built without attention, transformers, or traditional LLM methods. State follows a physical trajectory; meaning is path-dependent. The architecture is driven by the Propagate → Relax → Break → Evolve cycle that powers the TS-OS.
Primary repository: github.com/BoggersTheFish/BoggersTheLLM. Alternate mirror: github.com/BoggersTheFish/idekatp. The product name is BoggersTheLanguageModel.
Docs: docs/README.md indexes all guides; docs/BOGGERS_THE_LANGUAGE_MODEL_AUDIT.md is the full architecture / training audit; docs/PROJECT_STATUS.md summarizes what is implemented and what to do next; docs/DEVELOPMENT_ROADMAP.md is the phased roadmap.
State dimension D splits into num_waves channels of wave_dim each (D = num_waves × wave_dim). Slices are contiguous in the last dimension.
Corpus / token stream
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ TorchAttractorLanguageModel (sandbox.py) │
│ │
│ embed_window / embed_windows_batch → (W, D) or (B, W, D) │
│ │ │
│ ▼ │
│ run_window_dynamics() — outer loop (≤ max_window_steps) │
│ • Positional coupling (+ Phase 1 C·mask after inner step) │
│ • Optional GOAT per-position signal │
│ • Optional anchor pull (readout top-k embeds, detached) │
│ • Inner step: S ← S − dt·∇E with E = Σ_i energy_head_i(wave_i) │
│ + optional λ·tension_window(S) + anchor distance terms │
│ • Optional anchor-guided freeze: zero ∇ on converged wave slices │
│ • Phase 2 tension breaks + renorm │
│ │ │
│ ▼ │
│ readout_window_logits(S) — (B,W,D) full state │
│ → optional Linear(D,D) per position if --readout-fusion │
│ → readout_window: W·D → vocab │
└──────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────┐ ┌────────────────────────────┐
│ Decoding (training-aligned) │ │ LLMSubstrateNode │
│ model.generate() → │ │ (llm_substrate_node.py) │
│ forward_training_window → │ │ Propagate → Evolve hook │
│ readout_window_logits │ │ │
└─────────────────────────────┘ └────────────────────────────┘
Token path (evolve_token): per-wave WaveDynamics + wave_interaction, OR
VectorizedWindowDynamics.step on (num_waves, wave_dim) when model.dynamics.mhd is set.
Legacy: state_cache uses readout(D), not readout_window_logits — prefer generate + checkpoint loader.
No attention. No transformer blocks. No external foundation weights.
Core learnable pieces (high level):
- Window relaxation: one small MLP per wave maps wave slice → scalar energy; sum defines the attractor potential (plus optional tension / anchor terms).
- Token-time dynamics:
wave_dynamics(WaveDynamics×num_waves) pluswave_interactionmixing across waves;--dynamics vectorizedattachesVectorizedWindowDynamicsonwave_dimfor theevolve_tokenpath (step(S, signal)). - Readout:
readout_windowmaps flattened window to vocab; usereadout_window_logits(S)so optional readout fusion runs. - Tension:
compute_tension_window(S)on the window tensor;compute_tensionon fast/slow + logits insideevolve_token.
See docs/PROJECT_STATUS.md for a frank “where we are” checklist and docs/README.md for the doc index.
The attractor relaxation horizon was increased from 16 to 32 steps.
This parameter controls how many outer relaxation iterations the window-state attractor dynamics performs during training.
Previous:
MAX_WINDOW_STEPS = 16
Current:
MAX_WINDOW_STEPS = 32
This change allows the latent state trajectory to settle deeper into its attractor basin before computing trajectory contrastive loss.
No architecture or loss changes were introduced.
For a dated record, see CHANGELOG.md (2026-04-04) and docs/architecture_changes.md.
| File / Directory | Wave | Purpose |
|---|---|---|
sandbox.py |
Phase 0+ | BoggersTheLanguageModel core: training, generate, load_model_from_checkpoint, _save_checkpoint |
phase05_config.py |
0.5 | Phase05Config: tension, anchor terms, batch CSV, adaptive dt, anchor freeze, neg-def diffusion flag |
phase1_config.py |
1 | Phase1Config: multi-head drift, window interaction matrix C, diversity loss |
phase2_config.py |
2 | Phase2Config: directional breaks, residual mixing, C reg, head tension weights |
smoke_test.py |
Phase 0 | 5-assertion integration test (dynamics + TSCore wave cycle) |
tests/test_embed_windows_batch.py |
— | Parity check: embed_windows_batch vs stacking embed_window per row |
wave_a_tokenizer.py |
A | tiktoken BPE helpers; training uses sandbox._build_tokenizer() |
dynamics_vectorized.py |
B | VectorizedWindowDynamics: step(S, signal) only; forward disabled; run_window_dynamics_vectorized → model.run_window_dynamics |
state_cache.py |
C | Deprecated for decoding: generate_with_cache delegates to model.generate; cache.logits() still uses legacy readout(D). Prefer model.generate. |
scripts/generate_sample.py |
— | Load checkpoint via sandbox.load_model_from_checkpoint → model.generate (training-parity readout) |
scripts/ts_workflow_smoke.py |
— | Smoke: forward_training_window + model.generate + simple/vectorized run_window_dynamics |
data_pipeline.py |
D | Streaming sharded DataLoader (txt / JSONL, multi-worker) |
data/generate_corpus.py |
— | Deterministic synthetic .txt corpus (tiktoken-sized); CLI + sandbox fallback |
data/hf_remote_corpus.py |
— | TinyStories / FineWeb-Edu → cached .txt for training (--dataset-source) |
data/__init__.py |
— | Package marker for data.generate_corpus imports |
llm_substrate_node.py |
E | Registers model as a native TSCore node; Evolve hook |
goat_memory_transitions.py |
F | GOAT-TS ACTIVE → DORMANT → DEEP token state transitions |
inference_server.py |
G | FastAPI — /v1/completions; loads checkpoints with sandbox.load_model_from_checkpoint (vectorized dynamics + dt + Lorentz); model.generate |
Dockerfile |
G | CPU Docker image (swap whl URL for CUDA wheel) |
docker-compose.yml |
G | One-command deploy |
eval_harness.py |
H | Perplexity + tension metrics + 11-tick WaveCycleRunner |
vendor/GOAT-TS |
— | Constraint-graph engine (submodule) |
vendor/TS-Core |
— | UniversalLivingGraph + WaveCycleRunner (submodule) |
vendor/ts-llm |
— | Tokenizer, hierarchical dynamics, attractor LLM package (submodule) |
docs/README.md |
— | Index of all docs in docs/ |
evaluation/prompts.py |
— | EVAL_PROMPTS — fixed strings for end-of-epoch model.generate samples (logs/eval_epoch_*.txt) |
benchmarks/training_throughput.json |
— | Written by scripts/profile_training_step.py — wall-clock step time, batches/sec, tokens/sec |
scripts/profile_training_step.py |
— | Torch profiler + throughput benchmark (sandbox / ts-llm training step) |
docs/runs/apr2026_3epoch_cpu_example/ |
— | Full 3-epoch CPU transcript + metrics (TinyStories, ~55 min) |
docs/PROJECT_STATUS.md |
— | Current implementation status, gaps, recommended next steps |
docs/API_DISCOVERY.md |
— | Vendored TS-OS entrypoints + sandbox.py integration surface |
docs/BASELINE.md |
— | Phase 0 baseline recording instructions |
docs/architecture_changes.md |
— | Dated architecture / default changes (e.g. relaxation horizon) |
docs/DEVELOPMENT_ROADMAP.md |
— | Phased roadmap (measurement, throughput, multi-wave behavior) |
docs/BOGGERS_THE_LANGUAGE_MODEL_AUDIT.md |
— | Full technical audit of sandbox.py model, dynamics, losses, metrics |
scripts/plot_phase05_metrics.py |
— | Plots --phase05-batch-metrics-csv columns (incl. Phase 1–2 extras) |
- Python 3.10+
- PyTorch (CPU or CUDA)
git clone --recurse-submodules https://github.com/BoggersTheFish/BoggersTheLLM.git
cd BoggersTheLLM
python3 -m venv .venv
source .venv/bin/activate # after this, `python` usually works; without venv use `python3`
pip install -r requirements.txtOn Ubuntu / Linux Mint, only python3 may be installed. Either activate .venv as above or call python3 sandbox.py instead of python sandbox.py.
python3 sandbox.pyWith trajectory contrastive loss (recommended), LR schedule, metrics CSV, and a fixed epoch count:
python3 sandbox.py \
--epoch-metrics-csv metrics.csv \
--lr 0.001 \
--lr-decay-every 15 \
--lr-gamma 0.7 \
--val-fraction 0.1 \
--max-epochs 30The default data/corpus.txt (or empty path) may trigger a synthetic text fallback so integration tests always have enough tokens. For a real corpus and metrics that reflect language modeling (not random-like ~506 val perplexity on a toy file), use a Hub dataset or your own large .txt / directory.
Option A — TinyStories via Hugging Face (recommended first real run)
Requires datasets (included in requirements.txt). The first run downloads data into data/cache/hf/. Set HF_TOKEN (env) for higher Hub rate limits and faster downloads when unauthenticated warnings appear.
A0 — Full rows, large token count (GPU-oriented). With default --hf-max-rows 50000 and --hf-max-chars 0, TinyStories can materialize tens of millions of tokens — one epoch may take many hours on CPU. Use for serious GPU runs or overnight jobs.
A1 — CPU-sized slice (recommended first “real” run on laptop). Cap UTF-8 size so one epoch stays ~20–25 minutes on a typical CPU at batch 64 (~3.9 h for 10 epochs). The cache file name changes when --hf-max-chars changes.
pip install -r requirements.txt
mkdir -p checkpoints/meaningful_run
python3 sandbox.py \
--dataset-source tinystories \
--hf-max-rows 50000 \
--hf-max-chars 1500000 \
--tokenizer tiktoken \
--vocab-cap 8192 \
--val-fraction 0.1 \
--window-size 8 \
--state-dim 128 \
--num-waves 4 \
--vectorized-num-heads 4 \
--batch-size 64 \
--max-epochs 10 \
--lr 0.001 \
--grad-clip 1.0 \
--lr-decay-every 5 \
--lr-gamma 0.8 \
--epoch-metrics-csv metrics_meaningful.csv \
--eval-results-json eval_meaningful.json \
--checkpoint-dir checkpoints/meaningful_run \
--save-every 500End-state reference (Apr 2026, one machine): val_CE ~4.8, train_CE ~3.9 after 10 epochs; generations recognizable as story-like but not polished. A committed snapshot of metrics_meaningful.csv and eval_meaningful.json (plus a column/field glossary) lives under docs/runs/meaningful_apr2026/; narrative: docs/TRAINING_RUN_LOG.md.
That reference run used --lr 0.001 and --token-aux-ce 0.2 (defaults). If train_CE and val_CE rise together over epochs while dynamics diagnostics stay healthy, the trajectory objective is likely overpowering token CE — prefer --lr 3e-4 (or 1e-4) and --token-aux-ce 0.5 with --grad-clip 1.0, and compare train_CE / val_CE (not raw mean_loss, which scales with aux weights). See docs/FAILURE_ANALYSIS.md and A1c below.
A1c — Rebalanced loss (recommended when CE drifts up). Same TinyStories slice and model shape as A1; smaller trajectory batch (more batches/epoch on CPU), lower LR, stronger readout_window CE aux:
python3 sandbox.py --device cpu \
--dataset-source tinystories \
--hf-max-rows 50000 --hf-max-chars 1500000 \
--tokenizer tiktoken --vocab-cap 8192 \
--state-dim 128 --num-waves 4 --window-size 8 \
--num-dynamics-steps 16 \
--max-epochs 5 --trajectory-batch-size 32 \
--lr 0.0003 \
--token-aux-ce 0.5 \
--grad-clip 1.0 \
--val-fraction 0.1 \
--epoch-metrics-csv metrics_rebalanced.csvWall time is roughly ~28 min/epoch on a typical laptop CPU (~10k batches/epoch at batch 32). Use --num-dynamics-steps 32 to match current MAX_WINDOW_STEPS defaults if you want deeper relaxation (slower per step).
A1b — Same recipe, fewer epochs (quick sanity check). Use --max-epochs 3 (and omit CSV/JSON flags if you only need console metrics). Wall time is roughly ~55 minutes on a typical CPU at batch 64 for this corpus slice. A verbatim log (progress bars, checkpoints, Phase 0 baseline block, sample generations, debug dynamics) is committed under docs/runs/apr2026_3epoch_cpu_example/ (README, full output). That example was captured at git d65dd64 with dynamics_steps=16; current defaults use --num-dynamics-steps 32 unless you pass 16 explicitly.
A2 — Larger budget (optional GOAT + substrate). Omit --hf-max-chars (or raise it) and scale --max-epochs / GPU batch size as appropriate.
pip install -r requirements.txt
mkdir -p checkpoints/real_run
python3 sandbox.py \
--dataset-source tinystories \
--tokenizer tiktoken \
--val-fraction 0.1 \
--max-epochs 50 \
--use-goat-memory \
--use-substrate \
--lr 0.001 \
--lr-decay-every 15 \
--lr-gamma 0.7 \
--epoch-metrics-csv metrics_real.csv \
--eval-results-json eval_results.json \
--checkpoint-dir checkpoints/real_runThis writes a final ckpt_step*.pt under --checkpoint-dir and an eval_results.json with the same token-level val split as training (val_ce, val_ppl, val_windows, checkpoint path). Perplexity should move off the untrained baseline as loss decreases.
Option B — FineWeb-Edu subset (streaming)
Uses the sample-10BT config and stops after --hf-max-rows rows (default 50k):
python3 sandbox.py \
--dataset-source fineweb-edu \
--hf-max-rows 20000 \
--tokenizer tiktoken \
--val-fraction 0.1 \
--max-epochs 50 \
--eval-results-json eval_results.json \
--checkpoint-dir checkpoints/fineweb_runOption C — Your own corpus (TS-OS export or any large text)
Place UTF-8 text at data/corpus.txt, or pass --corpus /path/to/dir (merges .txt / .jsonl). To disable automatic synthetic fallback when the file is missing or tiny:
python3 sandbox.py --corpus data/my_corpus.txt --no-synthetic-fallback --tokenizer tiktoken ...Materialize HF data only (no training)
python3 data/hf_remote_corpus.py tinystories --max-rows 50000 --cache-dir data/cache/hf
# The script prints the path to the generated .txt; pass it to --corpus:
python3 sandbox.py --corpus PATH_PRINTED_ABOVE --tokenizer tiktoken ...Full Wave H harness (TSCore before/after tension + same val perplexity) on the same cached file:
python3 eval_harness.py \
--dataset-source tinystories \
--tokenizer tiktoken \
--model-checkpoint checkpoints/real_run/ckpt_step0001234.pt \
--output eval_results_wave_h.jsonVerifies dynamics, tension, training step, and TSCore wave cycle all pass:
python3 smoke_test.pyTraining uses a single batched embedding for trajectory windows (embed_windows_batch on (B, W) token ids). To verify it matches row-wise embed_window (same numerics as the old torch.stack loop):
python3 tests/test_embed_windows_batch.pyPrints max_abs_diff; asserts allclose at 1e-6. Requires PyTorch (same env as training).
python3 eval_harness.py --val-fraction 0.2 --max-ticks 11 --output eval_results.jsonUse the same Hugging Face corpus as training (ignores --corpus when set):
python3 eval_harness.py --dataset-source tinystories --tokenizer tiktoken \
--val-fraction 0.2 --max-ticks 11 --output eval_results.json(--wave-cycles is a deprecated alias for --max-ticks.)
pip install fastapi uvicorn
python inference_server.py --host 0.0.0.0 --port 8000Or with Docker:
docker compose up
# Service name: boggers-language-model (see docker-compose.yml)Endpoints:
| Method | Path | Description |
|---|---|---|
| GET | /health |
Liveness + tension metrics |
| POST | /v1/completions |
OpenAI-compatible text completion |
| POST | /v1/generate |
Direct generate call |
| GET | /metrics/tension |
Last window tension curve |
| POST | /ts/propagate |
Trigger one TSCore wave propagation |
| GET | /ts/tension |
Current TSCore graph tension |
This section is the operational manual — what to run, how data and validation behave, and how to reproduce runs: how training consumes data, how validation stays honest, how to reproduce runs, and how the main scripts relate to each other. For a compact flag list, see CLI reference below.
python3 sandbox.py is the full training and baseline run: it builds the tokenizer and TorchAttractorLanguageModel, loads (or synthesizes) the corpus, constructs an AttractorDataPipeline when possible, runs num_epochs of optimization, prints sample generations, and optionally writes checkpoints and CSV metrics.
Default data mode is streaming (recommended). The whole corpus file (or merged directory of .txt / .jsonl) is read as one string, encoded once into a single token sequence, and training uses all sliding windows (context, target) along that sequence. Legacy line-based mode (--no-streaming-dataset) tokenizes each non-empty line separately and skips lines shorter than window_size + 1; it also uses --epoch-copies to repeat the training line list each epoch. In stream mode, --epoch-copies is ignored on purpose: repetition is controlled only by --max-epochs and by shuffled window sampling, not by duplicating tokens in memory.
Device and checkpoints. --device auto picks CUDA when available. _save_checkpoint stores model_state, optimizer_state, step/epoch, legacy config (model geometry for loading), and training_config (full CLI hyperparameters: window_size, state_dim, num_waves, max_window_steps, batch_size, lr, tokenizer, dataset, seed, max_epochs, …). load_model_from_checkpoint merges training_config when present and warns if it is missing or incomplete. --resume-checkpoint restores weights and optimizer state. --checkpoint-dir and --save-every control where and how often numbered checkpoints are written.
End-of-epoch fixed prompts. After each epoch (before CSV metrics), the training loop runs model.generate(prompt, max_tokens=120) for every string in evaluation/prompts.EVAL_PROMPTS, prints each result, and appends logs/eval_epoch_{epoch}.txt (1-based epoch index). This is separate from the final Phase 0 baseline block, which still uses BASELINE_PROMPT_1–3 in sandbox.py.
Integrations. --use-substrate attaches LLMSubstrateNode so language tension can drive TSCore propagation and logging. --use-goat-memory enables GoatMemoryManager and injects a per-position signal into window dynamics. By default --dynamics vectorized replaces model.dynamics with VectorizedWindowDynamics on wave_dim (from dynamics_vectorized.py; falls back to non-vectorized if import fails). --dynamics simple leaves model.dynamics unset so evolve_token uses wave_dynamics (per-wave WaveDynamics) plus wave_interaction; the window path is unchanged (energy descent in run_window_dynamics). The legacy SimpleAttractorDynamics class remains in code for compatibility but is not the default.
Corpus path. --dataset-path wins over --corpus if both are set; otherwise the default is data/corpus.txt. Paths may be a single file, a directory (all .txt / .jsonl / .json under it, merged), or .jsonl with text / content / sentence fields concatenated.
Hugging Face datasets (--dataset-source). With --dataset-source tinystories or fineweb-edu, the sandbox ignores --corpus / --dataset-path for loading and materializes text into data/cache/hf/ (override with --hf-cache-dir). Limits: --hf-max-rows, --hf-max-chars; --hf-refresh rebuilds the cache file. Requires pip install datasets (listed in requirements.txt).
Automatic fallback. If no text files resolve for the path, or if the tokenized sequence is shorter than 20 * window_size tokens, training prints Corpus too small — generating synthetic corpus..., writes a temporary UTF-8 file using data/generate_corpus.py’s generate_corpus() (target length at least 20k tiktoken GPT-2 tokens, scaled up slightly with window size), trains from that file, and deletes the temp file after the run. This keeps local experiments runnable without hand-curating a large corpus. Pass --no-synthetic-fallback to fail fast instead (recommended when you expect a real corpus file to be present).
Manual corpus generation. To persist synthetic text instead of a temp file:
python3 data/generate_corpus.py --out data/generated.txt --tokens 20000 --seed 42
python3 sandbox.py --corpus data/generated.txtgenerate_corpus grows paragraphs until the tiktoken GPT-2 encoding length reaches --tokens. Counts may differ slightly from the sandbox tokenizer (--tokenizer tiktoken vs fallback), but the generated file is always large enough for the small-corpus threshold above.
Token-level split with a gap. Validation is a suffix of the token stream. Between the last training token and the first validation token the code skips window_size tokens so no sliding window’s context crosses into the other split (no train/val leakage). Each side must still have at least window_size + 1 tokens to form one window.
Minimum validation windows. The code targets at least 50 validation windows (MIN_VAL_WINDOWS). After the first split, if there are fewer than 50 val windows, it recomputes the split once using an effective hold-out fraction of at least (50 + window_size) / total_tokens, then logs final val_fraction (actual len(val_tokens) / total_tokens). If 50 windows are still impossible (corpus too small), it prints Validation set too small (X windows). Metrics will be noisy. and, at startup, WARNING: validation unreliable when val_windows < 50. Treat val_ce / perplexity on tiny val sets as qualitative only.
What you see at startup (stream mode). Lines such as total_tokens=, train_tokens=, val_tokens=, train_windows=, val_windows=, and final val_fraction=… summarize the run. For statistically stable validation on modest corpora, prefer --val-fraction 0.2–0.3 or more tokens.
Global seed. --seed seeds Python’s random module at process start.
Per-epoch batch order (stream mode). AttractorDataPipeline.epoch_batches(epoch_index=epoch) uses random.Random(seed + epoch_index) to shuffle window start indices. Each yield is a 3-tuple (contexts, targets, target_states_batch): the third element is None unless the pipeline was constructed with precomputed train_target_states (trajectory-guided training). Fixing --seed fixes the entire sequence of batch orders across epochs; changing the epoch index changes the shuffle, so epochs are not identical copies of the same ordering.
No stream duplication. Training does not multiply the train token list by epoch_copies in stream mode; multiple passes are real epochs over reshuffled windows.
| Command | Purpose |
|---|---|
python3 smoke_test.py |
Fast integration check: model + one dynamics pass + training step + TSCore wave cycle. Run after install or refactors. |
python3 tests/test_embed_windows_batch.py |
Confirms batched window embedding matches per-row embed_window (prints max abs diff). |
python3 eval_harness.py … |
Perplexity (teacher-forced forward_training_window), mean tension, trajectory contrast, optional TSCore WaveCycleRunner metrics; writes JSON. Text samples use model.generate. Use --dataset-source tinystories / fineweb-edu to match Hub training data. |
python3 inference_server.py |
FastAPI: /v1/completions via model.generate, load_model_from_checkpoint for vectorized ckpts. Needs pip install fastapi uvicorn. |
python3 scripts/profile_training_step.py |
Torch profiler on one training step; writes benchmarks/training_throughput.json and prints === Throughput === (see --throughput-iters). |
python3 data/generate_corpus.py --out PATH --tokens N |
Offline synthetic corpus for tests or a fixed data/generated.txt. |
docker compose up builds and runs the inference image from Dockerfile / docker-compose.yml. Swap the PyTorch wheel in the Dockerfile for CUDA if you need GPU in the container. The compose service name is noted in docker-compose.yml (see Quick start).
- Architecture and tension: Architecture overview and Tension semantics.
- Loss function: Training objective.
- Per-flag list: CLI reference and Epoch metrics CSV columns.
- Scaling and practical training: Scaling and training tips.
- Lightweight diagnostics: Debug mode.
- Module-by-module history: Wave-by-wave implementation log.
Data and validation
- Prefer stream mode (default): one token sequence and sliding windows scale to large corpora. Use
--dataset-source tinystoriesorfineweb-edufor a first serious run, or a large UTF-8--corpusfile / directory. - For stable val CE / perplexity, use enough hold-out tokens:
--val-fraction 0.1–0.3. If startup warnsval_windows < 50, treat metrics as noisy until you add text or increase the fraction. --no-synthetic-fallbackfails fast if the corpus is missing or tiny—useful before long GPU jobs.
Batch size, window, and steps
- Trajectory mode requires
--trajectory-batch-size≥ 2 (negatives are drawn inside the batch). Larger batches stabilise contrastive training but cost more memory. - Window size (
--window-size): wider context increases compute per step roughly linearly inW(embedding isW×D, dynamics run up to--num-dynamics-steps/--max-window-stepsouter steps). Start with the default or8; increase when data and VRAM allow. --num-dynamics-steps/--max-window-steps: hard cap on outer steps per window. Optional--convergence-epsilon(with--min-attractor-steps, default 2) can stop early only for batch size 1 when state change or tension is stable; default epsilon is0. Training batches withB > 1always use the full outer step count. Watchmean_final_step_tensionin epoch CSV.
Throughput and hardware
- Use
--device cudawhen available. On CUDA,torch.set_float32_matmul_precision("high")is set, andtorch.compiletargets only the inner step (dyn._stepfor vectorized,dyn._step_rowsfor simple)—not the outer window loop. First epoch can be slower while kernels warm up. python3 scripts/profile_training_step.pyrecords Chrome/TensorBoard traces and, after profiling, times--throughput-itersplain optimizer steps (default 32). It printsstep_time_ms,batches/sec,tokens/secand overwritesbenchmarks/training_throughput.json(repo root). Default--max-window-stepsin the script is 32 to matchsandbox.py.--dynamics vectorized(default) can help on GPU when token-timeevolve_tokenusesMultiHeadDynamics; ensurewave_dimis divisible by--vectorized-num-heads.--dynamics simpleis fine for CPU smoke tests (per-waveWaveDynamicson the token path).
Optimisation and stability
--lr1e-3 withStepLR(--lr-decay-every,--lr-gamma) is a reasonable starting point; iftrain_CEandval_CErise over epochs while dynamics look stable, try--lr 3e-4–1e-4and--token-aux-ce 0.5(README A1c). Lower LR if loss spikes orgrad_normexplodes (see--debug).- Keep at least one of
--token-aux-ceor--readout-aux-alphaon in trajectory mode so the readout heads receive gradients (the script warns if both are zero). - Phase 1 window interaction (
--phase1-enable-window-interaction) plus Phase 2--phase2-interaction-reg-weightand optional--phase2-interaction-decay-tauhelp keep coupling matrixCfrom drifting; enable when you see unstable window norms. - Checkpoints:
--checkpoint-dir+--save-everyfor long runs;--resume-checkpointrestores weights and optimizer. Newer code may add parameters—usestrict=Falsein custom loaders if needed.
Logging for analysis
--epoch-metrics-csv: one row per epoch (loss, CE, tension, TSCore fields).logs/eval_epoch_{epoch}.txt: fixed-promptmodel.generatesamples each epoch (fromevaluation/prompts.EVAL_PROMPTS); distinct from--baseline-out/BASELINE_PROMPT_*at end of training.--phase05-batch-metrics-csv(with--phase05-log-metricsimplied): per-batch diagnostics; plot withscripts/plot_phase05_metrics.py. When metrics logging is off, the runtime skips heavy tracing arrays and keeps only the tension values needed for control flow.
Pass --debug to print a small number of [debug] lines at meaningful points (no per-step spam):
| When | What you see |
|---|---|
| After resume (if used) | Starting epoch index and global_step. |
| After model → device | Parameter count, state_dim, train window size, max window steps. |
| After integrations | Dynamics class name, torch.compile outcome, substrate / GOAT on or off. |
| Before the training loop | Epoch range, starting global_step, loss_mode. |
| Pipeline | Streaming on/off, batch size, estimated batches per epoch (or legacy fallback). |
| Each epoch | Estimated batches, report_every (matches progress snippet cadence), current LR. |
| First batch only (trajectory mode) | Loss, gradient L2 norm, whether readout logits are all finite. |
| End of each epoch | Batch count, approximate window-updates, batches per second. |
| After training | Final global_step, last epoch id, last mean loss and train CE. |
--quick-test --debug prints one line before the sanity checks then exits.
Use this when verifying a new machine, a resumed run, or tracking down NaNs; for full traces use --phase05-batch-metrics-csv instead.
- Synced workspace from
origin/main(GitHub); taggedphase-0-baseline - Added git submodules:
vendor/GOAT-TS,vendor/TS-Core,vendor/ts-llm - Verified all entrypoints; documented in
docs/API_DISCOVERY.md smoke_test.py: 5 assertions all pass on CPU in ~2 s
Training and inference use sandbox._build_tokenizer(mode, vocab_cap), which loads AttractorTokenizer from vendor/ts-llm:
--tokenizer tiktoken— gpt2 BPE up to--vocab-cap(default 32768)--tokenizer fallback— same BPE, vocab capped at 512 for fast iteration; GPT-2 ids≥ 512fold asid % 512(not dropped, not all clamped to 511), so prompts stay more distinguishable under a small cap
The model is constructed with vocab_size = tok.n_vocab; model.tokenizer is set for encode / decode. sandbox.FULL_VOCAB remains an empty legacy shim for old imports.
wave_a_tokenizer.py still exposes make_vocab_and_tokenizer() for scripts that want a standalone helper.
import sandbox as sb
tok = sb._build_tokenizer("tiktoken", 32768)
model = sb.TorchAttractorLanguageModel(tok.n_vocab, train_window_size=6)
model.tokenizer = tokdynamics_vectorized.py provides VectorizedWindowDynamics, selected by default via --dynamics vectorized. It is attached to model.dynamics with state_dim=wave_dim (per-wave channel width). The unified step(S, signal) → S API is used on the token path (evolve_token) when model.dynamics.mhd is present.
run_window_dynamics does not call dynamics.step each outer step: it applies positional coupling, optional GOAT signal, optional anchor pull, then one inner energy-gradient step (learned energy_heads, optional tension / anchor distance), then Phase 1 global interaction. Static tensors (positional weights, C * mask, GOAT bonus) are cached for the outer loop.
With --dynamics simple, model.dynamics stays None; evolve_token uses wave_dynamics + wave_interaction instead.
- Wraps
MultiHeadDynamicsfromvendor/ts-llm(low-rank diffusion per head + cross-head coupling); window step uses cubic nonlinearity (simple path usestanh). forwardonVectorizedWindowDynamicsis disabled (NotImplementedError); usemodel.run_window_dynamicsorrun_window_dynamics_vectorized, which temporarily swaps dynamics and callsmodel.run_window_dynamics(optional**kwargsforwarded torun_window_dynamics).torch.compile: on CUDA, onlydyn._step(vectorized) ordyn._step_rows(simple) is compiled—not the full dynamics module.get_compiled()caches compiled_stepby shape key for smoke / benchmarks.- Parity tests: both paths produce finite outputs; equations differ by design.
For text generation, use TorchAttractorLanguageModel.generate() (same path as training: forward_training_window → readout_window_logits / effective_temperature). scripts/generate_sample.py and inference_server.py both load checkpoints with sandbox.load_model_from_checkpoint (rebuilds VectorizedWindowDynamics with vectorized_dt, use_lorentz before load_state_dict).
state_cache.py remains for experiments. generate_with_cache is a shim that calls model.generate ( FutureWarning ). cache.logits() still uses readout(fast/slow), not readout_window_logits, so logits do not match trajectory training.
step(token_id)— same window embedding +run_window_dynamicson(1, W, D)(dynamics aligned; readout head is not)logits()/generate_with_cache— deprecated for production decoding- Self-test:
python3 state_cache.pyusesforward_training_window+model.generate
# Preferred
text = model.generate("the cat sat", max_tokens=30, temperature=1.0, top_k=28)
# Legacy (warnings; delegates to model.generate)
from state_cache import AttractorStateCache, generate_with_cache
cache = AttractorStateCache(model)
text = generate_with_cache(model, cache, prompt="the cat sat", max_tokens=30)data_pipeline.py feeds training with stream-based tokenization by default (no dependence on individual lines being long enough):
- Stream mode (default): the corpus is read as full text (whole
.txtfiles;.jsonlrecords concatenated), thentokenizer.encode(full_text)produces one continuous token sequence. Sliding windows(context, target)usetokens[i : i+W]→ targettokens[i+W]. Each epoch shuffles all window start indices, then batches. train_token_ids=— sandbox passes the train split after token-level train/val cut with a gap ofwindow_sizetokens between train and val so no sliding window shares context across the split.train_target_states=— optional(n_windows, W, D)float tensor on CPU, one row per sliding window in stream order (window starting at token indexjforj = 0 … len(tokens)−W−1). When set,epoch_batchesyields(contexts, targets, batch_targets)withbatch_targetsshaped(B, W, D)aligned to the same shuffled window indices ascontexts. Line-based mode does not support this field (third tuple element is alwaysNone).- Trajectory guidance (sandbox CLI): in stream training, you can precompute targets with
--trajectory-guidance-from-embedor load--trajectory-guidance-states PATH.pt, then set--trajectory-guidance-nudge-scale(per–outer-step nudge inrun_window_dynamics) and/or--trajectory-guidance-mse-weight(MSE term in the trajectory loss). See Training objective. - Shuffle:
epoch_batches(epoch_index=epoch)usesRandom(seed + epoch_index)so each epoch has a deterministic but different batch order (reproducible runs). - Stream mode ignores
--epoch-copies(use--max-epochsinstead); duplicating the token stream is not applied. - Legacy line mode:
streaming_dataset=Falsekeeps per-line encoding (short lines dropped);shuffle_bufferrefills between batch groups. - Multi-shard round-robin (
shard_id/num_shards) for data-parallel workers - Too few tokens (
len < window_size + 1) raises a clear “Corpus too small after tokenization” error. - Synthetic fallback: if the corpus path yields no files or fewer than
20 * window_sizetokens after encoding,sandbox.pygenerates a temporary corpus viadata/generate_corpus.py(see Usage guide). - Startup logging: stream mode prints
total_tokens,train_tokens,val_tokens,train_windows,val_windows, andfinal val_fractionafter any minimum-val-window adjustment.
from data_pipeline import AttractorDataPipeline
pipe = AttractorDataPipeline(
sources=["data/corpus.txt"], model=model, batch_size=16, streaming_dataset=True
)
for contexts, targets, target_states in pipe.epoch_batches():
# target_states is None unless train_target_states= was passed to AttractorDataPipeline
loss, _ = model.trajectory_contrastive_loss_and_logits(
contexts, targets, target_states=target_states
)llm_substrate_node.py closes the language → TS-OS feedback loop:
- Registers
"llm_substrate"as a native node inTSCorewith an edge from"ts_native" on_batch(model)— reads_last_window_tension_curve, normalises to[0,1], pushes to node activation, callsts.propagate_wave()(skips when language tension is belowhigh_tension_threshold)- When TSCore tension exceeds
evolve_threshold, callsts.factory_evolve()(appends a stability node — self-improvement tick) - Optional HTTP POST to
LLM_HOOK_URL(BoggersTheAI Evolve endpoint) — fire-and-forget, never blocks training - With
--use-substrate, each epoch logsevolves,last_ts_tension, and active vs idle batch counts; if TSCore never fired (all batches below threshold), a single per-epoch warning is printed. The same substrate fields are appended to--epoch-metrics-csvastscore_evolvesandtscore_last_tension.
from llm_substrate_node import LLMSubstrateNode
substrate = LLMSubstrateNode(model)
# After each training batch:
substrate.on_batch(model)goat_memory_transitions.py wires GOAT-TS-style per-token memory into training when you pass --use-goat-memory:
- One
Nodeper vocabulary index (vocab_size); labels are string token IDs - After each batch,
GoatMemoryManager.tick(contexts)updates activations and ACTIVE / DORMANT / DEEP transitions - During window dynamics,
_single_window_stepbuilds a(B, W, D)signal fromactivation_bonus(token_id)at each position (broadcast acrossD), so GOAT affects the actual forward pass—not onlyget_signal()on the legacy single-token path sweep_config()— tunable knobs for automated sweeps
State machine (per token):
high usage → ACTIVE (activation ≥ 0.5)
low usage → DORMANT (activation < 0.1)
3 ticks at DORMANT → DEEP (excluded from bigram bias)
inference_server.py exposes the model via FastAPI:
- OpenAI-compatible
/v1/completions(drop-in for any client that targets the OpenAI API) model.generate()for completions (training-parityreadout_window_logitspath)- Checkpoints:
sandbox.load_model_from_checkpoint(same asscripts/generate_sample.py) so vectorized weights load correctly - TSCore sidecar:
/ts/propagateand/ts/tensionendpoints - Thread-safe:
threading.Lock()wraps generate calls Dockerfile(CPU; swap whl URL for CUDA) +docker-compose.ymlwith healthcheck, volume mounts for checkpoints and corpus
# Test without FastAPI installed:
python inference_server.py --self-testeval_harness.py provides the full evaluation loop:
compute_perplexity(model, dataset)— token-level PPL = exp(mean CE) on teacher-forcedforward_training_windowlogits (standard next-token CE; notgenerate)compute_mean_tension(model, dataset)— mean final window tension across batchescompute_traj_contrast(model, dataset)— mean trajectory contrastive lossrun_wave_cycle(model, substrate, dataset, max_ticks=11)— feeds language batches into TSCore, runsrun_until_stable(max_ticks), returns before/after tension delta and evolve count
Evaluation calls that run trajectory forwards now use a non-mutating path for repulsion memory bookkeeping, so diagnostics do not alter subsequent training behavior.
Phase 0 eval results (untrained model, 51-line corpus):
| Metric | Value |
|---|---|
| Baseline val PPL | 506 |
| Mean window tension | 0.318 |
| TSCore tension (before WaveCycle) | 0.149 |
| TSCore tension (after 11-tick WaveCycle) | 0.0005 |
| Evolve events triggered | 5 |
TSCore converges cleanly. High PPL is expected for an untrained model — the harness is the measurement instrument.
Configuration: Phase05Config in phase05_config.py, passed to TorchAttractorLanguageModel(..., phase05=...) (CLI: --phase05-*).
--phase05-log-metrics: collect window-trace and token-evolve diagnostics used for batch CSV and logged scalars. When disabled, the outer loop avoids accumulating tension curves, step diagnostics, and break-tracing arrays.--phase05-batch-metrics-csv PATH: append one row per training batch (implies log metrics). Column list isPHASE05_BATCH_CSV_HEADERinsandbox.py(tension components, stagnation, trajectory margin, break counts, Phase 1–2 extensions).--phase05-enforce-negdef-diffusion: strictly negative-definite diffusion in the simple dynamics path.--phase05-adaptive-window-dt: EMA-scaled positional timestep from window tension.--phase05-tension-w w1,w2,w3: override weights inT_total = w1·T_energy + w2·T_align + w3·T_entropy.--phase05-multi-negative/--phase05-num-negatives/--phase05-traj-temperature: trajectory contrastive negatives and temperature.- Trajectory-guided targets (also configured via
Phase05Config):--trajectory-guidance-nudge-scaleand--trajectory-guidance-mse-weightact when the data pipeline supplies precomputedtrain_target_states; see Wave D and CLI reference.
Configuration: Phase1Config in phase1_config.py (CLI: --phase1-*).
--phase1-num-heads,--phase1-head-dim-mode {shared,split}: parallel drift heads; split mode partitionsDacross heads (D % H == 0).--phase1-enable-window-interaction: learnableC ∈ ℝ^{W×W}applied aseinsum('bid,ij->bjd', S, C)after each local step (scaled by--phase1-interaction-scale).--phase1-head-diversity-weight: auxiliary penalty on mean pairwise cosine similarity of head drift directions.--phase1-enable-per-head-tension: when logging, record mean per-head geometry tension (split layout).
Configuration: Phase2Config in phase2_config.py (CLI: --phase2-*). No attention or token–token scoring; head-level weighting only.
| Area | Behaviour |
|---|---|
| Breaks | Default: escape along normalised state − prev_state, step size α = break_base_strength · clamp((T_target − T)/T_target, min, max); tiny delta norm falls back to random unit direction. --phase2-disable-directional-break restores Gaussian jitter. |
| Rejection | --phase2-enable-break-rejection: revert a break if tension increases and row cosine alignment worsens. |
| Mixing | state + sigmoid(gate)·W_mix(concat heads) when residual mixing is on; --phase2-disable-residual-mixing uses linear mix only. --phase2-mixing-gate-init sets initial gate (~0.1 default). |
Window C |
Optional --phase2-interaction-decay-tau: multiply C by **`exp(− |
| Head weights | --phase2-enable-head-tension-weighting: combine head drifts with softmax(−T_head) (requires per-head tension signal in the dynamics path). |
| Memory hook | --phase2-store-break-memory: store last pre/post break window states on the model for future reuse. |
Batch CSV (with --phase05-batch-metrics-csv) gains Phase 2 fields when breaks occur: phase2_break_direction_norm_mean, phase2_break_applied_alpha_mean, phase2_break_delta_tension_mean, phase2_break_delta_alignment_mean, phase2_head_weight_entropy, phase2_interaction_reg_loss.
Checkpoints: new parameters (for example mixing_gate_raw, phase1_window_C) are not in older checkpoints; load with strict=False or retrain.
Resume reliability: optimizer state is restored after model device placement and optimizer tensors are migrated to the active device, preventing Adam CPU/CUDA state mismatch on resumed training.
import sandbox as sb
tok = sb._build_tokenizer("fallback", 512) # or "tiktoken", 32768
model = sb.TorchAttractorLanguageModel(
vocab_size=tok.n_vocab,
state_dim=512,
train_window_size=6,
max_window_steps=32,
phase05=sb.Phase05Config(),
phase1=sb.Phase1Config(),
phase2=sb.Phase2Config(),
)
model.tokenizer = tok
# Training path
wids = model.window_ids_from_sequence(token_ids)
S = model.embed_window(wids) # (W, D)
# Batched: context_tensor (B, W) long -> (B, W, D), equivalent to stacking embed_window rows
# S_b = model.embed_windows_batch(context_tensor)
S, logs, _ = model.run_window_dynamics(S, context_ids=wids) # GOAT uses token ids if enabled
# Optional trajectory guidance: target_states (B, W, D) same shape as batched S;
# nudge each outer step when phase05 trajectory_guidance_nudge_scale > 0
# S, _, _ = model.run_window_dynamics(S_b, context_ids=contexts, target_states=targets_tensor)
train_logits = model.readout_window_logits(S) # primary training readout (optional fusion)
# Batched trajectory contrastive loss (optional precomputed targets per window)
loss, logits = model.trajectory_contrastive_loss_and_logits(
contexts, targets, target_states=maybe_precomputed_B_W_D
)
# Generation (readout_window_logits path)
text = model.generate("the quick brown fox", max_tokens=40)
# Load a saved checkpoint (vectorized dynamics rebuilt from config + state_dict)
# model = sb.load_model_from_checkpoint("checkpoints/.../ckpt_step0000001.pt", tokenizer_mode="tiktoken", vocab_cap=32768, device=torch.device("cpu"))
# Prompt comparison (trajectory distance)
sb.compare_prompts(model, "cats eat fish", "fish eat cats")Default: trajectory contrastive loss with optional auxiliary terms.
L = L_traj + w_token · L_token_aux + α · L_readout_aux + w_guide · L_traj_mse + …
L_traj = mean(ReLU(0.2 − cos(pred, teacher) + cos(pred, negative)))
pred = evolved(context window)
teacher = stop-gradient next state along the same predicted trajectory (consecutive outer-step
states; no second run_window_dynamics on a shifted window)
negative = shuffled teacher in batch
L_token_aux = CE on readout_window_logits(pred window) vs target (--token-aux-ce, default 0.2)
L_readout_aux = CE on readout(final token row of pred) vs target (--readout-aux-alpha, default 0.15)
L_traj_mse = MSE(pred, T) when batch target states T (B, W, D) are provided and
--trajectory-guidance-mse-weight > 0 (Phase05Config.trajectory_guidance_mse_weight).
During evolution, each outer step can apply S ← S + β (T.detach() − S) before coupling
/ energy descent when --trajectory-guidance-nudge-scale β > 0 and T is passed into
run_window_dynamics / trajectory_contrastive_loss_and_logits. Teacher path is not nudged.
Primary next-token signal: readout_window_logits(S) → readout_window (training + model.generate). Avoid calling readout_window alone if --readout-fusion is enabled. The single-vector readout head is for aux loss (--readout-aux-alpha), tension entropy, and legacy next_token_logits / state_cache — not the default decoding path. Use --loss-mode ce for classic next-token CE only.
Per-token path (evolve_token): after each inner step, compute_tension returns:
T = |ΔE_state| + λ · (1 − cos(fast, slow)) + μ · H(readout_logits)
| T < tol | Early exit — attractor is stable |
|---|---|
| T > high | Directional break (Phase 2 default) or Gaussian jitter (--phase2-disable-directional-break) |
| T > break_thresh | Same break family on the token path |
Window path (run_window_dynamics): after each outer iteration, compute_tension_window (alias compute_window_tension) uses neighbor energy drift + misalignment + optional readout entropy (see WINDOW_TENSION_USE_ENTROPY in sandbox.py). The outer loop runs at most max_window_steps times. Early convergence (convergence_epsilon > 0, after --min-attractor-steps) applies only when the window batch dimension B == 1 so multi-sample batches always run the full step count. If epsilon is 0, all max_window_steps are used. Tension still drives low-tension escape, high-tension breaks, GOAT transitions, and high-T row renorm inside each step. Phase 2 breaks use state − prev_state (F.normalize) with tension-scaled magnitude; see Phase 2.
python3 sandbox.py [options] # or: source .venv/bin/activate && python sandbox.py
Data & tokenizer:
--corpus PATH Training text (default: data/corpus.txt)
--dataset-path PATH Alias for --corpus (takes precedence if set)
--dataset-source {local,tinystories,fineweb-edu} Hugging Face corpus (requires `datasets`); ignores --corpus
--hf-cache-dir PATH HF materialized text cache (default: data/cache/hf)
--hf-max-rows N HF rows to read (default: 50000)
--hf-max-chars N Optional total character cap (0 = none)
--hf-refresh Rebuild HF cache file
--no-synthetic-fallback Error if corpus missing/too small instead of temp synthetic text
--val-fraction FLOAT Token-level val hold-out in stream mode (default: 0.05). Use ~0.3 if you need many val windows; 0 = off.
--tokenizer {tiktoken,fallback} BPE mode (default: fallback)
--vocab-cap INT Max BPE vocab when using tiktoken mode (default: 32768)
--seq-len INT Alias for --window-size
--batch-size INT Alias for --trajectory-batch-size
--shuffle-buffer INT Pipeline shuffle buffer (line-based mode only; default: 2048)
--no-streaming-dataset Legacy line-based corpus (short lines dropped). Default: stream whole file as tokens.
Training:
--window-size INT Context window W (default: 6)
--num-dynamics-steps INT, --max-window-steps INT
Max outer attractor steps per window (default: 32)
--convergence-epsilon FLOAT Early exit when B=1 and ‖ΔS‖ or |ΔT_mean| below this after min steps (0 = all outer steps; B>1 ignores early exit)
--min-attractor-steps INT Minimum outer steps before early exit may trigger (default: 2, ≥2)
--trajectory-batch-size INT Batch size for trajectory mode (default: 64, need ≥2)
--loss-mode {trajectory,ce}
--trajectory-intermediate-ce-weight W Add W * mean CE over outer dynamics steps (default: 0)
--trajectory-guidance-nudge-scale BETA Per outer step: S <- S + beta * (T - S) when batch targets T exist (default: 0)
--trajectory-guidance-mse-weight W Add W * MSE(S_pred, T) when batch targets T exist (default: 0)
--trajectory-guidance-states PATH.pt Load [n_train_windows, W, D] aligned with stream windows (mutually exclusive with from-embed)
--trajectory-guidance-from-embed Precompute T from token embeddings (stream training only)
--trajectory-guidance-embed-batch-size N Batch size for from-embed precompute (default: 256)
--token-aux-ce FLOAT Aux CE on readout_window_logits path in trajectory mode (default: 0.2)
--readout-aux-alpha FLOAT Aux CE on single-state readout (default: 0.15; 0 = off)
--grad-clip FLOAT Optional global grad-norm clip (default: off)
--lr, --lr-decay-every, --lr-gamma
--epoch-copies INT Repeat training lines per epoch
--max-epochs N, --epochs N Number of training epochs (default: 3)
--seed INT
Device & checkpointing:
--device auto|cpu|cuda|cuda:N
--resume-checkpoint PATH
--save-every N Save every N optimizer steps (0 = final only)
--checkpoint-dir PATH Default: ./checkpoints
Integrations:
--use-substrate TSCore LLMSubstrateNode after each batch
--use-goat-memory GoatMemoryManager + window-path signal injection
--use-lorentz Lorentzian positional coupling in window dynamics (vectorized path only; default: off)
--dynamics {simple,vectorized} Default: vectorized (MultiHeadDynamics); simple = legacy single-matrix drift
Phase 0.5 (instrumentation):
--phase05-log-metrics Per-batch diagnostics + window trace for CSV
--phase05-batch-metrics-csv PATH Append batch rows (implies log-metrics); see PHASE05_BATCH_CSV_HEADER
--phase05-enforce-negdef-diffusion
--phase05-adaptive-window-dt
--phase05-tension-w w1,w2,w3
--phase05-multi-negative Trajectory: multi-shuffle negatives
--phase05-num-negatives K (with multi-negative; default 4)
--phase05-traj-temperature FLOAT
Phase 1 (multi-head + window C):
--phase1-num-heads H
--phase1-head-dim-mode {shared,split}
--phase1-interaction-scale FLOAT
--phase1-enable-window-interaction
--phase1-head-diversity-weight FLOAT
--phase1-enable-per-head-tension
Phase 2 (breaks + stable routing):
--phase2-disable-directional-break Legacy Gaussian breaks
--phase2-break-base-strength, --phase2-break-min-scale, --phase2-break-max-scale
--phase2-break-t-target FLOAT Reference T in α scaling
--phase2-enable-break-rejection
--phase2-disable-residual-mixing Linear W_mix only
--phase2-mixing-gate-init FLOAT
--phase2-interaction-reg-weight FLOAT ‖C−I‖² on loss (needs window interaction)
--phase2-interaction-decay-tau TAU exp(−|i−j|/τ) mask on C
--phase2-enable-head-tension-weighting
--phase2-store-break-memory
Logging:
--epoch-metrics-csv PATH Per-epoch CSV (see below)
--eval-results-json PATH After training: val CE, val PPL, checkpoint path (same val split as training)
--log-hard-batch-loss-above FLOAT
--baseline-out PATH Phase-0 baseline snapshot text file
Misc:
--quick-test Window sanity checks, exit
--debug Concise [debug] lines at setup, each epoch, first-batch grad norm (trajectory)
When --epoch-metrics-csv is set, each row includes: epoch, loss_mode, mean_loss, train_ce (mean batch CE from the training readout path — readout_window_logits — during the epoch), val_ce (held-out mean_cross_entropy_eval, empty if no val), train_traj_contrast (last training batch trajectory loss snapshot), val_traj_contrast (full val-set mean when val exists), mean_final_step_tension, max_batch_loss, lr, global_step, tscore_evolves, tscore_last_tension (0 if substrate disabled).
Per-batch CSV (--phase05-batch-metrics-csv): separate file; one row per optimizer step with window tension curves, trajectory margins, break counters, and (when enabled) Phase 1 / Phase 2 columns — see PHASE05_BATCH_CSV_HEADER in sandbox.py. Plot with python3 scripts/plot_phase05_metrics.py PATH --out DIR.
Validation perplexity: PPL_val = exp(val_ce) when val_ce is finite.
Use the same --seed, corpus, and hyperparameters; only add --use-goat-memory for the treatment run. Log with --epoch-metrics-csv and compare val_ce / mean_loss curves (or exp(val_ce) for val perplexity).
On tiny corpora (under a few thousand tokens), val CE and GOAT A/B deltas are integration checks only, not evidence of real model quality.
mkdir -p experiments/goat_ab
# Baseline
python3 sandbox.py \
--corpus data/corpus.txt \
--val-fraction 0.3 \
--seed 42 \
--device cpu \
--tokenizer fallback \
--max-epochs 30 \
--epoch-metrics-csv experiments/goat_ab/baseline_42.csv
# +GOAT
python3 sandbox.py \
--corpus data/corpus.txt \
--val-fraction 0.3 \
--seed 42 \
--device cpu \
--tokenizer fallback \
--use-goat-memory \
--max-epochs 30 \
--epoch-metrics-csv experiments/goat_ab/goat_42.csvClone with:
git clone --recurse-submodules https://github.com/BoggersTheFish/BoggersTheLLM.gitIf already cloned:
git submodule update --init --recursive| Path | Repo | Branch | Role |
|---|---|---|---|
vendor/GOAT-TS |
GOAT-TS | main |
Constraint-graph engine, tension semantics, memory transitions |
vendor/TS-Core |
TS-Core | master |
UniversalLivingGraph, WaveCycleRunner (Rust + Python fallback) |
vendor/ts-llm |
ts-llm | main |
Tokenizer (tiktoken), hierarchical fast/slow dynamics, attractor LLM package |
To update a submodule to latest:
git -C vendor/<name> pull
git add vendor/<name>
git commit -m "chore: bump <name> submodule"First real training run on TinyStories (120 stories). Mean trajectory loss: 2.3759 Val trajectory contrast: 0.112059 Checkpoint: ckpt_step0000274.pt