TrainKeeper is a minimal-decision, high-signal toolkit for building reproducible, debuggable, and efficient ML training systems. It adds guardrails inside training loops without replacing your existing stack.
Most failures happen silently inside execution loops: non-determinism, data drift, unstable gradients, and inconsistent environments. TrainKeeper solves this with zero-config composable modules.
- π Zero-Surprise Reproducibility: Automatic seed setting, environment capture, and git state locking.
- π‘οΈ Data Integrity: Schema inference and drift detection caught before training wastes GPU hours.
- π Distributed Made Easy: Auto-configured DDP and FSDP with a single line of code.
- π Resource Efficiency: GPU memory profiling and smart checkpointing that respects disk limits.
from trainkeeper.experiment import run_reproducible
@run_reproducible(auto_capture_git=True)
def train():
# your training code
print("TrainKeeper is running.")
if __name__ == "__main__":
train()Minimal check:
pip install trainkeeper
tk --helpexperiment.yaml,run.jsonseeds.json,system.json,env.txt,run.sh
tk init
tk run -- python train.py
tk compare exp-aaa exp-bbb
tk repro-summary scenario_runs/pip install trainkeeperOptional extras:
trainkeeper[torch]PyTorch helperstrainkeeper[vision]vision benchmarkstrainkeeper[nlp]NLP benchmarkstrainkeeper[tabular]tabular benchmarkstrainkeeper[wandb]W&B integrationtrainkeeper[mlflow]MLflow integrationtrainkeeper[dashboard]NEW: Interactive Streamlit dashboardtrainkeeper[all]All features
Launch a beautiful, interactive dashboard to explore your experiments:
pip install trainkeeper[dashboard]
tk dashboardFeatures:
- π Experiment Explorer: Browse and filter all experiments with metadata
- π Metric Comparison: Interactive Plotly charts comparing metrics across runs
- π Data Drift Analysis: Visualize schema changes and data quality
- π» System Monitor: Track GPU usage, dependencies, and reproducibility score
The dashboard provides a modern, gradient-based UI with:
- Real-time filtering and search
- Interactive visualizations
- Reproducibility scoring
- Export capabilities
Open http://localhost:8501 after running tk dashboard to access the interface.
Stop fighting with torch.distributed. TrainKeeper handles everything:
from trainkeeper.distributed import distributed_training, wrap_model_ddp, wrap_model_fsdp
with distributed_training() as dist_config:
model = MyModel()
model = wrap_model_ddp(model, dist_config) # That's it!
# Or for large models:
# model = wrap_model_fsdp(model, dist_config)
# Your training code works exactly the sameFeatures:
- π Auto-detects
torchrun, SLURM, or manual setup - π― DDP support with one function call
- π NEW: FSDP (Fully Sharded Data Parallel) support for large models
- Just replace
wrap_model_ddpwithwrap_model_fsdp!
- Just replace
- πΎ Smart distributed checkpointing
- π Distributed sampler creation
# Single GPU β Multi-GPU with ZERO code changes
torchrun --nproc_per_node=4 train.pyThe #1 pain point in deep learning = solved.
from trainkeeper.gpu_profiler import GPUProfiler
profiler = GPUProfiler()
profiler.start()
for batch in dataloader:
profiler.step("forward")
loss = model(batch)
profiler.step("backward")
loss.backward()
report = profiler.stop()
print(report.summary()) # Get actionable recommendations!What you get:
- π Memory leak detection
- π‘ Automatic optimization recommendations
- π Peak/average/fragmentation analysis
- π― Optimal batch size finder
Example output:
π‘ Recommendations:
1. Memory fragmentation detected (35%). Try:
β’ torch.cuda.empty_cache() periodically
2. Consider gradient checkpointing to trade compute for memory
Never run out of disk space again. Automatic cleanup based on your metrics:
from trainkeeper.checkpoint_manager import CheckpointManager
manager = CheckpointManager(
keep_best=3, # Keep top 3 by metric
keep_last=2, # Keep 2 most recent
metric="val_acc",
mode="max", # Higher is better
compress=True # Auto-compress old checkpoints
)
# During training
manager.save(
model=model,
optimizer=optimizer,
epoch=epoch,
metrics={"val_acc": 0.95, "loss": 0.05}
)
# Old checkpoints automatically cleaned up!Features:
- π§Ή Automatic cleanup (keep best N + last N)
- π¦ Optional compression (gzip)
- π Checkpoint integrity hashing
- βοΈ Cloud sync ready (S3, GCS, Azure)
experimentreproducible runs + environment capturedatacheckschema inference + drift detectiontrainutilsefficient training primitivesdebuggerhooks + failure snapshotsmonitorruntime metrics + prediction driftpkgexport helpers
These are not included in the PyPI package.
scenarios/scenario1_reproducibility/scenarios/scenario2_data_integrity/scenarios/scenario3_training_robustness/system_tests/runner.py
| Scenario | Purpose | Key output | Outcome |
|---|---|---|---|
| reproducibility | deterministic runs | run.json, metrics.json |
consistent hashing |
| data integrity | silent data bugs | schema + drift reports | detected corruptions |
| robustness | model instability | debug reports | captured failures |
Run the cross-scenario validation suite:
tk system-checkOutputs: scenario_results/system_summary.md and scenario_results/unified_failure_matrix.json.
tk init
tk run -- python train.py
tk replay <exp-id> -- python train.py
tk compare <exp-a> <exp-b>
tk doctor
tk repro-summary <runs-dir>
tk system-check
tk dashboard # NEW: Launch interactive dashboardexamples/quickstart.pyexamples/datacheck_drift.pyexamples/official_demo.pyexamples/demo.py
See benchmarks/ for the baseline suite and real pipelines.
docs/architecture.mddocs/benchmark_plan.mddocs/benchmarks.mddocs/hypotheses.mddocs/research_problem.mddocs/packaging.md
python -m buildtwine check dist/*tk system-check
See docs/architecture.md for the system overview and component boundaries.
pip install -e .[dev,torch]
pytest
mkdocs serveWe welcome issues and PRs. Please:
- open an issue with the problem or proposal
- keep changes scoped and tested
- run
pytestandtk system-checkbefore submitting

