Skip to content

TrainKeeper is a minimal-decision, high-signal toolkit for building reproducible, debuggable, and efficient ML training systems. It adds guardrails inside training loops without replacing your existing stack.

License

Notifications You must be signed in to change notification settings

mosh3eb/TrainKeeper

TrainKeeper Logo

PyPI Version Python Versions License CI

Production-Grade Training Guardrails for PyTorch

Reproducible β€’ Debuggable β€’ Distributed β€’ Efficient


TrainKeeper is a minimal-decision, high-signal toolkit for building reproducible, debuggable, and efficient ML training systems. It adds guardrails inside training loops without replacing your existing stack.

⚑️ Why TrainKeeper?

Most failures happen silently inside execution loops: non-determinism, data drift, unstable gradients, and inconsistent environments. TrainKeeper solves this with zero-config composable modules.

  • πŸ”’ Zero-Surprise Reproducibility: Automatic seed setting, environment capture, and git state locking.
  • πŸ›‘οΈ Data Integrity: Schema inference and drift detection caught before training wastes GPU hours.
  • πŸš… Distributed Made Easy: Auto-configured DDP and FSDP with a single line of code.
  • πŸ“‰ Resource Efficiency: GPU memory profiling and smart checkpointing that respects disk limits.

πŸš€ Quick start

from trainkeeper.experiment import run_reproducible

@run_reproducible(auto_capture_git=True)
def train():
    # your training code
    print("TrainKeeper is running.")

if __name__ == "__main__":
    train()

Minimal check:

pip install trainkeeper
tk --help

Outputs per run

  • experiment.yaml, run.json
  • seeds.json, system.json, env.txt, run.sh

Typical workflow

tk init
tk run -- python train.py
tk compare exp-aaa exp-bbb
tk repro-summary scenario_runs/

πŸ“¦ Install

pip install trainkeeper

Optional extras:

  • trainkeeper[torch] PyTorch helpers
  • trainkeeper[vision] vision benchmarks
  • trainkeeper[nlp] NLP benchmarks
  • trainkeeper[tabular] tabular benchmarks
  • trainkeeper[wandb] W&B integration
  • trainkeeper[mlflow] MLflow integration
  • trainkeeper[dashboard] NEW: Interactive Streamlit dashboard
  • trainkeeper[all] All features

🎨 Interactive Dashboard (NEW in v0.3.0)

Launch a beautiful, interactive dashboard to explore your experiments:

pip install trainkeeper[dashboard]
tk dashboard

Features:

  • πŸ” Experiment Explorer: Browse and filter all experiments with metadata
  • πŸ“ˆ Metric Comparison: Interactive Plotly charts comparing metrics across runs
  • 🌊 Data Drift Analysis: Visualize schema changes and data quality
  • πŸ’» System Monitor: Track GPU usage, dependencies, and reproducibility score

The dashboard provides a modern, gradient-based UI with:

  • Real-time filtering and search
  • Interactive visualizations
  • Reproducibility scoring
  • Export capabilities

Open http://localhost:8501 after running tk dashboard to access the interface.

πŸš€ Production-Grade Features (NEW)

Distributed Training Made Easy

Stop fighting with torch.distributed. TrainKeeper handles everything:

from trainkeeper.distributed import distributed_training, wrap_model_ddp, wrap_model_fsdp

with distributed_training() as dist_config:
    model = MyModel()
    model = wrap_model_ddp(model, dist_config)  # That's it!
    # Or for large models:
    # model = wrap_model_fsdp(model, dist_config)
    # Your training code works exactly the same

Features:

  • πŸ”„ Auto-detects torchrun, SLURM, or manual setup
  • 🎯 DDP support with one function call
  • πŸš€ NEW: FSDP (Fully Sharded Data Parallel) support for large models
    • Just replace wrap_model_ddp with wrap_model_fsdp!
  • πŸ’Ύ Smart distributed checkpointing
  • πŸ“Š Distributed sampler creation
# Single GPU β†’ Multi-GPU with ZERO code changes
torchrun --nproc_per_node=4 train.py

GPU Memory Profiler

The #1 pain point in deep learning = solved.

from trainkeeper.gpu_profiler import GPUProfiler

profiler = GPUProfiler()
profiler.start()

for batch in dataloader:
    profiler.step("forward")
    loss = model(batch)
    profiler.step("backward")
    loss.backward()

report = profiler.stop()
print(report.summary())  # Get actionable recommendations!

What you get:

  • πŸ” Memory leak detection
  • πŸ’‘ Automatic optimization recommendations
  • πŸ“Š Peak/average/fragmentation analysis
  • 🎯 Optimal batch size finder

Example output:

πŸ’‘ Recommendations:
  1. Memory fragmentation detected (35%). Try:
     β€’ torch.cuda.empty_cache() periodically
  2. Consider gradient checkpointing to trade compute for memory

Smart Checkpoint Manager

Never run out of disk space again. Automatic cleanup based on your metrics:

from trainkeeper.checkpoint_manager import CheckpointManager

manager = CheckpointManager(
    keep_best=3,       # Keep top 3 by metric
    keep_last=2,       # Keep 2 most recent
    metric="val_acc",
    mode="max",        # Higher is better
    compress=True      # Auto-compress old checkpoints
)

# During training
manager.save(
    model=model,
    optimizer=optimizer,
    epoch=epoch,
    metrics={"val_acc": 0.95, "loss": 0.05}
)
# Old checkpoints automatically cleaned up!

Features:

  • 🧹 Automatic cleanup (keep best N + last N)
  • πŸ“¦ Optional compression (gzip)
  • πŸ” Checkpoint integrity hashing
  • ☁️ Cloud sync ready (S3, GCS, Azure)

Core modules

  • experiment reproducible runs + environment capture
  • datacheck schema inference + drift detection
  • trainutils efficient training primitives
  • debugger hooks + failure snapshots
  • monitor runtime metrics + prediction drift
  • pkg export helpers

Scenarios & system tests (repo-only)

These are not included in the PyPI package.

  • scenarios/scenario1_reproducibility/
  • scenarios/scenario2_data_integrity/
  • scenarios/scenario3_training_robustness/
  • system_tests/runner.py

Scenarios summary

Scenario Purpose Key output Outcome
reproducibility deterministic runs run.json, metrics.json consistent hashing
data integrity silent data bugs schema + drift reports detected corruptions
robustness model instability debug reports captured failures

System hardening

Run the cross-scenario validation suite:

tk system-check

Outputs: scenario_results/system_summary.md and scenario_results/unified_failure_matrix.json.

CLI

tk init
tk run -- python train.py
tk replay <exp-id> -- python train.py
tk compare <exp-a> <exp-b>
tk doctor
tk repro-summary <runs-dir>
tk system-check
tk dashboard  # NEW: Launch interactive dashboard

Examples

  • examples/quickstart.py
  • examples/datacheck_drift.py
  • examples/official_demo.py
  • examples/demo.py

Benchmarks

See benchmarks/ for the baseline suite and real pipelines.

Documentation

  • docs/architecture.md
  • docs/benchmark_plan.md
  • docs/benchmarks.md
  • docs/hypotheses.md
  • docs/research_problem.md
  • docs/packaging.md

How it works (diagram)

TrainKeeper architecture for training-time guardrails

Release checklist

  • python -m build
  • twine check dist/*
  • tk system-check

Architecture diagram

See docs/architecture.md for the system overview and component boundaries.

Development

pip install -e .[dev,torch]
pytest
mkdocs serve

Contributing

We welcome issues and PRs. Please:

  • open an issue with the problem or proposal
  • keep changes scoped and tested
  • run pytest and tk system-check before submitting