LadnerLab · jeffreyHoelzel · Mar 12, 2026 · Mar 11, 2026 · Mar 12, 2026 · Mar 12, 2026
diff --git a/.gitignore b/.gitignore
@@ -38,6 +38,5 @@ pytest-*.xml
 *.log
 *.txt
 *.pyz
-*.png
 *.metadata
 *.json
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,253 @@
+# AGENTS.md
+
+## Project overview
+
+PepSeqPred is a residue level epitope prediction pipeline for peptide and protein workflows.
+
+This repository is organized around:
+- `src/pepseqpred/apps/` for user facing CLIs
+- `src/pepseqpred/core/` for reusable pipeline logic
+- `scripts/hpc/` for SLURM batch execution on GPU clusters
+- `tests/` for `unit`, `integration`, and `e2e` coverage
+- `envs/`, `localdata/`, `notebooks/`, and `dist/` as supporting project directories
+
+Primary goals when working in this repo:
+- preserve scientific reproducibility
+- keep training and evaluation behavior stable unless explicitly asked to change it
+- prefer minimal, targeted edits
+- avoid expensive or risky compute by default
+- maintain compatibility with existing CLIs, scripts, and downstream outputs
+
+## Repository structure
+
+### Application entrypoints
+
+The main CLIs are:
+- `pepseqpred-esm`
+- `pepseqpred-labels`
+- `pepseqpred-predict`
+- `pepseqpred-preprocess`
+- `pepseqpred-train-ffnn`
+- `pepseqpred-train-ffnn-optuna`
+
+These map to files in `src/pepseqpred/apps/`.
+
+### Core package layout
+
+Important subpackages under `src/pepseqpred/core/`:
+- `data/` for dataset loading
+- `embeddings/` for ESM based embedding generation
+- `io/` for logging and file writing helpers
+- `labels/` for label generation logic
+- `models/` for model definitions
+- `predict/` for inference
+- `preprocess/` for preprocessing workflows
+- `train/` for DDP, splitting, metrics, thresholds, trainer logic, seeds, and class weighting
+
+### HPC scripts
+
+Batch scripts live in `scripts/hpc/`. These are part of the intended workflow, especially for:
+- embedding generation
+- label generation
+- preprocessing
+- prediction
+- FFNN training
+- FFNN Optuna tuning
+
+Treat these scripts as first class project interfaces, not throwaway helpers.
+
+## General working rules for any agents (Codex, Claude Code, etc.)
+
+Before editing:
+- inspect the relevant files first
+- understand the existing CLI and core flow before proposing changes
+- prefer the smallest possible diff
+- do not rename modules, scripts, CLI flags, or output files unless the task requires it
+- do not introduce dependencies unless clearly justified
+
+While editing:
+- follow the existing package structure
+- preserve current naming conventions and CLI semantics
+- preserve public script behavior unless the user explicitly asks for a behavior change
+- keep functions explicit and readable
+- add or update docstrings when behavior changes
+- avoid unrelated refactors or cosmetic churn
+
+After editing:
+- run the smallest relevant validation first
+- report exactly what changed
+- note anything you could not validate
+
+## Reproducibility and experiment safety
+
+This is research code. Changes can silently invalidate experiments.
+
+Always preserve:
+- deterministic seed handling
+- train, validation, and test split semantics
+- masking behavior for uncertain labels
+- metric calculation behavior
+- checkpoint and result artifact formats, unless explicitly changing schema
+- per run and per trial traceability
+
+Do not:
+- change default seeds casually
+- change label meaning or preprocessing behavior without documenting it
+- mix outputs from different experiments into ambiguous files
+- overwrite prior results when a new output path is safer
+
+If a change affects training or evaluation, explicitly check for:
+- data leakage
+- split leakage
+- rank specific side effects
+- output collisions across repeated runs or trials
+
+## Distributed training and HPC guardrails
+
+PepSeqPred training is designed around multi-GPU DistributedDataParallel and SLURM-based execution.
+
+When touching training or Optuna code:
+- assume jobs may run on at least 4 GPUs through SLURM
+- be careful with `torch.distributed` collectives, barriers, and rank scoped logic
+- ensure shared artifacts are only written by the correct rank
+- avoid introducing deadlocks
+- do not make changes that multiply compute cost unexpectedly
+- preserve scheduler-friendly behavior
+
+Prefer:
+- local dry runs
+- tiny subsets
+- reduced epoch smoke tests
+- single rank validation where possible before full scale recommendations
+
+Do not assume:
+- local laptop training is practical
+- interactive GPU access exists
+- paths outside repo root are portable unless already established by project scripts
+- using `sbatch` locally will work, it will fail for local development
+
+## Data and artifact handling
+
+Never modify raw or source data in place.
+
+Prefer:
+- writing derived outputs to new paths
+- append safe logs and result files
+- explicit artifact names that encode experiment identity
+
+Be careful with:
+- checkpoint directories
+- CSV summaries
+- Optuna trial outputs
+- per-rank logging
+- temporary files on shared scratch storage
+
+If a schema or file format must change:
+- make the change explicit
+- update readers and writers together
+- document the migration clearly
+
+## Validation expectations
+
+Use the repo’s configured tooling where practical.
+
+Default validation order:
+1. `ruff check .`
+2. targeted `pytest` invocation for affected tests
+3. broader `pytest` if the change is cross-cutting
+4. only then consider heavier runtime checks
+
+Important:
+- do not run long HPC style training jobs unless explicitly asked
+- do not present expensive end-to-end training as routine validation
+- for training code, prefer smoke tests over full experiments
+
+If validation is incomplete:
+- say what was not run
+- say why
+- identify the main remaining risks
+
+## Commands
+
+Common commands:
+- install package: `pip install -e .`
+- install dev tools: `pip install -e .[dev]`
+- run tests: `pytest`
+- lint: `ruff check .`
+- format: `ruff format .`
+
+Available CLIs:
+- `pepseqpred-esm`
+- `pepseqpred-labels`
+- `pepseqpred-predict`
+- `pepseqpred-preprocess`
+- `pepseqpred-train-ffnn`
+- `pepseqpred-train-ffnn-optuna`
+
+## Testing guidance
+
+The repo has:
+- `tests/unit/`
+- `tests/integration/`
+- `tests/e2e/`
+
+Prefer:
+- unit tests for isolated logic changes
+- integration tests for CLI to core interactions
+- e2e only when a full pipeline boundary changed
+
+Do not expand test scope unnecessarily if a small targeted test is enough.
+
+## Documentation expectations
+
+When behavior changes, update the relevant:
+- docstrings
+- CLI help text
+- comments near tricky distributed logic
+- any usage examples affected by the change
+
+Note:
+- the current root `README.md` is minimal, so do not assume broader user documentation already exists
+- if you add a major new workflow, include enough inline guidance for future contributors
+
+## What not to change without explicit approval
+
+Do not, unless clearly requested:
+- redesign package structure
+- replace DDP or SLURM workflows
+- alter default experiment semantics
+- change model architecture defaults broadly
+- change preprocessing formulas or label logic
+- rewrite output schemas
+- remove test categories
+- introduce large framework migrations
+
+## Preferred task workflow
+
+For most tasks:
+1. inspect relevant app, core, test, and script files
+2. identify the smallest safe fix
+3. implement minimally
+4. run focused validation
+5. summarize edits, validation, and remaining risks
+
+## Directory specific notes
+
+### `src/pepseqpred/apps/`
+- preserve CLI compatibility
+- do not break argument names or defaults without explicit instruction
+- keep orchestration logic thin when possible
+
+### `src/pepseqpred/core/train/`
+- highest risk area
+- be conservative with splits, seeds, metrics, thresholds, and DDP behavior
+- verify rank aware writes and collective calls carefully
+
+### `scripts/hpc/`
+- preserve SLURM semantics
+- avoid hard coding user specific assumptions unless already part of script conventions
+- comment any scheduler related changes clearly
+
+### `tests/`
+- add targeted coverage for bug fixes
+- do not rewrite unrelated fixtures or tests just for style
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,101 @@
+# Contributing to PepSeqPred
+
+This document defines required contribution workflow, naming conventions, and pull request expectations for this repository.
+
+## Core Rules
+
+- Do not develop directly on `main`.
+- All changes must be made on a different branch and merged via pull request.
+- Branch, issue, and commit names must follow the conventions below.
+- Keep titles and descriptions short, clear, and specific.
+
+## Required Contribution Workflow
+
+1. Create or confirm an issue for the work.
+2. Create a branch from the latest `main` using the branch naming rules.
+3. Implement the change and add/update relevant tests.
+4. Run required local checks.
+5. Open a pull request into `main` with required summary and verification details.
+
+## Branch Naming Conventions
+
+Use lowercase and hyphen-separated descriptions.
+
+Accepted patterns:
+- `feat/short-description`
+- `fix/short-description`
+- `docs/short-description`
+- `chore/short-description`
+- `test/short-description`
+- `refactor/short-description`
+
+You can also be extra specific by adding the issue number associated with your code as seen below.
+
+Examples:
+- `feat/add-sharded-embedding-index-logging`
+- `fix/issue-42-threshold-range-validation`
+- `docs/update-readme-pipeline-section`
+
+## Issue Naming and Content
+
+Issue title format:
+- `<type>: short description`
+
+Examples:
+- `bug: label shard mismatch across embedding keys`
+- `docs: add hpc setup troubleshooting`
+- `chore: tighten local test gating in README`
+
+Issue body requirements:
+- `Summary`: a short statement of the problem or request.
+- `Done when`: acceptance criteria, if applicable.
+
+## Commit Message Conventions
+
+Commit title format:
+- `<type>: short description`
+
+Examples:
+- `bug: fix id-family key validation in labels builder`
+- `chore: remove unused import from prediction cli`
+- `docs: add contributing workflow and naming rules`
+
+Commit guidance:
+- Keep the first line concise and specific.
+- Keep one logical change per commit where possible.
+
+## Pull Request Requirements
+
+All pull requests to `main` must include:
+- A concise summary of what changed.
+- Linked issue(s) (for example, `Fixes #42`).
+- A concise "How to verify" section with exact commands.
+- Any new or updated unit, integration, or e2e tests needed to verify behavior changes.
+
+PRs should not include changed unrelated to the issue unless it's minor, please use your own discretion.
+
+## Verification Expectations Before PR
+
+Run these checks locally before opening a PR:
+
+```bash
+ruff check .
+pytest -m "unit or integration or e2e"
+```
+
+If behavior changed, include targeted test commands in the PR verification section, along with expected outcomes.
+
+## PR Checklist
+
+- [ ] Branch name follows convention.
+- [ ] Issue title/body follow convention (`Summary` and `Done when` included when applicable).
+- [ ] Commit messages follow `<type>: short description`.
+- [ ] No development occurred directly on `main`.
+- [ ] PR includes concise summary and reproducible verification steps.
+- [ ] Relevant unit/integration/e2e tests were added or updated.
+
+## Maintainer Support and Escalation
+
+- Use GitHub issues for normal development questions, bug reports, and feature requests.
+- Use email for private or sensitive matters that should not be posted publicly.
+- Maintainer contact: [Jeffrey Hoelzel](mailto:jmh2338@nau.edu) or [Jason Ladner](mailto:jason.ladner@nau.edu).
-Original file line number
+Diff line change
@@ Expand Up / @@ -38,6 +38,5 @@ pytest-*.xml @@
     *.log
     *.txt
     *.pyz
-    *.png
     *.metadata
     *.json