Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,5 @@ pytest-*.xml
*.log
*.txt
*.pyz
*.png
*.metadata
*.json
253 changes: 253 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
# AGENTS.md

## Project overview

PepSeqPred is a residue level epitope prediction pipeline for peptide and protein workflows.

This repository is organized around:
- `src/pepseqpred/apps/` for user facing CLIs
- `src/pepseqpred/core/` for reusable pipeline logic
- `scripts/hpc/` for SLURM batch execution on GPU clusters
- `tests/` for `unit`, `integration`, and `e2e` coverage
- `envs/`, `localdata/`, `notebooks/`, and `dist/` as supporting project directories

Primary goals when working in this repo:
- preserve scientific reproducibility
- keep training and evaluation behavior stable unless explicitly asked to change it
- prefer minimal, targeted edits
- avoid expensive or risky compute by default
- maintain compatibility with existing CLIs, scripts, and downstream outputs

## Repository structure

### Application entrypoints

The main CLIs are:
- `pepseqpred-esm`
- `pepseqpred-labels`
- `pepseqpred-predict`
- `pepseqpred-preprocess`
- `pepseqpred-train-ffnn`
- `pepseqpred-train-ffnn-optuna`

These map to files in `src/pepseqpred/apps/`.

### Core package layout

Important subpackages under `src/pepseqpred/core/`:
- `data/` for dataset loading
- `embeddings/` for ESM based embedding generation
- `io/` for logging and file writing helpers
- `labels/` for label generation logic
- `models/` for model definitions
- `predict/` for inference
- `preprocess/` for preprocessing workflows
- `train/` for DDP, splitting, metrics, thresholds, trainer logic, seeds, and class weighting

### HPC scripts

Batch scripts live in `scripts/hpc/`. These are part of the intended workflow, especially for:
- embedding generation
- label generation
- preprocessing
- prediction
- FFNN training
- FFNN Optuna tuning

Treat these scripts as first class project interfaces, not throwaway helpers.

## General working rules for any agents (Codex, Claude Code, etc.)

Before editing:
- inspect the relevant files first
- understand the existing CLI and core flow before proposing changes
- prefer the smallest possible diff
- do not rename modules, scripts, CLI flags, or output files unless the task requires it
- do not introduce dependencies unless clearly justified

While editing:
- follow the existing package structure
- preserve current naming conventions and CLI semantics
- preserve public script behavior unless the user explicitly asks for a behavior change
- keep functions explicit and readable
- add or update docstrings when behavior changes
- avoid unrelated refactors or cosmetic churn

After editing:
- run the smallest relevant validation first
- report exactly what changed
- note anything you could not validate

## Reproducibility and experiment safety

This is research code. Changes can silently invalidate experiments.

Always preserve:
- deterministic seed handling
- train, validation, and test split semantics
- masking behavior for uncertain labels
- metric calculation behavior
- checkpoint and result artifact formats, unless explicitly changing schema
- per run and per trial traceability

Do not:
- change default seeds casually
- change label meaning or preprocessing behavior without documenting it
- mix outputs from different experiments into ambiguous files
- overwrite prior results when a new output path is safer

If a change affects training or evaluation, explicitly check for:
- data leakage
- split leakage
- rank specific side effects
- output collisions across repeated runs or trials

## Distributed training and HPC guardrails

PepSeqPred training is designed around multi-GPU DistributedDataParallel and SLURM-based execution.

When touching training or Optuna code:
- assume jobs may run on at least 4 GPUs through SLURM
- be careful with `torch.distributed` collectives, barriers, and rank scoped logic
- ensure shared artifacts are only written by the correct rank
- avoid introducing deadlocks
- do not make changes that multiply compute cost unexpectedly
- preserve scheduler-friendly behavior

Prefer:
- local dry runs
- tiny subsets
- reduced epoch smoke tests
- single rank validation where possible before full scale recommendations

Do not assume:
- local laptop training is practical
- interactive GPU access exists
- paths outside repo root are portable unless already established by project scripts
- using `sbatch` locally will work, it will fail for local development

## Data and artifact handling

Never modify raw or source data in place.

Prefer:
- writing derived outputs to new paths
- append safe logs and result files
- explicit artifact names that encode experiment identity

Be careful with:
- checkpoint directories
- CSV summaries
- Optuna trial outputs
- per-rank logging
- temporary files on shared scratch storage

If a schema or file format must change:
- make the change explicit
- update readers and writers together
- document the migration clearly

## Validation expectations

Use the repo’s configured tooling where practical.

Default validation order:
1. `ruff check .`
2. targeted `pytest` invocation for affected tests
3. broader `pytest` if the change is cross-cutting
4. only then consider heavier runtime checks

Important:
- do not run long HPC style training jobs unless explicitly asked
- do not present expensive end-to-end training as routine validation
- for training code, prefer smoke tests over full experiments

If validation is incomplete:
- say what was not run
- say why
- identify the main remaining risks

## Commands

Common commands:
- install package: `pip install -e .`
- install dev tools: `pip install -e .[dev]`
- run tests: `pytest`
- lint: `ruff check .`
- format: `ruff format .`

Available CLIs:
- `pepseqpred-esm`
- `pepseqpred-labels`
- `pepseqpred-predict`
- `pepseqpred-preprocess`
- `pepseqpred-train-ffnn`
- `pepseqpred-train-ffnn-optuna`

## Testing guidance

The repo has:
- `tests/unit/`
- `tests/integration/`
- `tests/e2e/`

Prefer:
- unit tests for isolated logic changes
- integration tests for CLI to core interactions
- e2e only when a full pipeline boundary changed

Do not expand test scope unnecessarily if a small targeted test is enough.

## Documentation expectations

When behavior changes, update the relevant:
- docstrings
- CLI help text
- comments near tricky distributed logic
- any usage examples affected by the change

Note:
- the current root `README.md` is minimal, so do not assume broader user documentation already exists
- if you add a major new workflow, include enough inline guidance for future contributors

## What not to change without explicit approval

Do not, unless clearly requested:
- redesign package structure
- replace DDP or SLURM workflows
- alter default experiment semantics
- change model architecture defaults broadly
- change preprocessing formulas or label logic
- rewrite output schemas
- remove test categories
- introduce large framework migrations

## Preferred task workflow

For most tasks:
1. inspect relevant app, core, test, and script files
2. identify the smallest safe fix
3. implement minimally
4. run focused validation
5. summarize edits, validation, and remaining risks

## Directory specific notes

### `src/pepseqpred/apps/`
- preserve CLI compatibility
- do not break argument names or defaults without explicit instruction
- keep orchestration logic thin when possible

### `src/pepseqpred/core/train/`
- highest risk area
- be conservative with splits, seeds, metrics, thresholds, and DDP behavior
- verify rank aware writes and collective calls carefully

### `scripts/hpc/`
- preserve SLURM semantics
- avoid hard coding user specific assumptions unless already part of script conventions
- comment any scheduler related changes clearly

### `tests/`
- add targeted coverage for bug fixes
- do not rewrite unrelated fixtures or tests just for style
101 changes: 101 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Contributing to PepSeqPred

This document defines required contribution workflow, naming conventions, and pull request expectations for this repository.

## Core Rules

- Do not develop directly on `main`.
- All changes must be made on a different branch and merged via pull request.
- Branch, issue, and commit names must follow the conventions below.
- Keep titles and descriptions short, clear, and specific.

## Required Contribution Workflow

1. Create or confirm an issue for the work.
2. Create a branch from the latest `main` using the branch naming rules.
3. Implement the change and add/update relevant tests.
4. Run required local checks.
5. Open a pull request into `main` with required summary and verification details.

## Branch Naming Conventions

Use lowercase and hyphen-separated descriptions.

Accepted patterns:
- `feat/short-description`
- `fix/short-description`
- `docs/short-description`
- `chore/short-description`
- `test/short-description`
- `refactor/short-description`

You can also be extra specific by adding the issue number associated with your code as seen below.

Examples:
- `feat/add-sharded-embedding-index-logging`
- `fix/issue-42-threshold-range-validation`
- `docs/update-readme-pipeline-section`

## Issue Naming and Content

Issue title format:
- `<type>: short description`

Examples:
- `bug: label shard mismatch across embedding keys`
- `docs: add hpc setup troubleshooting`
- `chore: tighten local test gating in README`

Issue body requirements:
- `Summary`: a short statement of the problem or request.
- `Done when`: acceptance criteria, if applicable.

## Commit Message Conventions

Commit title format:
- `<type>: short description`

Examples:
- `bug: fix id-family key validation in labels builder`
- `chore: remove unused import from prediction cli`
- `docs: add contributing workflow and naming rules`

Commit guidance:
- Keep the first line concise and specific.
- Keep one logical change per commit where possible.

## Pull Request Requirements

All pull requests to `main` must include:
- A concise summary of what changed.
- Linked issue(s) (for example, `Fixes #42`).
- A concise "How to verify" section with exact commands.
- Any new or updated unit, integration, or e2e tests needed to verify behavior changes.

PRs should not include changed unrelated to the issue unless it's minor, please use your own discretion.

## Verification Expectations Before PR

Run these checks locally before opening a PR:

```bash
ruff check .
pytest -m "unit or integration or e2e"
```

If behavior changed, include targeted test commands in the PR verification section, along with expected outcomes.

## PR Checklist

- [ ] Branch name follows convention.
- [ ] Issue title/body follow convention (`Summary` and `Done when` included when applicable).
- [ ] Commit messages follow `<type>: short description`.
- [ ] No development occurred directly on `main`.
- [ ] PR includes concise summary and reproducible verification steps.
- [ ] Relevant unit/integration/e2e tests were added or updated.

## Maintainer Support and Escalation

- Use GitHub issues for normal development questions, bug reports, and feature requests.
- Use email for private or sensitive matters that should not be posted publicly.
- Maintainer contact: [Jeffrey Hoelzel](mailto:jmh2338@nau.edu) or [Jason Ladner](mailto:jason.ladner@nau.edu).
Loading
Loading