PHEAT is the Protein Heavy-atom Energy and Analysis Toolkit. It is a library-first Python tool for converting PDB, mmCIF, or BinaryCIF structures to a compact atom representation, reconstructing structures from residue geometry, building reduced side-chain centroid models, computing radius-of-gyration metrics, and computing approximate protein energy scores.
The CLI is intentionally thin: every command delegates to importable backend functions
under pheat.
The package core is dependency-light and can be imported with the system Python:
PYTHONPATH=src python3 - <<'PY'
from pheat import structure_from_residue_geometry
print(len(structure_from_residue_geometry({"sequence": "AG"}).atoms))
PYFor editable local development without optional scientific or notebook dependencies:
python -m pip install -e .For the full pip-managed development stack, install the all extra:
python -m pip install -e ".[all]"
python -m pytestThe all extra is intended to cover the same feature and test surface as the
Miniforge environment where pip has a satisfiable package set: scientific
conversion/scoring dependencies, BinaryCIF input through PHEAT/msgpack/numpy,
OpenMM/PDBFixer, executable-backed scoring hooks for ambertools-sander and gromacs-mdrun,
training/table-generation dependencies, JupyterLab/Mol* widget
support, the optional FastAPI web app, tests, and linting tools. On Python 3.9, the pip
extras skip OpenMM/PDBFixer because current pip packages do not provide a
compatible pdbfixer/openmm>=8.2 set; use Python 3.10+ or the Miniforge
environment for the openmm-prepared path. AmberTools and GROMACS executables are
available through the conda environment, not pip extras.
Common local checks are available through make:
make test
make coverage
make coverage-html
make lint
make typecheck
make package
make checkmake coverage writes terminal coverage and coverage.xml. HTML coverage is
generated only when requested with make coverage-html. make package builds the
source distribution and wheel so packaged schemas and manifests can be checked
outside an editable install.
The public documentation is built with MkDocs, Material for MkDocs, and mkdocstrings. Its styling is matched to the Blankenberg Lab Astro site while remaining self-contained in this repository.
Preview the docs locally with:
make docs-deps
make docs-serveBuild the static site with make docs, which runs mkdocs build --strict.
Direct mkdocs serve also works when the docs requirements are installed.
The Cloudflare Workers static-assets deployment uses the generated site/
directory. Configure the Cloudflare build with:
Build command:
python -m pip install --upgrade pip && python -m pip install -r docs/requirements.txt && python -m mkdocs build --strict
Deploy command:
npx wrangler@4 deploy
The production documentation URL is:
https://pheat.tools.blankenberglab.org/
For conda/Miniforge development, activate the environment first and then update it from the repository root:
mamba env update -n "$CONDA_DEFAULT_ENV" -f environment.yml
python -m pytestThe Makefile assumes the development environment is already active and uses
python, jupyter, and npm from PATH. The conda environment installs the
package as -e ".[all]" after resolving the scientific stack from conda-forge,
including OpenMM/PDBFixer support used by the deterministic openmm-prepared
development scoring path.
Example reports, executed notebooks, and Mol* browser assets are generated artifacts rather than committed files:
make examplesRun these targets from an already-active development environment. The environment
includes nodejs, and make molstar wraps pheat molstar install to download
the pinned Mol* viewer bundle into PHEAT's platform-aware runtime cache. Use
pheat molstar status to inspect the active asset location.
The optional web app accepts a PDB or mmCIF upload and runs the same residue-geometry roundtrip comparison used by the backend tests and 2MU7 report. Install the web extra, build local Mol* assets, and start the app:
python -m pip install -e ".[web]"
make molstar
pheat webFor the full development environment, pip install -e ".[all]" or
environment.yml includes the web dependencies. make web installs the pinned
local Mol* assets in PHEAT's runtime cache and starts the app with the active environment's python. By
default it binds local-only on 127.0.0.1, tries http://127.0.0.1:8000/ first,
and chooses a random available port in 8001-8999 if 8000 is already in use.
Override the preferred port with make web WEB_PORT=8888.
Bind a different interface explicitly when needed:
pheat web --host 127.0.0.1
pheat web --host 0.0.0.0
make web WEB_HOST=0.0.0.0
make web WEB_HOST=192.168.1.250.0.0.0 means all IPv4 interfaces and can make the app reachable from other
computers on the network; use it only on trusted networks. The server prints an
Open PHEAT web app: URL that most terminals render as a clickable browser link.
The first screen is the upload tool. It has a workflow selector for a single
default roundtrip, one configurable roundtrip, or the combinatorial omega/tau/theta
and chi-limit sweep. Built-in deterministic scorers are selected by default, and
the optional OpenMM-prepared scorer can be selected explicitly. Results include
original score totals, reconstructed score totals, Kabsch RMSDs, radius-of-gyration
comparisons, aligned original/reconstructed PDB downloads, optional mmCIF downloads,
residue-geometry JSON, reconstructed atom-structure JSON, metrics JSON, and an embedded
Mol* alignment viewer. Generated
upload artifacts are written under .pheat-cache/web/. The viewer uses Mol*
semantic original/reconstructed coloring, includes quick toggles for each
structure, can switch between ribbon and all-atom Mol* representations, and has a
recolor control that reapplies the initial colors without reloading the PDB data.
Hidden structures dim when toggled off; the selected representation mode is
outlined while the other mode remains fully clickable.
pheat pdb-to-structure input.pdb -o structure.json
pheat pdb-to-structure input.pdb -o all-atom-structure.json --hydrogens preserve
pheat pdb-to-structure input.pdb -o bonded-structure.json --store-bonds all
pheat mmcif-to-structure input.cif -o structure.json
pheat bcif-to-structure input.bcif.gz -o structure.json
pheat structure-to-pdb structure.json -o roundtrip.pdb
pheat structure-to-mmcif structure.json -o roundtrip.cif
pheat structure-to-geometry structure.json -o residue-geometry.json
pheat pdb-to-geometry input.pdb -o residue-geometry.json
pheat mmcif-to-geometry input.cif -o residue-geometry.json
pheat bcif-to-geometry input.bcif.gz -o residue-geometry.json
pheat pdb-to-geometry input.pdb -o residue-geometry-degrees.json --angle-units degrees
pheat pdb-to-geometry input.pdb -o residue-geometry-chi1.json --max-chi 1
pheat pdb-to-geometry input.pdb -o residue-geometry-full.json --store-angles all
pheat pdb-to-geometry input.pdb -o residue-geometry-lengths.json --store-lengths all
pheat structure-to-geometry structure.json -o residue-geometry-full.json --store-angles omega,tau,theta
pheat geometry-to-structure residue-geometry.json -o structure.json --pdb-output rebuilt.pdb
pheat geometry-to-structure residue-geometry.json -o all-atom-structure.json --hydrogens generate
pheat geometry-to-structure residue-geometry.json -o structure.json --mmcif-output rebuilt.cif
pheat geometry-to-structure residue-geometry.json -o structure.json --geometry-table geometry-tables.json
pheat geometry-to-structure residue-geometry.json -o structure.json --geometry-table ccd-sidechain-geometry-v1
pheat geometry-to-structure residue-geometry.json -o structure.json --include-terminal-oxt
pheat geometry-to-structure residue-geometry-degrees.json -o structure.json --angle-units degrees
pheat score input.pdb --model generic
pheat score input.pdb --model pheat-dfire --table-set packaged:protein-heavy-30id-xray-aqueous-v0
pheat training tables describe --table-set packaged:protein-heavy-30id-xray-aqueous-v0
pheat scoring validate-options --model gromacs-mdrun
pheat gromacs validate-options --gromacs-forcefield amber19sb
pheat radius-of-gyration input.pdb
pheat rg structure.json --mode unweighted
pheat rg structure.json --atom-set ca
pheat rmsd original.pdb reconstructed.pdb
pheat rmsd original.pdb reconstructed.pdb --atom-set ca --alignment-atom-set ca
pheat examples list
pheat sources list
pheat sources fetch wwpdb-ccd-full --destination .pheat-cache/sources/ccd/full
pheat sources fetch rcsb-ccd-bcif --destination .pheat-cache/sources/ccd/bcif
pheat archive download --ids-file ids.txt --dry-run
pheat archive snapshots list
pheat archive snapshots ids rcsb-current-bcif --output-root .pheat-cache/pdb-archive/rcsb-current-bcif
pheat training corpus inventory --snapshot-root .pheat-cache/pdb-archive/rcsb-current-bcif -o .pheat-cache/training/inventory.jsonl
pheat training corpus select --inventory .pheat-cache/training/inventory.jsonl --output-root .pheat-cache/training/sets/protein-heavy-30id --corpus-id protein-heavy-30id --corpus-version v1
pheat training corpus describe --training-set .pheat-cache/training/sets/protein-heavy-30id
pheat training tables build --training-set .pheat-cache/training/sets/protein-heavy-30id --output-root .pheat-cache/training/tables/protein-heavy --burial-method both --table-set-id protein-heavy-30id --table-set-version v1
pheat training tables describe --table-set .pheat-cache/training/tables/protein-heavy/score-tables.json
pheat training tables build --training-set .pheat-cache/training/sets/protein-heavy-30id --output-root .pheat-cache/training/tables/protein-heavy-contacts --burial-method contacts --table-set-id protein-heavy-30id-contacts --table-set-version v1
pheat training tables build --training-set .pheat-cache/training/sets/protein-heavy-30id --output-root .pheat-cache/training/tables/protein-heavy-sasa --burial-method sasa --table-set-id protein-heavy-30id-sasa --table-set-version v1
pheat reference run-unattended --reference-root .pheat-cache/reference-builds --snapshot-root .pheat-cache/pdb-archive/rcsb-current-bcif --workers auto --overwrite
pheat reference build-decoys --training-set .pheat-cache/reference-builds/sets/protein-heavy-30id-xray-aqueous-v0 --output-root .pheat-cache/reference-builds/decoys/protein-heavy-30id-xray-aqueous-v0-pheat-torsion --recipes pheat-torsion-v1
pheat reference build-scores --training-set .pheat-cache/reference-builds/sets/protein-heavy-30id-xray-aqueous-v0 --output-root .pheat-cache/reference-builds/tables/protein-heavy-30id-xray-aqueous-v0
pheat reference extract-features --training-set .pheat-cache/reference-builds/sets/protein-heavy-30id-xray-aqueous-v0 --decoys .pheat-cache/reference-builds/decoys/protein-heavy-30id-xray-aqueous-v0-pheat-torsion/decoys.jsonl -o .pheat-cache/reference-builds/features/v0/aqueous-features.jsonl
pheat reference train-ml --features .pheat-cache/reference-builds/features/v0/aqueous-features.jsonl -o .pheat-cache/reference-builds/models/v0/pheat-ml-linear-aqueous.json
pheat reference package-scoring-assets --reference-root .pheat-cache/reference-builds --artifact-version v0 --destination-root src/pheat/data/scoring/v0 --overwrite
pheat geometry tables list
pheat geometry tables build-backbone --training-set .pheat-cache/training/sets/protein-heavy-30id --output-root .pheat-cache/training/geometry/protein-heavy-30id-backbone
pheat geometry tables build-cdl --training-set .pheat-cache/training/sets/protein-heavy-30id --output-root .pheat-cache/training/geometry/protein-heavy-30id-cdl --phi-psi-bin-size 10 --min-bin-count 20
pheat geometry tables import-cdl --input cdl-like-table.json --output-root .pheat-cache/training/geometry/imported-cdl
pheat geometry tables build-sidechain-ccd --ccd-full .pheat-cache/sources/ccd/full/components.cif.gz --output-root .pheat-cache/training/geometry/ccd-sidechains
pheat geometry tables validate --table-set .pheat-cache/training/geometry/protein-heavy-30id-backbone/geometry-tables.json
pheat webCLI commands print immediate human-readable status to stderr, including the
PHEAT version, command path, and key inputs/outputs. Structured command results
remain on stdout so JSON output can still be piped to other tools. Use
--quiet before the subcommand to suppress terminal startup/progress status,
--log FILE to append timestamped status/progress/error lines to a text log,
and -v or -vv for more diagnostics:
pheat --quiet archive snapshots list
pheat --log pheat-download.log archive snapshots download rcsb-current-bcif -y
pheat -vv archive download --ids-file ids.txt --dry-run
pheat --versionpheat archive download builds local RCSB/wwPDB coordinate corpora with
provenance manifests. The default output root is ignored local cache space:
pheat archive download --ids-file ids.txt --dry-run
pheat archive download --ids-file ids.txt --yes
pheat archive download --ids-file ids.txt --yes --no-progress-redraw
pheat archive download --all-current --format cif --max-entries 100 --dry-run
pheat archive snapshots list
pheat archive snapshots describe wwpdb-current-mmcif
pheat archive snapshots download wwpdb-current-mmcif --max-entries 100 --dry-run
pheat archive snapshots download rcsb-current-bcif --prefetch-metadata --metadata-source rcsb-api --yes
pheat archive snapshots verify wwpdb-current-mmcif
pheat archive snapshots relocate rcsb-current-bcif --output-root .pheat-cache/pdb-archive/rcsb-current-bcif --write
pheat archive snapshots metadata rcsb-current-bcif --output-root .pheat-cache/pdb-archive/rcsb-current-bcifBy default it writes under .pheat-cache/pdb-archive/:
raw/for coordinate files.processed/,analysis/, andfailed/for later corpus-building stages.manifests/forids.txt,filters.json,files.jsonl, andapi-schemas.json.
The command can reuse existing coordinate files from another directory without redownloading them:
pheat archive download --ids-file ids.txt \
--reuse-raw-dir /data/pdb/mmcif \
--dry-runBy default, reused files are copied into the new archive so the snapshot is
self-contained. Use --reuse-mode reference only when a manifest that points at
another raw-file location is acceptable.
Schema/API provenance is recorded without storing remote schema bodies. The
api-schemas.json manifest records the inspected RCSB schema URL, retrieval
timestamp, content SHA-256, content length, HTTP cache headers when present,
embedded version/license metadata when present, and stored: false. PHEAT uses
direct HTTP JSON requests for this utility; it does not depend on the external
rcsb-api package. Coordinate files are written atomically and local SHA-256
checksums are recorded in files.jsonl.
Archive downloads report processed files, failures, elapsed time, average
files/sec, average downloaded bytes/sec, and estimated time remaining. Interactive
terminals use a dynamic redraw line by default; redirected output, CI logs, and
--no-progress-redraw use append-only progress lines. --progress-interval
controls file-count updates, and --progress-seconds controls elapsed-time
updates. --log FILE always records append-only timestamped lines even when the
terminal display uses redraw.
Snapshot presets are named download plans for common public coordinate archives.
PHEAT currently includes current-holdings presets for gzipped mmCIF
(wwpdb-current-mmcif), gzipped legacy PDB (wwpdb-current-pdb), and gzipped
BinaryCIF (rcsb-current-bcif). Snapshot downloads use per-snapshot default
roots under .pheat-cache/pdb-archive/, record the snapshot ID in
filters.json, and can be verified later against the SHA-256 checksums in
files.jsonl. The rcsb-current-bcif files can be consumed directly by PHEAT
when the scientific or all extras are installed. BinaryCIF coordinate input uses
PHEAT's native atom-site decoder and label asym IDs for chain identifiers.
If a snapshot directory is moved, pheat archive snapshots relocate resolves
files by basename under the local raw/ directory and can rewrite
files.jsonl/filters.json with relative paths. pheat archive snapshots metadata writes compact normalized entry metadata for downstream training and
reference selection. Snapshot downloads can also use --prefetch-metadata to
populate that metadata cache immediately after coordinate files are downloaded;
this is the preferred one-pass archival mode for snapshots intended to be reused
for reference builds. Metadata extraction reports batch progress to stderr and
the existing global --log FILE captures the same progress lines for later
auditing.
For a reusable BinaryCIF snapshot on network-attached storage while downloading through local staging space in the current directory:
pheat archive snapshots download rcsb-current-bcif \
--output-root /path/to/pheat-archive/rcsb-current-bcif \
--staging-dir ./pheat-bcif-staging \
--cleanup-staging \
-yThis downloads pending files into ./pheat-bcif-staging, computes SHA-256,
promotes each verified file into raw/ under the snapshot root, verifies the
promoted copy, records the final paths and checksums in manifests/files.jsonl,
and removes successfully promoted staged files. Existing final files are skipped
when they match a prior manifest checksum; stale files that disagree with the
manifest are downloaded again. Verify the reusable snapshot later with:
pheat archive snapshots verify rcsb-current-bcif \
--output-root /path/to/pheat-archive/rcsb-current-bcifPHEAT's training commands default to all-heavy protein scoring (--domain protein-heavy) and to the reusable rcsb-current-bcif snapshot. Broader
heavy-atom domains are opt-in with --domain all-heavy or --domain full for
models that support them. Full-corpus table outputs are not bundled in this pass;
the commands below create reproducible artifacts under local cache or a
user-selected output root.
make training-snapshot-ids
make training-decoys
make training-inventory
make training-select
make training-tables
make training-tables-contacts
make training-tables-sasa
make training-validate
make training-validate-contacts
make training-validate-sasaThe corresponding CLI commands are available without Make:
pheat training decoys list
pheat training decoys fetch 3drobot --output-root .pheat-cache/training/decoys --yes
pheat training corpus inventory \
--snapshot-root .pheat-cache/pdb-archive/rcsb-current-bcif \
--domain protein-heavy \
-o .pheat-cache/training/inventory.jsonl
pheat training corpus select \
--inventory .pheat-cache/training/inventory.jsonl \
--output-root .pheat-cache/training/sets/protein-heavy-30id \
--sequence-identity 0.30 \
--corpus-id protein-heavy-30id \
--corpus-version v1
pheat training corpus describe \
--training-set .pheat-cache/training/sets/protein-heavy-30id
pheat training tables build \
--training-set .pheat-cache/training/sets/protein-heavy-30id \
--output-root .pheat-cache/training/tables/protein-heavy \
--models pheat-dfire,pheat-goap,pheat-mj,pheat-hydropathy,pheat-backbone,pheat-rotamer,pheat-hbond,pheat-rg \
--burial-method both \
--sasa-backend auto \
--table-set-id protein-heavy-30id \
--table-set-version v1
pheat training tables describe \
--table-set .pheat-cache/training/tables/protein-heavy/score-tables.json
pheat training tables build \
--training-set .pheat-cache/training/sets/protein-heavy-30id \
--output-root .pheat-cache/training/tables/protein-heavy-contacts \
--models pheat-dfire,pheat-goap,pheat-mj,pheat-hydropathy,pheat-backbone,pheat-rotamer,pheat-hbond,pheat-rg \
--burial-method contacts \
--table-set-id protein-heavy-30id-contacts \
--table-set-version v1
pheat training tables build \
--training-set .pheat-cache/training/sets/protein-heavy-30id \
--output-root .pheat-cache/training/tables/protein-heavy-sasa \
--models pheat-dfire,pheat-goap,pheat-mj,pheat-hydropathy,pheat-backbone,pheat-rotamer,pheat-hbond,pheat-rg \
--burial-method sasa \
--sasa-backend auto \
--table-set-id protein-heavy-30id-sasa \
--table-set-version v1pheat training tables build writes score-tables.json with
format: "pheat.score-table-set". The table format is profile-native:
--burial-method both writes both protein-heavy-30id-contacts and
protein-heavy-30id-sasa into one file, and single-method builds write one profile
with the same shape. Score both profiles side by side with:
pheat score input.pdb \
--model pheat-hydropathy \
--table-set .pheat-cache/training/tables/protein-heavy/score-tables.json \
--profiles protein-heavy-30id-contacts,protein-heavy-30id-sasaSASA builds use --sasa-backend auto, which uses MIT-licensed FreeSASA.
Contact builds remain dependency-light and are useful for systems where SASA
packages are unavailable.
The source tree includes compressed JSON.xz initial v0 score assets under
src/pheat/data/scoring/v0 so installed packages can exercise trained scoring
without an external reference-build archive. Use them with packaged:<id>:
protein-heavy-30id-xray-aqueous-v0, protein-heavy-30id-xray-membrane-v0,
pheat-ml-linear-aqueous-v0, or pheat-ml-linear-membrane-v0. The packaged
assets are about 1.7 MiB total and are functional testing artifacts, not final
scientific defaults.
pheat training corpus select can use a local RCSB-style sequence-cluster file
with --sequence-clusters; otherwise it falls back to deterministic internal
sequence-identity clustering. --sequence-identity accepts fractions or
percent-style values such as 0.30, 30, or 30%; generated labels use the
safe 30id/12p5id form for paths and profile IDs.
Generated corpus and score-table JSON separate schema version from artifact
identity. Corpus manifests include artifact_id, artifact_version, a safe
artifact_label, selected-entry checksums, source snapshot provenance when the
inventory came from a PHEAT archive snapshot, and optional --include-file,
--exclude-file, and --holdout-file member-list checksums. Score-table sets
record the source corpus artifact and entry checksum; generated profile IDs use
the corpus ID plus burial method, for example protein-heavy-30id-contacts.
Create a new corpus or table artifact version whenever the PDB snapshot,
selection filters, inclusion/exclusion lists, clustering threshold, scoring
definitions, or PHEAT build changes in a way that should remain distinguishable.
The pheat reference commands wrap the full local reference-build workflow used
to derive future PHEAT-owned scoring assets. Defaults are deliberately strict for
the current protein-heavy build: the reusable rcsb-current-bcif snapshot,
X-ray structures only (--method x-ray), 30% sequence-identity clustering
(--sequence-identity 0.30), maximum resolution 2.5 A, and artifact version
v0. This initial v0 label is intentional while the first packaged scoring
assets are functional but still awaiting broader manual and scientific review.
make reference-unattended \
SNAPSHOT_ROOT=.pheat-cache/pdb-archive/rcsb-current-bcif \
REFERENCE_ROOT=.pheat-cache/reference-builds \
REFERENCE_WORKERS=auto \
REFERENCE_OVERWRITE=1The unattended target runs the complete pheat reference run-unattended workflow:
input registration, snapshot metadata extraction, inventory, aqueous/membrane
selection, canary decoy/feature validation, full decoy generation, score-table
builds, feature extraction, ML fitting, and validation. It defaults to artifact
version v0, writes stage logs under $(REFERENCE_ROOT)/runs/v0/logs, and can
move previous outputs into backups/<timestamp>/ with
REFERENCE_BACKUP_EXISTING=1.
The currently bundled initial v0 assets were packaged from a completed local
reference run. That run selected 21,721 aqueous and 2,373 membrane X-ray
protein-heavy chains, accepted 5,454 aqueous and 554 membrane torsion-space
decoys, and extracted 27,175 aqueous and 2,927 membrane native/decoy feature
rows. Snapshot metadata had three missing RCSB metadata records (4M4C, 9KZM,
and 9MBW). Large source snapshots, decoys, feature JSONL files, and logs remain
external archive artifacts; only the packageable score/model JSON is committed as
compressed .json.xz payloads with compressed and uncompressed SHA-256 checksums
recorded in the packaged scoring manifest.
Use pheat reference audit-version --reference-root .pheat-cache/reference-builds --artifact-version v0 or make reference-audit to check that active manifests
and packageable outputs consistently use the expected artifact version. See
docs/reference-build-v0.md for the current build
settings and comparison against the previous comparable run.
Individual stages remain available for debugging:
make reference-fetch
make reference-metadata
make reference-inventory
make reference-select-aqueous
make reference-select-membrane
make reference-decoys
make reference-scores
make reference-features
make reference-ml-linear
make reference-validatereference-fetch records the selected coordinate snapshot and decoy benchmark
metadata. External benchmark payloads such as 3DRobot, CASP, I-TASSER, and
Rosetta decoy files are treated as local-use-only unless their license is
reviewed; PHEAT records source URLs, command settings, SHA-256 checksums, byte
counts, and registration/download dates, but does not redistribute unclear
license payloads. Local payloads can be registered with --local-file DATASET=PATH or --local-dir DATASET=PATH; direct downloads require
--include-payloads --payload-url DATASET=URL.
reference-metadata writes compact normalized snapshot metadata to
$(SNAPSHOT_ROOT)/manifests/metadata.jsonl before inventory. It records method,
resolution in Angstroms, deposition/revision dates, X-ray refinement and
validation summaries, composition counts, protein entity/chain sequence
metadata, RCSB sequence-cluster IDs when available, and normalized
aqueous/membrane/computed-model flags. The default source is auto: PHEAT uses
the RCSB Data API when available and falls back to local BinaryCIF metadata for
offline snapshots. Metadata manifests record source URLs, build dates, checksums,
and the fields intentionally omitted by default, such as coordinates, raw API
payloads, full citations, full crystallization text, and per-residue validation
details.
reference-inventory builds a JSONL inventory from a local snapshot using
multiple workers (--workers auto by default) and automatically consumes the
snapshot metadata file when present. reference-select writes a training-corpus
manifest plus selected.jsonl, holdout.jsonl, split files, and an audit
report. The aqueous subset keeps non-membrane entries and warns about entries
without explicit solvent metadata during this prototype phase; the membrane
subset only keeps entries with membrane annotations. RCSB sequence-cluster
metadata is used for the selected sequence-identity threshold when present;
otherwise PHEAT falls back to deterministic internal sequence comparison. Both
subsets keep relative, relocatable paths where possible and record input
manifest checksums.
reference build-decoys creates PHEAT-owned deterministic decoys from selected
native chains. The default pheat-torsion-v1 profile perturbs PHEAT
residue-geometry degrees of freedom, reconstructs heavy atoms, aligns decoys to
their native chain for inspection, and records all-heavy RMSD, C-alpha RMSD,
radius-of-gyration ratio, geometry-integrity score, acceptance status, seed, and
SHA-256 for every accepted or rejected candidate. Older coordinate-noise recipes
remain available as smoke-test recipes but are not the default for reference
training. reference build-scores builds native PHEAT score tables, including
contact and optional SASA profiles. reference extract-features, reference train-ml, and reference validate create
feature rows, a lightweight pheat-ml-linear baseline, and native-vs-decoy
separation summaries. For provisional ML experiments, reference extract-features --max-entries N samples the first N selected native entries
and matching decoys while recording that cap in the feature manifest. Use
reference package-scoring-assets to refresh bundled compressed score/model
assets from a completed reference build. Use reference promote with a review note to copy a packageable generated artifact
into a reviewed destination; manifests containing local-use-only or
unpackageable payload metadata are blocked from promotion.
PHEAT source code is licensed under MIT. The wwPDB/RCSB Chemical Component
Dictionary definition CIFs consulted for modified residue templates are PDB
archive data files made available under the
CC0 1.0 Universal Public Domain Dedication
according to the RCSB PDB usage policy.
The consulted files are the CCD definition CIFs for
SEC,
PYL,
MSE,
HYP,
LYZ,
SEP,
TPO,
PTR, and
PCA, plus the
wwPDB CCD documentation and
RCSB download documentation.
Those CCD files are not vendored, redistributed, or packaged with PHEAT; they
were used as reference data for component IDs, parent relationships, atom names,
connectivity, and rounded idealized residue templates. Use of those references
does not imply wwPDB/RCSB endorsement.
PHEAT can also fetch local CCD caches for user-generated geometry tables:
wwpdb-ccd-full downloads the full components.cif.gz CCD file, and
rcsb-ccd-bcif downloads the compact cca.bcif/ccb.bcif atom and bond
subsets. The full CCD mmCIF file is the preferred source for deriving
side-chain reconstruction geometry because it contains ideal/model-coordinate
and bond-distance fields used to compute lengths and angles. The compact
BinaryCIF subsets are useful for lightweight atom/bond connectivity validation,
but they do not replace the full CCD geometry fields. Fetch commands write a
pheat-source-provenance.json file with URLs, timestamps, SHA-256 checksums,
file sizes, license metadata, and PHEAT version.
Published conformation-dependent library (CDL) references are documented as non-downloadable literature/source references. PHEAT's current context-dependent backbone tables are generated from selected local corpora; they are not the official Phenix/CCTBX CDL tables.
Mass-weighted radius of gyration uses a compact built-in representative atomic-mass
table for common PDB heavy elements. The values are derived from CIAAW Standard
Atomic Weights 2024 and cross-checked against the NIST Atomic Weights and Isotopic
Compositions reference database. CIAAW/IUPAC website content is copyright-marked
with attribution conditions for republication and commercial-use restrictions;
NIST notes that Standard Reference Data and other NIST works can carry different
copyright/licensing terms. PHEAT treats both atomic-weight references as citation-only
inputs for this purpose: no CIAAW or NIST atomic-weight pages or data files are
downloaded, vendored, redistributed, or packaged. pheat sources list records
those entries as reference-only, and pheat sources fetch refuses to fetch them.
RCSB Search API schema documents used by the archive corpus utility are not
vendored or packaged. When inspected, PHEAT records only provenance such as the
URL, retrieval timestamp, content SHA-256, and embedded metadata. The Search API
OpenAPI document declares Apache 2.0 in its own info.license field; RCSB API
data and PDB archive data remain governed by the RCSB usage policy and its CC0
statement plus external-resource caveats.
Snapshot metadata extraction uses the RCSB Data API GraphQL endpoint for compact entry/entity/validation/cluster fields when network access is available. PHEAT stores normalized metadata rows and provenance, not raw RCSB API responses or downloaded API schemas.
PHEAT cites Miyazawa-Jernigan contact potentials, Kyte-Doolittle hydropathy,
FreeSASA, Zhang Lab decoy datasets, and CASP download areas as method or
benchmark references. The implementation does not vendor original MJ tables,
external hydropathy data files, decoy payloads, or CASP payloads. The generated
pheat-mj and other trained score-table outputs are PHEAT-owned artifacts built
from user-selected corpora and record their own provenance in
pheat.score-table-set metadata.
from pheat import (
filter_structure_for_domain,
kabsch_align,
kabsch_rmsd,
load_mmcif,
load_pdb,
residue_angle_specs,
score_model_option_specs,
score_structure,
validate_external_scoring_options,
validate_scoring_options,
write_pdb,
)
from pheat.metrics import structure_radius_of_gyration, structure_rmsd
structure = load_pdb("input.pdb")
mmcif_structure = load_mmcif("input.cif")
protein_heavy, coverage = filter_structure_for_domain(structure, domain="protein-heavy")
write_pdb(structure, "protein-heavy.pdb", domain="protein-heavy")
result = score_structure(structure, model="generic")
external_check = validate_external_scoring_options(model="gromacs-mdrun")
generic_options = validate_scoring_options("generic", {"domain": "protein-heavy"})
angle_specs = residue_angle_specs("MAG", stored_angles="omega")
print(result.total)
print(external_check["ok"])
print(generic_options["ok"])
print(coverage["scored_atom_count"], len(protein_heavy.atoms))
print(angle_specs[0]["angle_name"])
print(score_model_option_specs("gromacs-mdrun")[0]["name"])
print(len(mmcif_structure.atoms))
print(structure_radius_of_gyration(structure)["values"])
print(structure_rmsd(structure, structure)["value"])
coords = [atom.coord for atom in structure.atoms]
aligned = kabsch_align(coords, coords)
print(kabsch_rmsd(coords, coords, aligned_target=aligned))pheat rmsd and structure_rmsd default to all matched heavy atoms. Use
--atom-set ca for C-alpha-only RMSD; this matches atom name CA, not calcium
element records. --alignment-atom-set controls which matched atoms define the
Kabsch superposition, so callers can align on C-alpha atoms and measure
all-heavy RMSD, or align and measure on the same atom set.
examples/notebook/2mu7_roundtrip_energy_rmsd_molstar.ipynb demonstrates the committed
2MU7 heavy-atom to residue geometry to heavy-atom roundtrip. It computes energy comparisons,
radius-of-gyration comparisons, optional OpenMM-prepared scores, all-heavy,
backbone, and C-alpha Kabsch RMSDs, and a Mol* alignment visualization through ipymolstar.
Run make examples-notebook-executed to create an executed copy under
examples/notebook/executed/.
examples/2mu7_combinatorial_roundtrip.py runs the same 2MU7 roundtrip across every
subset of stored omega, tau, and theta fields, with chi limits of all, 1, and
2, across both fixed PHEAT reconstruction geometry and the packaged
CCD-derived side-chain geometry table. It writes aligned initial/reconstructed
PDBs, optional aligned mmCIFs, energy comparisons, radius-of-gyration comparisons, RMSDs,
summary.json, summary.csv, and report.html under
examples/roundtrip/2mu7_combinatorial/.
The default sweep produces 48 cases: 8 optional-angle combinations x 3 chi
limits x 2 reconstruction geometry variants.
The HTML report lists the original all-heavy scores once and reports reconstructed
score totals for each roundtrip case. It also embeds the aligned PDB pairs into an
interactive Mol* viewer loaded from PHEAT-managed local assets installed by
pheat molstar install, so the report can be opened directly from disk without a
CDN or runtime network dependency after make examples. The viewer uses semantic original/reconstructed
coloring, can switch between ribbon and all-atom Mol* representations, and
includes a recolor control that reapplies the initial colors without reloading
the embedded PDB data. Hidden structures dim when toggled off; the selected
representation mode is outlined while the other mode remains fully clickable.
Pass --write-mmcif to the example script when aligned mmCIF artifacts should be
written alongside the default PDB artifacts. Pass --geometry-variants fixed to
generate only the fixed-geometry cases, or provide comma-separated packaged table
IDs/paths to compare additional reconstruction geometry tables.
Use either pip install -e ".[all]" or the active conda environment from
environment.yml for JupyterLab and Mol* notebook widget support. Both install
paths include ipymolstar and molviewspec, which provide a Mol* anywidget
Jupyter viewer for local molecular data.
The current implementation distinguishes between production plumbing and approximate
scoring. PDB, mmCIF, and BinaryCIF parsing writes canonical atom-structure JSON.
By default PHEAT drops hydrogens and records the dropped count so artifacts stay
heavy-atom compact. Use --hydrogens preserve to keep source H/D/T atoms, or
--hydrogens generate on supported workflows to add hydrogens through the optional
OpenMM path. The JSON atom_scope field reports whether an artifact is heavy or
all.
Optional top-level bond storage is off by default. Use --store-bonds declared,
--store-bonds template, or --store-bonds all to include zero-based atom-index
bond records with coordinate-measured Angstrom lengths. Declared bonds come from
source connectivity such as PDB CONECT and mmCIF struct_conn; template bonds use
PHEAT's supported protein/CCD residue templates. PHEAT does not infer generic bonds
by distance in this pass.
Atom-structure JSON preserves heterogens, record metadata, and explicit disulfide
connectivity from SSBOND, CYS SG-to-SG CONECT records, or mmCIF
struct_conn disulfide annotations. Disulfides are preserved as
connectivity annotations only: PHEAT does not infer them from sulfur distance and
does not fit sulfur atoms to disulfide geometry during residue-geometry reconstruction.
Atom-structure JSON can be converted back to
PDB or mmCIF, extracted to best-effort residue-geometry JSON, or reconstructed from
residue-geometry JSON into atom-structure JSON plus optional PDB or mmCIF output.
Residue Geometry JSON uses radians by default; pass
--angle-units degrees for degree-valued residue-geometry input or output. Optional backbone
geometry storage is compact by default; pass --store-angles omega,tau,theta or
--store-angles all when exporting residue-geometry JSON if those fields should be stored.
Pass --store-lengths all, backbone, sidechain, or explicit ATOM-ATOM keys
to store per-residue measured bond lengths in Angstroms; reconstruction uses stored
lengths before geometry tables or built-in defaults.
Pass --max-chi N to keep only the first N side-chain chi angles per residue;
max_chi=0 suppresses chi angles, max_chi=1 keeps only chi1, and the default
has no chi limit.
The Python API function residue_angle_specs(...) reports the PHEAT residue-angle
fields available for a sequence, including phi/psi, residue-template chi angles,
and optional omega/tau/theta fields. It returns PHEAT-native metadata such as
residue_index, residue_name, angle_name, category, applies_to, and
required_atoms; it does not expose optimizer-specific aliases. The optional
selective_chi_map argument can restrict named chi angles by residue, and
max_chi is then applied as a numeric ceiling.
Residue-geometry extraction and reconstruction supports all 20 canonical amino
acids plus SEC, PYL, MSE, HYP, LYZ, SEP, TPO, PTR, and PCA.
Hydroxylysine uses the wwPDB Chemical Component Dictionary code LYZ; HYL is
accepted as an input alias and normalizes to LYZ. One-letter shorthand is
available for SEC (U) and PYL (O); other modified residues require their
three-letter CCD names. Ring templates are closed for canonical PRO, PHE,
TYR, HIS, and TRP, and for modified HYP, PCA, and PYL. Modified
residue side-chain templates are idealized CCD/PDB-name-compatible heavy-atom
reconstructions, not rotamer-library or force-field minimization.
Modified residues are reconstructable, but remain outside canonical residue-specific
statistical terms; generic and heavy-mm paths use element-level terms where available.
PHEAT includes deterministic built-in scorers for testing pipelines and comparative
experiments, plus optional OpenMM, AmberTools, and GROMACS-backed paths. Compare original vs reconstructed
scores within the same model; do not compare absolute totals across different
models because their scales and terms are different. In the Python API,
supported_models() lists every recognized model ID, while available_models()
lists only models runnable in the active environment. model_capabilities()
reports the same distinction with optional dependency details; for example,
openmm-prepared is supported everywhere but available only when OpenMM can be
imported, and ambertools-sander is available only when tleap and sander
are on PATH; gromacs-mdrun is available only when gmx is on PATH.
Capability records and every energy-result metadata payload include an
implementation block that states whether the model is native PHEAT code,
an optional Python backend, or an external executable backend.
Use score_model_option_specs(model) to inspect accepted scorer options and
validate_scoring_options(model, options, require_executables=False) to validate
API option dictionaries without running scoring or requiring external executables.
The pheat-geometry-integrity scorer reports its diagnostic tolerances, per-term
weights, Huber delta, and cis-or-trans planarity target in result metadata.
| Model | Implementation | What it computes | Units | Main caveat |
|---|---|---|---|---|
generic |
Native PHEAT | Element-based steric clash and short-range contact score for broad PDB/mmCIF coverage. | arbitrary | Smoke-test score, not a physical or statistical potential. |
pheat-dfire |
Native PHEAT | PHEAT canonical-residue distance-contact heuristic plus side-chain burial, inspired by DFIRE. | arbitrary | Does not use the original DFIRE parameter table or reference-state calculation. |
pheat-goap |
Native PHEAT | pheat-dfire base score plus a PHEAT residue-orientation heuristic from CA-CB or N-CA vectors, inspired by GOAP. |
arbitrary | Does not use the original GOAP parameter tables. |
pheat-mj |
Native PHEAT | PHEAT-generated Miyazawa-Jernigan-style residue contact score. | arbitrary | Valid for supported protein residues; original MJ parameter tables are not redistributed. |
pheat-hydropathy |
Native PHEAT | Kyte-Doolittle hydropathy/burial compatibility score using contact density or optional SASA. | arbitrary | SASA scoring requires a SASA backend; contact-density scoring is an approximation. |
pheat-backbone |
Native PHEAT | Backbone torsion plausibility from extracted phi/psi/omega geometry. | arbitrary | Requires ordered protein backbone atoms. |
pheat-rotamer |
Native PHEAT | Side-chain chi/rotamer plausibility by residue type. | arbitrary | Gly/Ala have no side-chain rotamer term; incomplete side chains score partially. |
pheat-hbond |
Native PHEAT | Heavy-atom donor/acceptor contact geometry and buried-polar term. | arbitrary | Protonation is inferred from heavy atoms and remains ambiguous. |
pheat-rg |
Native PHEAT | Expected-radius-of-gyration compactness penalty. Defaults to C-alpha, unweighted Rg with placeholder coefficients. | arbitrary | Shape score only; fit coefficients from an in-domain corpus before interpreting as a calibrated potential. |
pheat-ml-linear |
Native PHEAT | Lightweight linear combination of PHEAT score features. | arbitrary | Only meaningful with a trained table set from an in-domain corpus. |
pheat-coarse-protein-folding-v1 |
Native PHEAT | Coarse folding objective with end-to-end compactness, hydrophobic burial, contact, decoded torsion, aromatic, disulfide, steric, and geometry-integrity terms. | arbitrary | Heuristic lower-is-better objective for staged folding/reranking; not a physical free energy or trained statistical potential. |
pheat-geometry-integrity |
Native PHEAT | Robust coordinate-geometry plausibility score for backbone bonds, peptide C-N links, C-alpha chirality, peptide planarity, and proline ring closure. | arbitrary | Geometry-quality diagnostic only; missing atoms are skipped with warnings and the score is not a thermodynamic energy. |
heavy-mm |
Native PHEAT | Heavy-atom Lennard-Jones-like, simple charge, and backbone bond-length penalty terms. | arbitrary | Heavy-atoms-only approximation, not AMBER/OpenMM force-field energy. |
openmm-prepared |
External Python backend | OpenMM AMBER potential after internal OpenMM/PDBFixer preparation. | kJ/mol | Optional dependency path; requires OpenMM to run, uses PDBFixer when available, and may add hydrogens and missing terminal/heavy atoms internally for scoring without modifying input artifacts. |
ambertools-sander |
External executable backend | AmberTools tleap plus sander single-point AMBER molecular mechanics energy after preparation. |
kcal/mol | Requires AmberTools executables and a parameterizable prepared protein; not a folding free energy. |
gromacs-mdrun |
External executable backend | GROMACS pdb2gmx, grompp, mdrun -rerun, and energy validation/reranking energy after topology preparation. |
kJ/mol | Requires the gmx executable and a parameterizable protein; defaults to amber19sb, unsolvated rerun scoring, and is not a folding free energy. |
The built-in generic, pheat-dfire, pheat-goap, pheat-mj,
pheat-hydropathy, pheat-backbone, pheat-rotamer, pheat-hbond,
pheat-rg, pheat-ml-linear, pheat-coarse-protein-folding-v1,
pheat-geometry-integrity, and heavy-mm result metadata labels
their scale as arbitrary unless an exact external parameter source is added and
verified. The pheat-dfire score is generated from PHEAT's built-in
hydrophobicity, element-contact, coarse distance-bin, and side-chain burial
constants; pheat-goap adds a local orientation-vector term. Original DFIRE
and GOAP papers are cited as method inspiration only. pheat-rg currently uses
the placeholder form expected_rg = a * residue_count ** b and reports the
squared standardized deviation from that expectation; table sets can override
atom_set, mode, a, b, and sigma_fraction once fitted coefficients are
available.
pheat-coarse-protein-folding-v1 accepts optional decoded torsion angles in
radians from the Python API as decoded_torsions={"0_phi": -1.0, "1_chi1": 0.5}
or from the CLI with --decoded-torsions torsions.json, where the file is a JSON
object keyed by zero-based residue index and angle name. Non-numeric or
non-finite torsion values are ignored and counted in result metadata.
Scoring defaults to --domain protein-heavy, which ignores waters, ions,
ligands, nucleic acids, and hydrogens for PHEAT's protein-oriented scores.
Use --domain all-heavy or --domain full explicitly for broader heavy-atom
experiments. The same domain names are available from Python for explicit
structure filtering and PDB serialization: protein-heavy writes supported
protein heavy atoms, all-heavy keeps nonprotein heavy atoms, and full keeps
all atoms already present in the PHEAT structure object. Every energy-result
metadata payload reports the selected domain and atom/residue coverage. It also
reports an input_contract for the selected score model: the expected structure
type, accepted atom scopes, compatible domains, required atom families, hydrogen
handling, table usage, burial dependence, and whether the scorer operates
directly on coordinates, derived torsions, feature vectors, or an internally
prepared force-field system. Torsion or residue-geometry workflows should
reconstruct an atom structure first and score that coordinate structure unless a
future torsion-native scorer explicitly declares a different contract.
PHEAT's internal chain_id field is a string and can preserve full mmCIF chain
identifiers in atom-structure JSON, residue-geometry JSON, and mmCIF output. Legacy
PDB files have a one-character chain ID column. For that reason, direct PDB output
rejects chain IDs longer than one character unless --allow-pdb-chain-truncation
is selected; prefer mmCIF output when preserving full author or label chain IDs
matters. mmCIF input uses author chain/residue IDs by default and can read label
IDs with --chain-id-source label.
Radius-of-gyration calculations are geometric summary metrics, not energy terms.
Unweighted Rg measures the root-mean-square distance of supplied heavy atoms from
their coordinate centroid. Mass-weighted Rg uses the same coordinates with a
center of mass and mass-weighted squared distances. Unknown elements fall back to
carbon mass and are reported in the JSON payload's unknown_elements list.
Rg accepts the same atom-set names as RMSD: all-heavy by default, backbone,
or ca. The ca atom set matches atom name CA, not calcium element records,
and is useful for backbone-trace compactness.
Residue-geometry files are versioned with format: "pheat.residue-geometry-structure" and carry
angle_units as a required top-level field. Supported values are radians and
degrees; radians are emitted by default. Dihedrals are stored as conventional
signed torsion angles, so trans peptide omega values are near +/-180 degrees
rather than near zero. Per-residue chi arrays are ordered as
[chi1, chi2, ...], recorded by the top-level chi_order: "chi1_to_chiN" field.
When exporting residue geometry, --max-chi N truncates each residue's chi array to the
first N entries in that order; omitting it stores every extractable chi angle.
For supported modified residues, chi arrays follow the same template order and may
include template-specific torsions for the modification, such as phosphate or
pyrrolysine extension atoms.
Per-residue omega, tau, and theta are optional stored fields:
omega: peptide-bond dihedralCA(i)-C(i)-N(i+1)-CA(i+1).tau: intra-residue bond angleN(i)-CA(i)-C(i).theta: peptide-link bond angleCA(i)-C(i)-N(i+1).
When those fields are absent during reconstruction, PHEAT falls back to its idealized backbone geometry constants.
Reconstruction uses the fixed Engh-Huber-style geometry profile by default. An
opt-in pheat.geometry-table-set can provide replacement reconstruction targets:
pheat geometry tables list
pheat geometry-to-structure residue-geometry.json \
-o structure.json \
--geometry-table ccd-sidechain-geometry-v1
pheat geometry tables build-backbone \
--training-set .pheat-cache/training/sets/protein-heavy-30id \
--output-root .pheat-cache/training/geometry/protein-heavy-30id-backbone
pheat geometry tables build-cdl \
--training-set .pheat-cache/training/sets/protein-heavy-30id \
--output-root .pheat-cache/training/geometry/protein-heavy-30id-cdl \
--phi-psi-bin-size 10 \
--min-bin-count 20
pheat geometry tables import-cdl \
--input cdl-like-table.json \
--output-root .pheat-cache/training/geometry/imported-cdl
pheat geometry-to-structure residue-geometry.json \
-o structure.json \
--geometry-table .pheat-cache/training/geometry/protein-heavy-30id-backbone/geometry-tables.jsonBackbone geometry tables are PHEAT-owned artifacts generated from a selected local
corpus and record source corpus checksums, filters, PHEAT version, and command
arguments. They store default/residue-level bond targets and phi/psi-binned
tau/theta targets; table-mode reconstruction uses those binned targets only when
the residue supplies phi and psi and tau/theta were not stored explicitly.
build-cdl creates a PHEAT-generated conformation-dependent backbone profile
from the same selected local corpus. It bins residue phi/psi space, records
backbone bond-length and bond-angle targets, and can group observations as
gly-pro-general, canonical, or per-residue; stored per-residue bond lengths
and stored tau/theta still take precedence during reconstruction. The builder
does not vendor the official Phenix/CCTBX CDL tables. import-cdl accepts a
JSON CDL-like bin table and writes a normal PHEAT geometry-table-set while
recording the input path, SHA-256 checksum, and optional source-license string.
The --smoothing kernel option is recorded for generated table provenance, but
current runtime lookup uses the nearest matching phi/psi bin. CCD side-chain
geometry tables can be generated from the full wwPDB CCD
components.cif.gz file or from per-component CCD CIF files with
pheat geometry tables build-sidechain-ccd; the current builder uses PHEAT's
placement order and fills or validates bond lengths, angles, and element symbols
from CCD bond/model-coordinate data. The compact CCD BinaryCIF atom/bond subsets
are accepted as a connectivity-only input and warn that PHEAT template geometry
defaults are being used. PHEAT bundles the small derived
ccd-sidechain-geometry-v1 table as packaged runtime data under
src/pheat/data/geometry; raw CCD source files remain external cache/archive
artifacts and are not packaged.
Top-level disulfide_bonds entries preserve explicit CYS-CYS connectivity across
atom-structure and residue-geometry JSON. They do not add chi values or disulfide-specific
torsions; cysteine still stores its normal chi1 side-chain angle.
These optional backbone fields are coupled. Storing only omega can make a
roundtrip RMSD worse than using the ideal trans fallback because the real peptide
twist is then applied inside an otherwise idealized tau/theta frame. In the
committed 2MU7 combinatorial example, all-chi backbone RMSD is 0.9321 A with no
optional geometry, 1.0169 A with omega alone, 0.4932 A with omega,tau,
and 0.4047 A with omega,tau,theta. The omega values are still preserved
correctly; the difference reflects mixed real/ideal internal-coordinate geometry.
Draft 2020-12 schemas are bundled for the canonical atom-structure,
residue-geometry-structure, centroid-structure, energy-result,
radius-of-gyration-result, residue-angle-specs,
score-model-option-specs, scoring-options-validation, score-table-set,
geometry-table-set, and training-corpus JSON
formats:
from pheat.schemas import load_schema
residue_geometry_schema = load_schema("residue-geometry-structure")The bundled schema $id values use stable public URLs under
https://pheat.tools.blankenberglab.org/schemas/. The same schema files are
published with the documentation site, for example
https://pheat.tools.blankenberglab.org/schemas/residue-geometry-structure.schema.json.
Saved atom-structure and residue-geometry JSON artifacts must use the current format
string and version: 1; file and JSON-string loaders reject other versions. Python
dictionary shorthand, such as {"sequence": "AG"}, remains available for direct API
construction.
Model JSON serialization rounds floating-point values to 12 decimal places to keep committed artifacts stable across supported platforms without changing in-memory geometry or scoring calculations.
OpenMM remains optional for the dependency-light core. The openmm,
training-full, dev, and all extras include OpenMM/PDBFixer on Python
3.10+, and the Python 3.11 Miniforge environment includes the same path for
local development. training intentionally omits OpenMM/PDBFixer for lighter
corpus/table workflows. The explicit openmm-prepared path may add missing
terminal atoms and hydrogens internally for scoring without modifying input
artifacts. PHEAT uses a fixed preparation seed for this path so regenerated
example artifacts are reproducible within a given OpenMM/PDBFixer version.
Successful OpenMM-prepared scores are reported in kJ/mol.
AmberTools and GROMACS scoring are executable-based and should be installed through
conda or another system distribution, not pip extras. The repository
environment.yml includes ambertools and gromacs; pip install .[all] installs the
Python optional dependencies but cannot provide tleap, sander, or gmx. Score a
heavy-atom or partial structure through AMBER preparation with:
pheat score input.pdb --model ambertools-sander --prepare auto
pheat score input.pdb --model ambertools-sander --prepare write \
--prepared-output prepared.pdb --ambertools-work-dir ambertools-run \
--external-timeout 300AmberTools solvent mode defaults to vacuum. When --amber-solvent gb is
selected, PHEAT writes set default PBRadii mbondi3 into the generated tleap
input and records amber_pbradii: "mbondi3" in result metadata so GB setup is
auditable and reproducible.
GROMACS scoring is available as gromacs-mdrun. The default force field is
amber19sb, selected as the current native GROMACS protein-oriented default for
PHEAT validation/reranking; --gromacs-water auto resolves to none for the
default unsolvated score and to opc when --gromacs-solvate is selected.
The default run mode is rerun, which evaluates the prepared coordinates with
gmx mdrun -rerun instead of treating zero-step MD as a single-point score.
Unsolvated scoring still centers the prepared molecule in a GROMACS box so the
Verlet cutoff/PBC machinery is valid; it does not add water unless
--gromacs-solvate is selected.
PHEAT checks the active GROMACS force-field directory before running pdb2gmx
and reports the installed force-field names when the requested one is missing.
Some conda-forge GROMACS builds may not yet bundle amber19sb; in that case,
install a GROMACS/GMXLIB force-field set that provides it or select an installed
alternative such as --gromacs-forcefield amber99sb-ildn.
pheat score input.pdb --model gromacs-mdrun --prepare auto
pheat score input.pdb --model gromacs-mdrun \
--gromacs-forcefield amber19sb \
--gromacs-run-mode rerun \
--external-timeout 300 \
--gromacs-work-dir gromacs-run \
--keep-gromacs-files
pheat score all-atom.pdb --model gromacs-mdrun \
--domain full \
--hydrogens preserve \
--prepare never \
--prep-cache-dir .pheat-cache/external-prep \
--prep-cache-mode readwrite
pheat gromacs prepare input.pdb -o prepared.gro --topology topol.top
pheat gromacs minimize input.pdb -o minimized.gro --score-output minimize-score.json
pheat gromacs validate input.pdb --json gromacs-validation.jsonGROMACS can also be used with --gromacs-run-mode minimize or
minimize-rerun; those modes intentionally change coordinates and should be
interpreted separately from pure rerun validation. GROMACS totals are comparable
only when the structures use the same force field, water/solvation setting,
termini/protonation policy, preparation path, and run mode.
External AmberTools and GROMACS commands accept --external-timeout SECONDS,
which applies to each subprocess invocation and fails with the captured stdout/stderr
tail when a command exceeds the limit. AmberTools command failures also include
the tail of leap.log or sander.out when those files were written, which helps
diagnose parameterization and geometry problems without preserving the whole
working directory. Use pheat scoring validate-options or pheat gromacs validate-options to check selected options before launching a run; validation
catches unsupported enum values, missing executables, missing GROMACS force
fields in the active installation, and invalid cache configuration. The same
validation is available to Python callers through
validate_external_scoring_options(...).
GROMACS run settings are exposed for validation and reranking experiments:
--gromacs-minimize-steps, --gromacs-emtol, --gromacs-emstep,
--gromacs-box-distance, --gromacs-cutoff, --gromacs-coulombtype,
--gromacs-vdwtype, --gromacs-nstlist, --gromacs-pbc,
--gromacs-comm-mode, --gromacs-grompp-maxwarn, and repeated
--gromacs-mdrun-flag values. For example, pass
--gromacs-mdrun-flag=-ntomp --gromacs-mdrun-flag 4 to request four OpenMP
threads from mdrun.
--prep-cache-dir plus --prep-cache-mode off|readwrite|readonly|refresh
records and optionally reuses external preparation artifacts. AmberTools records
cache metadata but still runs tleap, because its coordinate file is
candidate-specific. GROMACS can reuse a cached topology only for
--prepare never inputs that already include hydrogens and keep the same atom
order; runtime MDP and mdrun settings are intentionally not part of the topology
cache key. Use --domain full if the input contains hydrogens that must be retained.
Because PHEAT readers drop hydrogens by default, CLI cache-reuse runs also need
--hydrogens preserve. For heavy-atom default scoring, auto-preparation remains
the safer path and the cache reports itself as disabled rather than silently
reusing an incompatible topology.
PHEAT can validate corpus specs and build small local reference-corpus manifests from ID lists, local archives, or dry-run archived snapshot templates. The tiny demo uses only local fixtures and is intended as a workflow check:
pheat reference validate-spec examples/corpora/user_defined_ids_demo.yml
pheat reference build --corpus-spec examples/corpora/user_defined_ids_demo.yml --output-root .pheat-cache/corpora/user-defined-demo --overwriteRelated docs:
docs/corpus-specs.mddocs/reference-manifests.mddocs/ccd-heterogen-annotation.md