Overview of Files

Basic Overview

Using RCSB data, train a transformer to predict protein structure & IDRs. Quickly evaluate models on array of performance metrics.

Overview of Files

get_pdbs.py : Download mmCif files, extract atom positions & protein sequence.
amino_expert.py : Utilize "standard aminos" to assist with amino-atom-index interconversions.
parse_seq.py : Align full protein sequence with known atom sequence. Data pre-processing.
transformer.py : Final data processing & Transformer training. Code for concurrent trainings & CUDA acceleration.
eval_model.py : Take model save-states & obtain training & validation accuracies at different time points.
logging_setup.py : Initialize & handle our logging framework

Highlights

Atom-Level Intrinsically Disordered Region Identification:
- Used mmCIF parsing to extract atom positions. Compared this against known Amino Acid sequence and "Ideal Aminos". From this, determined which atoms were "supposed" to be included, but were not, and classified these as Intrinsically Disordered.
CUDA Acceleration with Pytorch
- Utilized pytorch with CUDA to enable massive parallelization with GPUs. Significantly improves training & inference time.
Memory Mapping for Large Datasets
- Utilized memory mapping to allow large datasets to be stored on hard drive and be taken to correct device only when used in that batch. Allows for datasets which far surpass batch size, without holding entire dataset in memory and without constantly opening and closing files.
Multi-Node Management
- We manage concurrent training of multiple models, even if worker nodes utilize a Network File System (NFS). Allows for synchronous model training, speeding up hyper-parameter search.
Big-Data Management
- For data pre-processing, leveraged saving to different files to allow one large pre-processing run to be completed, and never performed again. Significantly improves speed of algorithm. Only saved relevant information to improve maximum capacity. Can specify set of protein codes & use it to train or evaluate models.
Extensive Logging
- To enable quick debugging in multi-node environments, we use extensive data logging to track errors as they occur. We 'fail gracefully', allowing for this large project to handle unexpected cases & continue training even if exceptions occur.
Model Saves
- Save models during training with configurable save frequencies. Allows for model to be recovered in case of node failure. Also enables retroactive performance analysis by iterating through every saved model & testing performance at that stage of training.
ESM Embedding Integration
- Utilizes existing architecture (ESM Fold) for amino acid embedding. Accelerates training by reducing time of model warm-up.
Extensive Model Analysis
- Utilizes vast array of established model analysis tools, including RMSD, TM-Score and GDT.
Protein Rendering
- Framework established for rendering proteins from their predicted positions. Uses existing mmCif file & modifies positions of each atom.

How to install

Clone the repository onto your device
Downloaded required packages with pip install -r requirements.txt
Install pytorch with cuda. Documentation for this installation can be found here. Helpful debugging documentation can be found here.

Please Note: Some packages may not be listed in requirements.txt and will require manual download. This code was run with Python 3.11.

How to use

Data Processing

Run parse_seq.py to automatically download and parse files. Note that you may need to run transformer.py for automatic creation of directories.

Protein IDs can be specified in PDBs/protein-ids.txt. Protein ids are 4-letter codes for downloading the protein's structure from the RCSB.

Pre-processed data will automatically be saved to PDBs/pre_processed_data. If downloading many proteins (>1,000) it's advisable to run this as soon as possible, as the data pre-processing may take a while.

Training

To train a model, run transformer.py. If running from the terminal, you may optionally add arguments.

transformer.py <Node name> <reprocess> <data size> <model heads> <model depth>

Node name: Specify the name that this model is training on. Useful when concurrently training multiple models.
Reprocess: Either 'y' or 'f'. If 'y', will run parse_seq.py to re-download and pre-process all data. If 'f', will skip this step.
Data size: Number of proteins from your pre-processed data you would like to include. Allows for debugging with small sample size without augmenting previously performed data pre-processing.
Model heads: Number of heads in our multi-head transformer architecture.
Model depth: Number of attention heads we run successively.

Future Work

Rotation space
- Use relative rotations of bonds instead of absolute coordinates. Prevents punishing model for producing correct predictions which are just rotated.
Smaller proteins
- Train our model exclusively on proteins which are shorter than our sequence length. Investigate what happens to model performance as the protein size increases.
Energy Minimization
- Include an energy minimization step at the end to make results more reasonable.
Mutli-GPU runs
- Integrate RAY or other multi-node framework
Investigate Model Performance on:
- Proteins in similar families
- "Toy" models (Test model on simplest alpha helix, simplest beta sheet, etc.)
- IDR performance (augment model to not train on IDRs, but still interpret its output)

Contributors

👨‍💻 Mark Stevens @MarkAStevens04
👩‍💻 Adelina Patlatii @AdelinaPatlatii
👨‍💻 Eilia Zomorrodian @E-zomorod
👩‍💻 Shabnam Tajik @ShbnmTjk

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.idea		.idea
PDBs		PDBs
Results		Results
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
amino_expert.py		amino_expert.py
eval_model.py		eval_model.py
get_pdbs.py		get_pdbs.py
logging_setup.py		logging_setup.py
logo.png		logo.png
parse_seq.py		parse_seq.py
requirements.txt		requirements.txt
restats		restats
transformer.py		transformer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Basic Overview

Overview of Files

Highlights

How to install

How to use

Data Processing

Training

Future Work

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Basic Overview

Overview of Files

Highlights

How to install

How to use

Data Processing

Training

Future Work

Contributors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages