okaris / progen2

ProGen: Language Modeling for Protein Engineering

  • Public
  • 233 runs
  • A100 (80GB)
  • GitHub
  • Paper
  • License

Input

string

Model id

Default: "progen2-small"

string
Shift + Return to add a new line

Device to run model on

Default: "cuda:0"

integer

Random number generator seed

Default: 42

boolean

Use deterministic RNG

Default: true

number

Probability of sampling from top-k

Default: 0.9

number

Temperature for top-k sampling

Default: 0.8

integer

Maximum length of generated text

Default: 1024

integer

Number of samples to generate

Default: 2

boolean

Use mixed precision

Default: true

string
Shift + Return to add a new line

Context to use for generation

Default: "1"

boolean

Run sanity check

Default: true

Output

No output yet! Press "Submit" to start a prediction.

Run time and cost

This model costs approximately $0.050 to run on Replicate, or 20 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 36 seconds. The predict time for this model varies significantly based on the inputs.

Readme

ProGen2

Official release of the ProGen2 models (151M, 764M, 2.7B, 6.4B) for Protein Engineering (see paper).

Models

Model Size Checkpoint
progen2-small 151M https://storage.googleapis.com/sfr-progen-research/checkpoints/progen2-small.tar.gz
progen2-medium 764M https://storage.googleapis.com/sfr-progen-research/checkpoints/progen2-medium.tar.gz
progen2-oas 764M https://storage.googleapis.com/sfr-progen-research/checkpoints/progen2-oas.tar.gz
progen2-base 764M https://storage.googleapis.com/sfr-progen-research/checkpoints/progen2-base.tar.gz
progen2-large 2.7B https://storage.googleapis.com/sfr-progen-research/checkpoints/progen2-large.tar.gz
progen2-BFD90 2.7B https://storage.googleapis.com/sfr-progen-research/checkpoints/progen2-BFD90.tar.gz
progen2-xlarge 6.4B https://storage.googleapis.com/sfr-progen-research/checkpoints/progen2-xlarge.tar.gz

Setup

# code
git clone https://github.com/salesforce/progen
cd progen/progen2

# checkpoint
model=progen2-large
wget -P checkpoints/${model} https://storage.googleapis.com/sfr-progen-research/checkpoints/${model}.tar.gz
tar -xvf checkpoints/${model}/${model}.tar.gz -C checkpoints/${model}/

# venv
python3.8 -m venv .venv
source .venv/bin/activate
pip3 install --upgrade pip setuptools
pip3 install -r requirements.txt

# sample
python3 sample.py --model ${model} --t 0.8 --p 0.9 --max-length 1024 --num-samples 2 --context "1"

# log-likelihood (GenBank: TMF32756.1)
python3 likelihood.py --model ${model} --context "1MGHGVSRPPVVTLRPAVLDDCPVLWRWRNDPETRQASVDEREIPVDTHTRWFEETLKRFDRKLFIVSADGVDAGMVRLDIQDRDAAVSVNIAPEWRGRGVGPRALGCLSREAFGPLALLRMSAVVKRENAASRIAFERAGFTVVDTGGPLLHSSKARLHVVAAIQARMGSTRLPGKVLVSIAGRPTIQRIAERLAVCQELDAVAVSTSVENRDDAIADLAAHLGLVCVRGSETDLIERLGRTAARTGADALVRITADCPLVDPALVDRVVGVWRRSAGRLEYVSNVFPPTFPDGLDVEVLSRTVLERLDREVSDPFFRESLTAYVREHPAAFEIANVEHPEDLSRLRWTMDYPEDLAFVEAVYRRLGNQGEIFGMDDLLRLLEWSPELRDLNRCREDVTVERGIRGTGYHAALRARGQAP2"

Citation

If you find our code or paper useful, please cite:

@article{ProGen2,
  title={ProGen2: Exploring the Boundaries of Protein Language Models},
  author={Nijkamp, Erik and Ruffolo, Jeffrey and Weinstein, Eli N. and Naik, Nikhil and Madani, Ali},
  journal={arXiv},
  year={2022}
}

License

Our code and models are BSD-3 licensed. See LICENSE.txt for details.

Ethics

Predicting the fitness of a protein sequence and capturing the distribution of natural proteins for generative purposes could be a powerful tool for protein design. If our technique or a future iteration thereof is adopted broadly, care should be taken in terms of the end use-cases of these designed samples and downstream effects to ensure safe, non-nefarious, and ethical applications. For projects in any domain, active oversight during project initiation, experimental optimization, and deployment phases should be put in place to ensure safe usage and limitation of unintended harmful effects.