Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Rishab Jain authored Aug 25, 2021
1 parent 07c6bbe commit 48af3a6
Showing 1 changed file with 54 additions and 46 deletions.
100 changes: 54 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,18 @@

---
- [About](#About)
- [Assets](#Assets)
- [Benchmark Results](#Benchmark-Results)
- [Benchmark Sequences](#Benchmark-Sequences)
- [ICOR Tool](#Tool)
- [Scripts](#Scripts)
- [Summaries](#Summaries)
- [Resources](#Resources)
- [Assets](#Assets)
- [Benchmark Results](#Benchmark-Results)
- [Benchmark Sequences](#Benchmark-Sequences)
- [ICOR Tool](#Tool)
- [Models](#Models)
- [Optimizers](#Optimizers)
- [Scripts](#Scripts)
- [Resources](#Resources)
- [Dependencies](#Dependencies)
---

## About
### About
In protein sequences—as there are 61 sense codons but only 20 standard amino acids—most amino acids are encoded by more than one codon. Although such synonymous codons do not alter the encoded amino acid sequence, their selection can dramatically affect the production of the resulting protein. Codon optimization of synthetic DNA sequences for maximum expression is an important segment of heterologous expression. However, existing solutions are primarily based on choosing high-frequency codons only, neglecting the important effects of rare codons. In this paper, we propose a novel recurrent-neural-network (RNN) based codon optimization tool, ICOR, that aims to learn codon usage bias on a genomic dataset of Escherichia coli. We compile a dataset of over 42,000 non-redundant, robust genes that are used for deep learning. The model uses a bidirectional long short-term memory-based architecture, allowing for the sequential information of genes to be learnt. Our tool can predict synonymous codons for synthetic genes towards optimal expression in E. coli. We demonstrate that sequential context achieved via RNN may yield codon selection that is more similar to the host genome, therefore improving protein expression more than frequency-based approaches. On a benchmark set of over 40 select DNA sequences, ICOR tool improved the codon adaptation index by 41.69% compared to the original sequence. Our resulting algorithm is provided as an open-source software package along with the benchmark set of sequences.

### Assets
Expand All @@ -36,7 +38,7 @@ Assets including images and branding for the ICOR tool, hosted on the [biotools
- `naive_benchmarks` which consists of the benchmark results for the naively optimized sequences.
- `original_benchmarks` which consists of the benchmark results for the original, unoptimized sequences.
- `super_naive_benchmarks` which consists of the benchmark results for the super naively optimized sequences.
- `genscript_benchmarks` which consists of the
- `genscript_benchmarks` which consists of the benchmark results for the [Genscript Gensmart™](https://www.genscript.com/gensmart-free-gene-codon-optimization.html) optimized sequences.

### Benchmark Sequences
`benchmark_sequences` is a folder that contains sequences for benchmarking purposes, each in the FASTA format:
Expand All @@ -47,34 +49,48 @@ Assets including images and branding for the ICOR tool, hosted on the [biotools
- `icor` which consists of 40 DNA sequences optimized by the ICOR optimizer.
- `naive` which consists of 40 DNA sequences optimized by the naive optimizer.
- `super_naive` consists of 40 DNA sequences optimized by the super naive optimizer.
- `genscript` consists of 40 DNA sequences optimized by the Genscript Gensmart tool.
- `genscript` consists of 40 DNA sequences optimized by the [Genscript Gensmart](https://www.genscript.com/gensmart-free-gene-codon-optimization.html) tool.

### Tool
The ICOR tool has been divided into four directories: models, optimizers, resources, and scripts. At the base of the directory sits the `run_icor.ipynb` file: an interactive notebook to optimize a sequence utilizing the trained ICOR model. Supporting files were used to train, evaluate, and test the ICOR model. Descriptions for these can be found below:
The ICOR tool has been divided into four directories: models, optimizers, resources, and scripts. In the `/tool/optimizers` directory sits the `icor_optimizer.py` file: an interactive script to optimize a sequence utilizing the trained ICOR model.

> Note as of 8/24/2021, this ICOR optimizer Python script has a bug, although it works, it does not output the correct sequence. The other script "run_icor_from_mat" does work and outputs the correct sequence given an input of a .mat file. However, a user would be inputting either a FASTA file or pasting in a sequence. This script currently accepts the pasted sequence, but the optimizer portion is not working as expected. It outputs a sequence but it is not correct. Since the same model was being inferenced in the run_icor_from_mat script, I have isolated that this issue is not because of the model file. It is because of the encoding done in this script. I have 1-2 things that I still need to try which I believe will solve this issue.
Supporting files were used to train, evaluate, and test the ICOR model. Descriptions for these can be found below:

#### Models
The models directory contains the trained ICOR model in the [ONNX](https://onnx.ai) (open-neural-network-exchange) format. Below is a preview of the model architecture:

<div style="text-align: right">
<img src="/assets/icor-small-visualization.png">
The ICOR model was trained in the MATLAB environment. For more details on model architecture, please review our manuscript file in the base of the repository. Upon submission, this will be changed to a DOI/biorxiv link.
</div>


`benchmark_genes.pdf`
> A document that contains all of the benchmarking genes and descriptions of them.
## Scripts
The following is a description of the purpose for each script in the repository.

`reformat_seqs.py`
> Iterate through each file in a directory and reformat the sequence uniformly.
#### Optimizers
`brute_force_optimizer.py`
> Naive optimizer creates a directory containing amino acid sequences in the FASTA format and saves these "optimized" / "generated" DNA sequences in a directory. It generates 10,000 sequences and chooses the one with the highest CAI.
`icor_optimizer.py`
> ICOR optimizer outputs a text file given a sequence input of amino acids or DNA. It is an interactive Python command-line script. It runs an inference through the ICOR model.
`naive_optimizer.py`
> Naive optimizer creates a directory containing amino acid sequences in the FASTA format and saves these "optimized" / "generated" DNA sequences in a directory. It selects codons to match the natural frequency that occurs within E. coli. This is what many tools in the industry use as well. This tool/script is built upon the `ecoli_codon_frequencies.csv` file in the summaries directory.
`super_naive_optimizer.py`
> Super naive optimizer creates a directory containing amino acid sequences in the FASTA format and saves these "optimized" / "generated" DNA sequences in a directory. It randomly selects a codon given an amino acid, making it a very naive approach.
`naive_optimizer.py`
> Naive optimizer creates a directory containing amino acid sequences in the FASTA format and saves these "optimized" / "generated" DNA sequences in a directory. It selects codons to match the natural frequency that occurs within E. coli. This is what many tools in the industry use as well. This tool/script is built upon the `ecoli_codon_frequencies.csv` file in the summaries directory.
#### Scripts
The following is a description of the purpose for each script in the repository.

`convert_to_cds.py`
> Takes an input of DNA sequences and fetches their CDS only from the NCBI nuccore database. Rewrites files with CDS.
`csv_to_seqs.py`
> Takes an input of a CSV from the GenScript Gensmart tool and writes them into files containing the sequences in the FASTA format.
`reformat_seqs.py`
> Iterate through each file in a directory and reformat the sequence uniformly.
`run_benchmark.ipynb`
> An interactive notebook that helps benchmark a directory containing FASTA sequences across the following metrics:
Expand All @@ -83,35 +99,27 @@ The following is a description of the purpose for each script in the repository.
- CFD (known un-optimized gene that reduces efficiency)
- Negative CIS elements
- Negative repeat elements

`run_icor_from_mat.ipynb`
> A notebook that accepts a `.mat` file that contains one variable called "XTrain" of the cell array type. Cell array used in experiments was of value/shape 42266x1.
> Note: as of 8/24/2021 this script successfully outputs the ICOR optimized sequence and it does indeed match the correct ICOR optimization.
## Summaries
The following is a description of the purpose for each summary in the summaries folder.

`benchmark_genes.csv`
> Description of each benchmark gene used. Also is above, in README file.
`codon_map.xlsx`
> Contains the codon map used for the AA2Codons dictionary.
`super_naive_benchmarks.csv`
> Contains the benchmarks for super_naively-created sequences.
`naive_benchmarks.csv`
> Contains the benchmarks for naively-created sequences.
`original_benchmarks.csv`
> Contains the benchmarks for the original sequences.
`ICOR_benchmarks.csv`
> Contains the benchmarks for the ICOR-optimized sequences.
#### Resources
The following is a description of the purpose for each resource in the resources folder.

`Benchmarking Results & Comparison - ICOR Codon Optimization.pdf`
> Contains an overview of the benchmarks, comparing each of the "tools" for each of the benchmarks. This is the sheet to look at if you would like to be able to see the metrics differences between the tools.
`benchmark_genes.pdf`
> A table for all of the benchmark genes used for validation.
`codon_map.xlsx`
> Contains the codon map used for the AA2Codons dictionary.
`ecoli_codon_frequencies.csv`
> Contains the codon frequency weights for each codon/amino acid used in the E. coli genomes. The naive tool was built upon these frequencies.
`ecoli_codon_frequencies.xlsx`
> Contains the codon frequencies found in E. coli for each amino acid. The naive tool was built upon these frequencies.
## Dependencies
### Dependencies
- Python 3.9.4
- biopython
- numpy
Expand All @@ -120,4 +128,4 @@ The following is a description of the purpose for each summary in the summaries
- re
- selenium
- Chrome (chromedriver does not seem to work for chromium, needs to use an actual chrome installation)
- [AA -> Codons dict](https://www.mathworks.com/help/bioinfo/ref/aa2nt.html)
- [AA -> Codons dict](https://www.mathworks.com/help/bioinfo/ref/aa2nt.html)

0 comments on commit 48af3a6

Please sign in to comment.