Protein Secondary Structure Prediction

This project involves predicting the secondary structure of proteins using a deep learning model. The model is trained to classify the secondary structure of each amino acid in a protein sequence into three categories: alpha Helix (H), betastrand (E), and loops and irregular elements (C), beta bridge (B), 3-helix (G), pi helix (I), Turn (T), Bend (S).

Project Overview

Protein secondary structure prediction is a critical task in bioinformatics, as it provides valuable insights into protein function and stability. This project leverages a Convolutional Neural Network (CNN) to make accurate predictions based on protein sequences.

Features

Convolutional Neural Network (CNN): Utilizes CNNs to capture patterns in protein sequences.
Three-state Secondary Structure: Predicts secondary structures as Helix (H), betastrand (E), and loops and irregular elements (C), beta bridge (B), 3-helix (G), pi helix (I), Turn (T), Bend (S).
Customizable: Easily adaptable to new datasets or extended to more states or features.
Model Persistence: Saves the trained model in Keras native format for easy reuse.

Dataset

The dataset used in this project includes protein sequences and their corresponding secondary structures. It consists of the following columns:

pdb_id: PDB ID of the protein.
chain_code: Chain code of the peptide.
seq: Sequence of amino acids.
sst8: Eight-state (Q8) secondary structure (not used in this project).
sst3: Three-state (Q3) secondary structure.
len: Length of the peptide.
has_nonstd_aa: Indicates the presence of non-standard amino acids, marked with *.

Non-standard amino acids are removed to ensure consistency.

Installation

Clone the Repository:

git clone https://github.com/yourusername/protein-secondary-structure-prediction.git
cd protein-secondary-structure-prediction

Create a Virtual Environment (optional but recommended):

python -m venv env
source env/bin/activate  # On Windows use `env\Scripts\activate`

Install Dependencies:
```
pip install -r requirements.txt
```
If you don't have a requirements.txt file, use the following command to install the necessary packages:
```
pip install numpy pandas tensorflow scikit-learn matplotlib
```

Usage

Prepare the Dataset:

Ensure the dataset (protein_sequences.csv) is placed in the data/ directory. The project uses a preprocessed dataset, so no additional preprocessing is required.
Run the Training Script:

You can train the model by running the train_model.ipynb notebook or executing the train_model.py script:
```
python train_model.py
```
The script will train the model and save it as protein_secondary_structure_model.keras in the root directory.

Predict Secondary Structure:

To predict the secondary structure of a new protein sequence, you can use the predict_secondary_structure.py script or run the predict_secondary_structure.ipynb notebook:

python predict_secondary_structure.py

Example usage in Python:

from predict_secondary_structure import predict_structure

# Example protein sequence
new_sequence = "ACDEFGHIKLMNPQRSTVWY"

# Predict secondary structure
predicted_structure = predict_structure(new_sequence)
print(f'Predicted secondary structure: {predicted_structure}')

The output will be a string representing the secondary structure, e.g., "CCCCCHHHHHHHCCCEEEEE".

File Structure

data/: Directory containing the dataset.
train_model.ipynb: Jupyter notebook for training the model.
train_model.py: Python script for training the model.
predict_secondary_structure.ipynb: Jupyter notebook for predicting secondary structures.
predict_secondary_structure.py: Python script for predicting secondary structures.
protein_secondary_structure_model.keras: Saved Keras model.
README.md: This readme file.

Contributing

Contributions are welcome! Please fork the repository and create a pull request with your changes. Ensure your code adheres to the project’s coding standards and is well-documented.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Acknowledgments

The dataset used in this project was obtained from https://cdn.rcsb.org/etl/kabschSander/ss.txt.gz]. and curated by ( https://github.com/zyxue/pdb-secondary-structure) and download from kaggle.
Inspiration for the project structure was drawn from various bioinformatics resources and tutorials.

Contact

For any questions or feedback, please open an issue in the repository or contact the project maintainers at [[email protected]].

Feel free to adapt and extend this project for your research or educational purposes. Happy coding!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein Secondary Structure Prediction

Project Overview

Features

Dataset

Installation

Usage

File Structure

Contributing

License

Acknowledgments

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
protein_secondary_structure_model.keras		protein_secondary_structure_model.keras
train_model.ipynb		train_model.ipynb

Pranjal-Bioinfo/Protein-Secondary-structure-Prediction-

Folders and files

Latest commit

History

Repository files navigation

Protein Secondary Structure Prediction

Project Overview

Features

Dataset

Installation

Usage

File Structure

Contributing

License

Acknowledgments

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages