Skip to content

Latest commit

 

History

History
79 lines (45 loc) · 2.63 KB

README.md

File metadata and controls

79 lines (45 loc) · 2.63 KB

Cancer Prediction from DNA Methylation Site Identification

DNA Sequence Cancer Diagnosis is a machine learning project that aims to predict cancer based on DNA methylation site identification. The project focuses on leveraging multidimensional information within gene sequences to improve prediction accuracy compared to traditional methods like Whole-genome bisulfite sequencing (WGBS). Overview

DNA methylation plays a crucial role in gene expression regulation by influencing DNA stability and chromosome structure. This project addresses the limitations of traditional methods and employs machine learning to predict DNA methylation modification sites efficiently. Features

Predicts DNA methylation modification sites using machine learning. Three different encoding methods for gene sequences: Binary encoding of Position Feature (BPF), Coding of nucleic acid chemical properties (NCP), and Coding of Dinucleotide physical and chemical properties (DPCP). Utilizes a Convolutional Neural Network (CNN) named MEDCNN for multidimensional feature encoding.

Getting Started Prerequisites

Python
Libraries: NumPy, Pandas, Scikit-learn, TensorFlow, Keras

Installation

bash

git clone https://github.com/devatraj/DNA-Sequence-Cancer-Diagnosis.git
cd DNA-Sequence-Cancer-Diagnosis
pip install -r requirements.txt

Usage

Prepare your dataset with entries containing index, label, and gene sequences. Run the preprocessing script:

bash

python preprocess.py --input_data dataset.csv --output_features features.npy --output_labels labels.npy

Train the model:

bash

python train.py --features features.npy --labels labels.npy --epochs 10

Make predictions:

bash

python predict.py --input_sequence new_sequence.txt

Data

The dataset includes entries with an index, label (0 for not methylated, 1 for methylated), and gene sequences.

Example:

Index | Label | Text
0     | 1     | TGTGGAGGGAGAGGCGCGGGCGGGAACTGCTGCTGTGCATG

Model Training

The model is trained using a Convolutional Neural Network (CNN) named MEDCNN. Three encoding methods (BPF, NCP, DPCP) are used to represent gene sequences as multidimensional feature matrices.

Evaluation

The model's performance is evaluated based on metrics such as accuracy, precision, recall, and F1 score.

Results

Provide insights into the model's predictions and any patterns discovered.

Contributing

Feel free to contribute by submitting issues, proposing new features, or enhancing the existing ones.

License

This project is licensed under the Apache license.

Acknowledgements

Thank you to the research community for advancements in computational methods for DNA methylation prediction.