Skip to content

Latest commit

 

History

History
56 lines (34 loc) · 3.8 KB

README.md

File metadata and controls

56 lines (34 loc) · 3.8 KB

Rule Based Learning for Transcriptional Regulation

This is the GitHub repo for the hackseq19 project: Rule Based Learning for Transcriptional Regulation!

Rationale

Gene regulatory sites, such as Transcription Factor Binding Sites (TFBS) and Promoters, are extremely important regions within both eukaryotic and prokaryotic genomes. Predicting whether or not a site acts as a regulatory element is an important, yet surprisingly difficult task. There has been a lot of focus in recent years towards building machine learning (ML) approaches for automatically detecting these genomic regions. In this hackathon, we hope to experiment with some of these tools.

Goals

Our goals during hackseq19 are to:

  • a) Build an accurate classifier for a given gene regulation dataset.
  • b) Build an interpretable classifier that outputs useful rules, describing each dataset.

We will experiment with many different classifiers, including decision trees, random forests, support vector machines, and neural networks. Accuracy is measured using F1 score, which we can visualize on our leaderboard (see below). Interpretability is measured by how clearly we can deduce rules from our dataset. An example rule:

IF Position[2] == "G" AND Position[3] == "C" THEN Class == "TFBS"

Data

Our leaderboard page is available here. You are required to sign in using your Google account. Once signed in, you can choose your username and submit files to the leaderboard. The leaderboard is based on a hacked version of my Natural Language Processing course professor's website.

Datasets:

  • 1 Human Chromosome #1 TFBS
  • 1 Ecoli K12 TFBS
  • 2 Ecoli K12 Promoter Region
  • 1 Pokemon

These come from a variety of sources, including gene regulation databases and previous Kaggle competitions.

Results

The following graph represents our progress improving classifier accuracy over the course of hackseq19. x-axis is measure in hours of time since the start of our hackathon, y-axis is measure in terms of F1 Score. We annotated times when we noticeably improved our position on the leaderboard. Dashed lines represent our "oracle", representing the highest recorded accuracy in the literature. As you can see, we beat the oracle score for Huamn SP1 TFBS!

Team Members

Team Lead:
Alex Sweeten

Participants:
Aris Grout

Chahat Upreti

Jade Chen

Kate Gibson

Oriol Fornes

Priyanka Mishra

Shawn Hsueh

Zakhar Krekhno