For this project, you will implement and run the featurization and random forest procedure described in Yu, et al. (Cell Systems, 2016) on the S. cerevisiae (baker's yeast) data from Costanzo, et al. (Science, 2010).
The input data for your algorithm is a matrix of genetic interaction scores for pairs of genes and a hierarchy of gene sets. The genetic interactions are stored in a square NumPy matrix format with a corresponding file that lists the gene names for the rows/columns. The hierarchy is stored in a tab-separated text file, where each line lists the genes (leaves) in a set (internal node) of the hiearchy.
You can find a small example dataset for your project in data/examples.
You will need to download real data for your project and process it into the same format as the example data. You will create a S. cerevisiae hierarchy from the Gene Ontology.
For genetic interaction data, Yu, et al. used the ~3 million interactions from Costanzo, et al. (Science, 2010). However, because 3 million is probably too many for you to reasonably be able to compute in a short period of time, please use the data from Collins, et al. (Nature, 2007) instead. I've already preprocessed the data, and you can download it from this link.