Data, code, and results for our paper GO Big or Go Home: A New Gene Ontology Subset that Improves Plant Gene Function Prediction (Link). This repository contains the two subsets GO Big maize and GO Big plant in OWL and OBO file formats, along with the methods used to generate them. Each subset has its own directory with a README that provides more information. For the GO Big maize subset, GO term collection was done manually. On the other hand, GO term collection for the GO Big plant subset was automated.
If you use any of our code or results in a scientific publication, we would be grateful if you cite the paper.
Background: The availability of gene function prediction datasets helps researchers to consider possible functions for uncharacterized genes for hypothesis generation, candidate gene prioritization, and many other applications. Many such datasets are based on the Gene Ontology (GO) function graph. For plants this can be problematic because the most specific GO terms available are often derived from the biology of non-plant taxa (e.g., functions specific to nerve function would not seem likely to map to plant biological processes given that plants lack nerves). To balance the need for functional specificity while limiting to functions relevant to plant biology, researchers often limit to the GO Slim plant subset, but, by design, that subset consists of very general terms and limits real utility for e.g., specific hypothesis generation. Worse yet, sometimes researchers choose to simply throw out terms if they are not relevant to plant biology (rather than traversing the GO graph to select the most specific term in that hierarchy that is compatible with plant biology).
Results: We created GO Big, a Gene Ontology subset type, to improve the biological relevance of gene function predictions for taxon-specific biology applications. GO Big plant subsets retain maximal functional specificity for hypothesis generation while limiting to terms applicable to the biology of plants. In brief, we used a curatorial approach to generate two GO Big subsets, a general subset derived from terms with experimentally validated functions across Viridiplantae species, and a species-specific subset for maize (Zea mays ssp. mays).
Conclusion: Annotating genes with assignments that better reflect the biology of a taxon can pave the way for more biologically accurate and testable hypotheses for genes of interest. The subsets produced here can help plant biologists limit genome-wide gene function prediction sets to functions possible for plant genes, and the process to generate GO Big subsets is described in detail to enable others to create GO Big subsets for additional taxon sets, including ones for protists, fungi, and other phylogenetic categories.