Google Summer of Code Ideas 2019

About Google Summer of Code

Google Summer of Code is a summer program that offers students stipends to develop software for open source projects.

How to apply

Get familar with the UCSC Xena and it's codebase. Next, take on some help wanted issues and submit a pull request. This will help us see how we might work with you during the summer.

Finally, either develop your project proposal based on one of the ideas or come up with your own. If you are a prospective student interested in doing your Google Summer of Code (GSoC) project with us, please contact us as soon as possible. We will do our best to assist and guide you in the formulation of your GSoC project proposal.

If you have any questions about any of the ideas, please join our Google Group or send us a private email.

Project ideas

Refactor chart view Refactor large number of functionalities

Implement more extensive Google Analytics coverage Low-hanging fruit

Update GDC data ingestion pipeline and run Infrastructure/Automation

Modify Xena Hub to be able to accept Microsoft Excel Files Core development

Enhance Xena Hub directly uses h5 single-cell RNAseq datafile Core development

Change how Xena Hub stores metadata for phenotypic dataset Core development

Refactor chart view

Background

We have two main visualizations: our primary Visual Spreadsheet and the charts view which draws bar charts, box plots and scatter plots. Users select columns of data in the Visual Spreadsheet, which become options for the x- and y-axis in the chart view.

Currently we use highcharts.js for our chart view, which does not follow the architecture of the rest of our site. This is especially frustrating since it does not use our current state model. Maintaining these differences takes time and energy from our team.

Goal

Refactor this code to instead use a react-based library. The functionality will remain exactly the same, including all statistical tests we currently perform. Students will need to evaluate possible libraries to determine which we should use. This will include determining what are the requirements for a library to work in our codebase.

An additional goal, if there is time, would be to add a violin plot to our current chart. This would likely require the incorporation of another library which would need to be veted. This github issue has some leads on libraries.

Would also ideally add this feature as well: https://github.com/ucscXena/ucsc-xena-client/issues/163

Difficulty: Difficult

Required Skills: Javascript, React.js, Rx.js, some knowledge of basic statistics including t-test, ANOVA, Wilcox, Spearman, and Pearsons

Mentors: Brian Craft, Mary Goldman

Implement more extensive Google Analytics coverage

Background

While we have some general knowledge of how users move through our application, we need more specifics. In particular, we need to determine which features and datasets users typically use to know where to focus our future development efforts and to give more detailed reports to grant agencies.

Goal

Implement more extensive Google Analytics coverage, including more functionality and determining which datasets are being viewed.

Difficulty: Easy

Required Skills: Javascript

Mentors: Brian Craft, Mary Goldman

Update GDC data ingestion pipeline and run

Background

The NCI's Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine. The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Treatments (TARGET). The GDC obtains validated datasets from NCI programs in which the strategies for tissue collection couples quantity with high quality. The GDC encourages data sharing in support of precision medicine. Tools are provided to guide data submissions by researchers and institutions.

While we already visualize all the public data in the GDC, it is from approximately a year ago and their APIs have changed. We need to update our pipeline to accommodate these changes.

Goal

Update our pipeline to download the data from the GDC and transform it so that it can be loaded into Xena.

gdc logo

Pointers

GDC data: https://gdc-portal.nci.nih.gov/search/s?facetTab=cases

GDC API: https://gdc.cancer.gov/developers/gdc-application-programming-interface-api

Our current GDC pipeline: https://github.com/yunhailuo/xena-GDC-ETL

Difficulty: Medium

Required Skills: Python

Mentors: Jing Zhu, Mary Goldman

Modify Xena Hub to be able to accept Microsoft Excel Files

Background

The way users visualize their own data in Xena is to download a Hub on to their laptop, load their data into the Hub, and then use a web browser pointed at their laptop hub to visualize it. Currently users must upload a .tsv (tab-delimited) file, but they typically have a Microsoft Excel file (.xls or .xlsx). Many users do not know how to convert their Excel file to tab-delimited.

UCSC Xena Logo

Logo on our Hub

Goal

Support users loading their Microsoft Excel file.

An additional goal, if there is time, would be to support the loading of files from GEO, a popular database for microarray data.

Difficulty: Medium

Required Skills: Clojure, Java

Mentors: Brian Craft, Mary Goldman, Jing Zhu, Holly Beale

Xena Hub directly uses h5 single-cell RNAseq datafile

Background

Very large single cell RNAseq data file generated using 10x genomics platform (such as the 1.3 million cell dataset) is in h5 format. Currently, we coverts .h5 files to txt format offline, essentially convert the sparse data into a much larger dense matrix data, then load the dense matrix data into xena hub. .h5 files are already indexed, much smaller in size. The converted dense matrix data is much bigger and the wrangling process takes time. Xena hub is implemented using Clojure, and Xena Browser is javascript.

Goal

Enhance Xena Hub to directly use .h5 single-cell RNA-seq data file: support Xena hubs query using h5 files directly. We also anticipate some small changes on the Xena Browser side as well.

Difficulty: Difficult

Required Skills: Clojure, Java, Javascript

Mentors: Brian Craft, Jing Zhu

Refactor the way Xena Hub stores metadata for phenotypic dataset

Goal

Change the way Xena Hub stores metadata for phenotypic data from a strict database schema method to store them in a blob of .json format text. We also anticipate changes are needed on the Xena Browser side as well to query the metadata.

Difficulty: Difficult

Required Skills: Clojure, Java, Javascript

Mentors: Brian Craft, Jing Zhu

Perform genome-wide analysis on Xena

Background

Currently Xena KM and Chart analysis is popular, but users can only run these analysis one column at a time. Users have asked for functionality to run the same analysis, but on a genome-wide scale, or for all the lncRNA genes. Develop genome-wide analysis functionality for Xena. Possible approaches to explore are: run the analysis in Xena Browser, or run it using local Xena Hub or cloud resource and communicates results back to users through Xena Browser. Genomics data is stored in public Xena Hubs. User interface is on Xena Browser.