Skip to content

Google Summer of Code Ideas 2019

jingchunzhu edited this page Mar 11, 2019 · 43 revisions

About Google Summer of Code

Google Summer of Code is a summer program that offers students stipends to develop software for open source projects.

How to apply

Get familiar with the UCSC Xena and it's codebase. The browser client javascript code base is at https://github.com/ucscXena/ucsc-xena-client. The data hub Clojure code base is at https://github.com/ucscXena/ucsc-xena-server. Next, take on some client help wanted issues and hub help wanted issues and submit a pull request. This will help us see how we might work with you during the summer.

Finally, either develop your project proposal based on one of the ideas or come up with your own. If you are a prospective student interested in doing your Google Summer of Code (GSoC) project with us, please contact us as soon as possible. We will do our best to assist and guide you in the formulation of your GSoC project proposal.

If you have any questions about any of the ideas, please join our Google Group or send us a private email.

Project ideas

Refactor chart view Refactor large number of functionalities

Implement more extensive Google Analytics coverage Low-hanging fruit

Update GDC data ingestion pipeline and run Infrastructure/Automation

Modify Xena Hub to be able to accept Microsoft Excel Files Core development

Enhance Xena Hub to directly use h5 single-cell RNAseq datafile Core development

Change how Xena Hub stores metadata for phenotypic dataset Core development

Perform genome-wide analysis on Xena Core development

Develop BRCAness View New functionality


Refactor chart view

Background

We have two main visualizations: our primary Visual Spreadsheet and the charts view which draws bar charts, box plots and scatter plots. Users select columns of data in the Visual Spreadsheet, which become options for the x- and y-axis in the chart view.

Currently we use highcharts.js for our chart view, which does not follow the architecture of the rest of our site. This is especially frustrating since it does not use our current state model. Maintaining these differences takes time and energy from our team.

Goal

Refactor this code to instead use a react-based library. The functionality will remain exactly the same, including all statistical tests we currently perform. Students will need to evaluate possible libraries to determine which we should use. This will include determining what are the requirements for a library to work in our codebase.

An additional goal, if there is time, would be to add a violin plot to our current chart. This would likely require the incorporation of another library which would need to be veted. This github issue has some leads on libraries.

Would also ideally add this feature as well: https://github.com/ucscXena/ucsc-xena-client/issues/163

Difficulty: Difficult

Required Skills: Javascript, React.js, Rx.js, some knowledge of basic statistics including t-test, ANOVA, Wilcox, Spearman, and Pearsons

Mentors: Brian Craft, Mary Goldman


Implement more extensive Google Analytics coverage

Background

While we have some general knowledge of how users move through our application, we need more specifics. In particular, we need to determine which features and datasets users typically use to know where to focus our future development efforts and to give more detailed reports to grant agencies.

Goal

Implement more extensive Google Analytics coverage, including more functionality and determining which datasets are being viewed.

Difficulty: Easy

Required Skills: Javascript

Mentors: Brian Craft, Mary Goldman


Update GDC data ingestion pipeline and run

Background

The NCI's Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine. The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Treatments (TARGET).

While we already visualize all the public data from TCGA and TARGET in the GDC, it is from approximately a year ago. New data is available on GDC, and GDC APIs is likely to have changed. We need to update our pipeline to accommodate both of these changes.

Goal

Update our pipeline to download the public tier data from the GDC and transform it so that it can be loaded into Xena.

gdc logo

Pointers

GDC data: https://gdc-portal.nci.nih.gov/search/s?facetTab=cases

GDC API: https://gdc.cancer.gov/developers/gdc-application-programming-interface-api

Our current GDC pipeline: https://github.com/yunhailuo/xena-GDC-ETL

Difficulty: Medium

Required Skills: Python

Mentors: Jing Zhu, Mary Goldman


Modify Xena Hub to be able to accept Microsoft Excel Files

Background

The way users visualize their own data in Xena is to download a Hub on to their laptop, load their data into the Hub, and then use a web browser pointed at their laptop hub to visualize it. Currently users must upload a .tsv (tab-delimited) file, but they typically have a Microsoft Excel file (.xls or .xlsx). Many users do not know how to convert their Excel file to tab-delimited.

UCSC Xena Logo

Logo on our Hub

Goal

Support users loading their Microsoft Excel file.

An additional goal, if there is time, would be to support the loading of files from GEO, a popular database for microarray data.

Difficulty: Medium

Required Skills: Clojure, Java

Mentors: Brian Craft, Mary Goldman, Jing Zhu, Holly Beale


Xena Hub directly uses h5 single-cell RNAseq datafile

Background

Very large single cell RNAseq data file generated using 10x genomics platform (such as the 1.3 million cell dataset) is in h5 (HDF5) format. Currently, we coverts .h5 files to txt format offline, essentially convert the sparse data into a much larger dense matrix data, then load the dense matrix data into xena hub. .h5 files are already indexed, much smaller in size. The converted dense matrix data is much bigger and the wrangling process takes time. Xena hub is implemented using Clojure, and Xena Browser is javascript.

HDF5 Feature Barcode Matrix Format https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/h5_matrices

Goal

Enhance Xena Hub to directly use .h5 single-cell RNA-seq data file: support Xena hubs query using h5 files directly. We also anticipate some small changes on the Xena Browser side as well.

10x genomics produced .h5 file needs to be transposed first before use for our purpose. We have some of the transposed files made already.

Difficulty: Difficult

Required Skills: Clojure, Java, Javascript

Mentors: Brian Craft, Jing Zhu


Refactor the way Xena Hub stores metadata for phenotypic dataset

Goal

Change the way Xena Hub stores metadata for phenotypic data from a strict database schema method to store them in a blob of .json format text. We also anticipate changes are needed on the Xena Browser side as well to query the metadata.

Difficulty: Difficult

Required Skills: Clojure, Java, Javascript

Mentors: Brian Craft, Jing Zhu


Perform genome-wide analysis on Xena

Background

Currently Xena KM and Chart analysis is popular, but users can only run these analysis one column at a time. Users have asked for functionality to run the same analysis, but on a genome-wide scale, or for all the lncRNA genes. Develop genome-wide analysis functionality for Xena. Possible approaches to explore are: run the analysis in Xena Browser, or run it using local Xena Hub or cloud resource and communicates results back to users through Xena Browser. Genomics data is stored in public Xena Hubs. User interface is on Xena Browser.

Goal

Run genome-wide analysis on Xena

Difficulty: Difficult, Risky

Required Skills: Clojure, Java, Javascript, very strong independent research skill, highly resourceful

Mentors: Jing Zhu


BRCAness View

Background

We have been developing a new visualization to help clinicians determine what is high BRCAness. Here is some education about BRCAness.

Goal

Implement this visualization as a new visualization Mock up. The double density or histogram calculation is similar to the Transcript View

Difficulty: Medium

Required Skills: Javascript, React.js, Rx.js

Mentors: Hannah Allegakoen, Brian Craft, Mary Goldman