-
Notifications
You must be signed in to change notification settings - Fork 42
Google Summer of Code Ideas 2019
Google Summer of Code is a summer program that offers students stipends to develop software for open source projects.
Get familar with the UCSC Xena and it's codebase. Next, take on some help wanted issues and submit a pull request. This will help us see how we might work with you during the summer.
Finally, either develop your project proposal based on one of the ideas or come up with your own. If you are a prospective student interested in doing your Google Summer of Code (GSoC) project with us, please contact us as soon as possible. We will do our best to assist and guide you in the formulation of your GSoC project proposal.
If you have any questions about any of the ideas, please join our Google Group or send us a private email.
Refactor chart view Refactor/Large amount of functionalities
Implement more extensive Google Analytics coverage Low-hanging fruit
Update GDC data ingestion pipeline and run Infrastructure/Automation
Modify Xena Hub to be able to accept Microsoft Excel Files Core development
Enhance Xena Hub directly uses h5 single-cell RNAseq datafile Core development
_Change how Xena Hub stores metadata for phenotypic dataset Core development
We have two main visualizations: our primary Visual Spreadsheet and the charts view which draws bar charts, box plots and scatter plots. Users select columns of data in the Visual Spreadsheet, which become options for the x- and y-axis in the chart view.
Currently we use highcharts.js for our chart view, which does not follow the architecture of the rest of our site. This is especially frustrating since it does not use our current state model. Maintaining these differences takes time and energy from our team.
Refactor this code to instead use a react-based library. The functionality will remain exactly the same, including all statistical tests we currently perform. Students will need to evaluate possible libraries to determine which we should use. This will include determining what are the requirements for a library to work in our codebase.
An additional goal, if there is time, would be to add a violin plot to our current chart. This would likely require the incorporation of another library which would need to be veted. This github issue has some leads on libraries.
Would also ideally add this feature as well: https://github.com/ucscXena/ucsc-xena-client/issues/163
Required Skills: Javascript, React.js, Rx.js, some knowledge of basic statistics including t-test, ANOVA, Wilcox, Spearman, and Pearsons
While we have some general knowledge of how users move through our application, we need more specifics. In particular, we need to determine which features and datasets users typically use to know where to focus our future development efforts and to give more detailed reports to grant agencies.
Implement more extensive Google Analytics coverage, including more functionality and determining which datasets are being viewed.
The NCI's Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine. The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Treatments (TARGET). The GDC obtains validated datasets from NCI programs in which the strategies for tissue collection couples quantity with high quality. The GDC encourages data sharing in support of precision medicine. Tools are provided to guide data submissions by researchers and institutions.
While we already visualize all the public data in the GDC, it is from approximately a year ago and their APIs have changed. We need to update our pipeline to accommodate these changes.
Update our pipeline to download the data from the GDC and transform it so that it can be loaded into Xena.
GDC data: https://gdc-portal.nci.nih.gov/search/s?facetTab=cases
GDC API: https://gdc.cancer.gov/developers/gdc-application-programming-interface-api
Our current GDC pipeline: https://github.com/yunhailuo/xena-GDC-ETL
The way users visualize their own data in Xena is to download a Hub on to their laptop, load their data into the Hub, and then use a web browser pointed at their laptop hub to visualize it. Currently users must upload a .tsv (tab-delimited) file, but they typically have a Microsoft Excel file (.xls or .xlsx). Many users do not know how to convert their Excel file to tab-delimited.
Logo on our Hub
Support users loading their Microsoft Excel file.
An additional goal, if there is time, would be to support the loading of files from GEO, a popular database for microarray data.
Very large single cell RNAseq data file generated using 10x genomics platform (such as the 1.3 million cell dataset) is in h5 format. Currently, we coverts .h5 files to txt format offline, essentially convert the sparse data into a much larger dense matrix data, then load the dense matrix data into xena hub. .h5 files are already indexed, much smaller in size. The converted dense matrix data is much bigger and the wrangling process takes time. Xena hub is implemented using Clojure, and Xena Browser is javascript.
Enhance Xena Hub to directly use .h5 single-cell RNA-seq data file: support Xena hubs query using h5 files directly. We also anticipate some small changes on the Xena Browser side as well.
Change the way Xena Hub stores metadata for phenotypic data from a strict database schema method to store them in a blob of .json format text. We also anticipate changes are needed on the Xena Browser side as well to query the metadata.