-
Notifications
You must be signed in to change notification settings - Fork 42
Google Summer of Code Ideas 2019
Google Summer of Code is a summer program that offers students stipends to develop software for open source projects.
Get familiar with the UCSC Xena and it's codebase. The browser client javascript code base is at https://github.com/ucscXena/ucsc-xena-client. The data hub Clojure code base is at https://github.com/ucscXena/ucsc-xena-server. Next, take on some client help wanted issues and hub help wanted issues and submit a pull request. This will help us see how we might work with you during the summer.
Finally, either develop your project proposal based on one of the ideas or come up with your own. If you are a prospective student interested in doing your Google Summer of Code (GSoC) project with us, please contact us as soon as possible. We will do our best to assist and guide you in the formulation of your GSoC project proposal.
If you have any questions about any of the ideas, please join our Google Group or send us a private email.
Refactor chart view Refactor large number of functionalities
Implement more extensive Google Analytics coverage Low-hanging fruit
Update GDC data ingestion pipeline and run Infrastructure/Automation
Modify Xena Hub to be able to accept Microsoft Excel Files Core development
Enhance Xena Hub to directly use h5 single-cell RNAseq datafile Core development
Perform genome-wide analysis on Xena Core development
Develop BRCAness View New functionality
We have two main visualizations: our primary Visual Spreadsheet and the charts view which draws bar charts, box plots and scatter plots. Users select columns of data in the Visual Spreadsheet, which become options for the x- and y-axis in the chart view.
Currently we use highcharts.js for our chart view, which does not follow the architecture of the rest of our site. This is especially frustrating since it does not use our current state model. Maintaining these differences takes time and energy from our team.
Refactor this code to instead use a react-based library. The functionality will remain exactly the same, including all statistical tests we currently perform. Students will need to evaluate possible libraries to determine which we should use. This will include determining what are the requirements for a library to work in our codebase.
An additional goal, if there is time, would be to add a violin plot to our current chart . This would likely require the incorporation of another library which would need to be veted. This github issue has some leads on libraries.
Would also ideally add this feature as well: https://github.com/ucscXena/ucsc-xena-client/issues/163
Required Skills: Javascript, React.js, Rx.js, some knowledge of basic statistics including t-test, ANOVA, Wilcox, Spearman, and Pearsons
While we have some general knowledge of how users move through our application, we need more specifics. In particular, we need to determine which features and datasets users typically use to know where to focus our future development efforts and to give more detailed reports to grant agencies.
Implement more extensive Google Analytics coverage, including more functionality and determining which datasets are being viewed.
The NCI's Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine. The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Treatments (TARGET).
While we already visualize all the public data from TCGA and TARGET in the GDC, it is from approximately a year ago. New data is available on GDC, and GDC APIs is likely to have changed. We need to update our pipeline to accommodate both of these changes.
Update our pipeline to download the public tier data from the GDC and transform it so that it can be loaded into Xena.
GDC data: https://portal.gdc.cancer.gov/repository
GDC API: https://gdc.cancer.gov/developers/gdc-application-programming-interface-api
Our current GDC pipeline: https://github.com/yunhailuo/xena-GDC-ETL
More detail on the project: https://github.com/ucscXena/ucsc-xena-client/wiki/update-GDC-data-on-Xena
The way users visualize their own data in Xena is to download a Hub on to their laptop, load their data into the Hub, and then use a web browser pointed at their laptop hub to visualize it. Currently users must upload a .tsv (tab-delimited) file, but they typically have a Microsoft Excel file (.xls or .xlsx). Many users do not know how to convert their Excel file to tab-delimited.
Logo on our Hub
Support users loading their Microsoft Excel file.
An additional goal, if there is time, would be to support the loading of files from GEO, a popular database for microarray data.
Very large single cell RNAseq data file generated using 10x genomics platform (such as the 1.3 million cell dataset) is in h5 (HDF5) format. Currently, we coverts .h5 files to txt format offline, essentially convert the sparse data into a much larger dense matrix data, then load the dense matrix data into xena hub. .h5 files are already indexed, much smaller in size. The converted dense matrix data is much bigger and the wrangling process takes time. Xena hub is implemented using Clojure, and Xena Browser is javascript.
HDF5 Feature Barcode Matrix Format https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/h5_matrices
Enhance Xena Hub to directly use .h5 single-cell RNA-seq data file: support Xena hubs query using h5 files directly. We also anticipate some small changes on the Xena Browser side as well.
10x genomics produced .h5 file needs to be transposed first before use for our purpose. We have some of the transposed files made already.
Currently Xena KM and Chart analysis is popular, but users can only run these analysis one column at a time. Users have asked for functionality to run the same analysis, but on a genome-wide scale, or for all the lncRNA genes. Develop genome-wide analysis functionality for Xena. Possible approaches to explore are: run the analysis in Xena Browser, or run it using local Xena Hub or cloud resource and communicates results back to users through Xena Browser. Genomics data is stored in public Xena Hubs. User interface is on Xena Browser.
Run genome-wide analysis on Xena
Required Skills: Clojure, Java, Javascript, very strong independent research skill, highly resourceful
We have been developing a new visualization to help clinicians determine what is high BRCAness. Here is some education about BRCAness.
Implement this visualization as a new visualization Mock up. The double density or histogram calculation is similar to the Transcript View.
This would be a new visualization but still within the larger structure of our website, similar to our Transcript View (https://xenabrowser.net/transcripts/)