-
Notifications
You must be signed in to change notification settings - Fork 42
update GDC data on Xena
- update GDC data on Xena
- update the GDC data wrangling pipeline
This summer project will make the GDC data on Xena up-to-date with the latest GDC release (as of March 11, 2019 it is on release 15, https://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/). The current GDC data on Xena was last updated about 1.5 years ago (on release 10).
We only need to import the open access data.
screenshot from https://portal.gdc.cancer.gov/repository
Looking at https://portal.gdc.cancer.gov/repository, it seems that Xena is missing public data from the FM program, which all belong to the phenotypic and clinical data category. There are significant more new data from the TARGET program.
There are new types of data on GDC that is missing on Xena. New code are needed for these new data types, follow the pattern of existing wrangling code.
- gene expression data STAR - Counts
- GISTIC - Copy Number Score
- DNAcopy
You can see the current xena GDC data at https://gdc.xenahubs.net. Some of the GDC API used might be out of date. We have not investigate this. So we suggest the student to run existing code to figure out which API is broken and needs update.
The current TCGA phenotype data on Xena was wrangled using two sources: GDC API and xml files. We would like to only include data from the API path, which will simplify the existing code.
The current wrangling code is at https://github.com/yunhailuo/xena-GDC-ETL.
Python