Skip to content

update GDC data on Xena

jingchunzhu edited this page Mar 11, 2019 · 25 revisions

Goals and Milestones of the project

  1. update GDC data on Xena
  2. update the GDC data wrangling pipeline

About the GDC data

This summer project will make the GDC data on Xena up-to-date with the latest GDC release (as of March 11, 2019 it is on release 15, https://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/). The current GDC data on Xena was last updated about 1.5 years ago (on release 10).

We only need to import the open access data.

gdc data access
screenshot from https://portal.gdc.cancer.gov/repository

Looking at https://portal.gdc.cancer.gov/repository, it seems that Xena is missing public data from the FM program, which all belong to the phenotypic and clinical data category. There are significant more new data from the TARGET program.

There are new types of data on GDC that is missing on Xena. New code are needed for these new data types, follow the pattern of existing wrangling code.

  • gene expression data STAR - Counts
  • GISTIC - Copy Number Score
  • DNAcopy

Current GDC data on Xena

You can see the current xena GDC data at https://gdc.xenahubs.net. Some of the GDC API used might be out of date. We have not investigate this. So we suggest the student to run existing code to figure out which API is broken and needs update.

The current TCGA phenotype data on Xena was wrangled using two sources: GDC API and xml files. We would like to only include data from the API path, which will simplify the existing code.

About the code

The current wrangling code is at https://github.com/yunhailuo/xena-GDC-ETL.

Programming language

Python