This document summarizes how to set up Terra and get access to services that are required to run the pipeline.
- You will need to request access to the following terra workspaces:
Members of the DepMap Omics team should give you access once you request it on Terra.
- For the mutation pipeline you will also need to request dbGaP access (required for TCGA workflows). See CCLE/new hiree section on Asana for details.
- Acquire access to required billing projects (e.g. broad-firecloud-ccle). See CCLE/new hiree section on Asana for details.
- Get access to the following Terra groups:
- depmap_ccle_data
- depmap-pipelines
- ccle-pipeline
- If you need to get access to the data delivered by Genomics Platform (GP) at the Broad, use the following links:
- RNA Broad
- WGS Broad
- WGS Broad hg38 cram
- Request access to the data bucket
gs://cclebams/
- Taiga is a platform that allows the Cancer Data Science team to store and share data. In order to access and upload data, you will need to login to taiga with your broad google account and set up your token for the python client.
- We are currently using a relational database, Gumbo, to track our cell lines' metadata and release status. In order to interact with Gumbo through python, follow the instruction and install the Gumbo client here.
- In order to use internal-only functions involved in the loading, depmap data post-processing, and uploading, you need to install the depmap_omics_upload repo
In order to run the processing pipelines written in WDL, you will need to set up workspaces on Terra:
- You first need a Terra account with correct billing setup. See here for a tutorial on getting started.
- If you haven't already, create a workspace under the billing project of your choice. If you need to process both RNA and WGS data, we recommend creating one workspace for each.
- Import the WDL scripts by following the links to dockstore and clicking on launch with terra (note: you'll need both *_pipeline and *_aggregate for each data type):
- DepMap's workspace configurations are saved after each data release under
data/
. We recommend using configurations from the latest quarter. For example, if the latest release is21Q4
, you should be able to find the configurations in https://github.com/broadinstitute/depmap_omics/blob/master/data/21Q4/RNAconfig/all_configs.json and https://github.com/broadinstitute/depmap_omics/blob/master/data/21Q4/WGSconfig/all_configs.json for RNA and WGS, respectively. - Set up the right inputs and outputs for your workflows according to
inputs_[WORKFLOW_NAME].json
andoutputs_[WORKFLOW_NAME].json
files, which are under the same directory asall_configs.json
. - Load your samples to the sample table so that their bam and bam index google storage filepaths get listed in the right data column to WGS_pipeline and RNA pipeline (e.g. internal_bam_filepath contains hg38 aligned bam files whereas hg19_bam_filepath contains hg19 aligned bam files).
- Create a sample set with the set of samples you want to analyse. Make sure the name of this sample set on terra is the same as
SAMPLESETNAME
inconfig_global.py
.
Once this is done, you can launch your jupyter notebook server and run the *_CCLE
jupyter notebooks corresponding to our RNA pipeline and WGS pipeline (older versions for WES (CN and mutations are available in a previous commit labelled 20Q4)).
Remark:
- you will need to use the
postProcesssing()
functions for post processing instead of the CCLE ones in thedm_omics.py
module. - you will need to change some of the variables in the
config_global.py
andconfig_prod.py
. - you won't be able to run the function conditional on the
isCCLE
boolean. You can however reimplement them to create your own pipeline.