Skip to content

Latest commit

 

History

History
56 lines (42 loc) · 5.25 KB

getting_started.md

File metadata and controls

56 lines (42 loc) · 5.25 KB

This document summarizes how to set up Terra and get access to services that are required to run the pipeline.

For Internal Users

Getting Terra Access

  1. You will need to request access to the following terra workspaces:

Members of the DepMap Omics team should give you access once you request it on Terra.

  1. For the mutation pipeline you will also need to request dbGaP access (required for TCGA workflows). See CCLE/new hiree section on Asana for details.
  2. Acquire access to required billing projects (e.g. broad-firecloud-ccle). See CCLE/new hiree section on Asana for details.
  3. Get access to the following Terra groups:
  • depmap_ccle_data
  • depmap-pipelines
  • ccle-pipeline
  1. If you need to get access to the data delivered by Genomics Platform (GP) at the Broad, use the following links:
  1. Request access to the data bucket gs://cclebams/

Additional python dependencies:

  • Taiga is a platform that allows the Cancer Data Science team to store and share data. In order to access and upload data, you will need to login to taiga with your broad google account and set up your token for the python client.
  • We are currently using a relational database, Gumbo, to track our cell lines' metadata and release status. In order to interact with Gumbo through python, follow the instruction and install the Gumbo client here.
  • In order to use internal-only functions involved in the loading, depmap data post-processing, and uploading, you need to install the depmap_omics_upload repo

For External Users

In order to run the processing pipelines written in WDL, you will need to set up workspaces on Terra:

Creating your Terra Workspaces:

  1. You first need a Terra account with correct billing setup. See here for a tutorial on getting started.
  2. If you haven't already, create a workspace under the billing project of your choice. If you need to process both RNA and WGS data, we recommend creating one workspace for each.
  3. Import the WDL scripts by following the links to dockstore and clicking on launch with terra (note: you'll need both *_pipeline and *_aggregate for each data type):
  4. DepMap's workspace configurations are saved after each data release under data/. We recommend using configurations from the latest quarter. For example, if the latest release is 21Q4, you should be able to find the configurations in https://github.com/broadinstitute/depmap_omics/blob/master/data/21Q4/RNAconfig/all_configs.json and https://github.com/broadinstitute/depmap_omics/blob/master/data/21Q4/WGSconfig/all_configs.json for RNA and WGS, respectively.
  5. Set up the right inputs and outputs for your workflows according to inputs_[WORKFLOW_NAME].json and outputs_[WORKFLOW_NAME].json files, which are under the same directory as all_configs.json.
  6. Load your samples to the sample table so that their bam and bam index google storage filepaths get listed in the right data column to WGS_pipeline and RNA pipeline (e.g. internal_bam_filepath contains hg38 aligned bam files whereas hg19_bam_filepath contains hg19 aligned bam files).
  7. Create a sample set with the set of samples you want to analyse. Make sure the name of this sample set on terra is the same as SAMPLESETNAME in config_global.py.

Once this is done, you can launch your jupyter notebook server and run the *_CCLE jupyter notebooks corresponding to our RNA pipeline and WGS pipeline (older versions for WES (CN and mutations are available in a previous commit labelled 20Q4)).

Remark:

  1. you will need to use the postProcesssing() functions for post processing instead of the CCLE ones in the dm_omics.py module.
  2. you will need to change some of the variables in the config_global.py and config_prod.py.
  3. you won't be able to run the function conditional on the isCCLE boolean. You can however reimplement them to create your own pipeline.