Skip to content

Latest commit

 

History

History
180 lines (121 loc) · 16.4 KB

ReadMeOUTDATED.md

File metadata and controls

180 lines (121 loc) · 16.4 KB
Pipeline Version Date Updated Documentation Author Questions or Feedback
optimus_v1.3.6 October 10, 2019 Elizabeth Kiernan Please file GitHub issues in skylab or contact Kylee Degatano

Table of Contents

Optimus Pipeline Overview

Diagram

Introduction to the Optimus Workflow

Optimus is a pipeline developed by the Data Coordination Platform (DCP) of the Human Cell Atlas (HCA) Project that supports processing of any 3' single-cell expression data generated with the 10X Genomic V2 and V3 assay. It is an alignment and transcriptome quantification pipeline that corrects Cell Barcodes, aligns reads to the genome, corrects Unique Molecular Identifiers (UMIs), generates an expression matrix in a UMI-aware manner, detects empty droplets, calculates summary metrics for genes and cells, returns read outputs in BAM format, and returns cell gene expression in numpy matrix, Zarr, and Loom file formats. Special care is taken to keep all reads that may be useful to the downstream user, such as unaligned reads or reads with uncorrectable barcodes. This design provides flexibility to the downstream user and allows for alternative filtering or leveraging the data for novel methodological development.

Optimus has been validated for analyzing both human and mouse data sets. More details about the human validation can be found in the in the original file.

Quick Start Table

Pipeline Features Description Source
Assay Type 10x Single Cell Expression (v2 and v3) 10x Genomics
Overall Workflow Quality control module and transcriptome quantification module Code available from Github
Workflow Language WDL openWDL
Genomic Reference Sequence GRCh38 human genome primary sequence and M21 (GRCm38.p6) mouse genome primary sequence GENCODE Human and Mouse
Transcriptomic Reference Annotation V27 GenCode human transcriptome and M21 mouse transcriptome GENCODE Human and Mouse
Aligner STAR (v.2.5.3) Dobin, et al.,2013
Transcript Quantification Utilities for processing large-scale single cell datasets Sctools
Data Input File Format File format in which sequencing data is provided FASTQ
Data Output File Format File formats in which Optimus output is provided BAM, Zarr version 2, Python numpy arrays (internal), Loom (generated with Loompy v.2.0.17)

Set-up

Optimus Installation and Requirements

The Optimus pipeline code can be downloaded by cloning the GitHub repository Skylab. For the latest release of Optimus, please see the realease tags prefixed with "optimus" here.

Optimus can be deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms. Optimus can also be run in Terra, a cloud-based analysis platform. In this featured workspace the user will find the Optimus pipeline, configurations, required reference data and other inputs, and example testing data.

Inputs

Optimus pipeline inputs are detailed in a json file, such as in this example.

Sample Data Input

Each 10X v2 and v3 3’ sequencing experiment generates triplets of fastq files for any given sample:

  1. A forward reads (r1_fastq), containing the unique molecular identifier (UMI) and cell barcode sequences
  2. A reverse reads (r2_fastq), which contain the alignable genomic information from the mRNA transcript
  3. An index fastq (i1_fastq) that contains the sample barcodes, when provided by the sequencing facility

Note: Optimus is currently a single sample pipeline, but can take in multiple sets of fastqs for a sample that has been split over lanes of sequencing.

Additional Reference Inputs

The json file also contains metadata for the following reference information:

  • Whitelist: a list of known cell barcodes from 10X genomics
  • Tar_star_reference: TAR file containing a specifes-specific reference genome and gtf; it is generated using the StarMkRef.wdl
  • Sample_id: a unique name describing the biological sample or replicate that corresponds with the original fastq files
  • Annotations_gtf: a GTF containing gene annotations used for gene tagging (must match gft in STAR reference)
  • Chemistry: an optional description of whether data was generated with V2 or V3 chemistry

Running Optimus

  • The Optimus.wdl in the pipelines/optimus folder of the HCA Skylab repository implements the workflow by importing individual modules ("tasks" written in WDL) from the Skylab Library folder.

Optimus Modules Summary

Here we describe the modules ("tasks") of the Optimus Pipeliene; the code and library of tasks are available through GitHub.

Overall, the workflow:

  1. Converts R2 fastq file (containing alignable genomic information) to an unaligend BAM (UBAM)
  2. Corrects and attaches 10X Barcodes using the R1 Fastq file
  3. Aligns reads to the genome with STAR v.2.5.3a
  4. Annotates genes with aligned reads
  5. Corrects UMIs
  6. Calculates summary metrics
  7. Produces a UMI-aware expression matrix
  8. Detects empty droplets
  9. Returns a GA4GH compliant BAM and metric matrix in Zarr or Loom formats

Special care is taken to flag but avoid the removal of reads that are not aligned or that do not contain recognizable barcodes. This design (which differs from many pipelines currently available) allows use of the entire dataset by those who may want to use alternative filtering or leverage the data for methodological development associated with the data processing.

1. Converting R2 Fastq File to UBAM

Unlike fastq files, BAM files enable researchers to keep track of important metadata throughout all data processing steps. The first step of Optimus is to convert the R2 fastq file, containing the alignable genomic information, to an unaligned BAM (UBAM) file.

2. Correcting and Attaching Cell Barcodes

Although the function of the cell barcodes is to identify unique cells, barcode errors can arise during sequencing (such as incorporation of the barcode into contaminating DNA or sequencing and PCR errors), making it difficult to distinguish unique cells from artifactual appearances of the barcode. The Attach10xBarcodes task uses sc-tools v.0.3.4 to evaluate barcode errors by comparing the R1 fastq sequences against a whitelist of known barcode sequences. The task then appends the UMI and Cell Barcode sequences from the R1 fastq to the UBAM sequence as tags, properly labeling the genomic information for alignment.

The output is a UBAM file containing the reads with correct barcodes, including barcodes that came within one edit distance (Levenshtein distance) of matching the whitelist of barcode sequences and were corrected by this tool. Correct barcodes are assigned a “CB” tag. Uncorrectable barcodes (with more than one error) are preserved and given a “CR” (Cell barcode Raw) tag. Cell barcode quality scores are also preserved in the file under the “CY” tag.

To facilitate subsequent processing steps, the pipeline then scatters and splits the corrected UBAM files into groups according to cell barcode.

3. Alignment

Optimus uses the STAR alignment task to map barcoded reads in the UBAM file to the genome primary assembly reference (see table above for version information). This task uses STAR (Spliced Transcripts Alignment to a Reference; Dobin, et al., 2013) a standard, splice-aware, RNA-seq alignment tool.

4. Gene Annotation

The TagGeneExon task then uses Drop-seq tools v.1.13 to annotate each read with the type of sequence to which it aligns. These annotations include INTERGENIC, INTRONIC, and EXONIC, and are stored using the XF BAM tag. In cases where the gene corresponds to an intron or exon, the name of the gene that overlaps the alignment is associated with the read and stored using the GE BAM tag.

5. UMI Correction

UMIs are designed to distinguish unique transcripts present in the cell at lysis from those arising from PCR amplification of these same transcripts. But, like cell barcodes, UMIs can also be incorrectly sequenced or amplified. Optimus uses the UmiCorrection task to apply a network-based, "directional" method (Smith, et al., 2017) to account for such errors using Umi-tools v.0.0.1

6. Summary Metric Calculation

The Metrics task calls the SequenceDataWithMoleculeTagMetrics.wdl to calculate summary metrics which are used assess the quality of the data output each time this pipeline is run. This task uses sctools v.0.3.3](https://github.com/HumanCellAtlas/sctools). These metrics are included in the ZARR and [Loom](link to Loom schema) output files.

7. Expression Matrix Construction

The Optimus Count task evaluates every read in the BAM file and creates a UMI-aware expression matrix using Drop-seq tools. This matrix contains the number of molecules that were observed for each cell barcode and for each gene. The task discards any read that maps to more than one gene, and counts any remaining reads provided the triplet of cell barcode, molecule barcode, and gene name is unique, indicating the read originates from a single transcript present at the time of cell lysis.

8. Identification of Empty Droplets

Empty droplets are lipid droplets that did not encapsulate a cell during 10X sequencing, but instead acquired cell-free RNA (secreted RNA or RNA released during cell lysis (Lun, et al., 2018) from the solution in which the cells resided. This ambient RNA can serve as a substrate for reverse transcription, leading to a small number of background reads. The Optimus pipeline calls the RunEmptyDrops task which uses the dropletUtils v.0.1.1 R package to flag cell barcodes that represent empty droplets rather than cells. These metrics are stored in the output Zarr and [Loom](link to Loom schema) files.

9. Outputs

Output files of the pipeline include:

  1. Cell x Gene unnormalized expression matrix
  2. Unfiltered, sorted BAM file with [barcode and downstream analysis Tags](link to UBAM Tag Description)
  3. Cell metadata, including cell metrics
  4. Gene metadata, including gene metrics

Following are the the types of files produced from the pipeline.

Output Name Filename, if applicable Output Type Output Format Notes/Description Store in Data Store? Tool
pipeline_version Version of the processing pipeline run on this data String This is passed from the processing WDL to the adapter pipelines to be put into the metadata in the HCA Yes, in metadata Lira
bam merged.bam aligned bam bam coordinate sorted Yes A few tools; need to address this provenance
matrix sparse_counts.npz GenexCell expression matrix Numpy array Yes sctools
matrix_row_index sparse_counts_row_index.npy Index of cells in expression matrix Numpy array index Yes sctools
matrix_col_index sparse_counts_col_index.npy Index of genes in expression matrix Numpy array index Yes sctools
cell_metrics merged-cell-metrics.csv.gz cell metrics compressed csv Matrix of metrics by cells Yes sctools
gene_metrics merged-gene-metrics.csv.gz gene metrics compressed csv Matrix of metrics by genes Yes sctools
cell_calls empty_drops_result.csv cell calls csv Yes emptyDrops
zarr_output_files {unique_id}.zarr!.zattrs zarr store? sparse matrix? Yes
loom_output_file output.loom Loom Loom Loom file with expression data and metadata N/A N/A

Versioning

Optimus Release Version Date Release Note
v1.3.6 (current) 09/23/2019 Optimus now optionally outputs a Loom formatted count matrix, with the default being true.
v1.3.3 08/29/2019 This version and newer have been validated to additionally support Mouse data. The gene expression per cell is now counted by gencode geneID instead of gene name. There is an additional output mapping geneID to gene name provided. This is a breaking change.
v1.0.0 03/30/2019 Initial pipeline release. Validated on hg38 gencodev27.

Have Suggestions?

We have an document dedicated to open issues! Please help us make our tools better.