Pipeline Version	Date Updated	Documentation Author	Questions or Feedback
optimus_v1.3.6	October 10, 2019	Elizabeth Kiernan	Please file GitHub issues in skylab or contact Kylee Degatano

Optimus Pipeline Overview

Introduction to the Optimus Workflow

Optimus is a pipeline developed by the Data Coordination Platform (DCP) of the Human Cell Atlas (HCA) Project that supports processing of any 3' single-cell expression data generated with the 10X Genomic V2 and V3 assay. It is an alignment and transcriptome quantification pipeline that corrects Cell Barcodes, aligns reads to the genome, corrects Unique Molecular Identifiers (UMIs), generates an expression matrix in a UMI-aware manner, detects empty droplets, calculates summary metrics for genes and cells, returns read outputs in BAM format, and returns cell gene expression in numpy matrix, Zarr, and Loom file formats. Special care is taken to keep all reads that may be useful to the downstream user, such as unaligned reads or reads with uncorrectable barcodes. This design provides flexibility to the downstream user and allows for alternative filtering or leveraging the data for novel methodological development.

Optimus has been validated for analyzing both human and mouse data sets. More details about the human validation can be found in the in the original file.

Quick Start Table

Pipeline Features	Description	Source
Assay Type	10x Single Cell Expression (v2 and v3)	10x Genomics
Overall Workflow	Quality control module and transcriptome quantification module	Code available from Github
Workflow Language	WDL	openWDL
Genomic Reference Sequence	GRCh38 human genome primary sequence and M21 (GRCm38.p6) mouse genome primary sequence	GENCODE Human and Mouse
Transcriptomic Reference Annotation	V27 GenCode human transcriptome and M21 mouse transcriptome	GENCODE Human and Mouse
Aligner	STAR (v.2.5.3)	Dobin, et al.,2013
Transcript Quantification	Utilities for processing large-scale single cell datasets	Sctools
Data Input File Format	File format in which sequencing data is provided	FASTQ
Data Output File Format	File formats in which Optimus output is provided	BAM, Zarr version 2, Python numpy arrays (internal), Loom (generated with Loompy v.2.0.17)

Set-up

Optimus Installation and Requirements

The Optimus pipeline code can be downloaded by cloning the GitHub repository Skylab. For the latest release of Optimus, please see the realease tags prefixed with "optimus" here.

Optimus can be deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms. Optimus can also be run in Terra, a cloud-based analysis platform. In this featured workspace the user will find the Optimus pipeline, configurations, required reference data and other inputs, and example testing data.

Inputs

Optimus pipeline inputs are detailed in a json file, such as in this example.

Sample Data Input

Each 10X v2 and v3 3’ sequencing experiment generates triplets of fastq files for any given sample:

A forward reads (r1_fastq), containing the unique molecular identifier (UMI) and cell barcode sequences
A reverse reads (r2_fastq), which contain the alignable genomic information from the mRNA transcript
An index fastq (i1_fastq) that contains the sample barcodes, when provided by the sequencing facility

Note: Optimus is currently a single sample pipeline, but can take in multiple sets of fastqs for a sample that has been split over lanes of sequencing.

Additional Reference Inputs

The json file also contains metadata for the following reference information:

Whitelist: a list of known cell barcodes from 10X genomics
Tar_star_reference: TAR file containing a specifes-specific reference genome and gtf; it is generated using the StarMkRef.wdl
Sample_id: a unique name describing the biological sample or replicate that corresponds with the original fastq files
Annotations_gtf: a GTF containing gene annotations used for gene tagging (must match gft in STAR reference)
Chemistry: an optional description of whether data was generated with V2 or V3 chemistry

Running Optimus

The Optimus.wdl in the pipelines/optimus folder of the HCA Skylab repository implements the workflow by importing individual modules ("tasks" written in WDL) from the Skylab Library folder.

Optimus Modules Summary

Here we describe the modules ("tasks") of the Optimus Pipeliene; the code and library of tasks are available through GitHub.

Overall, the workflow:

Converts R2 fastq file (containing alignable genomic information) to an unaligend BAM (UBAM)
Corrects and attaches 10X Barcodes using the R1 Fastq file
Aligns reads to the genome with STAR v.2.5.3a
Annotates genes with aligned reads
Corrects UMIs
Calculates summary metrics
Produces a UMI-aware expression matrix
Detects empty droplets
Returns a GA4GH compliant BAM and metric matrix in Zarr or Loom formats

Special care is taken to flag but avoid the removal of reads that are not aligned or that do not contain recognizable barcodes. This design (which differs from many pipelines currently available) allows use of the entire dataset by those who may want to use alternative filtering or leverage the data for methodological development associated with the data processing.

1. Converting R2 Fastq File to UBAM

Unlike fastq files, BAM files enable researchers to keep track of important metadata throughout all data processing steps. The first step of Optimus is to convert the R2 fastq file, containing the alignable genomic information, to an unaligned BAM (UBAM) file.

2. Correcting and Attaching Cell Barcodes

Although the function of the cell barcodes is to identify unique cells, barcode errors can arise during sequencing (such as incorporation of the barcode into contaminating DNA or sequencing and PCR errors), making it difficult to distinguish unique cells from artifactual appearances of the barcode. The Attach10xBarcodes task uses sc-tools v.0.3.4 to evaluate barcode errors by comparing the R1 fastq sequences against a whitelist of known barcode sequences. The task then appends the UMI and Cell Barcode sequences from the R1 fastq to the UBAM sequence as tags, properly labeling the genomic information for alignment.

The output is a UBAM file containing the reads with correct barcodes, including barcodes that came within one edit distance (Levenshtein distance) of matching the whitelist of barcode sequences and were corrected by this tool. Correct barcodes are assigned a “CB” tag. Uncorrectable barcodes (with more than one error) are preserved and given a “CR” (Cell barcode Raw) tag. Cell barcode quality scores are also preserved in the file under the “CY” tag.

To facilitate subsequent processing steps, the pipeline then scatters and splits the corrected UBAM files into groups according to cell barcode.

3. Alignment

Optimus uses the STAR alignment task to map barcoded reads in the UBAM file to the genome primary assembly reference (see table above for version information). This task uses STAR (Spliced Transcripts Alignment to a Reference; Dobin, et al., 2013) a standard, splice-aware, RNA-seq alignment tool.

4. Gene Annotation

The TagGeneExon task then uses Drop-seq tools v.1.13 to annotate each read with the type of sequence to which it aligns. These annotations include INTERGENIC, INTRONIC, and EXONIC, and are stored using the XF BAM tag. In cases where the gene corresponds to an intron or exon, the name of the gene that overlaps the alignment is associated with the read and stored using the GE BAM tag.

5. UMI Correction

UMIs are designed to distinguish unique transcripts present in the cell at lysis from those arising from PCR amplification of these same transcripts. But, like cell barcodes, UMIs can also be incorrectly sequenced or amplified. Optimus uses the UmiCorrection task to apply a network-based, "directional" method (Smith, et al., 2017) to account for such errors using Umi-tools v.0.0.1

6. Summary Metric Calculation

The Metrics task calls the SequenceDataWithMoleculeTagMetrics.wdl to calculate summary metrics which are used assess the quality of the data output each time this pipeline is run. This task uses sctools v.0.3.3](https://github.com/HumanCellAtlas/sctools). These metrics are included in the ZARR and [Loom](link to Loom schema) output files.

7. Expression Matrix Construction

The Optimus Count task evaluates every read in the BAM file and creates a UMI-aware expression matrix using Drop-seq tools. This matrix contains the number of molecules that were observed for each cell barcode and for each gene. The task discards any read that maps to more than one gene, and counts any remaining reads provided the triplet of cell barcode, molecule barcode, and gene name is unique, indicating the read originates from a single transcript present at the time of cell lysis.

8. Identification of Empty Droplets

Empty droplets are lipid droplets that did not encapsulate a cell during 10X sequencing, but instead acquired cell-free RNA (secreted RNA or RNA released during cell lysis (Lun, et al., 2018) from the solution in which the cells resided. This ambient RNA can serve as a substrate for reverse transcription, leading to a small number of background reads. The Optimus pipeline calls the RunEmptyDrops task which uses the dropletUtils v.0.1.1 R package to flag cell barcodes that represent empty droplets rather than cells. These metrics are stored in the output Zarr and [Loom](link to Loom schema) files.

9. Outputs

Output files of the pipeline include:

Cell x Gene unnormalized expression matrix
Unfiltered, sorted BAM file with [barcode and downstream analysis Tags](link to UBAM Tag Description)
Cell metadata, including cell metrics
Gene metadata, including gene metrics

Following are the the types of files produced from the pipeline.

Output Name	Filename, if applicable	Output Type	Output Format	Notes/Description	Store in Data Store?	Tool
pipeline_version		Version of the processing pipeline run on this data	String	This is passed from the processing WDL to the adapter pipelines to be put into the metadata in the HCA	Yes, in metadata	Lira
bam	merged.bam	aligned bam	bam	coordinate sorted	Yes	A few tools; need to address this provenance
matrix	sparse_counts.npz	GenexCell expression matrix	Numpy array		Yes	sctools
matrix_row_index	sparse_counts_row_index.npy	Index of cells in expression matrix	Numpy array index		Yes	sctools
matrix_col_index	sparse_counts_col_index.npy	Index of genes in expression matrix	Numpy array index		Yes	sctools
cell_metrics	merged-cell-metrics.csv.gz	cell metrics	compressed csv	Matrix of metrics by cells	Yes	sctools
gene_metrics	merged-gene-metrics.csv.gz	gene metrics	compressed csv	Matrix of metrics by genes	Yes	sctools
cell_calls	empty_drops_result.csv	cell calls	csv		Yes	emptyDrops
zarr_output_files	{unique_id}.zarr!.zattrs		zarr store? sparse matrix?		Yes
loom_output_file	output.loom	Loom	Loom	Loom file with expression data and metadata	N/A	N/A

Versioning

Optimus Release Version	Date	Release Note
v1.3.6 (current)	09/23/2019	Optimus now optionally outputs a Loom formatted count matrix, with the default being true.
v1.3.3	08/29/2019	This version and newer have been validated to additionally support Mouse data. The gene expression per cell is now counted by gencode geneID instead of gene name. There is an additional output mapping geneID to gene name provided. This is a breaking change.
v1.0.0	03/30/2019	Initial pipeline release. Validated on hg38 gencodev27.

Have Suggestions?

We have an document dedicated to open issues! Please help us make our tools better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReadMeOUTDATED.md

ReadMeOUTDATED.md

Table of Contents

Optimus Pipeline Overview

Introduction to the Optimus Workflow

Quick Start Table

Set-up

Optimus Installation and Requirements

Inputs

Sample Data Input

Additional Reference Inputs

Running Optimus

Optimus Modules Summary

1. Converting R2 Fastq File to UBAM

2. Correcting and Attaching Cell Barcodes

3. Alignment

4. Gene Annotation

5. UMI Correction

6. Summary Metric Calculation

7. Expression Matrix Construction

8. Identification of Empty Droplets

9. Outputs

Versioning

Have Suggestions?

Files

ReadMeOUTDATED.md

Latest commit

History

ReadMeOUTDATED.md

File metadata and controls

Table of Contents

Optimus Pipeline Overview

Introduction to the Optimus Workflow

Quick Start Table

Set-up

Optimus Installation and Requirements

Inputs

Sample Data Input

Additional Reference Inputs

Running Optimus

Optimus Modules Summary

1. Converting R2 Fastq File to UBAM

2. Correcting and Attaching Cell Barcodes

3. Alignment

4. Gene Annotation

5. UMI Correction

6. Summary Metric Calculation

7. Expression Matrix Construction

8. Identification of Empty Droplets

9. Outputs

Versioning

Have Suggestions?