Pipeline Version | Date Updated | Documentation Author | Questions or Feedback |
---|---|---|---|
optimus_v1.3.6 | October 10, 2019 | Elizabeth Kiernan | Please file GitHub issues in skylab or contact Kylee Degatano |
Optimus is a pipeline developed by the Data Coordination Platform (DCP) of the Human Cell Atlas (HCA) Project that supports processing of any 3' single-cell expression data generated with the 10X Genomic V2 and V3 assay. It is an alignment and transcriptome quantification pipeline that corrects Cell Barcodes, aligns reads to the genome, corrects Unique Molecular Identifiers (UMIs), generates an expression matrix in a UMI-aware manner, detects empty droplets, calculates summary metrics for genes and cells, returns read outputs in BAM format, and returns cell gene expression in numpy matrix, Zarr, and Loom file formats. Special care is taken to keep all reads that may be useful to the downstream user, such as unaligned reads or reads with uncorrectable barcodes. This design provides flexibility to the downstream user and allows for alternative filtering or leveraging the data for novel methodological development.
Optimus has been validated for analyzing both human and mouse data sets. More details about the human validation can be found in the in the original file.
Pipeline Features | Description | Source |
---|---|---|
Assay Type | 10x Single Cell Expression (v2 and v3) | 10x Genomics |
Overall Workflow | Quality control module and transcriptome quantification module | Code available from Github |
Workflow Language | WDL | openWDL |
Genomic Reference Sequence | GRCh38 human genome primary sequence and M21 (GRCm38.p6) mouse genome primary sequence | GENCODE Human and Mouse |
Transcriptomic Reference Annotation | V27 GenCode human transcriptome and M21 mouse transcriptome | GENCODE Human and Mouse |
Aligner | STAR (v.2.5.3) | Dobin, et al.,2013 |
Transcript Quantification | Utilities for processing large-scale single cell datasets | Sctools |
Data Input File Format | File format in which sequencing data is provided | FASTQ |
Data Output File Format | File formats in which Optimus output is provided | BAM, Zarr version 2, Python numpy arrays (internal), Loom (generated with Loompy v.2.0.17) |
The Optimus pipeline code can be downloaded by cloning the GitHub repository Skylab. For the latest release of Optimus, please see the realease tags prefixed with "optimus" here.
Optimus can be deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms. Optimus can also be run in Terra, a cloud-based analysis platform. In this featured workspace the user will find the Optimus pipeline, configurations, required reference data and other inputs, and example testing data.
Optimus pipeline inputs are detailed in a json file, such as in this example.
Each 10X v2 and v3 3’ sequencing experiment generates triplets of fastq files for any given sample:
- A forward reads (r1_fastq), containing the unique molecular identifier (UMI) and cell barcode sequences
- A reverse reads (r2_fastq), which contain the alignable genomic information from the mRNA transcript
- An index fastq (i1_fastq) that contains the sample barcodes, when provided by the sequencing facility
Note: Optimus is currently a single sample pipeline, but can take in multiple sets of fastqs for a sample that has been split over lanes of sequencing.
The json file also contains metadata for the following reference information:
- Whitelist: a list of known cell barcodes from 10X genomics
- Tar_star_reference: TAR file containing a specifes-specific reference genome and gtf; it is generated using the StarMkRef.wdl
- Sample_id: a unique name describing the biological sample or replicate that corresponds with the original fastq files
- Annotations_gtf: a GTF containing gene annotations used for gene tagging (must match gft in STAR reference)
- Chemistry: an optional description of whether data was generated with V2 or V3 chemistry
- The Optimus.wdl in the pipelines/optimus folder of the HCA Skylab repository implements the workflow by importing individual modules ("tasks" written in WDL) from the Skylab Library folder.
Here we describe the modules ("tasks") of the Optimus Pipeliene; the code and library of tasks are available through GitHub.
Overall, the workflow:
- Converts R2 fastq file (containing alignable genomic information) to an unaligend BAM (UBAM)
- Corrects and attaches 10X Barcodes using the R1 Fastq file
- Aligns reads to the genome with STAR v.2.5.3a
- Annotates genes with aligned reads
- Corrects UMIs
- Calculates summary metrics
- Produces a UMI-aware expression matrix
- Detects empty droplets
- Returns a GA4GH compliant BAM and metric matrix in Zarr or Loom formats
Special care is taken to flag but avoid the removal of reads that are not aligned or that do not contain recognizable barcodes. This design (which differs from many pipelines currently available) allows use of the entire dataset by those who may want to use alternative filtering or leverage the data for methodological development associated with the data processing.
Unlike fastq files, BAM files enable researchers to keep track of important metadata throughout all data processing steps. The first step of Optimus is to convert the R2 fastq file, containing the alignable genomic information, to an unaligned BAM (UBAM) file.
Although the function of the cell barcodes is to identify unique cells, barcode errors can arise during sequencing (such as incorporation of the barcode into contaminating DNA or sequencing and PCR errors), making it difficult to distinguish unique cells from artifactual appearances of the barcode. The Attach10xBarcodes task uses sc-tools v.0.3.4 to evaluate barcode errors by comparing the R1 fastq sequences against a whitelist of known barcode sequences. The task then appends the UMI and Cell Barcode sequences from the R1 fastq to the UBAM sequence as tags, properly labeling the genomic information for alignment.
The output is a UBAM file containing the reads with correct barcodes, including barcodes that came within one edit distance (Levenshtein distance) of matching the whitelist of barcode sequences and were corrected by this tool. Correct barcodes are assigned a “CB” tag. Uncorrectable barcodes (with more than one error) are preserved and given a “CR” (Cell barcode Raw) tag. Cell barcode quality scores are also preserved in the file under the “CY” tag.
To facilitate subsequent processing steps, the pipeline then scatters and splits the corrected UBAM files into groups according to cell barcode.
Optimus uses the STAR alignment task to map barcoded reads in the UBAM file to the genome primary assembly reference (see table above for version information). This task uses STAR (Spliced Transcripts Alignment to a Reference; Dobin, et al., 2013) a standard, splice-aware, RNA-seq alignment tool.
The TagGeneExon task then uses Drop-seq tools v.1.13 to annotate each read with the type of sequence to which it aligns. These annotations include INTERGENIC, INTRONIC, and EXONIC, and are stored using the XF BAM tag. In cases where the gene corresponds to an intron or exon, the name of the gene that overlaps the alignment is associated with the read and stored using the GE BAM tag.
UMIs are designed to distinguish unique transcripts present in the cell at lysis from those arising from PCR amplification of these same transcripts. But, like cell barcodes, UMIs can also be incorrectly sequenced or amplified. Optimus uses the UmiCorrection task to apply a network-based, "directional" method (Smith, et al., 2017) to account for such errors using Umi-tools v.0.0.1
The Metrics task calls the SequenceDataWithMoleculeTagMetrics.wdl to calculate summary metrics which are used assess the quality of the data output each time this pipeline is run. This task uses sctools v.0.3.3](https://github.com/HumanCellAtlas/sctools). These metrics are included in the ZARR and [Loom](link to Loom schema) output files.
The Optimus Count task evaluates every read in the BAM file and creates a UMI-aware expression matrix using Drop-seq tools. This matrix contains the number of molecules that were observed for each cell barcode and for each gene. The task discards any read that maps to more than one gene, and counts any remaining reads provided the triplet of cell barcode, molecule barcode, and gene name is unique, indicating the read originates from a single transcript present at the time of cell lysis.
Empty droplets are lipid droplets that did not encapsulate a cell during 10X sequencing, but instead acquired cell-free RNA (secreted RNA or RNA released during cell lysis (Lun, et al., 2018) from the solution in which the cells resided. This ambient RNA can serve as a substrate for reverse transcription, leading to a small number of background reads. The Optimus pipeline calls the RunEmptyDrops task which uses the dropletUtils v.0.1.1 R package to flag cell barcodes that represent empty droplets rather than cells. These metrics are stored in the output Zarr and [Loom](link to Loom schema) files.
Output files of the pipeline include:
- Cell x Gene unnormalized expression matrix
- Unfiltered, sorted BAM file with [barcode and downstream analysis Tags](link to UBAM Tag Description)
- Cell metadata, including cell metrics
- Gene metadata, including gene metrics
Following are the the types of files produced from the pipeline.
Output Name | Filename, if applicable | Output Type | Output Format | Notes/Description | Store in Data Store? | Tool |
---|---|---|---|---|---|---|
pipeline_version | Version of the processing pipeline run on this data | String | This is passed from the processing WDL to the adapter pipelines to be put into the metadata in the HCA | Yes, in metadata | Lira | |
bam | merged.bam | aligned bam | bam | coordinate sorted | Yes | A few tools; need to address this provenance |
matrix | sparse_counts.npz | GenexCell expression matrix | Numpy array | Yes | sctools | |
matrix_row_index | sparse_counts_row_index.npy | Index of cells in expression matrix | Numpy array index | Yes | sctools | |
matrix_col_index | sparse_counts_col_index.npy | Index of genes in expression matrix | Numpy array index | Yes | sctools | |
cell_metrics | merged-cell-metrics.csv.gz | cell metrics | compressed csv | Matrix of metrics by cells | Yes | sctools |
gene_metrics | merged-gene-metrics.csv.gz | gene metrics | compressed csv | Matrix of metrics by genes | Yes | sctools |
cell_calls | empty_drops_result.csv | cell calls | csv | Yes | emptyDrops | |
zarr_output_files | {unique_id}.zarr!.zattrs | zarr store? sparse matrix? | Yes | |||
loom_output_file | output.loom | Loom | Loom | Loom file with expression data and metadata | N/A | N/A |
Optimus Release Version | Date | Release Note |
---|---|---|
v1.3.6 (current) | 09/23/2019 | Optimus now optionally outputs a Loom formatted count matrix, with the default being true. |
v1.3.3 | 08/29/2019 | This version and newer have been validated to additionally support Mouse data. The gene expression per cell is now counted by gencode geneID instead of gene name. There is an additional output mapping geneID to gene name provided. This is a breaking change. |
v1.0.0 | 03/30/2019 | Initial pipeline release. Validated on hg38 gencodev27. |
We have an document dedicated to open issues! Please help us make our tools better.