-
Notifications
You must be signed in to change notification settings - Fork 67
Barcodes
zUMIs provides three main options for selecting relevant barcodes:
- automatic detection
- number of barcodes with most reads
- barcode list annotation
Here is more information on each of the modes:
zUMIs infers which barcodes mark good cells from the observed sequences. To this end, we fit a k-dimensional multivariate normal distribution using the R-package mclust for the number of reads/BC, where k is empirically determined by mclust via the Bayesian Information Criterion (BIC). We reason that only the kth normal distribution with the largest mean contains barcodes that identify reads originating from intact cells. We exclude all barcodes that fall in the lower 1% tail of this kth normal-distribution to exclude spurious barcodes.
zUMIs will make a summary statistic over all observed barcode sequences and their frequency. The user-specified number of barcodes will be selected in descending order.
If expected barcodes are known a priori, it is usually advisable to provide these. The format should be a plain text file without headers, where each line contains the exact barcode sequence.
For instance:
GGGGCA
TATTGT
GCACGG
CAATAA
CGCGTG
Attention: If you have specified a 6-mer in the barcode range (eg. 1-6
), this annotation should also contain 6-mer reference barcodes!
In case you are using several barcode ranges in zUMIs, the expected barcodelist should contain the concatenated string of all possible expected barcode combinations!
For instance, take the above cell barcodes that should all have the same plate barcode:
CGTACTAGGGGGCA
CGTACTAGTATTGT
CGTACTAGGCACGG
CGTACTAGCAATAA
CGTACTAGCGCGTG
Attention: Make sure the annotation always contains reference barcodes with correct length (sum of all specified barcode lengths)!
In this mode, zUMIs will use it's automatic BC detection as described above and make sure that each BC is part of the given BC whitelist. In case your reference BC whitelist barcodes are shorter than the barcode extracted from the sequence reads, zUMIs will still try to match them up by a grep command. Note this may become slow if you have many cells & whitelisted barcodes.
Example: You are using 10xGenomics data with 16bp RT-barcode + 8bp i7 index (-> zUMIs internal BC will be 24 bp) but only give 16bp RT barcodes in the whitelist. The matching up will still work in this case.
For some scRNA-seq protocols, the same cell may be observed with several barcode sequences. Examples are:
- SPLiT-seq: Round 1 RT barcode for the same cell differs if using oligo-dT and random hexamer priming together.
- 10xGenomics: i7 library barcode is actually a mix of 4 primers with distinct sequences to improve sequencer quality.
zUMIs can combine the reads belonging together when the users provides an annotation file in the following format in the barcode_sharing:
field of the YAML config file:
- Hashed out header line defining which portion of the full zUMIs barcode to match up (eg.
#17-26
if bases 17-26 have the barcode portion of interest) - Tab separated barcodes that belong together. Each line should contain all the barcodes that belong together in a file, with as many columns as necessary. eg:
GGTTTACT CTAAACGG TCGGCGTC AACCGTAA
TTTCATGA ACGTCCCT CGCATGTG GAAGGAAC
...
or
AACGTGAT GATAGACA
CGCTGATC GTCGTAGA
...