Multiplexing is done by adding artificial sequences to molecules, to encode their sample of origin, so that multiple samples can be sequenced together (to save costs). Demultiplexing is to retreive this information from the sequence reads, and recode it as an addition to the read name or by grouping the reads in separate files. Some multiplexing strategies are supported natively by some sequencers, but a lot of custom designs also exist.
In this example, we will install TagDust 2, download nanoCAGE data from DDBJ, and demultiplex it. These commands have been tested on a virtual machine running Linux.
mkdir -p $HOME/bin
Log out, and log in again. On Linux systems such as Debian, $HOME/bin
is now in the $PATH
.
Here we are downloading version 2.33. This is the first version where the option -show_finger_seq
is available.
wget http://downloads.sourceforge.net/project/tagdust/tagdust-2.33.tar.gz
tar xvfz tagdust-2.33.tar.gz
cd tagdust-2.33
./configure
make
cp src/tagdust $HOME/bin
cd
These are Read 1 and Read 2 from a HiSeq lane where nanoCAGE libraries have been multiplexed using a barcode that spans the first 6 bases of Read1. Illumina's TruSeq indexing system has not been used here, which is why we need to demultiplex by hand. The reason for using barcodes in Read 1 instead of TruSeq indexes is to allow for multiplexing the library construction at an early stage.
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/DRA004/DRA004180/DRX044600/DRR049557_1.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/DRA004/DRA004180/DRX044600/DRR049557_2.fastq.bz2
The sequence of Read 1 should match with BBBBBBNNNNNNNNTATAGGG
, where B
is a barcode base, N
is a UMI base, and TATAGGG
is a linker sequence that
is the same in every read. The rest of the bases are cDNA sequences to be aligned
on the genome.
The architecture files declare the structure of the reads in a machine-readable
format, where B
, F
, S
and R
bloks declare barcodes, fingerprints (UMIs),
spacers (linkers), and the rna sequence to be aligned. More details can be
found in the user manual (the doc/User-Manual.pdf
file in TagDust's source
archive) or in Lassmann T., 2015.
cat > DRR049557.arch <<__END__
tagdust -1 B:ACATGA,ATCATA,CACGTG,CGATGA,GAGATA,GCTCTC,GTATGA,TCGATA,AGTAGC,ATCGCA,CACTCT,CTGAGC,GAGCGT,GCTGCA,TATAGC,CACGAT,CTGACG -2 F:NNNNNNNN -3 S:TATAGGG -4 R:N
tagdust -1 R:N
__END__
tagdust -t 4 -show_finger_seq -ref tagdust.fa -arch DRR049557.arch -o DRR049557 DRR049557_1.fastq.bz2 DRR049557_2.fastq.bz2
-t 4
allocates 4 cores to the task. You can adapt this number to the machine
where you run that command. Note that even when using multiple cores, TagDust may
take hours before finishing. Look at the timestamps in the example output
below for an example.
The file tagdust.fa
contains sequences of nanoCAGE linkers and of Nextera sequencing
linkers.
As per Illumina's Illumina Adapter Sequences Document , here is the copyright notice for the sequences that have a name starting with "Nextera".
Oligonucleotide sequences © 2016 Illumina, Inc. All rights reserved. Derivative works created by Illumina customers are authorized for use with Illumina instruments and products only. All other uses are strictly prohibited.
Here is the list associating barcodes to sample names. It was constructed using the information submitted to DDBJ (https://trace.ddbj.nig.ac.jp/DRASearch/experiment?acc=DRX044600).
cat > DRR049557.samples.txt <<__END__
ACATGA HaCaT_r1
ATCATA HaCaT_r2
CACGTG HaCaT_r3
CGATGA c33a_r1
GAGATA c33a_r2
GCTCTC c33a_r3
GTATGA Hela_r1
TCGATA Hela_r2
AGTAGC Hela_r3
ATCGCA Caski_r1
CACTCT Caski_r2
CTGAGC Caski_r3
GAGCGT SiHa_r1
GCTGCA SiHa_r2
TATAGC SiHa_r3
CACGAT W12E_r1
CTGACG W12E_r2
__END__
Read the list and rename with a loop.
cat DRR049557.samples.txt |
while read from to
do
prename -f "s/$from/$to/" *$from*
done
The following files should be produced:
DRR049557_BC_Caski_r1_READ1.fq
DRR049557_BC_Caski_r1_READ2.fq
DRR049557_BC_Caski_r2_READ1.fq
DRR049557_BC_Caski_r2_READ2.fq
DRR049557_BC_Caski_r3_READ1.fq
DRR049557_BC_Caski_r3_READ2.fq
DRR049557_BC_HaCaT_r1_READ1.fq
DRR049557_BC_HaCaT_r1_READ2.fq
DRR049557_BC_HaCaT_r2_READ1.fq
DRR049557_BC_HaCaT_r2_READ2.fq
DRR049557_BC_HaCaT_r3_READ1.fq
DRR049557_BC_HaCaT_r3_READ2.fq
DRR049557_BC_Hela_r1_READ1.fq
DRR049557_BC_Hela_r1_READ2.fq
DRR049557_BC_Hela_r2_READ1.fq
DRR049557_BC_Hela_r2_READ2.fq
DRR049557_BC_Hela_r3_READ1.fq
DRR049557_BC_Hela_r3_READ2.fq
DRR049557_BC_SiHa_r1_READ1.fq
DRR049557_BC_SiHa_r1_READ2.fq
DRR049557_BC_SiHa_r2_READ1.fq
DRR049557_BC_SiHa_r2_READ2.fq
DRR049557_BC_SiHa_r3_READ1.fq
DRR049557_BC_SiHa_r3_READ2.fq
DRR049557_BC_W12E_r1_READ1.fq
DRR049557_BC_W12E_r1_READ2.fq
DRR049557_BC_W12E_r2_READ1.fq
DRR049557_BC_W12E_r2_READ2.fq
DRR049557_BC_c33a_r1_READ1.fq
DRR049557_BC_c33a_r1_READ2.fq
DRR049557_BC_c33a_r2_READ1.fq
DRR049557_BC_c33a_r2_READ2.fq
DRR049557_BC_c33a_r3_READ1.fq
DRR049557_BC_c33a_r3_READ2.fq
DRR049557_un_READ1.fq
DRR049557_un_READ2.fq
The tagdust
command also produces a summary file, DRR049557_logfile.txt
.
In our test run, its content was:
[2016-07-07 09:29:55] Tagdust 2.33, Copyright (C) 2013 Timo Lassmann <[email protected]>
[2016-07-07 09:29:55] cmd: tagdust -t 4 -show_finger_seq -ref tagdust.fa -arch DRR049557.arch -o DRR049557 DRR049557_1.fastq.bz2 DRR049557_2.fastq.bz2
[2016-07-07 09:29:55] Start Run
--------------------------------------------------
[2016-07-07 09:29:55] Looking at file:DRR049557_1.fastq.bz2
[2016-07-07 09:29:55] Searching for best architecture in file 'DRR049557.arch'
[2016-07-07 09:30:26] Using: -1 B:ACATGA,ATCATA,CACGTG,CGATGA,GAGATA,GCTCTC,GTATGA,TCGATA,AGTAGC,ATCGCA,CACTCT,CTGAGC,GAGCGT,GCTGCA,TATAGC,CACGAT,CTGACG -2 F:NNNNNNNN -3 S:TATAGGG -4 R:N
[2016-07-07 09:30:26] 1.00 Confidence.
[2016-07-07 09:30:26] Looking at file:DRR049557_2.fastq.bz2
[2016-07-07 09:30:26] Searching for best architecture in file 'DRR049557.arch'
[2016-07-07 09:30:53] Using: -1 R:N
[2016-07-07 09:30:53] 1.00 Confidence.
[2016-07-07 09:31:10] Determining threshold for read0.
[2016-07-07 09:31:20] Long sequence found. Need to realloc model...
[2016-07-07 09:34:31] Selected Threshold:: 2.658036
[2016-07-07 09:34:31] Determining threshold for read1.
[2016-07-07 09:34:41] Long sequence found. Need to realloc model...
[2016-07-07 09:34:42] Selected Threshold:: 3.051751
[2016-07-07 09:35:01] Detected casava 1.8 format.
[2016-07-08 00:24:06] Done.
[2016-07-08 00:24:06] DRR049557_1.fastq.bz2 Input file 0.
[2016-07-08 00:24:06] DRR049557_2.fastq.bz2 Input file 1.
[2016-07-08 00:24:06] 217752577 total input reads
[2016-07-08 00:24:06] 3.05 selected threshold
[2016-07-08 00:24:06] 187155993 successfully extracted
[2016-07-08 00:24:06] 85.9% extracted
[2016-07-08 00:24:06] 12041093 problems with architecture
[2016-07-08 00:24:06] 1251417 barcode / UMI not found
[2016-07-08 00:24:06] 0 too short
[2016-07-08 00:24:06] 10318 low complexity
[2016-07-08 00:24:06] 17293756 match artifacts:
[2016-07-08 00:24:06] 1750534 RT_(without_random_bases)
[2016-07-08 00:24:06] 15346 empty_(TS_linker_+_RT_reverse-complemented)
[2016-07-08 00:24:06] 470963 Nextera_501
[2016-07-08 00:24:06] 108 Nextera_502
[2016-07-08 00:24:06] 1349 Nextera_503
[2016-07-08 00:24:06] 391 Nextera_504
[2016-07-08 00:24:06] 14 Nextera_505
[2016-07-08 00:24:06] 1 Nextera_506
[2016-07-08 00:24:06] 73 Nextera_507
[2016-07-08 00:24:06] 3 Nextera_508
[2016-07-08 00:24:06] 11706684 Nextera_701
[2016-07-08 00:24:06] 374315 Nextera_702
[2016-07-08 00:24:06] 2972225 Nextera_703
[2016-07-08 00:24:06] 16 Nextera_704
[2016-07-08 00:24:06] 302 Nextera_705
[2016-07-08 00:24:06] 151 Nextera_706
[2016-07-08 00:24:06] 416 Nextera_707
[2016-07-08 00:24:06] 28 Nextera_708
[2016-07-08 00:24:06] 137 Nextera_709
[2016-07-08 00:24:06] 16 Nextera_710
[2016-07-08 00:24:06] 445 Nextera_711
[2016-07-08 00:24:06] 169 Nextera_712
[2016-07-08 00:24:06] 60 Nextera_501_Reversed:
[2016-07-08 00:24:06] 2 Nextera_502_Reversed:
[2016-07-08 00:24:06] 2 Nextera_503_Reversed:
[2016-07-08 00:24:06] 3 Nextera_504_Reversed:
[2016-07-08 00:24:06] 1 Nextera_505_Reversed:
[2016-07-08 00:24:06] 2 Nextera_703_Reversed:
The resulting FASTQ reads look like this:
@DRR049557.7 HWI-ST549:177:C6Y04ACXX:2:1101:1316:1986 length=51;FP:GTCAGGGG;RQ:30.19
GGGGAATCAGGGTTCGATTCCGGAGAGGGN
+
JJDDDDDDDDDDDDDDDEDDEDDDDDDDD#
The sequence of the fingerprint is indicated by the FP:
tag in the semicolon-separated part of the read name.