-
Notifications
You must be signed in to change notification settings - Fork 2
Home
-
gsufsort supports multiple files in directory INPUT (command
--dir
):
For example, dataset/input.txt
were splitted into 5 files in dataset/input/
:
ls dataset/input/
input-1.txt input-2.txt input-3.txt input-4.txt input-5.txt
One can index all files in dataset/input/
with options --dir
and --sa
:
./gsufsort dataset/input/ --dir --sa
In the case no argument is given for --output
, the default output filename is ./all
.
## gsufsort ##
## store_to_disk ##
all.4.sa 20198960 bytes (n = 5049740)
Compare the output all.4.sa
with dataset/input.txt.4.sa
:
diff all.4.sa dataset/input.txt.4.sa -s
Files all.4.sa and dataset/input.txt.4.sa are identical
We can also output the concatenations of all files in all.str
(option --str
):
./gsufsort dataset/input/ --dir --sa
## gsufsort ##
## store_to_disk ##
all.1.str 5049740 bytes (n = 5049740)
When we compare all.str
with dataset/input.txt
, there is an extra byte in all.str
corresponding to the terminator # added to the concatenated string:
ls all.1.str dataset/input.txt -la
-rw-rw-r--. 1 louza louza 5049740 Jun 26 09:54 all.1.str
-rw-rw-r--. 1 louza louza 5049739 Jun 25 10:19 dataset/input.txt
diff all.1.str dataset/input.txt -s
10001d10000
<
-
Awarning: in the case one use options
--txt
,--fasta
or-fastq
together with--dir
, be carefull that INPUT contains only valid files, otherwise the program may crash.
-
gsufsort can also output (command
--qs
) the Quality Scores (QS) permuted according to the BWT symbols: -
This option is valid only for
.fastq
or.fq
files.
For example, given the first DNA read in dataset/reads.fastq
:
head -4 dataset/reads.fastq
@HWI-ST928:79:C0GNWACXX:6:1101:1184:2104 1:N:0:TAAGGCGATATCCTCT
AGTTAGGACTATTCGAACATTATGTCACAAACGTGATGTCACAAAGCCGAATTGTCTGGAGTTAAGACTATACGAACATTATGAAACAAACGTGATGTCAC
+
@C@FDEDDHHGHHJIIGGHJJIJGIJIHGIIFGEFIIJJJGHIGGF@DHEHIIIIJIIGGIIIGE@CEEHHEE@B?AAECDDCDDCCCBB<=<?<?CCC>A
Then, run:
./gsufsort dataset/reads.fastq --docs 1 --bwt --qs
## gsufsort ##
## store_to_disk ##
dataset/reads.fastq.bwt 103 bytes (n = 103)
dataset/reads.fastq.bwt.qs 103 bytes (n = 103)
The QS permuted sequence is written at dataset/reads.fastq.bwt.qs
:
tail dataset/reads.fastq.bwt.qs
ACCHHD@ICGIIHCDJJBIHBI@DGGFGEC?JFAGHE>CIGCIJ?GFEH@BICDIDEJDEEI<EGDI?JII<FG@IH@EEJHCGJHID=GJ<IIIICAHGH
gsufsort can invert the QS permuted sequence together with the BWT (options --ibwt --qs
).
./gsufsort --ibwt --qs dataset/reads.fastq.bwt
See the resulting file:
less +1 dataset/reads.fastq.iqs
@C@FDEDDHHGHHJIIGGHJJIJGIJIHGIIFGEFIIJJJGHIGGF@DHEHIIIIJIIGGIIIGE@CEEHHEE@B?AAECDDCDDCCCBB<=<?<?CCC>A
Compare the output with the original file:
head -4 dataset/reads.fastq | sed -n 4~4p - | diff -s dataset/reads.fastq.iqs -
Files dataset/reads.fastq.iqs and - are identical