Skip to content
Felipe A. Louza edited this page Jun 26, 2020 · 5 revisions

additional features

Multiple files

  • gsufsort supports multiple files in directory INPUT (command --dir):

For example, dataset/input.txt were splitted into 5 files in dataset/input/:

ls dataset/input/
input-1.txt  input-2.txt  input-3.txt  input-4.txt  input-5.txt

One can index all files in dataset/input/ with options --dir and --sa:

./gsufsort dataset/input/ --dir --sa

In the case no argument is given for --output, the default output filename is ./all.

## gsufsort ##
## store_to_disk ##
all.4.sa 	20198960 bytes (n = 5049740)

Compare the output all.4.sa with dataset/input.txt.4.sa:

diff all.4.sa dataset/input.txt.4.sa -s
Files all.4.sa and dataset/input.txt.4.sa are identical

We can also output the concatenations of all files in all.str (option --str):

./gsufsort dataset/input/ --dir --sa
## gsufsort ##
## store_to_disk ##
all.1.str 5049740  bytes (n = 5049740)

When we compare all.str with dataset/input.txt, there is an extra byte in all.str corresponding to the terminator # added to the concatenated string:

ls all.1.str dataset/input.txt -la
-rw-rw-r--. 1 louza louza 5049740 Jun 26 09:54 all.1.str
-rw-rw-r--. 1 louza louza 5049739 Jun 25 10:19 dataset/input.txt
diff all.1.str dataset/input.txt -s
10001d10000
<
  • Awarning: in the case one use options --txt, --fasta or -fastq together with --dir, be carefull that INPUT contains only valid files, otherwise the program may crash.

quality score (QS) sequences

  • gsufsort can also output (command --qs) the Quality Scores (QS) permuted according to the BWT symbols:

  • This option is valid only for .fastq or .fq files.

For example, given the first DNA read in dataset/reads.fastq:

head -4 dataset/reads.fastq 
@HWI-ST928:79:C0GNWACXX:6:1101:1184:2104 1:N:0:TAAGGCGATATCCTCT
AGTTAGGACTATTCGAACATTATGTCACAAACGTGATGTCACAAAGCCGAATTGTCTGGAGTTAAGACTATACGAACATTATGAAACAAACGTGATGTCAC
+
@C@FDEDDHHGHHJIIGGHJJIJGIJIHGIIFGEFIIJJJGHIGGF@DHEHIIIIJIIGGIIIGE@CEEHHEE@B?AAECDDCDDCCCBB<=<?<?CCC>A

Then, run:

./gsufsort dataset/reads.fastq --docs 1 --bwt --qs
## gsufsort ##
## store_to_disk ##
dataset/reads.fastq.bwt	103 bytes (n = 103)
dataset/reads.fastq.bwt.qs	103 bytes (n = 103)

The QS permuted sequence is written at dataset/reads.fastq.bwt.qs:

tail dataset/reads.fastq.bwt.qs 
ACCHHD@ICGIIHCDJJBIHBI@DGGFGEC?JFAGHE>CIGCIJ?GFEH@BICDIDEJDEEI<EGDI?JII<FG@IH@EEJHCGJHID=GJ<IIIICAHGH

gsufsort can invert the QS permuted sequence together with the BWT (options --ibwt --qs).

./gsufsort --ibwt --qs dataset/reads.fastq.bwt

See the resulting file:

less +1 dataset/reads.fastq.iqs
@C@FDEDDHHGHHJIIGGHJJIJGIJIHGIIFGEFIIJJJGHIGGF@DHEHIIIIJIIGGIIIGE@CEEHHEE@B?AAECDDCDDCCCBB<=<?<?CCC>A

Compare the output with the original file:

head -4 dataset/reads.fastq | sed -n 4~4p - | diff -s dataset/reads.fastq.iqs -
Files dataset/reads.fastq.iqs and - are identical
Clone this wiki locally