bamToFastq

Given aligned reads in bam format, this workflow will backextract to generate fastq files based on the readgroups. By default, the files are named based on the readgroup IDs, but a custom naming scheme based on other readgroup fields can be provided. This takes the form of mixing text with {FD} symbols, where FD is a readgroup FD (SM, PU, etc)

Overview

Dependencies

Usage

Cromwell

java -jar cromwell.jar run bamToFastq.wdl --inputs inputs.json

Inputs

Required workflow parameters:

Parameter	Value	Description
`bamFile`	File	A BAM file with one or more readgroups
`outputFilePrefix`	String	prefix to use to identify bam summary files

Optional workflow parameters:

Parameter	Value	Default	Description
`finalOutputDirectory`	String	""	Optional final output directory. Copy out fastq files using a task
`fileNaming`	String	"{ID}"	The naming scheme for the extracted FASTQs

Optional task parameters:

Parameter	Value	Default	Description
`examineBam.memory`	Int	24	Memory allocated for this job
`examineBam.timeout`	Int	12	Time in hours before task timeout
`examineBam.modules`	String	"samtools/1.16.1"	Required environment modules
`nameCheck.modules`	String	"samtools/1.16.1"	Required environment modules
`nameCheck.memory`	Int	24	Memory allocated for this job
`nameCheck.timeout`	Int	12	Time in hours before task timeout
`unsortBam.memory`	Int	24	Memory allocated for this job
`unsortBam.timeout`	Int	12	Time in hours before task timeout
`unsortBam.modules`	String	"samtools/1.16.1"	Required environment modules
`backExtractByRG.modules`	String	"picard/2.21.2"	Required environment modules
`backExtractByRG.memory`	Int	24	Memory allocated for this job
`backExtractByRG.timeout`	Int	96	Time in hours before task timeout
`haveCustomDirRename.modules`	String	""	Required environment modules
`haveCustomDirRename.memory`	Int	24	Memory allocated for this job
`haveCustomDirRename.timeout`	Int	12	Time in hours before task timeout
`haveCustomDirReview.modules`	String	"fastqc/0.11.9"	Required environment modules
`haveCustomDirReview.memory`	Int	24	Memory allocated for this job
`haveCustomDirReview.timeout`	Int	12	Time in hours before task timeout
`copyOutFastq.memory`	Int	24	Memory allocated for this job
`copyOutFastq.timeout`	Int	12	Time in hours before task timeout
`composeLog.memory`	Int	4	Memory allocated for this job
`composeLog.timeout`	Int	2	Time in hours before task timeout
`noCustomDirRename.modules`	String	""	Required environment modules
`noCustomDirRename.memory`	Int	24	Memory allocated for this job
`noCustomDirRename.timeout`	Int	12	Time in hours before task timeout
`noCustomDirReview.modules`	String	"fastqc/0.11.9"	Required environment modules
`noCustomDirReview.memory`	Int	24	Memory allocated for this job
`noCustomDirReview.timeout`	Int	12	Time in hours before task timeout
`summarize.memory`	Int	24	Memory allocated for this job
`summarize.timeout`	Int	12	Time in hours before task timeout

Outputs

Output	Type	Description	Labels
`bamFileFlagstat`	File	A TXT file containing flag information about the BAM file	vidarr_label: bamFileFlagstat
`fastqc`	Array[File]?	An optional array of FastQC report files	vidarr_label: fastqc
`fastqs`	Array[File]?	An optional array of fastq.gz files	vidarr_label: fastqs
`summary`	File	Summary file generated by processing fastQC and samstat report data	vidarr_label: summary
`copyLog`	File?	log file from copy out task, provisioned if a custom output dir requested	vidarr_label: copyLog

Commands

This section lists command(s) run by bamToFastq workflow

Running bamToFastq

Given aligned reads in bam format, this workflow will backextract to generate fastq files based on the readgroups. By default, the files are named based on the readgroup IDs, but a custom naming scheme based on other readgroup fields can be provided. This takes the form of mixing text with {FD} symbols, where FD is a readgroup FD (SM, PU, etc)

Un-sort bam (sort by read name)

     samtools sort --threads 8 -n ~{bamFileToSort} -O bam -o unsorted.bam

Examine Bam: collect stats with samtools

This runs samtools, collecting read group data and stats

     set -euo pipefail
     samtools view -H ~{bamFile} | grep -G "^@RG" > readgroups.tsv
     samtools stats --threads 8 ~{bamFile} > ~{prefix}.samstats.txt

nameCheck: ensure we do not have collision among output files

Use read groups from the input bam file and naming schema so that the names for the outputs are unique.

         
         samtools view -H ~{bamFile} | grep -G "^@RG" > readgroups.tsv
         python3<<CODE
 
         import difflib
         import re
         with open(r"readgroups.tsv",'r') as rg_data:
             rg = rg_data.read()
        
         fileNaming = "~{fileNaming}"
         
         def validate(valid="true"):
             isValid = open("isValid.txt", 'w')
             isValid.write(valid)
             isValid.close()
         
         validate()
 
             # Check No. 1:
             # Comparing RG tags from readgroups file and fileNaming
             # Exits if a tag is missing
 
         inputs = re.findall(r"[^{\}]+(?=})", fileNaming)
         tags = list(set(re.findall(r"[^\t:]+(?=:)", rg)))
 
         if all(i in tags for i in inputs) == False:  
             print("FAILED: Missing input tag")
             d = difflib.Differ()
             diff = d.compare(tags, inputs)
             errorMessage = '\n'.join(diff)
             validate("false")
             raise ValueError("Input tag does not exist:" + '\n' + errorMessage)
 
             # Check No. 2:
             # Predicting modified file name:
             # Exits if non-unique names are created
 
         rgArray = []
         dictArray = []
 
         for line in rg.splitlines():
             data = line[1:].split()
             rgArray.append(data)
 
         for row in rgArray:
             del row[0]
             for i in range(len(row)):
                 k = row[i].split(":")
                 row[i] = k
             dictArray.append({row[i][0]: row[i][1] for i in range(len(row))})
 
         fastqNames = []
         rgData = {}
 
         for j in range(len(dictArray)):
             idData = []
             for readNum in [1, 2]:
                 newName = fileNaming.format_map(dictArray[j])
                 predictedFileName = f"{newName}_{readNum}.fastq.gz"
                 if predictedFileName in fastqNames:
                     validate("false")
                     raise ValueError("File name results in non-unique names")
                 else:
                     fastqNames.append(predictedFileName)
                     idData.append(predictedFileName)
             rgData[dictArray[j]["ID"]]=idData
 
         print(rgData)
         
         CODE

Backextract fastq files from bam file per readgroup.

We use picard tools for this (SamToFastq)

             java -Xmx20g -jar $PICARD_ROOT/picard.jar SamToFastq \
             INPUT=~{bamFile} \
             RG_TAG="ID" \
             OUTPUT_DIR=. \
             OUTPUT_PER_RG=true \
             COMPRESS_OUTPUTS_PER_RG=true \
             NON_PF=true \
             RE_REVERSE=true \
             VALIDATION_STRINGENCY=LENIENT
 
             ls *.fastq.gz > outfilenames

Rename Fastqs according to selected naming schema

the default naming scheme is to use the RG IDs. If an alternate naming pattern was provided then files will be renamed

             python3<<CODE
 
             import os
             import ast
             import re
 
             rgData = ast.literal_eval("~{rgData}")
             f = "~{sep=' ' rawFastqs}"
             fastqs = f.split()
 
             for fastq in fastqs:
                path = os.getcwd()
                ## get the filename, without the extension
                fastqID = re.sub(".fastq.gz","",os.path.basename(fastq))
                # determines read number
                readNum = int(fastqID[-1])
                ### get rid of the last two characters _N
                fastqID = fastqID[:-2]
                ### get the new name from the rgData dictionary
                newName = rgData[fastqID][(readNum-1)]
                ### format the file name
                formattedFileName = f"{path}/{newName}"
                ### rename the files
                os.rename(fastq, formattedFileName)
             
             CODE
 
             ls *.fastq.gz > outfilenames

Generate fastqc reports for each fastq file

             ## run fastqc, this will generate an html report and zipped files, all in the working directory
             fastqc ~{fastq} --outdir .
             ### locate the data file in the zipped output
             datafile=`unzip -l ~{id}_fastqc.zip | grep fastqc_data | sed 's/.* //'`
             ### pull the text data out to a file
             unzip -p ~{id}_fastqc.zip $datafile > ~{id}_fastqc_data.txt
 
             ls *fastqc* > outfilenames
             ls *fastqc_data* > datafilenames

Summarize samstats

           ### samstats
           bamtotal=`cat ~{samstats} | grep ^SN | cut -f 2- | grep "raw total sequences:" | cut -f2`
           bampaired=`cat ~{samstats} | grep ^SN | cut -f 2- | grep "reads paired:" | cut -f2`
           lnavg=`cat ~{samstats}| grep ^SN | cut -f 2- | grep "average length:" | cut -f2`
           lnmax=`cat ~{samstats}| grep ^SN | cut -f 2- | grep "maximum length:" | cut -f2`
           insertsize=`cat ~{samstats} | grep ^SN | cut -f 2- | grep "insert size average:" | cut -f2`
 
           echo "INPUT bam file stats" > "~{id}.summary.txt"
           echo -e "bam total\t$bamtotal" >> "~{id}.summary.txt"
           echo -e "bam paired\t$bampaired" >> "~{id}.summary.txt"
           echo -e "bam mean readlength\t$lnavg" >> "~{id}.summary.txt"
           echo -e "bam max readlength\t$lnmax" >> "~{id}.summary.txt"
           echo -e "bam insertsize\t$insertsize" >> "~{id}.summary.txt"
 
           echo ""  >> "~{id}.summary.txt"
           echo "fastq ReadCounts"  >> "~{id}.summary.txt"
           totalreads=0
           files="~{sep=' ' fastqc}"
           for f in $files
           do
             id=`basename $f "_fastqc_data.txt"`
             count=`cat $f | grep "Total Sequences" | cut -f2`
             echo -e "$id\t$count"  >> "~{id}.summary.txt"
             totalreads=$(($totalreads + $count)) 
           done
           echo -e "fastq total\t$totalreads"  >> "~{id}.summary.txt"
 
           echo ""  >> "~{id}.summary.txt"
           echo "fastq Sequence Length Distributions"  >> "~{id}.summary.txt"
           for f in $files
           do
             id=`basename $f "_fastqc_data.txt"`	
             echo $id  >> "~{id}.summary.txt"
             cat $f | grep "Sequence Length Distribution" -A 999999999 | grep "Sequence Duplication Levels" -B 999999999 | grep -v ">>" >> "~{id}.summary.txt"
           done

Copy out files if output directory specified

           set -euxo pipefail
           if [[ -e ~{outputDir} && -d ~{outputDir} ]]; then
             cp ~{Fastq} ~{outputDir}
             echo "File ~{basename(Fastq)} copied to ~{outputDir}"
           else
             echo "Final Output Directory was not configured, ~{basename(Fastq)} was not provisioned properly"
           fi

Compose log file from messages produced by copy out shard

           python3 <<CODE
           import json
           import re
           m_lines = re.split(",", "~{sep=',' messages}")
           with open("copy_out.log", "w") as log:
               for line in m_lines:
                   log.write(line + "\n")
           log.close()
           CODE

Support

For support, please file an issue on the Github project or send an email to [email protected] .

Generated with generate-markdown-readme (https://github.com/oicr-gsi/gsi-wdl-tools/)

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
tests		tests
CHANGELOG.md		CHANGELOG.md
Jenkinsfile		Jenkinsfile
README.md		README.md
bamToFastq.dot		bamToFastq.dot
bamToFastq.png		bamToFastq.png
bamToFastq.wdl		bamToFastq.wdl
commands.txt		commands.txt
vidarrbuild.json		vidarrbuild.json
vidarrtest-regression.json.in		vidarrtest-regression.json.in

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bamToFastq

Overview

Dependencies

Usage

Cromwell

Inputs

Required workflow parameters:

Optional workflow parameters:

Optional task parameters:

Outputs

Commands

Un-sort bam (sort by read name)

Examine Bam: collect stats with samtools

nameCheck: ensure we do not have collision among output files

Backextract fastq files from bam file per readgroup.

Rename Fastqs according to selected naming schema

Generate fastqc reports for each fastq file

Summarize samstats

Copy out files if output directory specified

Compose log file from messages produced by copy out shard

Support

About

Releases 2

Packages

Contributors 4

Languages

oicr-gsi/bamToFastq

Folders and files

Latest commit

History

Repository files navigation

bamToFastq

Overview

Dependencies

Usage

Cromwell

Inputs

Required workflow parameters:

Optional workflow parameters:

Optional task parameters:

Outputs

Commands

Un-sort bam (sort by read name)

Examine Bam: collect stats with samtools

nameCheck: ensure we do not have collision among output files

Backextract fastq files from bam file per readgroup.

Rename Fastqs according to selected naming schema

Generate fastqc reports for each fastq file

Summarize samstats

Copy out files if output directory specified

Compose log file from messages produced by copy out shard

Support

About

Topics

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 4

Languages

Packages