From e10fc577e1d01730b8344aec0045ad078bebd925 Mon Sep 17 00:00:00 2001 From: Jason Gilliland Date: Thu, 31 Mar 2016 12:26:42 -0400 Subject: [PATCH 01/64] Fix a few typos in man page --- bwa.1 | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/bwa.1 b/bwa.1 index ee4e4f35..206e32a9 100644 --- a/bwa.1 +++ b/bwa.1 @@ -302,7 +302,7 @@ Output all found alignments for single-end or unpaired paired-end reads. These alignments will be flagged as secondary alignments. .TP .B -C -Append append FASTA/Q comment to SAM output. This option can be used to +Append FASTA/Q comment to SAM output. This option can be used to transfer read meta information (e.g. barcode) to the SAM output. Note that the FASTA/Q comment (the string after a space in the header line) must conform the SAM spec (e.g. BC:Z:CGTAC). Malformated comments lead to incorrect SAM output. @@ -316,7 +316,7 @@ supplementary alignments. Mark shorter split hits as secondary (for Picard compatibility). .TP .BI -v \ INT -Control the verbose level of the output. This option has not been fully +Control the verbosity level of the output. This option has not been fully supported throughout BWA. Ideally, a value 0 for disabling all the output to stderr; 1 for outputting errors only; 2 for warnings and errors; 3 for all normal messages; 4 or higher for debugging. When this option takes value From d6636daead897301eaa7bbb5e8560a9bda5a25d7 Mon Sep 17 00:00:00 2001 From: Jason Gilliland Date: Thu, 31 Mar 2016 12:39:24 -0400 Subject: [PATCH 02/64] Include in man page info on read types Previously this info had only been present in the help text emitted from `bwa mem` without arguments, but I think it helpful to include in man page. --- bwa.1 | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/bwa.1 b/bwa.1 index 206e32a9..be63a5ab 100644 --- a/bwa.1 +++ b/bwa.1 @@ -107,6 +107,8 @@ appropriate algorithm will be chosen automatically. .IR clipPen ] .RB [ -U .IR unpairPen ] +.RB [ -x +.IR readType ] .RB [ -R .IR RGline ] .RB [ -H @@ -256,6 +258,18 @@ Penalty for an unpaired read pair. BWA-MEM scores an unpaired read pair as and scores a paired as scoreRead1+scoreRead2-insertPenalty. It compares these two scores to determine whether we should force pairing. A larger value leads to more aggressive read pair. [17] +.TP +.BI -x \ STR +Read type. Changes multiple parameters unless overriden [null] + pacbio: +.B -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0 +(PacBio reads to ref) + ont2d: +.B -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0 +(Oxford Nanopore 2D-reads to ref) + intractg: +.B -B9 -O16 -L5 +(intra-species contigs to ref) .TP .B INPUT/OUTPUT OPTIONS: From 00e59ddbaef5644190ac5f35829a28a0ddc49483 Mon Sep 17 00:00:00 2001 From: "Rintze M. Zelle" Date: Fri, 10 Jun 2016 11:15:53 -0400 Subject: [PATCH 03/64] Update README.md Correct header formatting, fix some typos in headers --- README.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index 9ac49bbe..2d186965 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ [![Build Status](https://travis-ci.org/lh3/bwa.svg?branch=dev)](https://travis-ci.org/lh3/bwa) [![Build Status](https://drone.io/github.com/lh3/bwa/status.png)](https://drone.io/github.com/lh3/bwa/latest) -##Getting started +## Getting started git clone https://github.com/lh3/bwa.git cd bwa; make @@ -8,7 +8,7 @@ ./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz ./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz -##Introduction +## Introduction BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome. It consists of three algorithms: @@ -24,7 +24,7 @@ reference genome (the **index** command). Alignment algorithms are invoked with different sub-commands: **aln/samse/sampe** for BWA-backtrack, **bwasw** for BWA-SW and **mem** for the BWA-MEM algorithm. -##Availability +## Availability BWA is released under [GPLv3][1]. The latest source code is [freely available at github][2]. Released packages can [be downloaded][3] at @@ -37,7 +37,7 @@ In addition to BWA, this self-consistent package also comes with bwa-associated and 3rd-party tools for proper BAM-to-FASTQ conversion, mapping to ALT contigs, adapter triming, duplicate marking, HLA typing and associated data files. -##Seeking helps +## Seeking help The detailed usage is described in the man page available together with the source code. You can use `man ./bwa.1` to view the man page in a terminal. The @@ -46,7 +46,7 @@ have questions about BWA, you may [sign up the mailing list][6] and then send the questions to [bio-bwa-help@sourceforge.net][7]. You may also ask questions in forums such as [BioStar][8] and [SEQanswers][9]. -##Citing BWA +## Citing BWA * Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. *Bioinformatics*, **25**, 1754-1760. [PMID: @@ -63,7 +63,7 @@ in forums such as [BioStar][8] and [SEQanswers][9]. Please note that the last reference is a preprint hosted at [arXiv.org][13]. I do not have plan to submit it to a peer-reviewed journal in the near future. -##Frequently asked questions (FAQs) +## Frequently asked questions (FAQs) 1. [What types of data does BWA work with?](#type) 2. [Why does a read appear multiple times in the output SAM?](#multihit) @@ -73,7 +73,7 @@ do not have plan to submit it to a peer-reviewed journal in the near future. 6. [Does BWA work with ALT contigs in the GRCh38 release?](#altctg) 7. [Can I just run BWA-MEM against GRCh38+ALT without post-processing?](#postalt) -####1. What types of data does BWA work with? +#### 1. What types of data does BWA work with? BWA works with a variety types of DNA sequence data, though the optimal algorithm and setting may vary. The following list gives the recommended @@ -108,7 +108,7 @@ errors given longer query sequences as the chance of missing all seeds is small. As is shown above, with non-default settings, BWA-MEM works with Oxford Nanopore reads with a sequencing error rate over 20%. -####2. Why does a read appear multiple times in the output SAM? +#### 2. Why does a read appear multiple times in the output SAM? BWA-SW and BWA-MEM perform local alignments. If there is a translocation, a gene fusion or a long deletion, a read bridging the break point may have two hits, @@ -116,18 +116,18 @@ occupying two lines in the SAM output. With the default setting of BWA-MEM, one and only one line is primary and is soft clipped; other lines are tagged with 0x800 SAM flag (supplementary alignment) and are hard clipped. -####3. Does BWA work on reference sequences longer than 4GB in total? +#### 3. Does BWA work on reference sequences longer than 4GB in total? Yes. Since 0.6.x, all BWA algorithms work with a genome with total length over 4GB. However, individual chromosome should not be longer than 2GB. -####4. Why can one read in a pair has high mapping quality but the other has zero? +#### 4. Why can one read in a pair have a high mapping quality but the other has zero? This is correct. Mapping quality is assigned for individual read, not for a read pair. It is possible that one read can be mapped unambiguously, but its mate falls in a tandem repeat and thus its accurate position cannot be determined. -####5. How can a BWA-backtrack alignment stands out of the end of a chromosome? +#### 5. How can a BWA-backtrack alignment stand out of the end of a chromosome? Internally BWA concatenates all reference sequences into one long sequence. A read may be mapped to the junction of two adjacent reference sequences. In this @@ -135,7 +135,7 @@ case, BWA-backtrack will flag the read as unmapped (0x4), but you will see position, CIGAR and all the tags. A similar issue may occur to BWA-SW alignment as well. BWA-MEM does not have this problem. -####6. Does BWA work with ALT contigs in the GRCh38 release? +#### 6. Does BWA work with ALT contigs in the GRCh38 release? Yes, since 0.7.11, BWA-MEM officially supports mapping to GRCh38+ALT. BWA-backtrack and BWA-SW don't properly support ALT mapping as of now. Please @@ -143,7 +143,7 @@ see [README-alt.md][18] for details. Briefly, it is recommended to use [bwakit][17], the binary release of BWA, for generating the reference genome and for mapping. -####7. Can I just run BWA-MEM against GRCh38+ALT without post-processing? +#### 7. Can I just run BWA-MEM against GRCh38+ALT without post-processing? If you are not interested in hits to ALT contigs, it is okay to run BWA-MEM without post-processing. The alignments produced this way are very close to From 8cc02badd9e77162ac646938de70521758e4e2da Mon Sep 17 00:00:00 2001 From: James Blachly Date: Tue, 9 Aug 2016 17:05:51 -0400 Subject: [PATCH 04/64] Forbid literal TAB control characters in @RG line --- bwa.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/bwa.c b/bwa.c index c78ebb67..6f1e9a85 100644 --- a/bwa.c +++ b/bwa.c @@ -414,10 +414,14 @@ char *bwa_set_rg(const char *s) if (bwa_verbose >= 1) fprintf(stderr, "[E::%s] the read group line is not started with @RG\n", __func__); goto err_set_rg; } + if (strstr(s, "\t") != NULL) { + if (bwa_verbose >= 1) fprintf(stderr, "[E::%s] the read group line contained literal characters -- replace with escaped tabs: \\t\n", __func__); + goto err_set_rg; + } rg_line = strdup(s); bwa_escape(rg_line); if ((p = strstr(rg_line, "\tID:")) == 0) { - if (bwa_verbose >= 1) fprintf(stderr, "[E::%s] no ID at the read group line\n", __func__); + if (bwa_verbose >= 1) fprintf(stderr, "[E::%s] no ID within the read group line\n", __func__); goto err_set_rg; } p += 4; From 9dfb9708dba9e100db1b30c7c9caae1fb898e84c Mon Sep 17 00:00:00 2001 From: Jacob Pfeil Date: Mon, 7 Nov 2016 16:52:54 -0800 Subject: [PATCH 05/64] Add -M option to bwakit --- bwakit/run-bwamem | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/bwakit/run-bwamem b/bwakit/run-bwamem index 462bafe6..1dc61aa1 100755 --- a/bwakit/run-bwamem +++ b/bwakit/run-bwamem @@ -5,7 +5,7 @@ use warnings; use Getopt::Std; my %opts = (t=>1); -getopts("PSadskHo:R:x:t:", \%opts); +getopts("MPSadskHo:R:x:t:", \%opts); die(' Usage: run-bwamem [options] [file2] @@ -24,6 +24,7 @@ Options: -o STR prefix for output files [inferred from -S for BAM input, don\'t shuffle -s sort the output alignment (via samtools; requring more RAM) -k keep temporary files generated by typeHLA + -M mark shorter split hits as secondary Examples: @@ -143,7 +144,7 @@ if ($is_bam) { $cmd = "cat $ARGV[1] \\\n"; } -my $bwa_opts = "-p " . ($opts{t} > 1? "-t$opts{t} " : "") . (defined($opts{x})? "-x $opts{x} " : "") . (defined($opts{R})? "-R'$opts{R}' " : ""); +my $bwa_opts = "-p " . ($opts{t} > 1? "-t$opts{t} " : "") . (defined($opts{x})? "-x $opts{x} " : "") . (defined($opts{R})? "-R'$opts{R}' " : "") . (defined($opts{M})? "-M " : ""); $bwa_opts .= join(" ", @RG_lines) . " -C " if @RG_lines > 0; $cmd .= " | $root/trimadap 2> $prefix.log.trim \\\n" if defined($opts{a}); From af87825525e3eb203e2dd8a993c02cbfc8c777ad Mon Sep 17 00:00:00 2001 From: John Marshall Date: Tue, 25 Apr 2017 13:51:52 +0100 Subject: [PATCH 06/64] Replace u_int32_t by the more common uint32_t The u_int32_t type comes from , which is often included as a byproduct of other headers on glibc platforms, but not on FreeBSD or Alpine Linux. Use 's uint32_t instead, which is used elsewhere in bwa source code. --- bwt_lite.c | 2 +- bwtgap.c | 2 +- bwtgap.h | 6 +++--- bwtindex.c | 2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/bwt_lite.c b/bwt_lite.c index f7946f54..eb16da64 100644 --- a/bwt_lite.c +++ b/bwt_lite.c @@ -48,7 +48,7 @@ bwtl_t *bwtl_seq2bwtl(int len, const uint8_t *seq) } { // generate cnt_table for (i = 0; i != 256; ++i) { - u_int32_t j, x = 0; + uint32_t j, x = 0; for (j = 0; j != 4; ++j) x |= (((i&3) == j) + ((i>>2&3) == j) + ((i>>4&3) == j) + (i>>6 == j)) << (j<<3); b->cnt_table[i] = x; diff --git a/bwtgap.c b/bwtgap.c index 08bc1f48..7dde443d 100644 --- a/bwtgap.c +++ b/bwtgap.c @@ -58,7 +58,7 @@ static inline void gap_push(gap_stack_t *stack, int i, bwtint_t k, bwtint_t l, i q->stack = (gap_entry_t*)realloc(q->stack, sizeof(gap_entry_t) * q->m_entries); } p = q->stack + q->n_entries; - p->info = (u_int32_t)score<<21 | i; p->k = k; p->l = l; + p->info = (uint32_t)score<<21 | i; p->k = k; p->l = l; p->n_mm = n_mm; p->n_gapo = n_gapo; p->n_gape = n_gape; p->n_ins = n_ins; p->n_del = n_del; p->state = state; diff --git a/bwtgap.h b/bwtgap.h index 7dd61659..9085b626 100644 --- a/bwtgap.h +++ b/bwtgap.h @@ -5,9 +5,9 @@ #include "bwtaln.h" typedef struct { // recursion stack - u_int32_t info; // score<<21 | i - u_int32_t n_mm:8, n_gapo:8, n_gape:8, state:2, n_seed_mm:6; - u_int32_t n_ins:16, n_del:16; + uint32_t info; // score<<21 | i + uint32_t n_mm:8, n_gapo:8, n_gape:8, state:2, n_seed_mm:6; + uint32_t n_ins:16, n_del:16; int last_diff_pos; bwtint_t k, l; // (k,l) is the SA region of [i,n-1] } gap_entry_t; diff --git a/bwtindex.c b/bwtindex.c index fb322313..571d087d 100644 --- a/bwtindex.c +++ b/bwtindex.c @@ -119,7 +119,7 @@ bwt_t *bwt_pac2bwt(const char *fn_pac, int use_is) } rope_destroy(r); } - bwt->bwt = (u_int32_t*)calloc(bwt->bwt_size, 4); + bwt->bwt = (uint32_t*)calloc(bwt->bwt_size, 4); for (i = 0; i < bwt->seq_len; ++i) bwt->bwt[i>>4] |= buf[i] << ((15 - (i&15)) << 1); free(buf); From beff2e27f82c818335fba209a63149eb477e6ea3 Mon Sep 17 00:00:00 2001 From: John Marshall Date: Tue, 25 Apr 2017 13:54:31 +0100 Subject: [PATCH 07/64] Add missing include Fixes compilation on non-glibc platforms, e.g., FreeBSD and Alpine Linux. --- kthread.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kthread.c b/kthread.c index 1b13494f..780de19c 100644 --- a/kthread.c +++ b/kthread.c @@ -1,4 +1,5 @@ #include +#include #include #include From 6442d009cd2ec8c20f13b3657be1dd47ff1a4e03 Mon Sep 17 00:00:00 2001 From: Brad Langhorst Date: Thu, 27 Apr 2017 08:40:48 -0400 Subject: [PATCH 08/64] updates grch38 URL simple URL update for grch38 fixes #111, #115 grch37 link is still valid --- bwakit/run-gen-ref | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bwakit/run-gen-ref b/bwakit/run-gen-ref index 3ed63b2b..58fab68a 100755 --- a/bwakit/run-gen-ref +++ b/bwakit/run-gen-ref @@ -2,7 +2,7 @@ root=`dirname $0` -url38="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz" +url38="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz" url37d5="ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz" if [ $# -eq 0 ]; then From 1972a978133110e36af05a6bc2e27881f51443c4 Mon Sep 17 00:00:00 2001 From: John Marshall Date: Wed, 10 May 2017 19:47:12 +0100 Subject: [PATCH 09/64] Only realloc(pac,...) if it needs to be made larger Similarly to the realloc(pac,...) within add1(), only bother to call realloc() if appending the reverse complemented sequence requires more space than is currently in the pac/m_pac buffer. Avoids realloc(pac,0) (and a "Failed to allocate 0 bytes at bntseq.c" message from wrap_realloc()) in the corner case of an empty reference FASTA file. Fixes #54. --- bntseq.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/bntseq.c b/bntseq.c index 465e3832..65f7e93a 100644 --- a/bntseq.c +++ b/bntseq.c @@ -299,9 +299,9 @@ int64_t bns_fasta2bntseq(gzFile fp_fa, const char *prefix, int for_only) // read sequences while (kseq_read(seq) >= 0) pac = add1(seq, bns, pac, &m_pac, &m_seqs, &m_holes, &q); if (!for_only) { // add the reverse complemented sequence - m_pac = (bns->l_pac * 2 + 3) / 4 * 4; - pac = realloc(pac, m_pac/4); - memset(pac + (bns->l_pac+3)/4, 0, (m_pac - (bns->l_pac+3)/4*4) / 4); + int64_t ll_pac = (bns->l_pac * 2 + 3) / 4 * 4; + if (ll_pac > m_pac) pac = realloc(pac, ll_pac/4); + memset(pac + (bns->l_pac+3)/4, 0, (ll_pac - (bns->l_pac+3)/4*4) / 4); for (l = bns->l_pac - 1; l >= 0; --l, ++bns->l_pac) _set_pac(pac, bns->l_pac, 3-_get_pac(pac, l)); } From 6a5caf764f4123b49fff8504291c72c372705514 Mon Sep 17 00:00:00 2001 From: Gwen Ives Date: Tue, 9 Dec 2014 10:24:46 +0100 Subject: [PATCH 10/64] Fixed BWTIncFree() memory leaks [Based on PR #37 with the additions below.] Don't free non-malloced items in BWTFree(). If BWTCreate() is ever called with a non-NULL decodeTable, BWTFree() will need to not free decodeTable -- see FIXME comment. Close packedFile in BWTIncConstructFromPacked() in the normal case. Ignore it in error cases, as they immediately call exit() anyway. --- bwt_gen.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/bwt_gen.c b/bwt_gen.c index 76f28c99..5b563d06 100644 --- a/bwt_gen.c +++ b/bwt_gen.c @@ -338,6 +338,7 @@ BWT *BWTCreate(const bgint_t textLength, unsigned int *decodeTable) bwt->decodeTable = (unsigned*)calloc(DNA_OCC_CNT_TABLE_SIZE_IN_WORD, sizeof(unsigned int)); GenerateDNAOccCountTable(bwt->decodeTable); } else { + // FIXME Prevent BWTFree() from freeing decodeTable in this case bwt->decodeTable = decodeTable; } @@ -1538,6 +1539,8 @@ BWTInc *BWTIncConstructFromPacked(const char *inputFileName, bgint_t initialMaxB (long)bwtInc->numberOfIterationDone, (long)processedTextLength); } } + + fclose(packedFile); return bwtInc; } @@ -1545,8 +1548,6 @@ void BWTFree(BWT *bwt) { if (bwt == 0) return; free(bwt->cumulativeFreq); - free(bwt->bwtCode); - free(bwt->occValue); free(bwt->occValueMajor); free(bwt->decodeTable); free(bwt); @@ -1555,8 +1556,10 @@ void BWTFree(BWT *bwt) void BWTIncFree(BWTInc *bwtInc) { if (bwtInc == 0) return; - free(bwtInc->bwt); + BWTFree(bwtInc->bwt); free(bwtInc->workingMemory); + free(bwtInc->cumulativeCountInCurrentBuild); + free(bwtInc->packedShift); free(bwtInc); } From 7a77cc478565d152e46fb8e9b4256891555cc058 Mon Sep 17 00:00:00 2001 From: Ram Yalamanchili Date: Thu, 18 May 2017 15:23:05 -0700 Subject: [PATCH 11/64] Fix output file parameter for samtools sort samtools sort v0.7.15+ requires -o flag to specify the output file --- bwakit/run-bwamem | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bwakit/run-bwamem b/bwakit/run-bwamem index 462bafe6..93e88bd7 100755 --- a/bwakit/run-bwamem +++ b/bwakit/run-bwamem @@ -163,7 +163,7 @@ if (-f "$ARGV[0].alt" && !defined($opts{P})) { } my $t_sort = $opts{t} < 4? $opts{t} : 4; -$cmd .= defined($opts{s})? " | $root/samtools sort -@ $t_sort -m1G - $prefix.aln;\n" : " | $root/samtools view -1 - > $prefix.aln.bam;\n"; +$cmd .= defined($opts{s})? " | $root/samtools sort -@ $t_sort -m1G - -o $prefix.aln.bam;\n" : " | $root/samtools view -1 - > $prefix.aln.bam;\n"; if ($has_hla && defined($opts{H}) && (!defined($opts{x}) || $opts{x} eq 'intractg')) { $cmd .= "$root/run-HLA ". (defined($opts{x}) && $opts{x} eq 'intractg'? "-A " : "") . "$prefix.hla > $prefix.hla.top 2> $prefix.log.hla;\n"; From 298284d754aa946b24a9d40d12cd911c57e8eacd Mon Sep 17 00:00:00 2001 From: Heng Li Date: Thu, 8 Jun 2017 16:06:13 -0400 Subject: [PATCH 12/64] r1144: output debugging message to stderr --- bwt_gen.c | 2 +- main.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/bwt_gen.c b/bwt_gen.c index 76f28c99..f06c1a3e 100644 --- a/bwt_gen.c +++ b/bwt_gen.c @@ -1602,7 +1602,7 @@ void bwt_bwtgen2(const char *fn_pac, const char *fn_bwt, int block_size) { BWTInc *bwtInc; bwtInc = BWTIncConstructFromPacked(fn_pac, block_size, block_size); - printf("[bwt_gen] Finished constructing BWT in %u iterations.\n", bwtInc->numberOfIterationDone); + fprintf(stderr, "[bwt_gen] Finished constructing BWT in %u iterations.\n", bwtInc->numberOfIterationDone); BWTSaveBwtCodeAndOcc(bwtInc->bwt, fn_bwt, 0); BWTIncFree(bwtInc); } diff --git a/main.c b/main.c index b652cf08..a0f25fa4 100644 --- a/main.c +++ b/main.c @@ -4,7 +4,7 @@ #include "utils.h" #ifndef PACKAGE_VERSION -#define PACKAGE_VERSION "0.7.15-r1142-dirty" +#define PACKAGE_VERSION "0.7.15-r1144-dirty" #endif int bwa_fa2pac(int argc, char *argv[]); From ab3a92bc736ae96518895275df6b656724ee6b88 Mon Sep 17 00:00:00 2001 From: John Marshall Date: Mon, 26 Jun 2017 10:45:13 +0100 Subject: [PATCH 13/64] Prevent Clang warnings on abs() and fabs() calls In the bwa.c and bwase.c calls, rlen is an int64_t returned from bns_get_seq() and is the number of reference bases covered by the alignment; l_query/len is an int and the query length of the alignment; and the result is an int given to an int parameter of ksw_global[2](). As even the result is int and as rlen is effectively bounded by the maximum length of a reference sequence, we maintain the status quo in this code and simply cast rlen to int to silence Clang's "use llabs()" (llabs() would not be a great answer given an int64_t anyway). The bwtsw2_pair.c call needs to remain fabs() so both divisions are done in floating point; cast to double to prevent Clang suggesting changing the call to integer abs(). --- bwa.c | 4 ++-- bwase.c | 4 +++- bwtsw2_pair.c | 2 +- 3 files changed, 6 insertions(+), 4 deletions(-) diff --git a/bwa.c b/bwa.c index c78ebb67..1ca5634e 100644 --- a/bwa.c +++ b/bwa.c @@ -144,9 +144,9 @@ uint32_t *bwa_gen_cigar2(const int8_t mat[25], int o_del, int e_del, int o_ins, max_del = (int)((double)(((l_query+1)>>1) * mat[0] - o_del) / e_del + 1.); max_gap = max_ins > max_del? max_ins : max_del; max_gap = max_gap > 1? max_gap : 1; - w = (max_gap + abs(rlen - l_query) + 1) >> 1; + w = (max_gap + abs((int)rlen - l_query) + 1) >> 1; w = w < w_? w : w_; - min_w = abs(rlen - l_query) + 3; + min_w = abs((int)rlen - l_query) + 3; w = w > min_w? w : min_w; // NW alignment if (bwa_verbose >= 4) { diff --git a/bwase.c b/bwase.c index 77c50db9..37e4274b 100644 --- a/bwase.c +++ b/bwase.c @@ -173,13 +173,15 @@ bwa_cigar_t *bwa_refine_gapped_core(bwtint_t l_pac, const ubyte_t *pacseq, int l ubyte_t *rseq; int64_t k, rb, re, rlen; int8_t mat[25]; + int w; bwa_fill_scmat(1, 3, mat); rb = *_rb; re = rb + len + ref_shift; assert(re <= l_pac); rseq = bns_get_seq(l_pac, pacseq, rb, re, &rlen); assert(re - rb == rlen); - ksw_global(len, seq, rlen, rseq, 5, mat, 5, 1, SW_BW > abs(rlen - len) * 1.5? SW_BW : abs(rlen - len) * 1.5, n_cigar, &cigar32); + w = abs((int)rlen - len) * 1.5; + ksw_global(len, seq, rlen, rseq, 5, mat, 5, 1, SW_BW > w? SW_BW : w, n_cigar, &cigar32); assert(*n_cigar > 0); if ((cigar32[*n_cigar - 1]&0xf) == 1) cigar32[*n_cigar - 1] = (cigar32[*n_cigar - 1]>>4<<4) | 3; // change endding ins to soft clipping if ((cigar32[0]&0xf) == 1) cigar32[0] = (cigar32[0]>>4<<4) | 3; // change beginning ins to soft clipping diff --git a/bwtsw2_pair.c b/bwtsw2_pair.c index b945128b..f48dcb6d 100644 --- a/bwtsw2_pair.c +++ b/bwtsw2_pair.c @@ -236,7 +236,7 @@ void bsw2_pair(const bsw2opt_t *opt, int64_t l_pac, const uint8_t *pac, int n, b double diff; G[0] = hits[i]->hits[0].G + a[1].G; G[1] = hits[i+1]->hits[0].G + a[0].G; - diff = fabs(G[0] - G[1]) / (opt->a + opt->b) / ((hits[i]->hits[0].len + a[1].len + hits[i+1]->hits[0].len + a[0].len) / 2.); + diff = fabs((double)(G[0] - G[1])) / (opt->a + opt->b) / ((hits[i]->hits[0].len + a[1].len + hits[i+1]->hits[0].len + a[0].len) / 2.); if (diff > 0.05) a[G[0] > G[1]? 0 : 1].G = 0; } if (a[0].G == 0 || a[1].G == 0) { // one proper pair only From d28da05f36671071440c909726a4cf43a5837bba Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jos=C3=A9=20Manuel=20Abu=C3=ADn=20Mosquera?= Date: Tue, 24 Mar 2015 10:35:46 +0100 Subject: [PATCH 14/64] Added option to write output to file --- fastmap.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fastmap.c b/fastmap.c index 7b7145d9..fcdadd1a 100644 --- a/fastmap.c +++ b/fastmap.c @@ -130,7 +130,7 @@ int main_mem(int argc, char *argv[]) aux.opt = opt = mem_opt_init(); memset(&opt0, 0, sizeof(mem_opt_t)); - while ((c = getopt(argc, argv, "51paMCSPVYjk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:W:x:G:h:y:K:X:H:")) >= 0) { + while ((c = getopt(argc, argv, "51paMCSPVYjk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:f:W:x:G:h:y:K:X:H:")) >= 0) { if (c == 'k') opt->min_seed_len = atoi(optarg), opt0.min_seed_len = 1; else if (c == '1') no_mt_io = 1; else if (c == 'x') mode = optarg; @@ -158,6 +158,7 @@ int main_mem(int argc, char *argv[]) else if (c == 's') opt->split_width = atoi(optarg), opt0.split_width = 1; else if (c == 'G') opt->max_chain_gap = atoi(optarg), opt0.max_chain_gap = 1; else if (c == 'N') opt->max_chain_extend = atoi(optarg), opt0.max_chain_extend = 1; + else if (c == 'f') xreopen(optarg, "wb", stdout); else if (c == 'W') opt->min_chain_weight = atoi(optarg), opt0.min_chain_weight = 1; else if (c == 'y') opt->max_mem_intv = atol(optarg), opt0.max_mem_intv = 1; else if (c == 'C') aux.copy_comment = 1; From bc58bf265d76734e650c051322485137de41a3e7 Mon Sep 17 00:00:00 2001 From: John Marshall Date: Tue, 27 Jun 2017 17:05:44 +0100 Subject: [PATCH 15/64] Add "bwa mem -o FILE" synonym and documentation As -o is still free in the mem command, use that standard option to specify an output file (and keep -f to parallel bwase/bwape commands). Document it in the usage and man page. --- bwa.1 | 8 ++++++++ fastmap.c | 5 +++-- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/bwa.1 b/bwa.1 index 11d3854d..a0b59bc5 100644 --- a/bwa.1 +++ b/bwa.1 @@ -277,6 +277,14 @@ If ARG starts with @, it is interpreted as a string and gets inserted into the output SAM header; otherwise, ARG is interpreted as a file with all lines starting with @ in the file inserted into the SAM header. [null] .TP +.BI -o \ FILE +Write the output SAM file to +.IR FILE . +For compatibility with other BWA commands, this option may also be given as +.B -f +.IR FILE . +[standard ouptut] +.TP .BI -T \ INT Don't output alignment with score lower than .IR INT . diff --git a/fastmap.c b/fastmap.c index fcdadd1a..c2ad90a4 100644 --- a/fastmap.c +++ b/fastmap.c @@ -130,7 +130,7 @@ int main_mem(int argc, char *argv[]) aux.opt = opt = mem_opt_init(); memset(&opt0, 0, sizeof(mem_opt_t)); - while ((c = getopt(argc, argv, "51paMCSPVYjk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:f:W:x:G:h:y:K:X:H:")) >= 0) { + while ((c = getopt(argc, argv, "51paMCSPVYjk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:o:f:W:x:G:h:y:K:X:H:")) >= 0) { if (c == 'k') opt->min_seed_len = atoi(optarg), opt0.min_seed_len = 1; else if (c == '1') no_mt_io = 1; else if (c == 'x') mode = optarg; @@ -158,7 +158,7 @@ int main_mem(int argc, char *argv[]) else if (c == 's') opt->split_width = atoi(optarg), opt0.split_width = 1; else if (c == 'G') opt->max_chain_gap = atoi(optarg), opt0.max_chain_gap = 1; else if (c == 'N') opt->max_chain_extend = atoi(optarg), opt0.max_chain_extend = 1; - else if (c == 'f') xreopen(optarg, "wb", stdout); + else if (c == 'o' || c == 'f') xreopen(optarg, "wb", stdout); else if (c == 'W') opt->min_chain_weight = atoi(optarg), opt0.min_chain_weight = 1; else if (c == 'y') opt->max_mem_intv = atol(optarg), opt0.max_mem_intv = 1; else if (c == 'C') aux.copy_comment = 1; @@ -266,6 +266,7 @@ int main_mem(int argc, char *argv[]) fprintf(stderr, " -p smart pairing (ignoring in2.fq)\n"); fprintf(stderr, " -R STR read group header line such as '@RG\\tID:foo\\tSM:bar' [null]\n"); fprintf(stderr, " -H STR/FILE insert STR to header if it starts with @; or insert lines in FILE [null]\n"); + fprintf(stderr, " -o FILE sam file to output results to [stdout]\n"); fprintf(stderr, " -j treat ALT contigs as part of the primary assembly (i.e. ignore .alt file)\n"); fprintf(stderr, " -5 always take the leftmost alignment on a read as primary\n"); fprintf(stderr, "\n"); From 5ede7eb98172ae20967223065c7bd35f0f4804ce Mon Sep 17 00:00:00 2001 From: "Rintze M. Zelle" Date: Fri, 10 Jun 2016 11:15:53 -0400 Subject: [PATCH 16/64] Update README.md Correct header formatting, fix some typos in headers --- README.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index 9ac49bbe..2d186965 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ [![Build Status](https://travis-ci.org/lh3/bwa.svg?branch=dev)](https://travis-ci.org/lh3/bwa) [![Build Status](https://drone.io/github.com/lh3/bwa/status.png)](https://drone.io/github.com/lh3/bwa/latest) -##Getting started +## Getting started git clone https://github.com/lh3/bwa.git cd bwa; make @@ -8,7 +8,7 @@ ./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz ./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz -##Introduction +## Introduction BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome. It consists of three algorithms: @@ -24,7 +24,7 @@ reference genome (the **index** command). Alignment algorithms are invoked with different sub-commands: **aln/samse/sampe** for BWA-backtrack, **bwasw** for BWA-SW and **mem** for the BWA-MEM algorithm. -##Availability +## Availability BWA is released under [GPLv3][1]. The latest source code is [freely available at github][2]. Released packages can [be downloaded][3] at @@ -37,7 +37,7 @@ In addition to BWA, this self-consistent package also comes with bwa-associated and 3rd-party tools for proper BAM-to-FASTQ conversion, mapping to ALT contigs, adapter triming, duplicate marking, HLA typing and associated data files. -##Seeking helps +## Seeking help The detailed usage is described in the man page available together with the source code. You can use `man ./bwa.1` to view the man page in a terminal. The @@ -46,7 +46,7 @@ have questions about BWA, you may [sign up the mailing list][6] and then send the questions to [bio-bwa-help@sourceforge.net][7]. You may also ask questions in forums such as [BioStar][8] and [SEQanswers][9]. -##Citing BWA +## Citing BWA * Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. *Bioinformatics*, **25**, 1754-1760. [PMID: @@ -63,7 +63,7 @@ in forums such as [BioStar][8] and [SEQanswers][9]. Please note that the last reference is a preprint hosted at [arXiv.org][13]. I do not have plan to submit it to a peer-reviewed journal in the near future. -##Frequently asked questions (FAQs) +## Frequently asked questions (FAQs) 1. [What types of data does BWA work with?](#type) 2. [Why does a read appear multiple times in the output SAM?](#multihit) @@ -73,7 +73,7 @@ do not have plan to submit it to a peer-reviewed journal in the near future. 6. [Does BWA work with ALT contigs in the GRCh38 release?](#altctg) 7. [Can I just run BWA-MEM against GRCh38+ALT without post-processing?](#postalt) -####1. What types of data does BWA work with? +#### 1. What types of data does BWA work with? BWA works with a variety types of DNA sequence data, though the optimal algorithm and setting may vary. The following list gives the recommended @@ -108,7 +108,7 @@ errors given longer query sequences as the chance of missing all seeds is small. As is shown above, with non-default settings, BWA-MEM works with Oxford Nanopore reads with a sequencing error rate over 20%. -####2. Why does a read appear multiple times in the output SAM? +#### 2. Why does a read appear multiple times in the output SAM? BWA-SW and BWA-MEM perform local alignments. If there is a translocation, a gene fusion or a long deletion, a read bridging the break point may have two hits, @@ -116,18 +116,18 @@ occupying two lines in the SAM output. With the default setting of BWA-MEM, one and only one line is primary and is soft clipped; other lines are tagged with 0x800 SAM flag (supplementary alignment) and are hard clipped. -####3. Does BWA work on reference sequences longer than 4GB in total? +#### 3. Does BWA work on reference sequences longer than 4GB in total? Yes. Since 0.6.x, all BWA algorithms work with a genome with total length over 4GB. However, individual chromosome should not be longer than 2GB. -####4. Why can one read in a pair has high mapping quality but the other has zero? +#### 4. Why can one read in a pair have a high mapping quality but the other has zero? This is correct. Mapping quality is assigned for individual read, not for a read pair. It is possible that one read can be mapped unambiguously, but its mate falls in a tandem repeat and thus its accurate position cannot be determined. -####5. How can a BWA-backtrack alignment stands out of the end of a chromosome? +#### 5. How can a BWA-backtrack alignment stand out of the end of a chromosome? Internally BWA concatenates all reference sequences into one long sequence. A read may be mapped to the junction of two adjacent reference sequences. In this @@ -135,7 +135,7 @@ case, BWA-backtrack will flag the read as unmapped (0x4), but you will see position, CIGAR and all the tags. A similar issue may occur to BWA-SW alignment as well. BWA-MEM does not have this problem. -####6. Does BWA work with ALT contigs in the GRCh38 release? +#### 6. Does BWA work with ALT contigs in the GRCh38 release? Yes, since 0.7.11, BWA-MEM officially supports mapping to GRCh38+ALT. BWA-backtrack and BWA-SW don't properly support ALT mapping as of now. Please @@ -143,7 +143,7 @@ see [README-alt.md][18] for details. Briefly, it is recommended to use [bwakit][17], the binary release of BWA, for generating the reference genome and for mapping. -####7. Can I just run BWA-MEM against GRCh38+ALT without post-processing? +#### 7. Can I just run BWA-MEM against GRCh38+ALT without post-processing? If you are not interested in hits to ALT contigs, it is okay to run BWA-MEM without post-processing. The alignments produced this way are very close to From a61b1dc6108b68d412055306bfca0fb133be37f3 Mon Sep 17 00:00:00 2001 From: John Marshall Date: Thu, 29 Jun 2017 15:36:46 +0100 Subject: [PATCH 17/64] Expand tot_seqs variable to long long This counter is only used in a log message. Fixes #131. --- bwape.c | 5 +++-- bwase.c | 5 +++-- bwtaln.c | 5 +++-- 3 files changed, 9 insertions(+), 6 deletions(-) diff --git a/bwape.c b/bwape.c index f2d3d8e0..fa4f7f7c 100644 --- a/bwape.c +++ b/bwape.c @@ -624,7 +624,8 @@ ubyte_t *bwa_paired_sw(const bntseq_t *bns, const ubyte_t *_pacseq, int n_seqs, void bwa_sai2sam_pe_core(const char *prefix, char *const fn_sa[2], char *const fn_fa[2], pe_opt_t *popt, const char *rg_line) { extern bwa_seqio_t *bwa_open_reads(int mode, const char *fn_fa); - int i, j, n_seqs, tot_seqs = 0; + int i, j, n_seqs; + long long tot_seqs = 0; bwa_seq_t *seqs[2]; bwa_seqio_t *ks[2]; clock_t t; @@ -711,7 +712,7 @@ void bwa_sai2sam_pe_core(const char *prefix, char *const fn_sa[2], char *const f for (j = 0; j < 2; ++j) bwa_free_read_seq(n_seqs, seqs[j]); - fprintf(stderr, "[bwa_sai2sam_pe_core] %d sequences have been processed.\n", tot_seqs); + fprintf(stderr, "[bwa_sai2sam_pe_core] %lld sequences have been processed.\n", tot_seqs); last_ii = ii; } diff --git a/bwase.c b/bwase.c index 37e4274b..18e86718 100644 --- a/bwase.c +++ b/bwase.c @@ -507,7 +507,8 @@ void bwase_initialize() void bwa_sai2sam_se_core(const char *prefix, const char *fn_sa, const char *fn_fa, int n_occ, const char *rg_line) { extern bwa_seqio_t *bwa_open_reads(int mode, const char *fn_fa); - int i, n_seqs, tot_seqs = 0, m_aln; + int i, n_seqs, m_aln; + long long tot_seqs = 0; bwt_aln1_t *aln = 0; bwa_seq_t *seqs; bwa_seqio_t *ks; @@ -565,7 +566,7 @@ void bwa_sai2sam_se_core(const char *prefix, const char *fn_sa, const char *fn_f fprintf(stderr, "%.2f sec\n", (float)(clock() - t) / CLOCKS_PER_SEC); bwa_free_read_seq(n_seqs, seqs); - fprintf(stderr, "[bwa_aln_core] %d sequences have been processed.\n", tot_seqs); + fprintf(stderr, "[bwa_aln_core] %lld sequences have been processed.\n", tot_seqs); } // destroy diff --git a/bwtaln.c b/bwtaln.c index 20b01cd3..a348fdfa 100644 --- a/bwtaln.c +++ b/bwtaln.c @@ -158,7 +158,8 @@ bwa_seqio_t *bwa_open_reads(int mode, const char *fn_fa) void bwa_aln_core(const char *prefix, const char *fn_fa, const gap_opt_t *opt) { - int i, n_seqs, tot_seqs = 0; + int i, n_seqs; + long long tot_seqs = 0; bwa_seq_t *seqs; bwa_seqio_t *ks; clock_t t; @@ -218,7 +219,7 @@ void bwa_aln_core(const char *prefix, const char *fn_fa, const gap_opt_t *opt) fprintf(stderr, "%.2f sec\n", (float)(clock() - t) / CLOCKS_PER_SEC); bwa_free_read_seq(n_seqs, seqs); - fprintf(stderr, "[bwa_aln_core] %d sequences have been processed.\n", tot_seqs); + fprintf(stderr, "[bwa_aln_core] %lld sequences have been processed.\n", tot_seqs); } // destroy From 690649872b456f17e8f5cf2d52108f8e7eaf9d96 Mon Sep 17 00:00:00 2001 From: John Marshall Date: Fri, 30 Jun 2017 12:46:56 +0100 Subject: [PATCH 18/64] Copy the whole kstring_t even if it contains NULs FASTQ files containing NULs are invalid but should not cause bwa to crash, as it does if the quality line contains a NUL. Fixes #122. --- bwa.c | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/bwa.c b/bwa.c index 1ca5634e..bc9d3732 100644 --- a/bwa.c +++ b/bwa.c @@ -30,13 +30,23 @@ static inline void trim_readno(kstring_t *s) s->l -= 2, s->s[s->l] = 0; } +static inline char *dupkstring(const kstring_t *str, int dupempty) +{ + char *s = (str->l > 0 || dupempty)? malloc(str->l + 1) : NULL; + if (!s) return NULL; + + memcpy(s, str->s, str->l); + s[str->l] = '\0'; + return s; +} + static inline void kseq2bseq1(const kseq_t *ks, bseq1_t *s) { // TODO: it would be better to allocate one chunk of memory, but probably it does not matter in practice - s->name = strdup(ks->name.s); - s->comment = ks->comment.l? strdup(ks->comment.s) : 0; - s->seq = strdup(ks->seq.s); - s->qual = ks->qual.l? strdup(ks->qual.s) : 0; - s->l_seq = strlen(s->seq); + s->name = dupkstring(&ks->name, 1); + s->comment = dupkstring(&ks->comment, 0); + s->seq = dupkstring(&ks->seq, 1); + s->qual = dupkstring(&ks->qual, 0); + s->l_seq = ks->seq.l; } bseq1_t *bseq_read(int chunk_size, int *n_, void *ks1_, void *ks2_) From bcb475cd033b3641d2de19ad6fb70da7bdd54f77 Mon Sep 17 00:00:00 2001 From: John Marshall Date: Fri, 30 Jun 2017 13:36:16 +0100 Subject: [PATCH 19/64] In bsw2_stat(), fail to infer when no pairs are within boundaries Just as we set r.failed and return early if there are no high-quality read pairs at all, also do so if there are none within the calculated boundaries. This avoids returning avg=NAN std=NAN, which leads to malloc(INFINITY) and a crash; fixes #108. --- bwtsw2_pair.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/bwtsw2_pair.c b/bwtsw2_pair.c index b945128b..fe8489cb 100644 --- a/bwtsw2_pair.c +++ b/bwtsw2_pair.c @@ -70,6 +70,12 @@ bsw2pestat_t bsw2_stat(int n, bwtsw2_t **buf, kstring_t *msg, int max_ins) for (i = x = 0, r.avg = 0; i < k; ++i) if (isize[i] >= r.low && isize[i] <= r.high) r.avg += isize[i], ++x; + if (x == 0) { + ksprintf(msg, "[%s] fail to infer the insert size distribution: no pairs within boundaries.\n", __func__); + free(isize); + r.failed = 1; + return r; + } r.avg /= x; for (i = 0, r.std = 0; i < k; ++i) if (isize[i] >= r.low && isize[i] <= r.high) From 3b96dcee7fbc647eceab92f62b71739bb272b027 Mon Sep 17 00:00:00 2001 From: Nils Homer Date: Sun, 25 Jun 2017 08:02:42 -0700 Subject: [PATCH 20/64] Mem should set the mate cigar tag. --- bwamem.c | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/bwamem.c b/bwamem.c index 535606b6..69be7691 100644 --- a/bwamem.c +++ b/bwamem.c @@ -809,6 +809,19 @@ static inline int get_rlen(int n_cigar, const uint32_t *cigar) return l; } +static inline void add_cigar(const mem_opt_t *opt, mem_aln_t *p, kstring_t *str, int which) +{ + int i; + if (p->n_cigar) { // aligned + for (i = 0; i < p->n_cigar; ++i) { + int c = p->cigar[i]&0xf; + if (!(opt->flag&MEM_F_SOFTCLIP) && !p->is_alt && (c == 3 || c == 4)) + c = which? 4 : 3; // use hard clipping for supplementary alignments + kputw(p->cigar[i]>>4, str); kputc("MIDSH"[c], str); + } + } else kputc('*', str); // having a coordinate but unaligned (e.g. when copy_mate is true) +} + void mem_aln2sam(const mem_opt_t *opt, const bntseq_t *bns, kstring_t *str, bseq1_t *s, int n, const mem_aln_t *list, int which, const mem_aln_t *m_) { int i, l_name; @@ -835,14 +848,7 @@ void mem_aln2sam(const mem_opt_t *opt, const bntseq_t *bns, kstring_t *str, bseq kputs(bns->anns[p->rid].name, str); kputc('\t', str); // RNAME kputl(p->pos + 1, str); kputc('\t', str); // POS kputw(p->mapq, str); kputc('\t', str); // MAPQ - if (p->n_cigar) { // aligned - for (i = 0; i < p->n_cigar; ++i) { - int c = p->cigar[i]&0xf; - if (!(opt->flag&MEM_F_SOFTCLIP) && !p->is_alt && (c == 3 || c == 4)) - c = which? 4 : 3; // use hard clipping for supplementary alignments - kputw(p->cigar[i]>>4, str); kputc("MIDSH"[c], str); - } - } else kputc('*', str); // having a coordinate but unaligned (e.g. when copy_mate is true) + add_cigar(opt, p, str, which); } else kputsn("*\t0\t0\t*", 7, str); // without coordinte kputc('\t', str); @@ -899,6 +905,7 @@ void mem_aln2sam(const mem_opt_t *opt, const bntseq_t *bns, kstring_t *str, bseq kputsn("\tNM:i:", 6, str); kputw(p->NM, str); kputsn("\tMD:Z:", 6, str); kputs((char*)(p->cigar + p->n_cigar), str); } + if (m->n_cigar) { kputsn("\tMC:Z:", 6, str); add_cigar(opt, m, str, which); } if (p->score >= 0) { kputsn("\tAS:i:", 6, str); kputw(p->score, str); } if (p->sub >= 0) { kputsn("\tXS:i:", 6, str); kputw(p->sub, str); } if (bwa_rg_id[0]) { kputsn("\tRG:Z:", 6, str); kputs(bwa_rg_id, str); } From c56c2e4677d64cc301167b7345b88210e83762e0 Mon Sep 17 00:00:00 2001 From: Andreas Tille Date: Sun, 12 Jul 2015 14:59:23 +0200 Subject: [PATCH 21/64] Several trivial Debian patches MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit [Andreas Tille] Check exact number of arguments of bwtupdate Fix spelling [Fabian Klötzl] change printing as size_t is unsigned --- bwtindex.c | 2 +- fastmap.c | 2 +- malloc_wrap.c | 8 ++++---- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/bwtindex.c b/bwtindex.c index 571d087d..e1322f8e 100644 --- a/bwtindex.c +++ b/bwtindex.c @@ -175,7 +175,7 @@ void bwt_bwtupdate_core(bwt_t *bwt) int bwa_bwtupdate(int argc, char *argv[]) // the "bwtupdate" command { bwt_t *bwt; - if (argc < 2) { + if (argc != 2) { fprintf(stderr, "Usage: bwa bwtupdate \n"); return 1; } diff --git a/fastmap.c b/fastmap.c index c2ad90a4..06199991 100644 --- a/fastmap.c +++ b/fastmap.c @@ -258,7 +258,7 @@ int main_mem(int argc, char *argv[]) fprintf(stderr, " -E INT[,INT] gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [%d,%d]\n", opt->e_del, opt->e_ins); fprintf(stderr, " -L INT[,INT] penalty for 5'- and 3'-end clipping [%d,%d]\n", opt->pen_clip5, opt->pen_clip3); fprintf(stderr, " -U INT penalty for an unpaired read pair [%d]\n\n", opt->pen_unpaired); - fprintf(stderr, " -x STR read type. Setting -x changes multiple parameters unless overriden [null]\n"); + fprintf(stderr, " -x STR read type. Setting -x changes multiple parameters unless overridden [null]\n"); fprintf(stderr, " pacbio: -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0 (PacBio reads to ref)\n"); fprintf(stderr, " ont2d: -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0 (Oxford Nanopore 2D-reads to ref)\n"); fprintf(stderr, " intractg: -B9 -O16 -L5 (intra-species contigs to ref)\n"); diff --git a/malloc_wrap.c b/malloc_wrap.c index 100b8cb6..6165e043 100644 --- a/malloc_wrap.c +++ b/malloc_wrap.c @@ -13,7 +13,7 @@ void *wrap_calloc(size_t nmemb, size_t size, void *p = calloc(nmemb, size); if (NULL == p) { fprintf(stderr, - "[%s] Failed to allocate %zd bytes at %s line %u: %s\n", + "[%s] Failed to allocate %zu bytes at %s line %u: %s\n", func, nmemb * size, file, line, strerror(errno)); exit(EXIT_FAILURE); } @@ -25,7 +25,7 @@ void *wrap_malloc(size_t size, void *p = malloc(size); if (NULL == p) { fprintf(stderr, - "[%s] Failed to allocate %zd bytes at %s line %u: %s\n", + "[%s] Failed to allocate %zu bytes at %s line %u: %s\n", func, size, file, line, strerror(errno)); exit(EXIT_FAILURE); } @@ -37,7 +37,7 @@ void *wrap_realloc(void *ptr, size_t size, void *p = realloc(ptr, size); if (NULL == p) { fprintf(stderr, - "[%s] Failed to allocate %zd bytes at %s line %u: %s\n", + "[%s] Failed to allocate %zu bytes at %s line %u: %s\n", func, size, file, line, strerror(errno)); exit(EXIT_FAILURE); } @@ -49,7 +49,7 @@ char *wrap_strdup(const char *s, char *p = strdup(s); if (NULL == p) { fprintf(stderr, - "[%s] Failed to allocate %zd bytes at %s line %u: %s\n", + "[%s] Failed to allocate %zu bytes at %s line %u: %s\n", func, strlen(s), file, line, strerror(errno)); exit(EXIT_FAILURE); } From 89a5c0c5c8179b622887d1cc684f4307043c66c7 Mon Sep 17 00:00:00 2001 From: Heng Li Date: Sun, 30 Jul 2017 18:47:03 -0400 Subject: [PATCH 22/64] moved BWTFree() into BWTIncFree() BWTFree() is not for build. Only BWTIncFree() is needed. --- bwt_gen.c | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/bwt_gen.c b/bwt_gen.c index acea99b6..3adcc220 100644 --- a/bwt_gen.c +++ b/bwt_gen.c @@ -1544,19 +1544,13 @@ BWTInc *BWTIncConstructFromPacked(const char *inputFileName, bgint_t initialMaxB return bwtInc; } -void BWTFree(BWT *bwt) -{ - if (bwt == 0) return; - free(bwt->cumulativeFreq); - free(bwt->occValueMajor); - free(bwt->decodeTable); - free(bwt); -} - void BWTIncFree(BWTInc *bwtInc) { if (bwtInc == 0) return; - BWTFree(bwtInc->bwt); + free(bwtInc->bwt->cumulativeFreq); + free(bwtInc->bwt->occValueMajor); + free(bwtInc->bwt->decodeTable); + free(bwtInc->bwt); free(bwtInc->workingMemory); free(bwtInc->cumulativeCountInCurrentBuild); free(bwtInc->packedShift); From 24a8d66481333bde7583ba336bc2f355075cdf3f Mon Sep 17 00:00:00 2001 From: Heng Li Date: Sun, 30 Jul 2017 18:49:06 -0400 Subject: [PATCH 23/64] removed drone.io --- README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/README.md b/README.md index 2d186965..5f673a26 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,4 @@ [![Build Status](https://travis-ci.org/lh3/bwa.svg?branch=dev)](https://travis-ci.org/lh3/bwa) -[![Build Status](https://drone.io/github.com/lh3/bwa/status.png)](https://drone.io/github.com/lh3/bwa/latest) ## Getting started git clone https://github.com/lh3/bwa.git From 47d9fb27d74423d4da7c54fcf029a558083bd19a Mon Sep 17 00:00:00 2001 From: Heng Li Date: Sun, 30 Jul 2017 22:35:38 -0400 Subject: [PATCH 24/64] Release 0.7.16-r1180 --- NEWS.md | 19 ++++++++++++++++++- bwa.1 | 29 ++++++++++++++++++++++++----- fastmap.c | 5 +++-- main.c | 2 +- 4 files changed, 46 insertions(+), 9 deletions(-) diff --git a/NEWS.md b/NEWS.md index 9a63bef9..02a1c0ce 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,7 +1,24 @@ +Release 0.7.16 (30 July 2017) +----------------------------- + +This release added a couple of minor features and incorporated multiple pull +requests, including: + + * Added option -5, which is useful to some Hi-C pipelines. + + * Fixed an error with samtools sorting (#129). Updated download link for + GRCh38 (#123). Fixed README MarkDown formatting (#70). Addressed multiple + issues via a collected pull request #139 by @jmarshall. Avoid malformatted + SAM header when -R is used with TAB (#84). Output mate CIGAR (#138). + +(0.7.16: 30 July 2017, r1180) + + + Release 0.7.15 (31 May 2016) ---------------------------- -Fixed a long existing bug which potentially leads underestimated insert size +Fixed a long existing bug which potentially leads to underestimated insert size upper bound. This bug should have little effect in practice. (0.7.15: 31 May 2016, r1140) diff --git a/bwa.1 b/bwa.1 index da158d84..94c2c58e 100644 --- a/bwa.1 +++ b/bwa.1 @@ -1,4 +1,4 @@ -.TH bwa 1 "31 May 2016" "bwa-0.7.15-r1140" "Bioinformatics tools" +.TH bwa 1 "30 July 2017" "bwa-0.7.16-r1180" "Bioinformatics tools" .SH NAME .PP bwa - Burrows-Wheeler Alignment Tool @@ -261,16 +261,20 @@ more aggressive read pair. [17] .TP .BI -x \ STR Read type. Changes multiple parameters unless overriden [null] - pacbio: +.RS +.TP 10 +.BR pacbio : .B -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0 (PacBio reads to ref) - ont2d: +.TP +.BR ont2d : .B -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0 (Oxford Nanopore 2D-reads to ref) - intractg: +.TP +.BR intractg : .B -B9 -O16 -L5 (intra-species contigs to ref) - +.RE .TP .B INPUT/OUTPUT OPTIONS: .TP @@ -299,6 +303,21 @@ For compatibility with other BWA commands, this option may also be given as .IR FILE . [standard ouptut] .TP +.B -5 +For split alignment, mark the segment with the smallest coordinate as the +primary. This option may help some Hi-C pipelines. By default, BWA-MEM marks +highest scoring segment as primary. +.TP +.B -K \ INT +Process +.I INT +input bases in each batch regardless of the number of threads in use +.RI [10000000* nThreads ]. +By default, the batch size is proportional to the number of threads in use. +Because the inferred insert size distribution slightly depends on the batch +size, using different number of threads may produce different output. +Specifying this option helps reproducibility. +.TP .BI -T \ INT Don't output alignment with score lower than .IR INT . diff --git a/fastmap.c b/fastmap.c index 06199991..d89c4959 100644 --- a/fastmap.c +++ b/fastmap.c @@ -268,9 +268,10 @@ int main_mem(int argc, char *argv[]) fprintf(stderr, " -H STR/FILE insert STR to header if it starts with @; or insert lines in FILE [null]\n"); fprintf(stderr, " -o FILE sam file to output results to [stdout]\n"); fprintf(stderr, " -j treat ALT contigs as part of the primary assembly (i.e. ignore .alt file)\n"); - fprintf(stderr, " -5 always take the leftmost alignment on a read as primary\n"); + fprintf(stderr, " -5 for split alignment, take the alignment with the smallest coordiate as primary\n"); + fprintf(stderr, " -K INT process INT input bases in each batch regardless of nThreads (for reproducibility) []\n"); fprintf(stderr, "\n"); - fprintf(stderr, " -v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [%d]\n", bwa_verbose); + fprintf(stderr, " -v INT verbosity level: 1=error, 2=warning, 3=message, 4+=debugging [%d]\n", bwa_verbose); fprintf(stderr, " -T INT minimum score to output [%d]\n", opt->T); fprintf(stderr, " -h INT[,INT] if there are 80%% of the max score, output all in XA [%d,%d]\n", opt->max_XA_hits, opt->max_XA_hits_alt); fprintf(stderr, " -a output all alignments for SE or unpaired PE\n"); diff --git a/main.c b/main.c index a0f25fa4..21bff9bb 100644 --- a/main.c +++ b/main.c @@ -4,7 +4,7 @@ #include "utils.h" #ifndef PACKAGE_VERSION -#define PACKAGE_VERSION "0.7.15-r1144-dirty" +#define PACKAGE_VERSION "0.7.16-r1180" #endif int bwa_fa2pac(int argc, char *argv[]); From a9c688ac923ea63c14f4a9dbf74a4aedb4d65459 Mon Sep 17 00:00:00 2001 From: Heng Li Date: Mon, 31 Jul 2017 07:55:35 -0400 Subject: [PATCH 25/64] r1181: fixed a segfault (#145) due to (#138) --- bwamem.c | 2 +- main.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/bwamem.c b/bwamem.c index 69be7691..d7c00fe7 100644 --- a/bwamem.c +++ b/bwamem.c @@ -905,7 +905,7 @@ void mem_aln2sam(const mem_opt_t *opt, const bntseq_t *bns, kstring_t *str, bseq kputsn("\tNM:i:", 6, str); kputw(p->NM, str); kputsn("\tMD:Z:", 6, str); kputs((char*)(p->cigar + p->n_cigar), str); } - if (m->n_cigar) { kputsn("\tMC:Z:", 6, str); add_cigar(opt, m, str, which); } + if (m && m->n_cigar) { kputsn("\tMC:Z:", 6, str); add_cigar(opt, m, str, which); } if (p->score >= 0) { kputsn("\tAS:i:", 6, str); kputw(p->score, str); } if (p->sub >= 0) { kputsn("\tXS:i:", 6, str); kputw(p->sub, str); } if (bwa_rg_id[0]) { kputsn("\tRG:Z:", 6, str); kputs(bwa_rg_id, str); } diff --git a/main.c b/main.c index 21bff9bb..e2b7cf13 100644 --- a/main.c +++ b/main.c @@ -4,7 +4,7 @@ #include "utils.h" #ifndef PACKAGE_VERSION -#define PACKAGE_VERSION "0.7.16-r1180" +#define PACKAGE_VERSION "0.7.16a-r1181" #endif int bwa_fa2pac(int argc, char *argv[]); From 340babdd671eeeb3c7bfbf2e4ad1e761ece94983 Mon Sep 17 00:00:00 2001 From: Heng Li Date: Mon, 11 Sep 2017 09:58:41 -0400 Subject: [PATCH 26/64] r1185: don't lower supp mapq with -5 --- bwamem.c | 3 ++- main.c | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/bwamem.c b/bwamem.c index d7c00fe7..a4e90686 100644 --- a/bwamem.c +++ b/bwamem.c @@ -1026,7 +1026,8 @@ void mem_reg2sam(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac, if (p->secondary >= 0) q->sub = -1; // don't output sub-optimal score if (l && p->secondary < 0) // if supplementary q->flag |= (opt->flag&MEM_F_NO_MULTI)? 0x10000 : 0x800; - if (l && !p->is_alt && q->mapq > aa.a[0].mapq) q->mapq = aa.a[0].mapq; + if (!(opt->flag&MEM_F_PRIMARY5) && l && !p->is_alt && q->mapq > aa.a[0].mapq) + q->mapq = aa.a[0].mapq; // lower mapq for supplementary mappings, unless -5 is applied ++l; } if (aa.n == 0) { // no alignments good enough; then write an unaligned record diff --git a/main.c b/main.c index e2b7cf13..189e7237 100644 --- a/main.c +++ b/main.c @@ -4,7 +4,7 @@ #include "utils.h" #ifndef PACKAGE_VERSION -#define PACKAGE_VERSION "0.7.16a-r1181" +#define PACKAGE_VERSION "0.7.16a-r1185-dirty" #endif int bwa_fa2pac(int argc, char *argv[]); From b58281621136a0ce2a66837ba509716c727b9387 Mon Sep 17 00:00:00 2001 From: Heng Li Date: Tue, 26 Sep 2017 09:35:19 -0400 Subject: [PATCH 27/64] r1187: a typo in command line help --- bwamem.c | 4 ++-- bwamem.h | 1 + fastmap.c | 8 +++++--- main.c | 2 +- 4 files changed, 9 insertions(+), 6 deletions(-) diff --git a/bwamem.c b/bwamem.c index a4e90686..eeaf4f43 100644 --- a/bwamem.c +++ b/bwamem.c @@ -1026,8 +1026,8 @@ void mem_reg2sam(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac, if (p->secondary >= 0) q->sub = -1; // don't output sub-optimal score if (l && p->secondary < 0) // if supplementary q->flag |= (opt->flag&MEM_F_NO_MULTI)? 0x10000 : 0x800; - if (!(opt->flag&MEM_F_PRIMARY5) && l && !p->is_alt && q->mapq > aa.a[0].mapq) - q->mapq = aa.a[0].mapq; // lower mapq for supplementary mappings, unless -5 is applied + if (!(opt->flag & MEM_F_KEEP_SUPP_MAPQ) && l && !p->is_alt && q->mapq > aa.a[0].mapq) + q->mapq = aa.a[0].mapq; // lower mapq for supplementary mappings, unless -5 or -q is applied ++l; } if (aa.n == 0) { // no alignments good enough; then write an unaligned record diff --git a/bwamem.h b/bwamem.h index 4e2b1d8c..fb572b28 100644 --- a/bwamem.h +++ b/bwamem.h @@ -20,6 +20,7 @@ typedef struct __smem_i smem_i; #define MEM_F_SOFTCLIP 0x200 #define MEM_F_SMARTPE 0x400 #define MEM_F_PRIMARY5 0x800 +#define MEM_F_KEEP_SUPP_MAPQ 0x1000 typedef struct { int a, b; // match score and mismatch penalty diff --git a/fastmap.c b/fastmap.c index d89c4959..41c29cb0 100644 --- a/fastmap.c +++ b/fastmap.c @@ -130,7 +130,7 @@ int main_mem(int argc, char *argv[]) aux.opt = opt = mem_opt_init(); memset(&opt0, 0, sizeof(mem_opt_t)); - while ((c = getopt(argc, argv, "51paMCSPVYjk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:o:f:W:x:G:h:y:K:X:H:")) >= 0) { + while ((c = getopt(argc, argv, "51qpaMCSPVYjk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:o:f:W:x:G:h:y:K:X:H:")) >= 0) { if (c == 'k') opt->min_seed_len = atoi(optarg), opt0.min_seed_len = 1; else if (c == '1') no_mt_io = 1; else if (c == 'x') mode = optarg; @@ -147,7 +147,8 @@ int main_mem(int argc, char *argv[]) else if (c == 'S') opt->flag |= MEM_F_NO_RESCUE; else if (c == 'Y') opt->flag |= MEM_F_SOFTCLIP; else if (c == 'V') opt->flag |= MEM_F_REF_HDR; - else if (c == '5') opt->flag |= MEM_F_PRIMARY5; + else if (c == '5') opt->flag |= MEM_F_PRIMARY5 | MEM_F_KEEP_SUPP_MAPQ; // always apply MEM_F_KEEP_SUPP_MAPQ with -5 + else if (c == 'q') opt->flag |= MEM_F_KEEP_SUPP_MAPQ; else if (c == 'c') opt->max_occ = atoi(optarg), opt0.max_occ = 1; else if (c == 'd') opt->zdrop = atoi(optarg), opt0.zdrop = 1; else if (c == 'v') bwa_verbose = atoi(optarg); @@ -268,7 +269,8 @@ int main_mem(int argc, char *argv[]) fprintf(stderr, " -H STR/FILE insert STR to header if it starts with @; or insert lines in FILE [null]\n"); fprintf(stderr, " -o FILE sam file to output results to [stdout]\n"); fprintf(stderr, " -j treat ALT contigs as part of the primary assembly (i.e. ignore .alt file)\n"); - fprintf(stderr, " -5 for split alignment, take the alignment with the smallest coordiate as primary\n"); + fprintf(stderr, " -5 for split alignment, take the alignment with the smallest coordinate as primary\n"); + fprintf(stderr, " -q don't modify mapQ of supplementary alignments\n"); fprintf(stderr, " -K INT process INT input bases in each batch regardless of nThreads (for reproducibility) []\n"); fprintf(stderr, "\n"); fprintf(stderr, " -v INT verbosity level: 1=error, 2=warning, 3=message, 4+=debugging [%d]\n", bwa_verbose); diff --git a/main.c b/main.c index 189e7237..063384e7 100644 --- a/main.c +++ b/main.c @@ -4,7 +4,7 @@ #include "utils.h" #ifndef PACKAGE_VERSION -#define PACKAGE_VERSION "0.7.16a-r1185-dirty" +#define PACKAGE_VERSION "0.7.16a-r1187-dirty" #endif int bwa_fa2pac(int argc, char *argv[]); From 9f26bfcc7780753129b60717ecab0ebba6f04b7c Mon Sep 17 00:00:00 2001 From: Heng Li Date: Mon, 23 Oct 2017 13:10:17 -0400 Subject: [PATCH 28/64] Released bwa-0.7.17 (r1188) --- NEWS.md | 11 +++++++++++ bwa.1 | 9 +++++++-- main.c | 2 +- 3 files changed, 19 insertions(+), 3 deletions(-) diff --git a/NEWS.md b/NEWS.md index 02a1c0ce..151e024b 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,3 +1,14 @@ +Release 0.7.17 (23 October 2017) +-------------------------------- + +This release adds option -q to preserve the mapping quality of split alignment +with a lower alignment score than the primary alignment. Option -5 +automatically applies -q as well. + +(0.7.17: 23 October 2017, r1188) + + + Release 0.7.16 (30 July 2017) ----------------------------- diff --git a/bwa.1 b/bwa.1 index 94c2c58e..d30ac331 100644 --- a/bwa.1 +++ b/bwa.1 @@ -1,4 +1,4 @@ -.TH bwa 1 "30 July 2017" "bwa-0.7.16-r1180" "Bioinformatics tools" +.TH bwa 1 "23 October 2017" "bwa-0.7.17-r1188" "Bioinformatics tools" .SH NAME .PP bwa - Burrows-Wheeler Alignment Tool @@ -303,9 +303,14 @@ For compatibility with other BWA commands, this option may also be given as .IR FILE . [standard ouptut] .TP +.B -q + Don't reduce the mapping quality of split alignment of lower alignment score. +.TP .B -5 For split alignment, mark the segment with the smallest coordinate as the -primary. This option may help some Hi-C pipelines. By default, BWA-MEM marks +primary. It automatically applies option +.B -q +as well. This option may help some Hi-C pipelines. By default, BWA-MEM marks highest scoring segment as primary. .TP .B -K \ INT diff --git a/main.c b/main.c index 063384e7..50ae755e 100644 --- a/main.c +++ b/main.c @@ -4,7 +4,7 @@ #include "utils.h" #ifndef PACKAGE_VERSION -#define PACKAGE_VERSION "0.7.16a-r1187-dirty" +#define PACKAGE_VERSION "0.7.17-r1188" #endif int bwa_fa2pac(int argc, char *argv[]); From ab64326ae6634ab8561bc207b13938d47fb6238b Mon Sep 17 00:00:00 2001 From: Heng Li Date: Mon, 6 Nov 2017 19:40:42 -0500 Subject: [PATCH 29/64] claim bwa-mem long-read mode deprecated --- README.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/README.md b/README.md index 5f673a26..dcb78735 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,12 @@ [![Build Status](https://travis-ci.org/lh3/bwa.svg?branch=dev)](https://travis-ci.org/lh3/bwa) +[![SourceForge Downloads](https://img.shields.io/sourceforge/dt/bio-bwa.svg)](https://sourceforge.net/projects/bio-bwa/files/?source=navbar) +[![GitHub Downloads](https://img.shields.io/github/downloads/lh3/minimap2/total.svg?style=flat)](https://github.com/lh3/bwa/releases) + +**Note: [minimap2][minimap2] has replaced BWA-MEM for PacBio and Nanopore read +alignment.** It is ~50 times as fast, more accurate, produces better base-level +alignment and works with long RNA-seq reads and long-read assemblies. +[minimap2]: https://github.com/lh3/minimap2 + ## Getting started git clone https://github.com/lh3/bwa.git From 824e00e7ac2fd7031c29c6f9a4e6657862487cec Mon Sep 17 00:00:00 2001 From: Heng Li Date: Mon, 6 Nov 2017 19:41:53 -0500 Subject: [PATCH 30/64] formatting change --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index dcb78735..ffe25fca 100644 --- a/README.md +++ b/README.md @@ -5,6 +5,7 @@ **Note: [minimap2][minimap2] has replaced BWA-MEM for PacBio and Nanopore read alignment.** It is ~50 times as fast, more accurate, produces better base-level alignment and works with long RNA-seq reads and long-read assemblies. + [minimap2]: https://github.com/lh3/minimap2 ## Getting started From 1a2b0c001af4919f2c6c313f3e6d607a1a27c048 Mon Sep 17 00:00:00 2001 From: Heng Li Date: Mon, 6 Nov 2017 19:43:02 -0500 Subject: [PATCH 31/64] fixed a typo in link --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ffe25fca..270ddb06 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ [![Build Status](https://travis-ci.org/lh3/bwa.svg?branch=dev)](https://travis-ci.org/lh3/bwa) [![SourceForge Downloads](https://img.shields.io/sourceforge/dt/bio-bwa.svg)](https://sourceforge.net/projects/bio-bwa/files/?source=navbar) -[![GitHub Downloads](https://img.shields.io/github/downloads/lh3/minimap2/total.svg?style=flat)](https://github.com/lh3/bwa/releases) +[![GitHub Downloads](https://img.shields.io/github/downloads/lh3/bwa/total.svg?style=flat)](https://github.com/lh3/bwa/releases) **Note: [minimap2][minimap2] has replaced BWA-MEM for PacBio and Nanopore read alignment.** It is ~50 times as fast, more accurate, produces better base-level From eef04271e619f18ae4dd2aaf1cec307b08f1c6cd Mon Sep 17 00:00:00 2001 From: Heng Li Date: Mon, 6 Nov 2017 19:49:19 -0500 Subject: [PATCH 32/64] wording changes --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 270ddb06..34b49030 100644 --- a/README.md +++ b/README.md @@ -3,8 +3,8 @@ [![GitHub Downloads](https://img.shields.io/github/downloads/lh3/bwa/total.svg?style=flat)](https://github.com/lh3/bwa/releases) **Note: [minimap2][minimap2] has replaced BWA-MEM for PacBio and Nanopore read -alignment.** It is ~50 times as fast, more accurate, produces better base-level -alignment and works with long RNA-seq reads and long-read assemblies. +alignment.** It retains all major BWA-MEM features, but is ~50 times as fast, +more versatile, more accurate and produces better base-level alignment. [minimap2]: https://github.com/lh3/minimap2 From fde1fd7ce3f52e74907169b927922a12a9feb000 Mon Sep 17 00:00:00 2001 From: Heng Li Date: Mon, 6 Nov 2017 19:51:01 -0500 Subject: [PATCH 33/64] emphasize pacbio and nanopore --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 34b49030..23762b88 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ [![SourceForge Downloads](https://img.shields.io/sourceforge/dt/bio-bwa.svg)](https://sourceforge.net/projects/bio-bwa/files/?source=navbar) [![GitHub Downloads](https://img.shields.io/github/downloads/lh3/bwa/total.svg?style=flat)](https://github.com/lh3/bwa/releases) -**Note: [minimap2][minimap2] has replaced BWA-MEM for PacBio and Nanopore read +**Note: [minimap2][minimap2] has replaced BWA-MEM for __PacBio and Nanopore__ read alignment.** It retains all major BWA-MEM features, but is ~50 times as fast, more versatile, more accurate and produces better base-level alignment. From eb7dbc14295962bbef4542969c3f594c1523ca4c Mon Sep 17 00:00:00 2001 From: Heng Li Date: Mon, 2 Apr 2018 10:43:41 -0400 Subject: [PATCH 34/64] optionally write XB to include alignment score request from 4DN-DCIC --- bwamem.c | 5 ++++- bwamem.h | 1 + bwamem_extra.c | 4 ++++ fastmap.c | 3 ++- 4 files changed, 11 insertions(+), 2 deletions(-) diff --git a/bwamem.c b/bwamem.c index d7c00fe7..da102b57 100644 --- a/bwamem.c +++ b/bwamem.c @@ -932,7 +932,10 @@ void mem_aln2sam(const mem_opt_t *opt, const bntseq_t *bns, kstring_t *str, bseq if (p->alt_sc > 0) ksprintf(str, "\tpa:f:%.3f", (double)p->score / p->alt_sc); } - if (p->XA) { kputsn("\tXA:Z:", 6, str); kputs(p->XA, str); } + if (p->XA) { + kputsn((opt->flag&MEM_F_XB)? "\tXB:Z:" : "\tXA:Z:", 6, str); + kputs(p->XA, str); + } if (s->comment) { kputc('\t', str); kputs(s->comment, str); } if ((opt->flag&MEM_F_REF_HDR) && p->rid >= 0 && bns->anns[p->rid].anno != 0 && bns->anns[p->rid].anno[0] != 0) { int tmp; diff --git a/bwamem.h b/bwamem.h index 4e2b1d8c..e40ebcce 100644 --- a/bwamem.h +++ b/bwamem.h @@ -20,6 +20,7 @@ typedef struct __smem_i smem_i; #define MEM_F_SOFTCLIP 0x200 #define MEM_F_SMARTPE 0x400 #define MEM_F_PRIMARY5 0x800 +#define MEM_F_XB 0x1000 typedef struct { int a, b; // match score and mismatch penalty diff --git a/bwamem_extra.c b/bwamem_extra.c index d2e37f43..fb42abdb 100644 --- a/bwamem_extra.c +++ b/bwamem_extra.c @@ -126,6 +126,10 @@ char **mem_gen_alt(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac kputc("MIDSHN"[t.cigar[k]&0xf], &str); } kputc(',', &str); kputw(t.NM, &str); + if (opt->flag & MEM_F_XB) { + kputc(',', &str); + kputw(t.score, &str); + } kputc(';', &str); free(t.cigar); kputsn(str.s, str.l, &aln[r]); diff --git a/fastmap.c b/fastmap.c index d89c4959..9f37d4e0 100644 --- a/fastmap.c +++ b/fastmap.c @@ -130,7 +130,7 @@ int main_mem(int argc, char *argv[]) aux.opt = opt = mem_opt_init(); memset(&opt0, 0, sizeof(mem_opt_t)); - while ((c = getopt(argc, argv, "51paMCSPVYjk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:o:f:W:x:G:h:y:K:X:H:")) >= 0) { + while ((c = getopt(argc, argv, "51paMCSPVYjuk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:o:f:W:x:G:h:y:K:X:H:")) >= 0) { if (c == 'k') opt->min_seed_len = atoi(optarg), opt0.min_seed_len = 1; else if (c == '1') no_mt_io = 1; else if (c == 'x') mode = optarg; @@ -148,6 +148,7 @@ int main_mem(int argc, char *argv[]) else if (c == 'Y') opt->flag |= MEM_F_SOFTCLIP; else if (c == 'V') opt->flag |= MEM_F_REF_HDR; else if (c == '5') opt->flag |= MEM_F_PRIMARY5; + else if (c == 'u') opt->flag |= MEM_F_XB; else if (c == 'c') opt->max_occ = atoi(optarg), opt0.max_occ = 1; else if (c == 'd') opt->zdrop = atoi(optarg), opt0.zdrop = 1; else if (c == 'v') bwa_verbose = atoi(optarg); From 27dd1da7020151be9358fce67cbe2c10801c348d Mon Sep 17 00:00:00 2001 From: Imran Haque Date: Sat, 7 Apr 2018 07:47:32 +0000 Subject: [PATCH 35/64] Exclude empty SAM lines in bounds check for ALT postprocessor --- bwakit/bwa-postalt.js | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bwakit/bwa-postalt.js b/bwakit/bwa-postalt.js index bfc41900..e00d3dc7 100644 --- a/bwakit/bwa-postalt.js +++ b/bwakit/bwa-postalt.js @@ -283,7 +283,7 @@ function bwa_postalt(args) // process SAM var buf2 = [], hla = {}; file = args.length - getopt.ind >= 2? new File(args[getopt.ind+1]) : new File(); - while (file.readline(buf) >= 0) { + while (file.readline(buf) > 0) { var m, line = buf.toString(); if (line.charAt(0) == '@') { // print and then skip the header line From 20d0a13092aa4cb73230492b05f9697d5ef0b88e Mon Sep 17 00:00:00 2001 From: Heng Li Date: Wed, 23 Jan 2019 13:23:25 -0500 Subject: [PATCH 36/64] r1198: exit if .alt is malformatted Resolves #232 --- bntseq.c | 8 +++++++- main.c | 2 +- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/bntseq.c b/bntseq.c index 65f7e93a..bf67bd44 100644 --- a/bntseq.c +++ b/bntseq.c @@ -197,7 +197,13 @@ bntseq_t *bns_restore(const char *prefix) } while (c != '\n' && c != EOF) c = fgetc(fp); i = 0; - } else str[i++] = c; // FIXME: potential segfault here + } else { + if (i >= 1022) { + fprintf(stderr, "[E::%s] sequence name longer than 1023 characters. Abort!\n", __func__); + exit(1); + } + str[i++] = c; + } } kh_destroy(str, h); fclose(fp); diff --git a/main.c b/main.c index 24412da3..ff36b71a 100644 --- a/main.c +++ b/main.c @@ -4,7 +4,7 @@ #include "utils.h" #ifndef PACKAGE_VERSION -#define PACKAGE_VERSION "0.7.17-r1194-dirty" +#define PACKAGE_VERSION "0.7.17-r1198-dirty" #endif int bwa_fa2pac(int argc, char *argv[]); From 38e751a40a2f6816f54e5fdb29e0b2accace616e Mon Sep 17 00:00:00 2001 From: Heng Li Date: Tue, 9 Jul 2019 10:25:57 -0400 Subject: [PATCH 37/64] mention bwa-mem2 --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index 23762b88..58a4fad4 100644 --- a/README.md +++ b/README.md @@ -5,8 +5,11 @@ **Note: [minimap2][minimap2] has replaced BWA-MEM for __PacBio and Nanopore__ read alignment.** It retains all major BWA-MEM features, but is ~50 times as fast, more versatile, more accurate and produces better base-level alignment. +A beta version of [BWA-MEM2][bwa-mem2] has been released for short-read mapping. +BWA-MEM2 is about twice as fast as BWA-MEM and outputs near identical alignments. [minimap2]: https://github.com/lh3/minimap2 +[bwa-mem2]: https://github.com/bwa-mem2/bwa-mem2 ## Getting started From 3a2b0c05ddeacea2785bbd3935cc3d3861f28751 Mon Sep 17 00:00:00 2001 From: Heng Li Date: Tue, 9 Jul 2019 15:44:34 -0400 Subject: [PATCH 38/64] updated badges --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 58a4fad4..59f62a42 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,7 @@ [![Build Status](https://travis-ci.org/lh3/bwa.svg?branch=dev)](https://travis-ci.org/lh3/bwa) -[![SourceForge Downloads](https://img.shields.io/sourceforge/dt/bio-bwa.svg)](https://sourceforge.net/projects/bio-bwa/files/?source=navbar) -[![GitHub Downloads](https://img.shields.io/github/downloads/lh3/bwa/total.svg?style=flat)](https://github.com/lh3/bwa/releases) +[![SourceForge Downloads](https://img.shields.io/sourceforge/dt/bio-bwa.svg?label=SF%20downloads)](https://sourceforge.net/projects/bio-bwa/files/?source=navbar) +[![GitHub Downloads](https://img.shields.io/github/downloads/lh3/bwa/total.svg?style=flat&label=GitHub%20downloads)](https://github.com/lh3/bwa/releases) +[![BioConda Install](https://img.shields.io/conda/dn/bioconda/bwa.svg?style=flag&label=BioConda%20install)](https://anaconda.org/bioconda/bwa) **Note: [minimap2][minimap2] has replaced BWA-MEM for __PacBio and Nanopore__ read alignment.** It retains all major BWA-MEM features, but is ~50 times as fast, From 2a1ae7b6f34a96ea25be007ac9d91e57e9d32284 Mon Sep 17 00:00:00 2001 From: David Seifert Date: Wed, 26 Feb 2020 13:24:17 +0100 Subject: [PATCH 39/64] Fix building against GCC 10 * GCC 10 defaults to `-fno-common`, which makes C behave more like C++ in that you can only ever have one definition of an object per executable. --- rle.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rle.h b/rle.h index 0d594846..4f8946dc 100644 --- a/rle.h +++ b/rle.h @@ -30,7 +30,7 @@ extern "C" { *** 43+3 codec *** ******************/ -const uint8_t rle_auxtab[8]; +extern const uint8_t rle_auxtab[8]; #define RLE_MIN_SPACE 18 #define rle_nptr(block) ((uint16_t*)(block)) From 436aceb4effdb0ff92650e5efe4361930f95d8dc Mon Sep 17 00:00:00 2001 From: Heng Li Date: Wed, 1 Jul 2020 23:02:01 -0400 Subject: [PATCH 40/64] added MIT license to some non-GPL source files --- bntseq.c | 7 +++---- bntseq.h | 6 +++--- bwa.c | 26 ++++++++++++++++++++++++++ bwa.h | 26 ++++++++++++++++++++++++++ bwamem.c | 26 ++++++++++++++++++++++++++ bwamem.h | 26 ++++++++++++++++++++++++++ bwamem_extra.c | 26 ++++++++++++++++++++++++++ bwamem_pair.c | 26 ++++++++++++++++++++++++++ bwt.h | 6 ++++-- bwtindex.c | 7 +++---- fastmap.c | 26 ++++++++++++++++++++++++++ utils.c | 6 +++--- utils.h | 7 +++---- 13 files changed, 201 insertions(+), 20 deletions(-) diff --git a/bntseq.c b/bntseq.c index bf67bd44..63806896 100644 --- a/bntseq.c +++ b/bntseq.c @@ -1,6 +1,8 @@ /* The MIT License - Copyright (c) 2008 Genome Research Ltd (GRL). + Copyright (c) 2018- Dana-Farber Cancer Institute + 2009-2018 Broad Institute, Inc. + 2008-2009 Genome Research Ltd. (GRL) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the @@ -22,9 +24,6 @@ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ - -/* Contact: Heng Li */ - #include #include #include diff --git a/bntseq.h b/bntseq.h index 03671d68..92e3afc8 100644 --- a/bntseq.h +++ b/bntseq.h @@ -1,6 +1,8 @@ /* The MIT License - Copyright (c) 2008 Genome Research Ltd (GRL). + Copyright (c) 2018- Dana-Farber Cancer Institute + 2009-2018 Broad Institute, Inc. + 2008-2009 Genome Research Ltd. (GRL) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the @@ -23,8 +25,6 @@ SOFTWARE. */ -/* Contact: Heng Li */ - #ifndef BWT_BNTSEQ_H #define BWT_BNTSEQ_H diff --git a/bwa.c b/bwa.c index 75ccdf44..f35b4be7 100644 --- a/bwa.c +++ b/bwa.c @@ -1,3 +1,29 @@ +/* The MIT License + + Copyright (c) 2018- Dana-Farber Cancer Institute + 2009-2018 Broad Institute, Inc. + 2008-2009 Genome Research Ltd. (GRL) + + Permission is hereby granted, free of charge, to any person obtaining + a copy of this software and associated documentation files (the + "Software"), to deal in the Software without restriction, including + without limitation the rights to use, copy, modify, merge, publish, + distribute, sublicense, and/or sell copies of the Software, and to + permit persons to whom the Software is furnished to do so, subject to + the following conditions: + + The above copyright notice and this permission notice shall be + included in all copies or substantial portions of the Software. + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. +*/ #include #include #include diff --git a/bwa.h b/bwa.h index aa21725e..96689504 100644 --- a/bwa.h +++ b/bwa.h @@ -1,3 +1,29 @@ +/* The MIT License + + Copyright (c) 2018- Dana-Farber Cancer Institute + 2009-2018 Broad Institute, Inc. + 2008-2009 Genome Research Ltd. (GRL) + + Permission is hereby granted, free of charge, to any person obtaining + a copy of this software and associated documentation files (the + "Software"), to deal in the Software without restriction, including + without limitation the rights to use, copy, modify, merge, publish, + distribute, sublicense, and/or sell copies of the Software, and to + permit persons to whom the Software is furnished to do so, subject to + the following conditions: + + The above copyright notice and this permission notice shall be + included in all copies or substantial portions of the Software. + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. +*/ #ifndef BWA_H_ #define BWA_H_ diff --git a/bwamem.c b/bwamem.c index deb56d66..966e8d0f 100644 --- a/bwamem.c +++ b/bwamem.c @@ -1,3 +1,29 @@ +/* The MIT License + + Copyright (c) 2018- Dana-Farber Cancer Institute + 2009-2018 Broad Institute, Inc. + 2008-2009 Genome Research Ltd. (GRL) + + Permission is hereby granted, free of charge, to any person obtaining + a copy of this software and associated documentation files (the + "Software"), to deal in the Software without restriction, including + without limitation the rights to use, copy, modify, merge, publish, + distribute, sublicense, and/or sell copies of the Software, and to + permit persons to whom the Software is furnished to do so, subject to + the following conditions: + + The above copyright notice and this permission notice shall be + included in all copies or substantial portions of the Software. + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. +*/ #include #include #include diff --git a/bwamem.h b/bwamem.h index fe537fa1..0a0e3bb4 100644 --- a/bwamem.h +++ b/bwamem.h @@ -1,3 +1,29 @@ +/* The MIT License + + Copyright (c) 2018- Dana-Farber Cancer Institute + 2009-2018 Broad Institute, Inc. + 2008-2009 Genome Research Ltd. (GRL) + + Permission is hereby granted, free of charge, to any person obtaining + a copy of this software and associated documentation files (the + "Software"), to deal in the Software without restriction, including + without limitation the rights to use, copy, modify, merge, publish, + distribute, sublicense, and/or sell copies of the Software, and to + permit persons to whom the Software is furnished to do so, subject to + the following conditions: + + The above copyright notice and this permission notice shall be + included in all copies or substantial portions of the Software. + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. +*/ #ifndef BWAMEM_H_ #define BWAMEM_H_ diff --git a/bwamem_extra.c b/bwamem_extra.c index fb42abdb..0d64e79c 100644 --- a/bwamem_extra.c +++ b/bwamem_extra.c @@ -1,3 +1,29 @@ +/* The MIT License + + Copyright (c) 2018- Dana-Farber Cancer Institute + 2009-2018 Broad Institute, Inc. + 2008-2009 Genome Research Ltd. (GRL) + + Permission is hereby granted, free of charge, to any person obtaining + a copy of this software and associated documentation files (the + "Software"), to deal in the Software without restriction, including + without limitation the rights to use, copy, modify, merge, publish, + distribute, sublicense, and/or sell copies of the Software, and to + permit persons to whom the Software is furnished to do so, subject to + the following conditions: + + The above copyright notice and this permission notice shall be + included in all copies or substantial portions of the Software. + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. +*/ #include #include "bwa.h" #include "bwamem.h" diff --git a/bwamem_pair.c b/bwamem_pair.c index 9beb84b6..ef795218 100644 --- a/bwamem_pair.c +++ b/bwamem_pair.c @@ -1,3 +1,29 @@ +/* The MIT License + + Copyright (c) 2018- Dana-Farber Cancer Institute + 2009-2018 Broad Institute, Inc. + 2008-2009 Genome Research Ltd. (GRL) + + Permission is hereby granted, free of charge, to any person obtaining + a copy of this software and associated documentation files (the + "Software"), to deal in the Software without restriction, including + without limitation the rights to use, copy, modify, merge, publish, + distribute, sublicense, and/or sell copies of the Software, and to + permit persons to whom the Software is furnished to do so, subject to + the following conditions: + + The above copyright notice and this permission notice shall be + included in all copies or substantial portions of the Software. + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. +*/ #include #include #include diff --git a/bwt.h b/bwt.h index c71d6b5e..daba87fd 100644 --- a/bwt.h +++ b/bwt.h @@ -1,6 +1,8 @@ /* The MIT License - Copyright (c) 2008 Genome Research Ltd (GRL). + Copyright (c) 2018- Dana-Farber Cancer Institute + 2009-2018 Broad Institute, Inc. + 2008-2009 Genome Research Ltd. (GRL) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the @@ -23,7 +25,7 @@ SOFTWARE. */ -/* Contact: Heng Li */ +/* Contact: Heng Li */ #ifndef BWA_BWT_H #define BWA_BWT_H diff --git a/bwtindex.c b/bwtindex.c index e1322f8e..6a27ae10 100644 --- a/bwtindex.c +++ b/bwtindex.c @@ -1,6 +1,8 @@ /* The MIT License - Copyright (c) 2008 Genome Research Ltd (GRL). + Copyright (c) 2018- Dana-Farber Cancer Institute + 2009-2018 Broad Institute, Inc. + 2008-2009 Genome Research Ltd. (GRL) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the @@ -22,9 +24,6 @@ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ - -/* Contact: Heng Li */ - #include #include #include diff --git a/fastmap.c b/fastmap.c index 53d2b71f..c865426c 100644 --- a/fastmap.c +++ b/fastmap.c @@ -1,3 +1,29 @@ +/* The MIT License + + Copyright (c) 2018- Dana-Farber Cancer Institute + 2009-2018 Broad Institute, Inc. + 2008-2009 Genome Research Ltd. (GRL) + + Permission is hereby granted, free of charge, to any person obtaining + a copy of this software and associated documentation files (the + "Software"), to deal in the Software without restriction, including + without limitation the rights to use, copy, modify, merge, publish, + distribute, sublicense, and/or sell copies of the Software, and to + permit persons to whom the Software is furnished to do so, subject to + the following conditions: + + The above copyright notice and this permission notice shall be + included in all copies or substantial portions of the Software. + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. +*/ #include #include #include diff --git a/utils.c b/utils.c index 29832617..21d52819 100644 --- a/utils.c +++ b/utils.c @@ -1,6 +1,8 @@ /* The MIT License - Copyright (c) 2008 Genome Research Ltd (GRL). + Copyright (c) 2018- Dana-Farber Cancer Institute + 2009-2018 Broad Institute, Inc. + 2008-2009 Genome Research Ltd. (GRL) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the @@ -22,8 +24,6 @@ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ - -/* Contact: Heng Li */ #define FSYNC_ON_FLUSH #include diff --git a/utils.h b/utils.h index 11966b8d..9d807846 100644 --- a/utils.h +++ b/utils.h @@ -1,6 +1,8 @@ /* The MIT License - Copyright (c) 2008 Genome Research Ltd (GRL). + Copyright (c) 2018- Dana-Farber Cancer Institute + 2009-2018 Broad Institute, Inc. + 2008-2009 Genome Research Ltd. (GRL) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the @@ -22,9 +24,6 @@ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ - -/* Contact: Heng Li */ - #ifndef LH3_UTILS_H #define LH3_UTILS_H From 13b5637fe6bd678b5756247a857c7ed9460b8e77 Mon Sep 17 00:00:00 2001 From: Heng Li Date: Thu, 2 Jul 2020 13:49:42 -0400 Subject: [PATCH 41/64] added the MIT license header to main.c --- main.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/main.c b/main.c index ff36b71a..a77ba53a 100644 --- a/main.c +++ b/main.c @@ -1,3 +1,29 @@ +/* The MIT License + + Copyright (c) 2018- Dana-Farber Cancer Institute + 2009-2018 Broad Institute, Inc. + 2008-2009 Genome Research Ltd. (GRL) + + Permission is hereby granted, free of charge, to any person obtaining + a copy of this software and associated documentation files (the + "Software"), to deal in the Software without restriction, including + without limitation the rights to use, copy, modify, merge, publish, + distribute, sublicense, and/or sell copies of the Software, and to + permit persons to whom the Software is furnished to do so, subject to + the following conditions: + + The above copyright notice and this permission notice shall be + included in all copies or substantial portions of the Software. + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. +*/ #include #include #include "kstring.h" From e5985723ccf1ba6095fb7fc33e13ad9d27fb9fe5 Mon Sep 17 00:00:00 2001 From: Nils Homer Date: Thu, 6 Aug 2020 10:08:29 -0700 Subject: [PATCH 42/64] Document "-u" --- fastmap.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fastmap.c b/fastmap.c index c865426c..226f5a14 100644 --- a/fastmap.c +++ b/fastmap.c @@ -312,6 +312,7 @@ int main_mem(int argc, char *argv[]) fprintf(stderr, " specify the mean, standard deviation (10%% of the mean if absent), max\n"); fprintf(stderr, " (4 sigma from the mean if absent) and min of the insert size distribution.\n"); fprintf(stderr, " FR orientation only. [inferred]\n"); + fprintf(stderr, " -u output XB instead of XA; XB is XA with the alignment score added\n"); fprintf(stderr, "\n"); fprintf(stderr, "Note: Please read the man page for detailed description of the command line and options.\n"); fprintf(stderr, "\n"); From e050ec9a10cd8ca630f891de8cf666a26966becb Mon Sep 17 00:00:00 2001 From: Nils Homer Date: Thu, 6 Aug 2020 10:10:58 -0700 Subject: [PATCH 43/64] Output mapping quality with XB --- bwamem_extra.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/bwamem_extra.c b/bwamem_extra.c index 0d64e79c..c47b93f4 100644 --- a/bwamem_extra.c +++ b/bwamem_extra.c @@ -155,6 +155,8 @@ char **mem_gen_alt(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac if (opt->flag & MEM_F_XB) { kputc(',', &str); kputw(t.score, &str); + kputc(',', &str); + kputw(t.mapq, &str); } kputc(';', &str); free(t.cigar); From ab9e0e0226e256df8fc5bd53a40d538637648be1 Mon Sep 17 00:00:00 2001 From: Nils Homer Date: Wed, 12 Aug 2020 08:51:19 -0700 Subject: [PATCH 44/64] Update fastmap.c --- fastmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fastmap.c b/fastmap.c index 226f5a14..db45208d 100644 --- a/fastmap.c +++ b/fastmap.c @@ -312,7 +312,7 @@ int main_mem(int argc, char *argv[]) fprintf(stderr, " specify the mean, standard deviation (10%% of the mean if absent), max\n"); fprintf(stderr, " (4 sigma from the mean if absent) and min of the insert size distribution.\n"); fprintf(stderr, " FR orientation only. [inferred]\n"); - fprintf(stderr, " -u output XB instead of XA; XB is XA with the alignment score added\n"); + fprintf(stderr, " -u output XB instead of XA; XB is XA with the alignment score added.\n"); fprintf(stderr, "\n"); fprintf(stderr, "Note: Please read the man page for detailed description of the command line and options.\n"); fprintf(stderr, "\n"); From 50122bed2a9524eeaebe44ca0903b9bdc95a8ae5 Mon Sep 17 00:00:00 2001 From: Nils Homer Date: Wed, 12 Aug 2020 09:01:44 -0700 Subject: [PATCH 45/64] allow the user to specify the XA drop ratio via the -z option --- fastmap.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/fastmap.c b/fastmap.c index c865426c..fcb2e3cd 100644 --- a/fastmap.c +++ b/fastmap.c @@ -156,7 +156,7 @@ int main_mem(int argc, char *argv[]) aux.opt = opt = mem_opt_init(); memset(&opt0, 0, sizeof(mem_opt_t)); - while ((c = getopt(argc, argv, "51qpaMCSPVYjuk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:o:f:W:x:G:h:y:K:X:H:")) >= 0) { + while ((c = getopt(argc, argv, "51qpaMCSPVYjuk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:o:f:W:x:G:h:y:K:X:H:z:")) >= 0) { if (c == 'k') opt->min_seed_len = atoi(optarg), opt0.min_seed_len = 1; else if (c == '1') no_mt_io = 1; else if (c == 'x') mode = optarg; @@ -198,6 +198,7 @@ int main_mem(int argc, char *argv[]) if (*p != 0 && ispunct(*p) && isdigit(p[1])) opt->max_XA_hits_alt = strtol(p+1, &p, 10); } + else if (c == 'z') opt->XA_drop_ratio = atof(optarg); else if (c == 'Q') { opt0.mapQ_coef_len = 1; opt->mapQ_coef_len = atoi(optarg); @@ -302,7 +303,12 @@ int main_mem(int argc, char *argv[]) fprintf(stderr, "\n"); fprintf(stderr, " -v INT verbosity level: 1=error, 2=warning, 3=message, 4+=debugging [%d]\n", bwa_verbose); fprintf(stderr, " -T INT minimum score to output [%d]\n", opt->T); - fprintf(stderr, " -h INT[,INT] if there are 80%% of the max score, output all in XA [%d,%d]\n", opt->max_XA_hits, opt->max_XA_hits_alt); + fprintf(stderr, " -h INT[,INT] if there are %.2f%% of the max score, output all in XA [%d,%d]\n", + opt->XA_drop_ratio * 100.0, + opt->max_XA_hits, opt->max_XA_hits_alt); + fprintf(stderr, " A second value may be given for alternate sequences.\n"); + fprintf(stderr, " -z FLOAT The fraction of the max score to use with -h [%f].\n", opt->XA_drop_ratio); + fprintf(stderr, " specify the mean, standard deviation (10%% of the mean if absent), max\n"); fprintf(stderr, " -a output all alignments for SE or unpaired PE\n"); fprintf(stderr, " -C append FASTA/FASTQ comment to SAM output\n"); fprintf(stderr, " -V output the reference FASTA header in the XR tag\n"); From 84b2abb675fb8f5be923059a9a14b42fd1b8cc96 Mon Sep 17 00:00:00 2001 From: Nils Homer Date: Tue, 18 Aug 2020 23:01:43 -0700 Subject: [PATCH 46/64] Clarify bwa mem -5 option Clarify that the -5 bwa mem option chooses the alignment that starts earliest in the read relative to the read/sequencing order, not genomic coordinate order --- fastmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fastmap.c b/fastmap.c index c865426c..d2ee943c 100644 --- a/fastmap.c +++ b/fastmap.c @@ -296,7 +296,7 @@ int main_mem(int argc, char *argv[]) fprintf(stderr, " -H STR/FILE insert STR to header if it starts with @; or insert lines in FILE [null]\n"); fprintf(stderr, " -o FILE sam file to output results to [stdout]\n"); fprintf(stderr, " -j treat ALT contigs as part of the primary assembly (i.e. ignore .alt file)\n"); - fprintf(stderr, " -5 for split alignment, take the alignment with the smallest coordinate as primary\n"); + fprintf(stderr, " -5 for split alignment, take the alignment with the smallest query (not genomic) coordinate as primary\n"); fprintf(stderr, " -q don't modify mapQ of supplementary alignments\n"); fprintf(stderr, " -K INT process INT input bases in each batch regardless of nThreads (for reproducibility) []\n"); fprintf(stderr, "\n"); From b9accf95adba7d23fad8fd328f041c7b309e235c Mon Sep 17 00:00:00 2001 From: Heng Li Date: Mon, 22 Feb 2021 23:26:03 -0500 Subject: [PATCH 47/64] debug flag to measure memory --- bwa.c | 1 + bwa.h | 4 +++- bwamem.c | 2 ++ fastmap.c | 3 ++- utils.c | 15 +++++++++++++-- utils.h | 5 +++-- 6 files changed, 24 insertions(+), 6 deletions(-) diff --git a/bwa.c b/bwa.c index f35b4be7..8aacde31 100644 --- a/bwa.c +++ b/bwa.c @@ -40,6 +40,7 @@ #endif int bwa_verbose = 3; +int bwa_dbg = 0; char bwa_rg_id[256]; char *bwa_pg; diff --git a/bwa.h b/bwa.h index 96689504..95c324b7 100644 --- a/bwa.h +++ b/bwa.h @@ -43,6 +43,8 @@ #define BWTALGO_BWTSW 2 #define BWTALGO_IS 3 +#define BWA_DBG_QNAME 0x1 + typedef struct { bwt_t *bwt; // FM-index bntseq_t *bns; // information on the reference sequences @@ -58,7 +60,7 @@ typedef struct { char *name, *comment, *seq, *qual, *sam; } bseq1_t; -extern int bwa_verbose; +extern int bwa_verbose, bwa_dbg; extern char bwa_rg_id[256]; #ifdef __cplusplus diff --git a/bwamem.c b/bwamem.c index 966e8d0f..104d0c95 100644 --- a/bwamem.c +++ b/bwamem.c @@ -1223,10 +1223,12 @@ static void worker2(void *data, int i, int tid) mem_mark_primary_se(w->opt, w->regs[i].n, w->regs[i].a, w->n_processed + i); if (w->opt->flag & MEM_F_PRIMARY5) mem_reorder_primary5(w->opt->T, &w->regs[i]); mem_reg2sam(w->opt, w->bns, w->pac, &w->seqs[i], &w->regs[i], 0, 0); + fprintf(stderr, "Q\t%s\t%.3f\n", w->seqs[i].name, peakrss() / 1073741824.0); free(w->regs[i].a); } else { if (bwa_verbose >= 4) printf("=====> Finalizing read pair '%s' <=====\n", w->seqs[i<<1|0].name); mem_sam_pe(w->opt, w->bns, w->pac, w->pes, (w->n_processed>>1) + i, &w->seqs[i<<1], &w->regs[i<<1]); + fprintf(stderr, "Q\t%s\t%.3f\n", w->seqs[i<<1|0].name, peakrss() / 1073741824.0); free(w->regs[i<<1|0].a); free(w->regs[i<<1|1].a); } } diff --git a/fastmap.c b/fastmap.c index c865426c..7c2f3a24 100644 --- a/fastmap.c +++ b/fastmap.c @@ -156,7 +156,7 @@ int main_mem(int argc, char *argv[]) aux.opt = opt = mem_opt_init(); memset(&opt0, 0, sizeof(mem_opt_t)); - while ((c = getopt(argc, argv, "51qpaMCSPVYjuk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:o:f:W:x:G:h:y:K:X:H:")) >= 0) { + while ((c = getopt(argc, argv, "51qpaMCSPVYjuk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:o:f:W:x:G:h:y:K:X:H:F:")) >= 0) { if (c == 'k') opt->min_seed_len = atoi(optarg), opt0.min_seed_len = 1; else if (c == '1') no_mt_io = 1; else if (c == 'x') mode = optarg; @@ -192,6 +192,7 @@ int main_mem(int argc, char *argv[]) else if (c == 'C') aux.copy_comment = 1; else if (c == 'K') fixed_chunk_size = atoi(optarg); else if (c == 'X') opt->mask_level = atof(optarg); + else if (c == 'F') bwa_dbg = atoi(optarg); else if (c == 'h') { opt0.max_XA_hits = opt0.max_XA_hits_alt = 1; opt->max_XA_hits = opt->max_XA_hits_alt = strtol(optarg, &p, 10); diff --git a/utils.c b/utils.c index 21d52819..9ceb1be1 100644 --- a/utils.c +++ b/utils.c @@ -279,17 +279,28 @@ int err_gzclose(gzFile file) * Timer * *********/ -double cputime() +double cputime(void) { struct rusage r; getrusage(RUSAGE_SELF, &r); return r.ru_utime.tv_sec + r.ru_stime.tv_sec + 1e-6 * (r.ru_utime.tv_usec + r.ru_stime.tv_usec); } -double realtime() +double realtime(void) { struct timeval tp; struct timezone tzp; gettimeofday(&tp, &tzp); return tp.tv_sec + tp.tv_usec * 1e-6; } + +long peakrss(void) +{ + struct rusage r; + getrusage(RUSAGE_SELF, &r); +#ifdef __linux__ + return r.ru_maxrss * 1024; +#else + return r.ru_maxrss; +#endif +} diff --git a/utils.h b/utils.h index 9d807846..d4d45500 100644 --- a/utils.h +++ b/utils.h @@ -84,8 +84,9 @@ extern "C" { int err_fclose(FILE *stream); int err_gzclose(gzFile file); - double cputime(); - double realtime(); + double cputime(void); + double realtime(void); + long peakrss(void); void ks_introsort_64 (size_t n, uint64_t *a); void ks_introsort_128(size_t n, pair64_t *a); From f1d1fd7c42e16f6006da77bf81e402eb531dbac2 Mon Sep 17 00:00:00 2001 From: Heng Li Date: Tue, 23 Feb 2021 10:43:41 -0500 Subject: [PATCH 48/64] output more memory information --- bwamem.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/bwamem.c b/bwamem.c index 104d0c95..859aff4e 100644 --- a/bwamem.c +++ b/bwamem.c @@ -1205,11 +1205,13 @@ static void worker1(void *data, int i, int tid) if (!(w->opt->flag&MEM_F_PE)) { if (bwa_verbose >= 4) printf("=====> Processing read '%s' <=====\n", w->seqs[i].name); w->regs[i] = mem_align1_core(w->opt, w->bwt, w->bns, w->pac, w->seqs[i].l_seq, w->seqs[i].seq, w->aux[tid]); + fprintf(stderr, "Q1\t%s\t%.3f\t%d\n", w->seqs[i].name, peakrss() / 1073741824.0, tid); } else { if (bwa_verbose >= 4) printf("=====> Processing read '%s'/1 <=====\n", w->seqs[i<<1|0].name); w->regs[i<<1|0] = mem_align1_core(w->opt, w->bwt, w->bns, w->pac, w->seqs[i<<1|0].l_seq, w->seqs[i<<1|0].seq, w->aux[tid]); if (bwa_verbose >= 4) printf("=====> Processing read '%s'/2 <=====\n", w->seqs[i<<1|1].name); w->regs[i<<1|1] = mem_align1_core(w->opt, w->bwt, w->bns, w->pac, w->seqs[i<<1|1].l_seq, w->seqs[i<<1|1].seq, w->aux[tid]); + fprintf(stderr, "Q1\t%s\t%.3f\t%d\n", w->seqs[i<<1|0].name, peakrss() / 1073741824.0, tid); } } @@ -1223,12 +1225,12 @@ static void worker2(void *data, int i, int tid) mem_mark_primary_se(w->opt, w->regs[i].n, w->regs[i].a, w->n_processed + i); if (w->opt->flag & MEM_F_PRIMARY5) mem_reorder_primary5(w->opt->T, &w->regs[i]); mem_reg2sam(w->opt, w->bns, w->pac, &w->seqs[i], &w->regs[i], 0, 0); - fprintf(stderr, "Q\t%s\t%.3f\n", w->seqs[i].name, peakrss() / 1073741824.0); + fprintf(stderr, "Q2\t%s\t%.3f\n", w->seqs[i].name, peakrss() / 1073741824.0); free(w->regs[i].a); } else { if (bwa_verbose >= 4) printf("=====> Finalizing read pair '%s' <=====\n", w->seqs[i<<1|0].name); mem_sam_pe(w->opt, w->bns, w->pac, w->pes, (w->n_processed>>1) + i, &w->seqs[i<<1], &w->regs[i<<1]); - fprintf(stderr, "Q\t%s\t%.3f\n", w->seqs[i<<1|0].name, peakrss() / 1073741824.0); + fprintf(stderr, "Q2\t%s\t%.3f\n", w->seqs[i<<1|0].name, peakrss() / 1073741824.0); free(w->regs[i<<1|0].a); free(w->regs[i<<1|1].a); } } From 110bf9b8edd05f05de6b37467745f284b72e36ef Mon Sep 17 00:00:00 2001 From: Heng Li Date: Tue, 9 Mar 2021 00:52:20 -0500 Subject: [PATCH 49/64] deprecate bwasw --- main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/main.c b/main.c index a77ba53a..22ccd405 100644 --- a/main.c +++ b/main.c @@ -58,7 +58,7 @@ static int usage() fprintf(stderr, "\n"); fprintf(stderr, "Program: bwa (alignment via Burrows-Wheeler transformation)\n"); fprintf(stderr, "Version: %s\n", PACKAGE_VERSION); - fprintf(stderr, "Contact: Heng Li \n\n"); + fprintf(stderr, "Contact: Heng Li \n\n"); fprintf(stderr, "Usage: bwa [options]\n\n"); fprintf(stderr, "Command: index index sequences in the FASTA format\n"); fprintf(stderr, " mem BWA-MEM algorithm\n"); @@ -67,7 +67,7 @@ static int usage() fprintf(stderr, " aln gapped/ungapped alignment\n"); fprintf(stderr, " samse generate alignment (single ended)\n"); fprintf(stderr, " sampe generate alignment (paired ended)\n"); - fprintf(stderr, " bwasw BWA-SW for long queries\n"); + fprintf(stderr, " bwasw BWA-SW for long queries (DEPRECATED)\n"); fprintf(stderr, "\n"); fprintf(stderr, " shm manage indices in shared memory\n"); fprintf(stderr, " fa2pac convert FASTA to PAC format\n"); From 34374c56139b7a08b3ef7dae38aca36f43f3cdd1 Mon Sep 17 00:00:00 2001 From: Heng Li Date: Tue, 9 Mar 2021 08:56:52 -0500 Subject: [PATCH 50/64] Removed the debug output; resolves #320 --- bwamem.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/bwamem.c b/bwamem.c index 859aff4e..966e8d0f 100644 --- a/bwamem.c +++ b/bwamem.c @@ -1205,13 +1205,11 @@ static void worker1(void *data, int i, int tid) if (!(w->opt->flag&MEM_F_PE)) { if (bwa_verbose >= 4) printf("=====> Processing read '%s' <=====\n", w->seqs[i].name); w->regs[i] = mem_align1_core(w->opt, w->bwt, w->bns, w->pac, w->seqs[i].l_seq, w->seqs[i].seq, w->aux[tid]); - fprintf(stderr, "Q1\t%s\t%.3f\t%d\n", w->seqs[i].name, peakrss() / 1073741824.0, tid); } else { if (bwa_verbose >= 4) printf("=====> Processing read '%s'/1 <=====\n", w->seqs[i<<1|0].name); w->regs[i<<1|0] = mem_align1_core(w->opt, w->bwt, w->bns, w->pac, w->seqs[i<<1|0].l_seq, w->seqs[i<<1|0].seq, w->aux[tid]); if (bwa_verbose >= 4) printf("=====> Processing read '%s'/2 <=====\n", w->seqs[i<<1|1].name); w->regs[i<<1|1] = mem_align1_core(w->opt, w->bwt, w->bns, w->pac, w->seqs[i<<1|1].l_seq, w->seqs[i<<1|1].seq, w->aux[tid]); - fprintf(stderr, "Q1\t%s\t%.3f\t%d\n", w->seqs[i<<1|0].name, peakrss() / 1073741824.0, tid); } } @@ -1225,12 +1223,10 @@ static void worker2(void *data, int i, int tid) mem_mark_primary_se(w->opt, w->regs[i].n, w->regs[i].a, w->n_processed + i); if (w->opt->flag & MEM_F_PRIMARY5) mem_reorder_primary5(w->opt->T, &w->regs[i]); mem_reg2sam(w->opt, w->bns, w->pac, &w->seqs[i], &w->regs[i], 0, 0); - fprintf(stderr, "Q2\t%s\t%.3f\n", w->seqs[i].name, peakrss() / 1073741824.0); free(w->regs[i].a); } else { if (bwa_verbose >= 4) printf("=====> Finalizing read pair '%s' <=====\n", w->seqs[i<<1|0].name); mem_sam_pe(w->opt, w->bns, w->pac, w->pes, (w->n_processed>>1) + i, &w->seqs[i<<1], &w->regs[i<<1]); - fprintf(stderr, "Q2\t%s\t%.3f\n", w->seqs[i<<1|0].name, peakrss() / 1073741824.0); free(w->regs[i<<1|0].a); free(w->regs[i<<1|1].a); } } From fbfffc9031a3a2d40bdb0a5157691bcbf2b2d586 Mon Sep 17 00:00:00 2001 From: Heng Li Date: Thu, 22 Apr 2021 17:35:37 -0400 Subject: [PATCH 51/64] added code of conduct --- code_of_conduct.md | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) create mode 100644 code_of_conduct.md diff --git a/code_of_conduct.md b/code_of_conduct.md new file mode 100644 index 00000000..b7175c1f --- /dev/null +++ b/code_of_conduct.md @@ -0,0 +1,30 @@ +## Contributor Code of Conduct + +As contributors and maintainers of this project, we pledge to respect all +people who contribute through reporting issues, posting feature requests, +updating documentation, submitting pull requests or patches, and other +activities. + +We are committed to making participation in this project a harassment-free +experience for everyone, regardless of level of experience, gender, gender +identity and expression, sexual orientation, disability, personal appearance, +body size, race, age, or religion. + +Examples of unacceptable behavior by participants include the use of sexual +language or imagery, derogatory comments or personal attacks, trolling, public +or private harassment, insults, or other unprofessional conduct. + +Project maintainers have the right and responsibility to remove, edit, or +reject comments, commits, code, wiki edits, issues, and other contributions +that are not aligned to this Code of Conduct. Project maintainers or +contributors who do not follow the Code of Conduct may be removed from the +project team. + +Instances of abusive, harassing, or otherwise unacceptable behavior may be +reported by opening an issue or contacting the maintainer via email. + +This Code of Conduct is adapted from the [Contributor Covenant][cc], [version +1.0.0][v1]. + +[cc]: http://contributor-covenant.org/ +[v1]: http://contributor-covenant.org/version/1/0/0/ From 765fac1070b8edd8759dacbaa7a0e535ff7f1430 Mon Sep 17 00:00:00 2001 From: John Marshall Date: Sun, 9 May 2021 22:54:58 +0100 Subject: [PATCH 52/64] Convert Travis CI to GitHub Actions Make an equivalent GitHub Actions workflow that tests compilation with both GCC and Clang. --- .github/workflows/ci.yaml | 21 +++++++++++++++++++++ .travis.yml | 5 ----- 2 files changed, 21 insertions(+), 5 deletions(-) create mode 100644 .github/workflows/ci.yaml delete mode 100644 .travis.yml diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml new file mode 100644 index 00000000..b77bc162 --- /dev/null +++ b/.github/workflows/ci.yaml @@ -0,0 +1,21 @@ +name: CI + +on: + push: + branches: + - master + pull_request: + +jobs: + build: + runs-on: ubuntu-latest + strategy: + matrix: + compiler: [gcc, clang] + + steps: + - name: Checkout bwa + uses: actions/checkout@v2 + + - name: Compile with ${{ matrix.compiler }} + run: make CC=${{ matrix.compiler }} diff --git a/.travis.yml b/.travis.yml deleted file mode 100644 index e386b272..00000000 --- a/.travis.yml +++ /dev/null @@ -1,5 +0,0 @@ -language: c -compiler: - - gcc - - clang -script: make From fe209ff1ca8e72e739aa375a91085d8522445495 Mon Sep 17 00:00:00 2001 From: Heng Li Date: Mon, 10 May 2021 12:47:13 -0400 Subject: [PATCH 53/64] updated to Github CI status --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 59f62a42..b1e0e0d9 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -[![Build Status](https://travis-ci.org/lh3/bwa.svg?branch=dev)](https://travis-ci.org/lh3/bwa) +[![Build Status](https://github.com/lh3/bwa/actions/workflows/ci.yaml/badge.svg) [![SourceForge Downloads](https://img.shields.io/sourceforge/dt/bio-bwa.svg?label=SF%20downloads)](https://sourceforge.net/projects/bio-bwa/files/?source=navbar) [![GitHub Downloads](https://img.shields.io/github/downloads/lh3/bwa/total.svg?style=flat&label=GitHub%20downloads)](https://github.com/lh3/bwa/releases) [![BioConda Install](https://img.shields.io/conda/dn/bioconda/bwa.svg?style=flag&label=BioConda%20install)](https://anaconda.org/bioconda/bwa) From 3ddd7b87d41f89a404d95f083fb37c369f70d783 Mon Sep 17 00:00:00 2001 From: Heng Li Date: Mon, 10 May 2021 12:48:45 -0400 Subject: [PATCH 54/64] fixed wrong Markdown format in README --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index b1e0e0d9..de4f8f70 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -[![Build Status](https://github.com/lh3/bwa/actions/workflows/ci.yaml/badge.svg) +[![Build Status](https://github.com/lh3/bwa/actions/workflows/ci.yaml/badge.svg)](https://github.com/lh3/bwa/actions) [![SourceForge Downloads](https://img.shields.io/sourceforge/dt/bio-bwa.svg?label=SF%20downloads)](https://sourceforge.net/projects/bio-bwa/files/?source=navbar) [![GitHub Downloads](https://img.shields.io/github/downloads/lh3/bwa/total.svg?style=flat&label=GitHub%20downloads)](https://github.com/lh3/bwa/releases) [![BioConda Install](https://img.shields.io/conda/dn/bioconda/bwa.svg?style=flag&label=BioConda%20install)](https://anaconda.org/bioconda/bwa) From d8dd308a1f686221d3c7c4af7ba57e191d2cb62a Mon Sep 17 00:00:00 2001 From: Nils Homer Date: Wed, 21 Jul 2021 11:12:27 -0700 Subject: [PATCH 55/64] Add the mate mapping quality tag --- bwamem.c | 1 + 1 file changed, 1 insertion(+) diff --git a/bwamem.c b/bwamem.c index 966e8d0f..03e2a059 100644 --- a/bwamem.c +++ b/bwamem.c @@ -932,6 +932,7 @@ void mem_aln2sam(const mem_opt_t *opt, const bntseq_t *bns, kstring_t *str, bseq kputsn("\tMD:Z:", 6, str); kputs((char*)(p->cigar + p->n_cigar), str); } if (m && m->n_cigar) { kputsn("\tMC:Z:", 6, str); add_cigar(opt, m, str, which); } + if (m) { kputsn("\tMQ:i:", 6, str); kputw(m->mapq, str);} if (p->score >= 0) { kputsn("\tAS:i:", 6, str); kputw(p->score, str); } if (p->sub >= 0) { kputsn("\tXS:i:", 6, str); kputw(p->sub, str); } if (bwa_rg_id[0]) { kputsn("\tRG:Z:", 6, str); kputs(bwa_rg_id, str); } From 6b18630a62eef0a422c590bb8626460f06a8c356 Mon Sep 17 00:00:00 2001 From: Nils Homer Date: Tue, 14 Dec 2021 08:02:05 -0700 Subject: [PATCH 56/64] Add the header line to the output SAM In particular, this defines the output SAM to be unsorted BUT also query grouped. The latter is very important to explicitly define so downstream tools that don't make assumptions know that reads from the same template are grouped. --- bwa.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/bwa.c b/bwa.c index 8aacde31..104c95c5 100644 --- a/bwa.c +++ b/bwa.c @@ -406,10 +406,17 @@ int bwa_idx2mem(bwaidx_t *idx) void bwa_print_sam_hdr(const bntseq_t *bns, const char *hdr_line) { - int i, n_SQ = 0; + int i, n_HD = 0, n_SQ = 0; extern char *bwa_pg; + if (hdr_line) { + // check for HD line const char *p = hdr_line; + if ((p = strstr(p, "@HD")) != 0) { + ++n_HD; + } + // check for SQ lines + p = hdr_line; while ((p = strstr(p, "@SQ\t")) != 0) { if (p == hdr_line || *(p-1) == '\n') ++n_SQ; p += 4; @@ -423,6 +430,9 @@ void bwa_print_sam_hdr(const bntseq_t *bns, const char *hdr_line) } } else if (n_SQ != bns->n_seqs && bwa_verbose >= 2) fprintf(stderr, "[W::%s] %d @SQ lines provided with -H; %d sequences in the index. Continue anyway.\n", __func__, n_SQ, bns->n_seqs); + if (n_HD == 0) { + err_printf("@HD\tVN:1.5\tSO:unsorted\tGO:query\n"); + } if (hdr_line) err_printf("%s\n", hdr_line); if (bwa_pg) err_printf("%s\n", bwa_pg); } From 4bf3cdf94847079517ec5342c347889a5859b8a9 Mon Sep 17 00:00:00 2001 From: Nils Homer Date: Fri, 18 Feb 2022 13:36:09 -0800 Subject: [PATCH 57/64] Update bwa.1 --- bwa.1 | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/bwa.1 b/bwa.1 index d30ac331..eefb235a 100644 --- a/bwa.1 +++ b/bwa.1 @@ -154,8 +154,8 @@ batch of reads. The BWA-MEM algorithm performs local alignment. It may produce multiple primary alignments for different part of a query sequence. This is a crucial feature -for long sequences. However, some tools such as Picard's markDuplicates does -not work with split alignments. One may consider to use option +for long sequences. However, some tools may not work with split alignments. +One may consider to use option .B -M to flag shorter split hits as secondary. From ceaaa6d9cbb17b2766a4a73ed2289be12ba4f09b Mon Sep 17 00:00:00 2001 From: John Marshall Date: Sun, 15 Dec 2019 18:14:49 +0000 Subject: [PATCH 58/64] Use $CPPFLAGS and $LDFLAGS if they are set The bwa makefile doesn't set these two itself, but the environment or make command line might set any of CC/CPPFLAGS/CFLAGS/LDFLAGS/LIBS. Use $(CPPFLAGS) when compiling and $(LDFLAGS) when linking so they can be used to customise the build. Remove $(DFLAGS) from link commands as these preprocessor options are irrelevant for linking. --- Makefile | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Makefile b/Makefile index 71514357..37751bb3 100644 --- a/Makefile +++ b/Makefile @@ -22,15 +22,15 @@ endif .SUFFIXES:.c .o .cc .c.o: - $(CC) -c $(CFLAGS) $(DFLAGS) $(INCLUDES) $< -o $@ + $(CC) -c $(CFLAGS) $(DFLAGS) $(INCLUDES) $(CPPFLAGS) $< -o $@ all:$(PROG) bwa:libbwa.a $(AOBJS) main.o - $(CC) $(CFLAGS) $(DFLAGS) $(AOBJS) main.o -o $@ -L. -lbwa $(LIBS) + $(CC) $(CFLAGS) $(LDFLAGS) $(AOBJS) main.o -o $@ -L. -lbwa $(LIBS) bwamem-lite:libbwa.a example.o - $(CC) $(CFLAGS) $(DFLAGS) example.o -o $@ -L. -lbwa $(LIBS) + $(CC) $(CFLAGS) $(LDFLAGS) example.o -o $@ -L. -lbwa $(LIBS) libbwa.a:$(LOBJS) $(AR) -csru $@ $(LOBJS) @@ -39,7 +39,7 @@ clean: rm -f gmon.out *.o a.out $(PROG) *~ *.a depend: - ( LC_ALL=C ; export LC_ALL; makedepend -Y -- $(CFLAGS) $(DFLAGS) -- *.c ) + ( LC_ALL=C ; export LC_ALL; makedepend -Y -- $(CFLAGS) $(DFLAGS) $(CPPFLAGS) -- *.c ) # DO NOT DELETE THIS LINE -- make depend depends on it. From 2160a0c7de630bf764da189b5d4c296762560ce2 Mon Sep 17 00:00:00 2001 From: clintval Date: Thu, 19 May 2022 13:43:22 -0400 Subject: [PATCH 59/64] Document that the XB tag now contains the mapping quality too --- fastmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fastmap.c b/fastmap.c index 15f0aa42..be7ba0ef 100644 --- a/fastmap.c +++ b/fastmap.c @@ -319,7 +319,7 @@ int main_mem(int argc, char *argv[]) fprintf(stderr, " specify the mean, standard deviation (10%% of the mean if absent), max\n"); fprintf(stderr, " (4 sigma from the mean if absent) and min of the insert size distribution.\n"); fprintf(stderr, " FR orientation only. [inferred]\n"); - fprintf(stderr, " -u output XB instead of XA; XB is XA with the alignment score added.\n"); + fprintf(stderr, " -u output XB instead of XA; XB is XA with the alignment score and mapping quality added.\n"); fprintf(stderr, "\n"); fprintf(stderr, "Note: Please read the man page for detailed description of the command line and options.\n"); fprintf(stderr, "\n"); From ab01ab4490f4aa8aa7e8fc638dd49e37a3cf286c Mon Sep 17 00:00:00 2001 From: John Marshall Date: Fri, 17 Jun 2022 18:42:07 +0100 Subject: [PATCH 60/64] Make _mm_load_si128() explicit The previous code implicitly caused a load; change it so the load intrinsic is explicitly invoked, as the others are. (This in fact makes no difference to the generated code.) --- ksw.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ksw.c b/ksw.c index 9793e5eb..92fd1e06 100644 --- a/ksw.c +++ b/ksw.c @@ -267,7 +267,7 @@ kswr_t ksw_i16(kswq_t *q, int tlen, const uint8_t *target, int _o_del, int _e_de h = _mm_load_si128(H0 + slen - 1); // h={2,5,8,11,14,17,-1,-1} in the above example h = _mm_slli_si128(h, 2); for (j = 0; LIKELY(j < slen); ++j) { - h = _mm_adds_epi16(h, *S++); + h = _mm_adds_epi16(h, _mm_load_si128(S++)); e = _mm_load_si128(E + j); h = _mm_max_epi16(h, e); h = _mm_max_epi16(h, f); From 165e524e5cec747734607944f851cc4908c6a41b Mon Sep 17 00:00:00 2001 From: John Marshall Date: Fri, 17 Jun 2022 19:34:14 +0100 Subject: [PATCH 61/64] On ARM, rewrite SSE2 SIMD calls using Neon intrinsics Many Intel intrinsics have a corresponding Neon equivalent. Other cases are more interesting: * Neon's vmaxvq directly selects the maximum entry in a vector, so can be used to implement both the __max_16/__max_8 macros and the _mm_movemask_epi8 early loop exit. Introduce additional helper macros alongside __max_16/__max_8 so that the early loop exit can similarly be implemented differently on the two platforms. * Full-width shifts can be done via vextq. This is defined close to the ksw_u8()/ksw_i16() functions (rather than in neon_sse.h) as it implicitly uses one of their local variables. * ksw_i16() uses saturating *signed* 16-bit operations apart from _mm_subs_epu16; presumably the data is effectively still signed but we wish to keep it non-negative. The ARM intrinsics are more careful about type checking, so this requires an extra U16() helper macro. --- Makefile | 2 +- ksw.c | 34 ++++++++++++++++++++++++++++++---- neon_sse.h | 33 +++++++++++++++++++++++++++++++++ 3 files changed, 64 insertions(+), 5 deletions(-) create mode 100644 neon_sse.h diff --git a/Makefile b/Makefile index 37751bb3..fbd87192 100644 --- a/Makefile +++ b/Makefile @@ -78,7 +78,7 @@ fastmap.o: bwa.h bntseq.h bwt.h bwamem.h kvec.h malloc_wrap.h utils.h kseq.h is.o: malloc_wrap.h kopen.o: malloc_wrap.h kstring.o: kstring.h malloc_wrap.h -ksw.o: ksw.h malloc_wrap.h +ksw.o: ksw.h neon_sse.h malloc_wrap.h main.o: kstring.h malloc_wrap.h utils.h malloc_wrap.o: malloc_wrap.h maxk.o: bwa.h bntseq.h bwt.h bwamem.h kseq.h malloc_wrap.h diff --git a/ksw.c b/ksw.c index 92fd1e06..ba0e0926 100644 --- a/ksw.c +++ b/ksw.c @@ -26,7 +26,11 @@ #include #include #include +#if defined __x86_64__ #include +#elif defined __ARM_NEON +#include "neon_sse.h" +#endif #include "ksw.h" #ifdef USE_MALLOC_WRAPPERS @@ -108,6 +112,11 @@ kswq_t *ksw_qinit(int size, int qlen, const uint8_t *query, int m, const int8_t return q; } +#if defined __ARM_NEON +// This macro implicitly uses each function's `zero` local variable +#define _mm_slli_si128(a, n) (vextq_u8(zero, (a), 16 - (n))) +#endif + kswr_t ksw_u8(kswq_t *q, int tlen, const uint8_t *target, int _o_del, int _e_del, int _o_ins, int _e_ins, int xtra) // the first gap costs -(_o+_e) { int slen, i, m_b, n_b, te = -1, gmax = 0, minsc, endsc; @@ -115,6 +124,7 @@ kswr_t ksw_u8(kswq_t *q, int tlen, const uint8_t *target, int _o_del, int _e_del __m128i zero, oe_del, e_del, oe_ins, e_ins, shift, *H0, *H1, *E, *Hmax; kswr_t r; +#if defined __x86_64__ #define __max_16(ret, xx) do { \ (xx) = _mm_max_epu8((xx), _mm_srli_si128((xx), 8)); \ (xx) = _mm_max_epu8((xx), _mm_srli_si128((xx), 4)); \ @@ -123,6 +133,14 @@ kswr_t ksw_u8(kswq_t *q, int tlen, const uint8_t *target, int _o_del, int _e_del (ret) = _mm_extract_epi16((xx), 0) & 0x00ff; \ } while (0) +// Given entries with arbitrary values, return whether they are all 0x00 +#define allzero_16(xx) (_mm_movemask_epi8(_mm_cmpeq_epi8((xx), zero)) == 0xffff) + +#elif defined __ARM_NEON +#define __max_16(ret, xx) (ret) = vmaxvq_u8((xx)) +#define allzero_16(xx) (vmaxvq_u8((xx)) == 0) +#endif + // initialization r = g_defr; minsc = (xtra&KSW_XSUBO)? xtra&0xffff : 0x10000; @@ -143,7 +161,7 @@ kswr_t ksw_u8(kswq_t *q, int tlen, const uint8_t *target, int _o_del, int _e_del } // the core loop for (i = 0; i < tlen; ++i) { - int j, k, cmp, imax; + int j, k, imax; __m128i e, h, t, f = zero, max = zero, *S = q->qp + target[i] * slen; // s is the 1st score vector h = _mm_load_si128(H0 + slen - 1); // h={2,5,8,11,14,17,-1,-1} in the above example h = _mm_slli_si128(h, 1); // h=H(i-1,-1); << instead of >> because x64 is little-endian @@ -182,8 +200,7 @@ kswr_t ksw_u8(kswq_t *q, int tlen, const uint8_t *target, int _o_del, int _e_del _mm_store_si128(H1 + j, h); h = _mm_subs_epu8(h, oe_ins); f = _mm_subs_epu8(f, e_ins); - cmp = _mm_movemask_epi8(_mm_cmpeq_epi8(_mm_subs_epu8(f, h), zero)); - if (UNLIKELY(cmp == 0xffff)) goto end_loop16; + if (UNLIKELY(allzero_16(_mm_subs_epu8(f, h)))) goto end_loop16; } } end_loop16: @@ -236,6 +253,7 @@ kswr_t ksw_i16(kswq_t *q, int tlen, const uint8_t *target, int _o_del, int _e_de __m128i zero, oe_del, e_del, oe_ins, e_ins, *H0, *H1, *E, *Hmax; kswr_t r; +#if defined __x86_64__ #define __max_8(ret, xx) do { \ (xx) = _mm_max_epi16((xx), _mm_srli_si128((xx), 8)); \ (xx) = _mm_max_epi16((xx), _mm_srli_si128((xx), 4)); \ @@ -243,6 +261,14 @@ kswr_t ksw_i16(kswq_t *q, int tlen, const uint8_t *target, int _o_del, int _e_de (ret) = _mm_extract_epi16((xx), 0); \ } while (0) +// Given entries all either 0x0000 or 0xffff, return whether they are all 0x0000 +#define allzero_0f_8(xx) (!_mm_movemask_epi8((xx))) + +#elif defined __ARM_NEON +#define __max_8(ret, xx) (ret) = vmaxvq_s16(vreinterpretq_s16_u8((xx))) +#define allzero_0f_8(xx) (vmaxvq_u16(vreinterpretq_u16_u8((xx))) == 0) +#endif + // initialization r = g_defr; minsc = (xtra&KSW_XSUBO)? xtra&0xffff : 0x10000; @@ -290,7 +316,7 @@ kswr_t ksw_i16(kswq_t *q, int tlen, const uint8_t *target, int _o_del, int _e_de _mm_store_si128(H1 + j, h); h = _mm_subs_epu16(h, oe_ins); f = _mm_subs_epu16(f, e_ins); - if(UNLIKELY(!_mm_movemask_epi8(_mm_cmpgt_epi16(f, h)))) goto end_loop8; + if(UNLIKELY(allzero_0f_8(_mm_cmpgt_epi16(f, h)))) goto end_loop8; } } end_loop8: diff --git a/neon_sse.h b/neon_sse.h new file mode 100644 index 00000000..bd55f8b4 --- /dev/null +++ b/neon_sse.h @@ -0,0 +1,33 @@ +#ifndef NEON_SSE_H +#define NEON_SSE_H + +#include + +typedef uint8x16_t __m128i; + +static inline __m128i _mm_load_si128(const __m128i *ptr) { return vld1q_u8((const uint8_t *) ptr); } +static inline __m128i _mm_set1_epi32(int n) { return vreinterpretq_u8_s32(vdupq_n_s32(n)); } +static inline void _mm_store_si128(__m128i *ptr, __m128i a) { vst1q_u8((uint8_t *) ptr, a); } + +static inline __m128i _mm_adds_epu8(__m128i a, __m128i b) { return vqaddq_u8(a, b); } +static inline __m128i _mm_max_epu8(__m128i a, __m128i b) { return vmaxq_u8(a, b); } +static inline __m128i _mm_set1_epi8(int8_t n) { return vreinterpretq_u8_s8(vdupq_n_s8(n)); } +static inline __m128i _mm_subs_epu8(__m128i a, __m128i b) { return vqsubq_u8(a, b); } + +#define M128I(a) vreinterpretq_u8_s16((a)) +#define UM128I(a) vreinterpretq_u8_u16((a)) +#define S16(a) vreinterpretq_s16_u8((a)) +#define U16(a) vreinterpretq_u16_u8((a)) + +static inline __m128i _mm_adds_epi16(__m128i a, __m128i b) { return M128I(vqaddq_s16(S16(a), S16(b))); } +static inline __m128i _mm_cmpgt_epi16(__m128i a, __m128i b) { return UM128I(vcgtq_s16(S16(a), S16(b))); } +static inline __m128i _mm_max_epi16(__m128i a, __m128i b) { return M128I(vmaxq_s16(S16(a), S16(b))); } +static inline __m128i _mm_set1_epi16(int16_t n) { return vreinterpretq_u8_s16(vdupq_n_s16(n)); } +static inline __m128i _mm_subs_epu16(__m128i a, __m128i b) { return UM128I(vqsubq_u16(U16(a), U16(b))); } + +#undef M128I +#undef UM128I +#undef S16 +#undef U16 + +#endif From ac612b6f5a079750684ef67044fd131d209cc69f Mon Sep 17 00:00:00 2001 From: John Marshall Date: Wed, 22 Jun 2022 18:47:04 +0100 Subject: [PATCH 62/64] On other platforms, emulate SSE2 SIMD calls using scalar code --- Makefile | 2 +- ksw.c | 10 +++++ scalar_sse.h | 119 +++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 130 insertions(+), 1 deletion(-) create mode 100644 scalar_sse.h diff --git a/Makefile b/Makefile index fbd87192..54805363 100644 --- a/Makefile +++ b/Makefile @@ -78,7 +78,7 @@ fastmap.o: bwa.h bntseq.h bwt.h bwamem.h kvec.h malloc_wrap.h utils.h kseq.h is.o: malloc_wrap.h kopen.o: malloc_wrap.h kstring.o: kstring.h malloc_wrap.h -ksw.o: ksw.h neon_sse.h malloc_wrap.h +ksw.o: ksw.h neon_sse.h scalar_sse.h malloc_wrap.h main.o: kstring.h malloc_wrap.h utils.h malloc_wrap.o: malloc_wrap.h maxk.o: bwa.h bntseq.h bwt.h bwamem.h kseq.h malloc_wrap.h diff --git a/ksw.c b/ksw.c index ba0e0926..22c8d10b 100644 --- a/ksw.c +++ b/ksw.c @@ -30,6 +30,8 @@ #include #elif defined __ARM_NEON #include "neon_sse.h" +#else +#include "scalar_sse.h" #endif #include "ksw.h" @@ -139,6 +141,10 @@ kswr_t ksw_u8(kswq_t *q, int tlen, const uint8_t *target, int _o_del, int _e_del #elif defined __ARM_NEON #define __max_16(ret, xx) (ret) = vmaxvq_u8((xx)) #define allzero_16(xx) (vmaxvq_u8((xx)) == 0) + +#else +#define __max_16(ret, xx) (ret) = m128i_max_u8((xx)) +#define allzero_16(xx) (m128i_allzero((xx))) #endif // initialization @@ -267,6 +273,10 @@ kswr_t ksw_i16(kswq_t *q, int tlen, const uint8_t *target, int _o_del, int _e_de #elif defined __ARM_NEON #define __max_8(ret, xx) (ret) = vmaxvq_s16(vreinterpretq_s16_u8((xx))) #define allzero_0f_8(xx) (vmaxvq_u16(vreinterpretq_u16_u8((xx))) == 0) + +#else +#define __max_8(ret, xx) (ret) = m128i_max_s16((xx)) +#define allzero_0f_8(xx) (m128i_allzero((xx))) #endif // initialization diff --git a/scalar_sse.h b/scalar_sse.h new file mode 100644 index 00000000..9c2c326d --- /dev/null +++ b/scalar_sse.h @@ -0,0 +1,119 @@ +#ifndef SCALAR_SSE_H +#define SCALAR_SSE_H + +#include +#include +#include + +typedef union m128i { + uint8_t u8[16]; + int16_t i16[8]; +} __m128i; + +static inline __m128i _mm_set1_epi32(int32_t n) { + assert(n >= 0 && n <= 255); + __m128i r; memset(&r, n, sizeof r); return r; +} + +static inline __m128i _mm_load_si128(const __m128i *ptr) { __m128i r; memcpy(&r, ptr, sizeof r); return r; } +static inline void _mm_store_si128(__m128i *ptr, __m128i a) { memcpy(ptr, &a, sizeof a); } + +static inline int m128i_allzero(__m128i a) { + static const char zero[] = "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"; + return memcmp(&a, zero, sizeof a) == 0; +} + +static inline __m128i _mm_slli_si128(__m128i a, int n) { + int i; + memmove(&a.u8[n], &a.u8[0], 16 - n); + for (i = 0; i < n; i++) a.u8[i] = 0; + return a; +} + +static inline __m128i _mm_adds_epu8(__m128i a, __m128i b) { + int i; + for (i = 0; i < 16; i++) { + uint16_t aa = a.u8[i]; + aa += b.u8[i]; + a.u8[i] = (aa < 256)? aa : 255; + } + return a; +} + +static inline __m128i _mm_max_epu8(__m128i a, __m128i b) { + int i; + for (i = 0; i < 16; i++) + if (a.u8[i] < b.u8[i]) a.u8[i] = b.u8[i]; + return a; +} + +static inline uint8_t m128i_max_u8(__m128i a) { + uint8_t max = 0; + int i; + for (i = 0; i < 16; i++) + if (max < a.u8[i]) max = a.u8[i]; + return max; +} + +static inline __m128i _mm_set1_epi8(int8_t n) { __m128i r; memset(&r, n, sizeof r); return r; } + +static inline __m128i _mm_subs_epu8(__m128i a, __m128i b) { + int i; + for (i = 0; i < 16; i++) { + int16_t aa = a.u8[i]; + aa -= b.u8[i]; + a.u8[i] = (aa >= 0)? aa : 0; + } + return a; +} + +static inline __m128i _mm_adds_epi16(__m128i a, __m128i b) { + int i; + for (i = 0; i < 8; i++) { + int32_t aa = a.i16[i]; + aa += b.i16[i]; + a.i16[i] = (aa < 32768)? aa : 32767; + } + return a; +} + +static inline __m128i _mm_cmpgt_epi16(__m128i a, __m128i b) { + int i; + for (i = 0; i < 8; i++) + a.i16[i] = (a.i16[i] > b.i16[i])? 0xffff : 0x0000; + return a; +} + +static inline __m128i _mm_max_epi16(__m128i a, __m128i b) { + int i; + for (i = 0; i < 8; i++) + if (a.i16[i] < b.i16[i]) a.i16[i] = b.i16[i]; + return a; +} + +static inline __m128i _mm_set1_epi16(int16_t n) { + __m128i r; + r.i16[0] = r.i16[1] = r.i16[2] = r.i16[3] = + r.i16[4] = r.i16[5] = r.i16[6] = r.i16[7] = n; + return r; +} + +static inline int16_t m128i_max_s16(__m128i a) { + int16_t max = -32768; + int i; + for (i = 0; i < 8; i++) + if (max < a.i16[i]) max = a.i16[i]; + return max; +} + +static inline __m128i _mm_subs_epu16(__m128i a, __m128i b) { + int i; + for (i = 0; i < 8; i++) { + int32_t aa = a.i16[i]; + aa -= b.i16[i]; + a.i16[i] = (aa >= 0)? aa : 0; + } + return a; +} + +#endif From c77ace7059e0376d33c818f7d77b63ecf3739fa9 Mon Sep 17 00:00:00 2001 From: John Marshall Date: Mon, 27 Jun 2022 14:15:59 +0100 Subject: [PATCH 63/64] Use native SSE2 intrinsics on i386 as well as x86-64 Make the native SSE2 code conditional on __SSE2__, which is defined by GCC/Clang/etc on x86-64 by default and on i386 with -msse2 etc. --- ksw.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/ksw.c b/ksw.c index 22c8d10b..1e584e9d 100644 --- a/ksw.c +++ b/ksw.c @@ -26,7 +26,7 @@ #include #include #include -#if defined __x86_64__ +#if defined __SSE2__ #include #elif defined __ARM_NEON #include "neon_sse.h" @@ -126,7 +126,7 @@ kswr_t ksw_u8(kswq_t *q, int tlen, const uint8_t *target, int _o_del, int _e_del __m128i zero, oe_del, e_del, oe_ins, e_ins, shift, *H0, *H1, *E, *Hmax; kswr_t r; -#if defined __x86_64__ +#if defined __SSE2__ #define __max_16(ret, xx) do { \ (xx) = _mm_max_epu8((xx), _mm_srli_si128((xx), 8)); \ (xx) = _mm_max_epu8((xx), _mm_srli_si128((xx), 4)); \ @@ -259,7 +259,7 @@ kswr_t ksw_i16(kswq_t *q, int tlen, const uint8_t *target, int _o_del, int _e_de __m128i zero, oe_del, e_del, oe_ins, e_ins, *H0, *H1, *E, *Hmax; kswr_t r; -#if defined __x86_64__ +#if defined __SSE2__ #define __max_8(ret, xx) do { \ (xx) = _mm_max_epi16((xx), _mm_srli_si128((xx), 8)); \ (xx) = _mm_max_epi16((xx), _mm_srli_si128((xx), 4)); \ From 94248a8ceadc867e3ea69f859f5f467c68ea146e Mon Sep 17 00:00:00 2001 From: Martin Tzvetanov Grigorov Date: Wed, 10 Aug 2022 14:59:32 +0300 Subject: [PATCH 64/64] Add CI job for Ubuntu 20.04 aarch64 Signed-off-by: Martin Tzvetanov Grigorov --- .github/workflows/ci.yaml | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml index b77bc162..8eb2f062 100644 --- a/.github/workflows/ci.yaml +++ b/.github/workflows/ci.yaml @@ -15,7 +15,32 @@ jobs: steps: - name: Checkout bwa - uses: actions/checkout@v2 + uses: actions/checkout@v3 - name: Compile with ${{ matrix.compiler }} run: make CC=${{ matrix.compiler }} + + build-aarch64: + runs-on: ubuntu-latest + strategy: + matrix: + compiler: [gcc, clang] + + steps: + - name: Checkout bwa + uses: actions/checkout@v3 + + - name: Compile with ${{ matrix.compiler }} + uses: uraimo/run-on-arch-action@v2 + with: + arch: aarch64 + distro: ubuntu20.04 + githubToken: ${{ github.token }} + dockerRunArgs: | + --volume "${PWD}:/bwa" + install: | + apt-get update -q -y + apt-get install -q -y make ${{ matrix.compiler }} zlib1g-dev + run: | + cd /bwa + make CC=${{ matrix.compiler }}