Skip to content

Commit

Permalink
Updated seqann like rrwick#76 (comment)
Browse files Browse the repository at this point in the history
  • Loading branch information
kpalin committed Jan 23, 2019
2 parents 6875ad4 + 109e437 commit 49bd5fd
Show file tree
Hide file tree
Showing 694 changed files with 10,141 additions and 5,073 deletions.
42 changes: 42 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,13 @@ Porechop is a tool for finding and removing adapters from [Oxford Nanopore](http
Porechop also supports demultiplexing of Nanopore reads that were barcoded with the [Native Barcoding Kit](https://store.nanoporetech.com/native-barcoding-kit-1d.html), [PCR Barcoding Kit](https://store.nanoporetech.com/pcr-barcoding-kit-96.html) or [Rapid Barcoding Kit](https://store.nanoporetech.com/rapid-barcoding-sequencing-kit.html).


### Oct 2018 update: Porechop is officially unsupported

While I'm happy Porechop has so many users, it has always been a bit klugey and a pain to maintain. I don't have the time to give it the attention it deserves, so I'm going to now officially declare Porechop as abandonware (though the unanswered [issues](https://github.com/rrwick/Porechop/issues) and [pull requests](https://github.com/rrwick/Porechop/pulls) reveal that it already has been for some time). I've added a [known issues](#known-issues) section to the README to outline what I think is wrong with Porechop and how a reimplementation should look. I may someday (no promises though :stuck_out_tongue:) try to rewrite it from a blank canvas to address its faults.





# Table of contents

Expand All @@ -24,6 +31,7 @@ Porechop also supports demultiplexing of Nanopore reads that were barcoded with
* [Verbose output](#verbose-output)
* [Known adapters](#known-adapters)
* [Full usage](#full-usage)
* [Known issues](#known-issues)
* [Acknowledgements](#acknowledgements)
* [License](#license)

Expand Down Expand Up @@ -331,6 +339,40 @@ Help:



# Known issues

### Adapter search

Porechop tries to automatically determine which adapters are present by looking at the reads, but this approach has a few issues:

* As the number of kits/barcodes has grown, adapter-search part of the Porechop's pipeline has become increasingly slow.
* Porechop only does the adapter search on a subset of reads, which means there can be problems with non-randomly ordered read sets (e.g. all barcode 1 reads at the start of a file, followed by barcode 2 reads, etc).
* Many ONT adapters share common sequence with each other, making false positive adapter finds possible.

A simpler solution (and in hindsight what I should have done) would be to require the kit and/or adapters from the user. E.g. `porechop --sqk-lsk109` or `porechop --start_adapt ACGCTAGCATACGT`.


### Performance

Porechop uses [SeqAn](https://github.com/seqan/seqan) to perform its alignments in C++. This library is very flexible, but not as fast as some alternatives, such as [Edlib](https://github.com/Martinsos/edlib).

Another performance issue is that Porechop uses [ctypes](https://docs.python.org/3/library/ctypes.html) to interface with its C++ code. Function calls with ctypes can have a bit of overhead, which means that Porechop cannot use threads very efficiently (it spends too much of its time in the Python code, which is intrinsically non-parallel).


### Barcode demultiplexing

I added demultiplexing to Porechop as an afterthought – it was already looking for barcodes to trim, so why not sort reads by barcodes too? This turned out to be a very useful feature, but in hindsight I think it might be simpler (and easier to maintain) if trimming and demultiplexing functionality were in separate tools.


### Base-space problems

I've encountered a couple of issues where adapter sequences are not properly basecalled, resulting in inconsistent sequence. Porechop trims in base-space, so this is a somewhat intractable problem. See [issue #40](https://github.com/rrwick/Porechop/issues/40) for an example.

These have made me wonder if the adapter trimming should be done in signal-space instead, though that would be a more complex problem to solve. I hope that in the future ONT can integrate this kind of functionality directly into their basecallers.




# Acknowledgements

Porechop was inspired by (and largely coded during) [Porecamp Australia 2017](https://porecamp-au.github.io/). Thanks to the organisers and attendees who helped me realise that a Nanopore adapter trimmer might be a useful tool! I later met David Stoddart from Oxford Nanopore at London Calling 2017, and he helped me get many of the adapter sequences right.
Expand Down
57 changes: 42 additions & 15 deletions porechop/adapters.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,9 @@ def get_barcode_name(self):
Gets the barcode name for the output files. We want a concise name, so it looks at all
options and chooses the shortest.
"""
possible_names = [self.name, self.start_sequence[0]]
possible_names = [self.name]
if self.start_sequence:
possible_names.append(self.start_sequence[0])
if self.end_sequence:
possible_names.append(self.end_sequence[0])
barcode_name = sorted(possible_names, key=lambda x: len(x))[0]
Expand Down Expand Up @@ -76,28 +78,36 @@ def get_barcode_name(self):
start_sequence=('SQK-NSK007_Y_Top', 'AATGTACTTCGTTCAGTTACGTATTGCT'),
end_sequence=('SQK-NSK007_Y_Bottom', 'GCAATACGTAACTGAACGAAGT')),


Adapter('Rapid',
start_sequence=('Rapid_adapter', 'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA')),
start_sequence=('Rapid_adapter',
'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA')),

Adapter('RBK004_upstream',
start_sequence=('RBK004_upstream', 'AATGTACTTCGTTCAGTTACGGCTTGGGTGTTTAACC')),


Adapter('SQK-MAP006',
start_sequence=('SQK-MAP006_Y_Top_SK63', 'GGTTGTTTCTGTTGGTGCTGATATTGCT'),
end_sequence= ('SQK-MAP006_Y_Bottom_SK64', 'GCAATATCAGCACCAACAGAAA')),
start_sequence=('SQK-MAP006_Y_Top_SK63', 'GGTTGTTTCTGTTGGTGCTGATATTGCT'),
end_sequence= ('SQK-MAP006_Y_Bottom_SK64', 'GCAATATCAGCACCAACAGAAA')),

Adapter('SQK-MAP006 Short',
Adapter('SQK-MAP006 short',
start_sequence=('SQK-MAP006_Short_Y_Top_LI32', 'CGGCGTCTGCTTGGGTGTTTAACCT'),
end_sequence= ('SQK-MAP006_Short_Y_Bottom_LI33', 'GGTTAAACACCCAAGCAGACGCCG')),


# The PCR adapters are used both in PCR DNA kits and some cDNA kits.
Adapter('PCR adapters 1',
start_sequence=('PCR_1_start', 'ACTTGCCTGTCGCTCTATCTTC'),
end_sequence= ('PCR_1_end', 'GAAGATAGAGCGACAGGCAAGT')),
end_sequence= ('PCR_1_end', 'GAAGATAGAGCGACAGGCAAGT')),

Adapter('PCR tail 1',
start_sequence=('PCR_tail_1_start', 'TTAACCTTTCTGTTGGTGCTGATATTGC'),
end_sequence= ('PCR_tail_1_end', 'GCAATATCAGCACCAACAGAAAGGTTAA')),
Adapter('PCR adapters 2',
start_sequence=('PCR_2_start', 'TTTCTGTTGGTGCTGATATTGC'),
end_sequence= ('PCR_2_end', 'GCAATATCAGCACCAACAGAAA')),

Adapter('PCR tail 2',
start_sequence=('PCR_tail_2_start', 'TTAACCTACTTGCCTGTCGCTCTATCTTC'),
end_sequence= ('PCR_tail_2_end', 'GAAGATAGAGCGACAGGCAAGTAGGTTAA')),
Adapter('PCR adapters 3',
start_sequence=('PCR_3_start', 'TACTTGCCTGTCGCTCTATCTTC'),
end_sequence= ('PCR_3_end', 'GAAGATAGAGCGACAGGCAAGTA')),


# 1D^2 kit adapters are interesting. ONT provided the following sequences on their site:
Expand All @@ -117,6 +127,11 @@ def get_barcode_name(self):
# adapter sequences here.


Adapter('cDNA SSP',
start_sequence=('cDNA_SSP', 'TTTCTGTTGGTGCTGATATTGCTGCCATTACGGCCGGG'),
end_sequence= ('cDNA_SSP_rev', 'CCCGGCCGTAATGGCAGCAATATCAGCACCAACAGAAA')),


# Some barcoding kits (like the native barcodes) use the rev comp barcode at the start
# of the read and the forward barcode at the end of the read.
Adapter('Barcode 1 (reverse)',
Expand Down Expand Up @@ -461,11 +476,23 @@ def make_full_native_barcode_adapter(barcode_num):
end_sequence=('NB' + '%02d' % barcode_num + '_end', end_full_seq))


def make_full_rapid_barcode_adapter(barcode_num):
def make_old_full_rapid_barcode_adapter(barcode_num): # applies to SQK-RBK001
barcode = [x for x in ADAPTERS if x.name == 'Barcode ' + str(barcode_num) + ' (forward)'][0]
start_barcode_seq = barcode.start_sequence[1]

start_full_seq = 'AATGTACTTCGTTCAGTTACG' + 'TATTGCT' + start_barcode_seq + \
'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA'

return Adapter('Rapid barcoding ' + str(barcode_num) + ' (full sequence, old)',
start_sequence=('RB' + '%02d' % barcode_num + '_full', start_full_seq))


def make_new_full_rapid_barcode_adapter(barcode_num): # applies to SQK-RBK004
barcode = [x for x in ADAPTERS if x.name == 'Barcode ' + str(barcode_num) + ' (forward)'][0]
start_barcode_seq = barcode.start_sequence[1]

start_full_seq = 'AATGTACTTCGTTCAGTTACGTATTGCT' + start_barcode_seq + 'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA'
start_full_seq = 'AATGTACTTCGTTCAGTTACG' + 'GCTTGGGTGTTTAACC' + start_barcode_seq + \
'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA'

return Adapter('Rapid barcoding ' + str(barcode_num) + ' (full sequence)',
return Adapter('Rapid barcoding ' + str(barcode_num) + ' (full sequence, new)',
start_sequence=('RB' + '%02d' % barcode_num + '_full', start_full_seq))
27 changes: 27 additions & 0 deletions porechop/include/seqan/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
Copyright (c) 2006-2018, Knut Reinert, FU Berlin
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Knut Reinert or the FU Berlin nor the names of
its contributors may be used to endorse or promote products derived
from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL KNUT REINERT OR THE FU BERLIN BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
DAMAGE.

5 changes: 4 additions & 1 deletion porechop/include/seqan/align.h
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
// ==========================================================================
// SeqAn - The Library for Sequence Analysis
// ==========================================================================
// Copyright (c) 2006-2016, Knut Reinert, FU Berlin
// Copyright (c) 2006-2018, Knut Reinert, FU Berlin
// All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
Expand Down Expand Up @@ -53,6 +53,7 @@
#include <algorithm>

#include <seqan/basic.h>
#include <seqan/simd.h>
#include <seqan/modifier.h> // ModifiedAlphabet<>.
#include <seqan/align/align_metafunctions.h>
#include <seqan/graph_align.h> // TODO(holtgrew): We should not have to depend on this.
Expand All @@ -76,6 +77,8 @@

#include <seqan/align/fragment.h>

#include <seqan/align/aligned_sequence_concept.h>

#include <seqan/align/gaps_base.h>
#include <seqan/align/gaps_iterator_base.h>

Expand Down
15 changes: 12 additions & 3 deletions porechop/include/seqan/align/align_base.h
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
// ==========================================================================
// SeqAn - The Library for Sequence Analysis
// ==========================================================================
// Copyright (c) 2006-2016, Knut Reinert, FU Berlin
// Copyright (c) 2006-2018, Knut Reinert, FU Berlin
// All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
Expand Down Expand Up @@ -476,7 +476,7 @@ detach(Align<TSource, TSpec> & me)

template <typename TFile, typename TSource, typename TSpec>
inline void
write(TFile & target,
_write(TFile & target,
Align<TSource, TSpec> const & source)
{
typedef Align<TSource, TSpec> const TAlign;
Expand Down Expand Up @@ -557,6 +557,15 @@ write(TFile & target,
writeValue(target, '\n');
}

template <typename TFile, typename TSource, typename TSpec>
[[deprecated("Old-style I/O. Use stream operator << instead.")]]
inline void
write(TFile & target,
Align<TSource, TSpec> const & source)
{
_write(target, source);
}

// ----------------------------------------------------------------------------
// Function clearClipping()
// ----------------------------------------------------------------------------
Expand Down Expand Up @@ -607,7 +616,7 @@ operator<<(TStream & target,
Align<TSource, TSpec> const & source)
{
typename DirectionIterator<TStream, Output>::Type it = directionIterator(target, Output());
write(it, source);
_write(it, source);
return target;
}

Expand Down
2 changes: 1 addition & 1 deletion porechop/include/seqan/align/align_cols.h
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
// ==========================================================================
// SeqAn - The Library for Sequence Analysis
// ==========================================================================
// Copyright (c) 2006-2016, Knut Reinert, FU Berlin
// Copyright (c) 2006-2018, Knut Reinert, FU Berlin
// All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
Expand Down
2 changes: 1 addition & 1 deletion porechop/include/seqan/align/align_config.h
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
// ==========================================================================
// SeqAn - The Library for Sequence Analysis
// ==========================================================================
// Copyright (c) 2006-2016, Knut Reinert, FU Berlin
// Copyright (c) 2006-2018, Knut Reinert, FU Berlin
// All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
Expand Down
57 changes: 38 additions & 19 deletions porechop/include/seqan/align/align_interface_wrapper.h
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
// ==========================================================================
// SeqAn - The Library for Sequence Analysis
// ==========================================================================
// Copyright (c) 2006-2016, Knut Reinert, FU Berlin
// Copyright (c) 2006-2018, Knut Reinert, FU Berlin
// All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
Expand Down Expand Up @@ -58,14 +58,20 @@ namespace seqan
// Function _alignWrapperSequential(); Score; StringSet vs. StringSet
// ----------------------------------------------------------------------------

template <typename TString1, typename TSpec1,
typename TString2, typename TSpec2,
template <typename TSetH,
typename TSetV,
typename TScoreValue, typename TScoreSpec,
typename TAlignConfig,
typename TGapModel>
typename TGapModel,
std::enable_if_t<And<And<Is<ContainerConcept<TSetH>>,
Is<ContainerConcept<typename Value<TSetH>::Type>>>,
And<Is<ContainerConcept<TSetV>>,
Is<ContainerConcept<typename Value<TSetV>::Type>>>
>::VALUE,
int> = 0>
inline auto
_alignWrapperSequential(StringSet<TString1, TSpec1> const & stringsH,
StringSet<TString2, TSpec2> const & stringsV,
_alignWrapperSequential(TSetH const & stringsH,
TSetV const & stringsV,
Score<TScoreValue, TScoreSpec> const & scoringScheme,
TAlignConfig const & config,
TGapModel const & /*gaps*/)
Expand All @@ -92,14 +98,20 @@ _alignWrapperSequential(StringSet<TString1, TSpec1> const & stringsH,
// Function _alignWrapperSequential(); Score; String vs. StringSet
// ----------------------------------------------------------------------------

template <typename TString1,
typename TString2, typename TSpec,
template <typename TSeqH,
typename TSetV,
typename TScoreValue, typename TScoreSpec,
typename TAlignConfig,
typename TGapModel>
typename TGapModel,
std::enable_if_t<And<And<Is<ContainerConcept<TSeqH>>,
Not<Is<ContainerConcept<typename Value<TSeqH>::Type>>>>,
And<Is<ContainerConcept<TSetV>>,
Is<ContainerConcept<typename Value<TSetV>::Type>>>
>::VALUE,
int> = 0>
inline auto
_alignWrapperSequential(TString1 const & stringH,
StringSet<TString2, TSpec> const & stringsV,
_alignWrapperSequential(TSeqH const & stringH,
TSetV const & stringsV,
Score<TScoreValue, TScoreSpec> const & scoringScheme,
TAlignConfig const & config,
TGapModel const & /*gaps*/)
Expand All @@ -125,21 +137,27 @@ _alignWrapperSequential(TString1 const & stringH,
// Function _alignWrapperSequential(); Gaps
// ----------------------------------------------------------------------------

template <typename TSequenceH, typename TGapsSpecH, typename TSetSpecH,
typename TSequenceV, typename TGapsSpecV, typename TSetSpecV,
template <typename TSetH,
typename TSetV,
typename TScoreValue, typename TScoreSpec,
typename TAlignConfig,
typename TGapModel>
typename TGapModel,
std::enable_if_t<And<And<Is<ContainerConcept<TSetH>>,
Is<AlignedSequenceConcept<typename Value<TSetH>::Type>>>,
And<Is<ContainerConcept<TSetV>>,
Is<AlignedSequenceConcept<typename Value<TSetV>::Type>>>
>::VALUE,
int> = 0>
inline auto
_alignWrapperSequential(StringSet<Gaps<TSequenceH, TGapsSpecH>, TSetSpecH> & gapSeqSetH,
StringSet<Gaps<TSequenceV, TGapsSpecV>, TSetSpecV> & gapSeqSetV,
_alignWrapperSequential(TSetH & gapSeqSetH,
TSetV & gapSeqSetV,
Score<TScoreValue, TScoreSpec> const & scoringScheme,
TAlignConfig const & config,
TGapModel const & /*gaps*/)

{
typedef typename Size<TSequenceH>::Type TSize;
typedef typename Position<TSequenceH>::Type TPosition;
typedef typename Size<TSetH>::Type TSize;
typedef typename Position<TSetH>::Type TPosition;
typedef TraceSegment_<TPosition, TSize> TTraceSegment;

String<TScoreValue> results;
Expand Down Expand Up @@ -167,7 +185,8 @@ _alignWrapperSequential(StringSet<Gaps<TSequenceH, TGapsSpecH>, TSetSpecH> & gap
template <typename... TArgs>
inline auto _alignWrapper(TArgs && ...args)
{
#ifdef SEQAN_SIMD_ENABLED
// NOTE(marehr): ume_simd is currently not working, thus falling back to sequential case
#if defined(SEQAN_SIMD_ENABLED) && !defined(SEQAN_UMESIMD_ENABLED)
return _alignWrapperSimd(std::forward<TArgs>(args)...);
#else
return _alignWrapperSequential(std::forward<TArgs>(args)...);
Expand Down
2 changes: 1 addition & 1 deletion porechop/include/seqan/align/align_iterator_base.h
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
// ==========================================================================
// SeqAn - The Library for Sequence Analysis
// ==========================================================================
// Copyright (c) 2006-2016, Knut Reinert, FU Berlin
// Copyright (c) 2006-2018, Knut Reinert, FU Berlin
// All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
Expand Down
Loading

0 comments on commit 49bd5fd

Please sign in to comment.