Skip to content

Commit

Permalink
Finish Release-2.3.0
Browse files Browse the repository at this point in the history
Added new feature to filter seq tool so that it handles paired end input.  NOTE that this has required a change in the interface.

Updated documentation

Tweaked the makefiles to remove some redundancy in the compiler flags.
  • Loading branch information
Daniel Mapleson committed Dec 6, 2016
2 parents 6d72b35 + 10157cd commit d2826fe
Show file tree
Hide file tree
Showing 17 changed files with 378 additions and 301 deletions.
33 changes: 32 additions & 1 deletion NEWS
Original file line number Diff line number Diff line change
@@ -1,9 +1,40 @@

KAT Changelog

==========================================

V2.3.0 - ????

Modified sequence filtering tool so that it can now handle filtering of paired end reads.

==========================================

V2.2.0 - 28th October 2016

Improved README and documentation

Better checking of python dependencies

Enforcing static linking of kat_jellyfish and kat libraries to executable

Better checking of sequence files, now kat can detect fasta and fastq files for kmer counting, even without a known extension.


==========================================

V2.1.1 - 26th July 2016

Updated documentation

Removed dependency on sphinx extensions in order to keep both readthedocs and homebrew happy.

Fixed representation of labels and plotting coordinates based on whether the matrix should be interpretted transposed or not.

Added shebang to python library files, in order for homebrew to be happy including them in the bin directory.

===========================================

V2.1.0 - ????
V2.1.0 - 21st June 2016

Added filtering tools for filtering k-mer hashes and sequences based on presence
or absence of a set of k-mers.
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ If you use KAT in your work and wish to cite us please use the following citatio

Daniel Mapleson, Gonzalo Garcia Accinelli, George Kettleborough, Jonathan Wright, and Bernardo J. Clavijo.
**KAT: A K-mer Analysis Toolkit to quality control NGS datasets and genome assemblies.**
Bioinformatics, 2016. doi: 10.1093/bioinformatics/btw663
Bioinformatics, 2016. [doi: 10.1093/bioinformatics/btw663](http://bioinformatics.oxfordjournals.org/content/early/2016/10/20/bioinformatics.btw663.abstract)


##Authors:
Expand Down
2 changes: 1 addition & 1 deletion configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

# Autoconf initialistion. Sets package name version and contact details
AC_PREREQ([2.68])
AC_INIT([kat],[2.2.0],[https://github.com/TGAC/KAT/issues],[kat],[https://github.com/TGAC/KAT])
AC_INIT([kat],[2.3.0],[https://github.com/TGAC/KAT/issues],[kat],[https://github.com/TGAC/KAT])
AC_CONFIG_SRCDIR([src/kat.cc])
AC_CONFIG_AUX_DIR([build-aux])
AC_CONFIG_MACRO_DIR([m4])
Expand Down
17 changes: 9 additions & 8 deletions deps/jellyfish-2.2.0/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,16 @@ pkgconfigdir = $(libdir)/pkgconfig
pkgconfig_DATA = kat_jellyfish.pc

AM_LDFLAGS = -lpthread # $(VALGRIND_LIBS)
AM_CPPFLAGS = -Wall -Wnon-virtual-dtor -Wno-deprecated-declarations -I$(top_srcdir) -I$(top_srcdir)/include -g -O3 $(VALGRIND_CFLAGS)
AM_CXXFLAGS = $(ALL_CXXFLAGS) -g -O3
AM_CPPFLAGS = -I$(top_srcdir) -I$(top_srcdir)/include $(VALGRIND_CFLAGS)
AM_CXXFLAGS = $(ALL_CXXFLAGS) -Wall -Wnon-virtual-dtor -Wno-deprecated-declarations -g -O3

noinst_HEADERS = $(YAGGO_SOURCES)
bin_PROGRAMS =
dist_bin_SCRIPTS =
data_DATA =
BUILT_SOURCES = $(YAGGO_SOURCES)
CLEANFILES =
DISTCLEANFILES =
DISTCLEANFILES =

# Yaggo automatic rules with silencing
V_YAGGO = $(V_YAGGO_$(V))
Expand Down Expand Up @@ -51,7 +51,6 @@ bin_kat_jellyfish_SOURCES = sub_commands/jellyfish.cc \
jellyfish/merge_files.cc
bin_kat_jellyfish_LDFLAGS = $(AM_LDFLAGS) $(STATIC_FLAGS)


YAGGO_SOURCES += sub_commands/count_main_cmdline.hpp \
sub_commands/info_main_cmdline.hpp \
sub_commands/dump_main_cmdline.hpp \
Expand Down Expand Up @@ -181,7 +180,6 @@ bin_test_all_SOURCES = unit_tests/test_main.cc \
unit_tests/test_offsets_key_value.cc \
unit_tests/test_simple_circular_buffer.cc \
unit_tests/test_rectangular_binary_matrix.cc \
unit_tests/test_mer_dna.cc \
unit_tests/test_large_hash_array.cc \
unit_tests/test_mer_overlap_sequence_parser.cc \
unit_tests/test_file_header.cc \
Expand All @@ -193,18 +191,21 @@ bin_test_all_SOURCES = unit_tests/test_main.cc \
unit_tests/test_text_dumper.cc \
unit_tests/test_dumpers.cc \
unit_tests/test_mapped_file.cc \
unit_tests/test_int128.cc \
unit_tests/test_mer_dna_bloom_counter.cc \
unit_tests/test_whole_sequence_parser.cc \
unit_tests/test_allocators_mmap.cc \
unit_tests/test_cooperative_pool2.cc \
unit_tests/test_generator_manager.cc \
unit_tests/test_atomic_bits_array.cc \
unit_tests/test_stdio_filebuf.cc

#unit_tests/test_mer_dna.cc
#unit_tests/test_int128.cc

bin_test_all_SOURCES += jellyfish/backtrace.cc

bin_test_all_CPPFLAGS = -DJSON_IS_AMALGAMATION=1
bin_test_all_CXXFLAGS = $(AM_CXXFLAGS) -I$(top_srcdir)/unit_tests/gtest/include -I$(top_srcdir)/unit_tests -I$(top_srcdir)/include
bin_test_all_CPPFLAGS = -DJSON_IS_AMALGAMATION=1 -I$(top_srcdir)/unit_tests/gtest/include -I$(top_srcdir)/unit_tests -I$(top_srcdir)/include
bin_test_all_CXXFLAGS = $(ALL_CXXFLAGS) -g -O3
bin_test_all_LDADD = libgtest.la $(LDADD)
YAGGO_SOURCES += unit_tests/test_main_cmdline.hpp
noinst_HEADERS += unit_tests/test_main.hpp
Expand Down
1 change: 0 additions & 1 deletion deps/jellyfish-2.2.0/configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ AC_PROG_LIBTOOL

# Change default compilation flags
AC_SUBST([ALL_CXXFLAGS], [-std=c++0x])
CXXFLAGS="-std=c++0x $CXXFLAGS"
AC_LANG(C++)
AC_PROG_CXX

Expand Down
1 change: 1 addition & 0 deletions doc/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ DISTFILES = $(DPS)/index.rst \
$(DPS)/kmer.rst \
$(DPS)/using.rst \
$(DPS)/walkthrough.rst \
$(DPS)/faq.rst \
$(DPI)/ccoli_comp.png \
$(DPI)/ccoli_gcp.png \
$(DPI)/ccoli_hist.png \
Expand Down
1 change: 1 addition & 0 deletions doc/Makefile.in
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ DISTFILES = $(DPS)/index.rst \
$(DPS)/kmer.rst \
$(DPS)/using.rst \
$(DPS)/walkthrough.rst \
$(DPS)/faq.rst \
$(DPI)/ccoli_comp.png \
$(DPI)/ccoli_gcp.png \
$(DPI)/ccoli_hist.png \
Expand Down
4 changes: 2 additions & 2 deletions doc/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,9 +52,9 @@
# built documents.
#
# The short X.Y version.
version = '2.2.0'
version = '2.3.0'
# The full version, including alpha/beta/rc tags.
release = '2.2.0'
release = '2.3.0'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
44 changes: 44 additions & 0 deletions doc/source/faq.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@

.. _faq:

Frequently Asked Questions
==========================

Can KAT handle gzipped sequence files?
--------------------------------------

Yes, but only via named pipes. For example, say we wanted to run ``kat hist`` using
gzipped paired end dataset, we can use a named pipe to do this as follows::

mkfifo pe_dataset.fq && gunzip -c pe_dataset_?.fq.gz > pe_dataset.fq &
kat hist -o pe_dataset.hist pe_dataset.fq

Where ``pe_dataset_?.fq.gz``, represents ``pe_dataset_1.fq.gz`` and ``pe_dataset_2.fq.gz``.

For those unfamiliar with named pipes, the first line will create an empty file
in your working directory called pe_dataset.fq and then specifies that anything
consuming from the named pipe will take data that has been gunzipped first. To be
clear this means you do not have to decompress the gzipped files to disk, this happens
on the fly as consumed by KAT.

Thanks to John Davey for suggesting this.


Why is jellyfish bundled with KAT?
----------------------------------

We require a stable interface to the k-mer hash arrays produced by jellyfish hence,
we are reliant on a particular version of jellyfish to guarantee that KAT works
correctly. Instead of potentially requiring the user to install multiple jellyfish instances
on their machine, we bundle our own version, with all jellyfish binaries prefixed
with `kat_` in order to avoid any naming clashes with official jellyfish releases.


I downloaded a release from github but it doesn't contain the configure script. What gives?
--------------------------------------------------------------------------------------------

Github offers a feature which allows you to download source code bundles for all
releases. However, these bundles do not contain the distributable form of KAT, i.e.
they are not produced by calling ``make dist``. Although they come from the same
place you can distinguish between the github bundles and the proper distributable form
of KAT because it will have the name: ``kat-x.x.x.tar.gz``.
12 changes: 9 additions & 3 deletions doc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ The K-mer counting itself, a critical element for all KAT tools, is accomplished
using
kmer
walkthrough
faq

.. _system:

Expand All @@ -46,7 +47,9 @@ Citing
The KAT paper is currently in submission. In the meantime, if you use our software
and wish to cite us please use our bioRxiv preprint:

`Daniel Mapleson et al. 2016. KAT: A K-mer Analysis Toolkit to quality control NGS datasets and genome assemblies. bioRxiv doi: 10.1101/064733 <http://biorxiv.org/content/early/2016/07/19/064733>`_
Daniel Mapleson, Gonzalo Garcia Accinelli, George Kettleborough, Jonathan Wright, and Bernardo J. Clavijo.
**KAT: A K-mer Analysis Toolkit to quality control NGS datasets and genome assemblies.**
Bioinformatics, 2016. `doi: 10.1093/bioinformatics/btw663 <http://bioinformatics.oxfordjournals.org/content/early/2016/10/20/bioinformatics.btw663.abstract>`_



Expand All @@ -55,8 +58,11 @@ and wish to cite us please use our bioRxiv preprint:
Issues
======

Should you discover any issues with spectre, or wish to request a new feature please raise a `ticket here <https://github.com/TGAC/KAT/issues>`_.
Alternatively, contact Daniel Mapleson at: [email protected]; or Bernardo Clavijo at: [email protected]
Should you discover any issues with KAT, or wish to request a new feature please raise a `ticket here <https://github.com/TGAC/KAT/issues>`_.
Alternatively, contact Daniel Mapleson at: [email protected]; or Bernardo Clavijo at: [email protected].
However, please consult the `Frequently Asked Questions <faq>`_ page first in case your
question is already answered there.



.. _availability:
Expand Down
9 changes: 5 additions & 4 deletions doc/source/using.rst
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
.. _using:

Using KAT
================
=========

KAT is a C++ program containing a number of subtools which can be used in
isolation or as part of a pipeline. Typing ``kat --help`` will show a
list of the available subtools. Each subtool has its own help system which you
can access by typing ``kat <subtool> --help``.
can access by typing ``kat <subtool> --help``.


HIST
Expand Down Expand Up @@ -153,14 +153,15 @@ Sequence filtering
The user loads a k-mer hash and then filters sequences (either in or out) depending
on whether those sequences contain the k-mer or not. The user can also apply a
threshold requiring X% of k-mers to be in the sequence before filtering is applied.
The user can also use this tool for filtering paired end reads, and for subsampling.

Basic usage::

kat filter seq [options] <seq_file> (<k-mer_hash>)
kat filter seq [options] --seq <seq_file> <k-mer_hash>

Applications:

* Contamination extraction from read file or assembly file
* Contamination extraction from read file or assembly files, extraction of organelles, subsampling high_coverage regions


Plotting tools
Expand Down
13 changes: 11 additions & 2 deletions doc/source/walkthrough.rst
Original file line number Diff line number Diff line change
Expand Up @@ -209,7 +209,7 @@ will not be very useful for scaffolding.
Contamination detection and extraction
--------------------------------------

Breaking WGS data into k-mers provides a nice way of identifying contamination, or
Breaking WGS data into k-mers provides a nice way of identifying contamination, organelles or
otherwise unexpected content, in your reads or assemblies. This section will walk
you through how you might be able to identify and extract contamination in your
data.
Expand Down Expand Up @@ -260,10 +260,19 @@ This produces a k-mer hash containing only those k-mer found in the defined regi
We can get the reads (or assembled contigs) associated with these k-mers by
running the following command::

kat filter seq --threshold=0.5 <path_to_seq_file_to_filter> <filtered_k-mer hash>
kat filter seq --threshold=0.5 --seq=<path_to_seq_file_to_filter> --seq2=<path_to_seq_file_to_filter_2> <filtered_k-mer hash>

The example above assumes you want to filter a paired end library, although if you
want to filter single end data or and assembly you can do this by simply dropping
the ``--seq2`` option.

BLASTing some of the sequences removed by the filtering might then identify the contaminant.

You can also use this tool for subsampling the extracted data. This can be useful
for reducing expression of highly expressed reads. To do this add the ``--frequency``
option and set a threshold indicating how many of the reads to keep: 1.0 implies keep
all, 0.0 means discard all, 0.5 would imply to keep half of the sequences.


In assemblies
~~~~~~~~~~~~~
Expand Down
2 changes: 1 addition & 1 deletion lib/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ pkgconfig_DATA = kat.pc

lib_LTLIBRARIES = libkat.la

libkat_la_LDFLAGS = -version-info 2:2:0
libkat_la_LDFLAGS = -version-info 2:3:0
libkat_la_SOURCES = \
src/gnuplot_i.cc \
src/matrix_metadata_extractor.cc \
Expand Down
2 changes: 1 addition & 1 deletion lib/src/gnuplot_i.cc
Original file line number Diff line number Diff line change
Expand Up @@ -409,7 +409,7 @@ Gnuplot::~Gnuplot()
#elif defined(unix) || defined(__unix) || defined(__unix__) || defined(__APPLE__)
if (pclose(gnucmd) == -1)
#endif
throw GnuplotException("Problem closing communication to gnuplot");
std::cerr << "Problem closing communication to gnuplot" << std::endl;
}


Expand Down
2 changes: 1 addition & 1 deletion src/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ AUTOMAKE_OPTIONS = subdir-objects
bin_PROGRAMS = kat

# Executable inputs
kat_CXXFLAGS = -g -O3 -fwrapv -Wall -Wextra -Wno-deprecated-declarations -Wno-unused-function -Wno-unused-parameter -Wno-unused-variable -Wno-unused-command-line-argument -ansi -pedantic -std=c++11 @AM_CXXFLAGS@ @CXXFLAGS@
kat_CXXFLAGS = -g -O3 -fwrapv -Wall -Wextra -Wno-deprecated-declarations -Wno-unused-function -Wno-unused-parameter -Wno-unused-variable -Wno-unused-command-line-argument -ansi -pedantic -std=c++11 @AM_CXXFLAGS@

kat_CPPFLAGS = \
-isystem $(top_srcdir)/deps/seqan-library-2.0.0/include \
Expand Down
Loading

0 comments on commit d2826fe

Please sign in to comment.