Skip to content

Commit

Permalink
Finish Release-2.3.2
Browse files Browse the repository at this point in the history
KAT now allows users to pipe in gzipped files via process substitution.

Updated distribution analysis script to make it more robust and added calculations for estimated genome size and level of heterozygos content.

Fixed a bug in spectra hist plot title.

Fixed a bug in the help message for '-n' option in kat comp.
  • Loading branch information
Daniel Mapleson committed Mar 7, 2017
2 parents acbaa32 + 4a2eb07 commit 7666a08
Show file tree
Hide file tree
Showing 23 changed files with 807 additions and 437 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,4 @@ Makefile.in
*.pc
*.la
*.out
/tags
3 changes: 2 additions & 1 deletion configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

# Autoconf initialistion. Sets package name version and contact details
AC_PREREQ([2.68])
AC_INIT([kat],[2.3.1],[https://github.com/TGAC/KAT/issues],[kat],[https://github.com/TGAC/KAT])
AC_INIT([kat],[2.3.2],[https://github.com/TGAC/KAT/issues],[kat],[https://github.com/TGAC/KAT])
AC_CONFIG_SRCDIR([src/kat.cc])
AC_CONFIG_AUX_DIR([build-aux])
AC_CONFIG_MACRO_DIR([m4])
Expand Down Expand Up @@ -176,6 +176,7 @@ else
if [[ -z "${BOOST_TIMER_STATIC_LIB}" ]] || [[ -z "${BOOST_CHRONO_STATIC_LIB}" ]] || [[ -z "${BOOST_FILESYSTEM_STATIC_LIB}" ]] || [[ -z "${BOOST_PROGRAM_OPTIONS_STATIC_LIB}" ]] || [[ -z "${BOOST_SYSTEM_STATIC_LIB}" ]]; then
AC_MSG_WARN([Not all static boost libraries could be found. Will use dynamic libraries instead.])
BOOST_LIBS="${BOOST_DYN_LIBS}"
dynboost="yes"
else
BOOST_LIBS="${BOOST_STATIC_LIBS}"
fi
Expand Down
4 changes: 2 additions & 2 deletions doc/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,9 +52,9 @@
# built documents.
#
# The short X.Y version.
version = '2.3.1'
version = '2.3.2'
# The full version, including alpha/beta/rc tags.
release = '2.3.1'
release = '2.3.2'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
19 changes: 15 additions & 4 deletions doc/source/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,11 @@
Frequently Asked Questions
==========================

Can KAT handle gzipped sequence files?
--------------------------------------
Can KAT handle compressed sequence files?
-----------------------------------------

Yes, but only via named pipes. For example, say we wanted to run ``kat hist`` using
Yes, via named pipes. Anonymous named pipes (process substitution) is also supported.
For example, say we wanted to run ``kat hist`` using
gzipped paired end dataset, we can use a named pipe to do this as follows::

mkfifo pe_dataset.fq && gunzip -c pe_dataset_?.fq.gz > pe_dataset.fq &
Expand All @@ -21,7 +22,17 @@ consuming from the named pipe will take data that has been gunzipped first. To
clear this means you do not have to decompress the gzipped files to disk, this happens
on the fly as consumed by KAT.

Thanks to John Davey for suggesting this.
Alternatively, using process substitution we could write the previous example more
concisely in a single line like this::

kat hist -o oe_dataset.hist <(gunzip -c pe_dataset_?.fq.gz)

As a more complex example, the KAT comp tool can be driven in spectra-cn mode using
both compressed paired end reads and a compressed assembly as follows::

kat comp -o oe_spectra_cn <(gunzip -c pe_dataset_?.fq.gz) <(gunzip -c asm.fa.gz)

Thanks to John Davey and Torsten Seeman for suggesting this.


Why is jellyfish bundled with KAT?
Expand Down
Binary file added doc/source/images/distanalysis_console.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/source/images/distanalysis_plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
35 changes: 33 additions & 2 deletions doc/source/walkthrough.rst
Original file line number Diff line number Diff line change
Expand Up @@ -308,8 +308,9 @@ Genome assembly analysis using k-mer spectra
--------------------------------------------

One of the most frequently used tools in KAT are the so called "assembly spectra
copy number plots" or spectra-cn. We use these as a fast first analysis for assembly coherence
to the data in the reads they are representing. Basically we represent how many elements
copy number plots" or spectra-cn. We use these as a fast first analysis to check
assembly coherence against
the content within reads that were used to produce the assembly. Basically we represent how many elements
of each frequency on the read’s spectrum ended up not included in the assembly, included
once, included twice etc.

Expand Down Expand Up @@ -374,6 +375,36 @@ duplications, inclusion of extra variation, etc:
:scale: 33%


Distribution decomposition analysis
-----------------------------------

It's useful to be able to fit distributions to each peak in a k-mer histogram or spectra-cn matrix
in order to work out how many distinct k-mers can be associated with those distributions. By counting
k-mers in this way we can make predictions around genome size, heterozygous rates (if diploid) and
assembly completeness. To this end we bundle a script with kat called kat_distanalysis.py. It takes
in either a spectra-cn matrix file, or kat histogram file as input, then proceeds to identify peaks
and fit distributions to each one. In the case of spectra-cn matrix files it also identifies peaks
for each copy number for an assembly.

The user can help to get correct predictions out of the tool by providing an approximate frequency for
the homozygous part of the distribution. By default, this is assumed to be the last peak. For example,
this command::

kat_distanalysis.py --plot spectra-cn.mx

might produce the following output for this tetraploid genome:

.. image:: images/distanalysis_console.png
:scale: 100%

.. image:: images/distanalysis_plot.png
:scale: 100%






Finding repetitive regions in assemblies
----------------------------------------

Expand Down
1 change: 1 addition & 0 deletions lib/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@
*.lo
*.o
*.kat-2.1.pc.swp
/tags
1 change: 1 addition & 0 deletions lib/include/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/tags
1 change: 1 addition & 0 deletions lib/include/kat/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/tags
6 changes: 6 additions & 0 deletions lib/include/kat/jellyfish_helper.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,12 @@ namespace kat {
*/
static void printHeader(const file_header& header, ostream& out);

/**
* Checks if path refers to a pipe rather than a real file
* @param filename Path to input file
* @return Whether or not the path refers to a pipe
*/
static bool isPipe(const path& filename);

/**
* Returns whether or not the specified file path looks like it belongs to
Expand Down
1 change: 1 addition & 0 deletions lib/src/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
*.o
/.deps/
/.libs/
/tags
9 changes: 5 additions & 4 deletions lib/src/input_handler.cc
Original file line number Diff line number Diff line change
Expand Up @@ -64,13 +64,12 @@ void kat::InputHandler::validateInput() {
}
}

if (!bfs::exists(p)) {
if (!JellyfishHelper::isPipe(p) && !bfs::exists(p)) {
BOOST_THROW_EXCEPTION(InputFileException() << InputFileErrorInfo(string(
"Could not find input file at: ") + p.string() + "; please check the path and try again."));
}

InputMode m = JellyfishHelper::isSequenceFile(p) ? InputMode::COUNT : InputMode::LOAD;

if (start) {
mode = m;
}
Expand Down Expand Up @@ -109,8 +108,10 @@ void kat::InputHandler::validateMerLen(const uint16_t merLen) {
string kat::InputHandler::pathString() {

string s;
uint16_t index = 1;
for(auto& p : input) {
s += p.string() + " ";
string msg = JellyfishHelper::isPipe(p) ? string("<pipe>") : p.string();
s += msg + " ";
}
return boost::trim_right_copy(s);
}
Expand All @@ -131,7 +132,7 @@ void kat::InputHandler::count(const uint16_t threads) {
hashCounter = make_shared<HashCounter>(hashSize, merLen * 2, 7, threads);
hashCounter->do_size_doubling(!disableHashGrow);

cout << "Input is a sequence file. Counting kmers for " << pathString() << "...";
cout << "Input " << index << " is a sequence file. Counting kmers for input " << index << " (" << pathString() << ") ...";
cout.flush();

hash = JellyfishHelper::countSeqFile(input, *hashCounter, canonical, threads);
Expand Down
9 changes: 9 additions & 0 deletions lib/src/jellyfish_helper.cc
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ using jellyfish::Offsets;
using jellyfish::quadratic_reprobes;

#include <kat/jellyfish_helper.hpp>
#include <boost/algorithm/string/predicate.hpp>
using kat::JellyfishHelper;

/**
Expand Down Expand Up @@ -255,6 +256,11 @@ void kat::JellyfishHelper::dumpHash(LargeHashArrayPtr ary, file_header& header,
dumper.dump(ary);
}

bool kat::JellyfishHelper::isPipe(const path& filename) {
return boost::starts_with(filename.string(), "/proc") || boost::starts_with(filename.string(), "/dev");
}


/**
* Returns whether or not the specified file path looks like it belongs to
* a sequence file (either FastA or FastQ). Gzipped sequence files are
Expand All @@ -264,6 +270,9 @@ void kat::JellyfishHelper::dumpHash(LargeHashArrayPtr ary, file_header& header,
*/
bool kat::JellyfishHelper::isSequenceFile(const path& filename) {

// If we have a pipe as input then assume we are working with a sequence file
if (JellyfishHelper::isPipe(filename)) return true;

string ext = filename.extension().string();

// Actually we can't handle gz files directly, so turning this off for now
Expand Down
13 changes: 10 additions & 3 deletions m4/ax_boost_chrono.m4
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,13 @@ AC_DEFUN([AX_BOOST_CHRONO],
AC_DEFINE(HAVE_BOOST_CHRONO,,[define if the Boost::Chrono library is available])
BOOSTLIBDIR=`echo $BOOST_LDFLAGS | sed -e 's/@<:@^\/@:>@*//'`
ax_lib=""
ax_static_lib=""
no_find="yes"
no_link="yes"
link_chrono="no"
link_chrono_static="no"
LDFLAGS_SAVE=$LDFLAGS
if test "x$ax_boost_user_chrono_lib" = "x"; then
for libextension in `ls $BOOSTLIBDIR/libboost_chrono*.so* $BOOSTLIBDIR/libboost_chrono*.dylib* 2>/dev/null | sed 's,.*/,,' | sed -e 's;^lib\(boost_chrono.*\)\.so.*$;\1;' -e 's;^lib\(boost_chrono.*\)\.dylib.*$;\1;'` ; do
Expand Down Expand Up @@ -133,11 +140,11 @@ AC_DEFUN([AX_BOOST_CHRONO],
AC_MSG_ERROR(Could not find any version boost_chrono to link to)
fi
if [[ "$link_chrono" = "no" ]]; then
AC_MSG_WARN(Could not dynamic link against $ax_lib)
if [[ "$link_chrono" == "no" ]]; then
AC_MSG_WARN(Could not dynamic link against boost_chrono)
fi
if [[ "$link_chrono_static" == "no" ]]; then
AC_MSG_WARN(Could not static link against $ax_static_lib)
AC_MSG_WARN(Could not static link against boost_chrono)
fi
if [[ "$no_link" == "yes" ]]; then
AC_MSG_ERROR(Could not link against any boost_chrono lib)
Expand Down
13 changes: 10 additions & 3 deletions m4/ax_boost_filesystem.m4
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,13 @@ AC_DEFUN([AX_BOOST_FILESYSTEM],
AC_DEFINE(HAVE_BOOST_FILESYSTEM,,[define if the Boost::Filesystem library is available])
BOOSTLIBDIR=`echo $BOOST_LDFLAGS | sed -e 's/@<:@^\/@:>@*//'`
ax_lib=""
ax_static_lib=""
no_find="yes"
no_link="yes"
link_filesystem="yes"
link_filesystem_static="yes"
LDFLAGS_SAVE=$LDFLAGS
if test "x$ax_boost_user_filesystem_lib" = "x"; then
for libextension in `ls $BOOSTLIBDIR/libboost_filesystem*.so* $BOOSTLIBDIR/libboost_filesystem*.dylib* 2>/dev/null | sed 's,.*/,,' | sed -e 's;^lib\(boost_filesystem.*\)\.so.*$;\1;' -e 's;^lib\(boost_filesystem.*\)\.dylib.*$;\1;'` ; do
Expand Down Expand Up @@ -135,11 +142,11 @@ AC_DEFUN([AX_BOOST_FILESYSTEM],
AC_MSG_ERROR(Could not find any version boost_filesystem to link to)
fi
if [[ "$link_filesystem" = "no" ]]; then
AC_MSG_WARN(Could not dynamic link against $ax_lib)
if [[ "$link_filesystem" == "no" ]]; then
AC_MSG_WARN(Could not dynamic link against boost_filesystem)
fi
if [[ "$link_filesystem_static" == "no" ]]; then
AC_MSG_WARN(Could not static link against $ax_static_lib)
AC_MSG_WARN(Could not static link against boost_filesystem)
fi
if [[ "$no_link" == "yes" ]]; then
AC_MSG_ERROR(Could not link against any boost_filesystem lib)
Expand Down
13 changes: 10 additions & 3 deletions m4/ax_boost_program_options.m4
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,13 @@ AC_DEFUN([AX_BOOST_PROGRAM_OPTIONS],
AC_DEFINE(HAVE_BOOST_PROGRAM_OPTIONS,,[define if the Boost::Program_Options library is available])
BOOSTLIBDIR=`echo $BOOST_LDFLAGS | sed -e 's/@<:@^\/@:>@*//'`
ax_lib=""
ax_static_lib=""
no_find="yes"
no_link="yes"
link_program_options="no"
link_program_options_static="no"
LDFLAGS_SAVE=$LDFLAGS
if test "x$ax_boost_user_program_options_lib" = "x"; then
for libextension in `ls $BOOSTLIBDIR/libboost_program_options*.so* $BOOSTLIBDIR/libboost_program_options*.dylib* 2>/dev/null | sed 's,.*/,,' | sed -e 's;^lib\(boost_program_options.*\)\.so.*$;\1;' -e 's;^lib\(boost_program_options.*\)\.dylib.*$;\1;'` ; do
Expand Down Expand Up @@ -127,11 +134,11 @@ AC_DEFUN([AX_BOOST_PROGRAM_OPTIONS],
AC_MSG_ERROR(Could not find any version boost_program_options to link to)
fi
if [[ "$link_program_options" = "no" ]]; then
AC_MSG_WARN(Could not dynamic link against $ax_lib)
if [[ "$link_program_options" == "no" ]]; then
AC_MSG_WARN(Could not dynamic link against boost_program_options)
fi
if [[ "$link_program_options_static" == "no" ]]; then
AC_MSG_WARN(Could not static link against $ax_static_lib)
AC_MSG_WARN(Could not static link against boost_program_options)
fi
if [[ "$no_link" == "yes" ]]; then
AC_MSG_ERROR(Could not link against any boost_program_options lib)
Expand Down
17 changes: 12 additions & 5 deletions m4/ax_boost_system.m4
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,13 @@ AC_DEFUN([AX_BOOST_SYSTEM],
AC_DEFINE(HAVE_BOOST_SYSTEM,,[define if the Boost::System library is available])
BOOSTLIBDIR=`echo $BOOST_LDFLAGS | sed -e 's/@<:@^\/@:>@*//'`
ax_lib=""
ax_static_lib=""
no_find="yes"
no_link="yes"
link_system="no"
link_system_static="no"
LDFLAGS_SAVE=$LDFLAGS
if test "x$ax_boost_user_system_lib" = "x"; then
for libextension in `ls $BOOSTLIBDIR/libboost_system*.so* $BOOSTLIBDIR/libboost_system*.dylib* 2>/dev/null | sed 's,.*/,,' | sed -e 's;^lib\(boost_system.*\)\.so.*$;\1;' -e 's;^lib\(boost_system.*\)\.dylib.*$;\1;'` ; do
Expand Down Expand Up @@ -133,19 +140,19 @@ AC_DEFUN([AX_BOOST_SYSTEM],
AC_MSG_ERROR(Could not find any version boost_system to link to)
fi
if [[ "$link_system" = "no" ]]; then
AC_MSG_WARN(Could not dynamic link against $ax_lib)
if [[ "$link_system" == "no" ]]; then
AC_MSG_WARN(Could not dynamic link against boost_system)
fi
if [[ "$link_system_static" == "no" ]]; then
AC_MSG_WARN(Could not static link against $ax_static_lib)
AC_MSG_WARN(Could not static link against boost_system)
fi
if [[ "$no_link" == "yes" ]]; then
AC_MSG_ERROR(Could not link against any boost_system lib)
fi
fi
CPPFLAGS="$CPPFLAGS_SAVED"
CPPFLAGS="$CPPFLAGS_SAVED"
LDFLAGS="$LDFLAGS_SAVED"
fi
fi
])
13 changes: 10 additions & 3 deletions m4/ax_boost_timer.m4
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,13 @@ AC_DEFUN([AX_BOOST_TIMER],
AC_DEFINE(HAVE_BOOST_TIMER,,[define if the Boost::Timer library is available])
BOOSTLIBDIR=`echo $BOOST_LDFLAGS | sed -e 's/@<:@^\/@:>@*//'`
ax_lib=""
ax_static_lib=""
no_find="yes"
no_link="yes"
link_timer="no"
link_timer_static="no"
LDFLAGS_SAVE=$LDFLAGS
if test "x$ax_boost_user_timer_lib" = "x"; then
for libextension in `ls $BOOSTLIBDIR/libboost_timer*.so* $BOOSTLIBDIR/libboost_timer*.dylib* 2>/dev/null | sed 's,.*/,,' | sed -e 's;^lib\(boost_timer.*\)\.so.*$;\1;' -e 's;^lib\(boost_timer.*\)\.dylib.*$;\1;'` ; do
Expand Down Expand Up @@ -134,11 +141,11 @@ AC_DEFUN([AX_BOOST_TIMER],
AC_MSG_ERROR(Could not find any version boost_timer to link to)
fi
if [[ "$link_timer" = "no" ]]; then
AC_MSG_WARN(Could not dynamic link against $ax_lib)
if [[ "$link_timer" == "no" ]]; then
AC_MSG_WARN(Could not dynamic link against boost_timer)
fi
if [[ "$link_timer_static" == "no" ]]; then
AC_MSG_WARN(Could not static link against $ax_static_lib)
AC_MSG_WARN(Could not static link against boost_timer)
fi
if [[ "$no_link" == "yes" ]]; then
AC_MSG_ERROR(Could not link against any boost_timer lib)
Expand Down
Loading

0 comments on commit 7666a08

Please sign in to comment.