Skip to content
This repository has been archived by the owner on Nov 9, 2023. It is now read-only.

ENH: avoid full dense matrix for parallel beta #2040

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 11 additions & 6 deletions qiime/beta_diversity.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
import warnings
warnings.filterwarnings('ignore', 'Not using MPI as mpi4py not found')

from numpy import asarray
from numpy import vstack
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably import numpy as np, but this is also ok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no strong feelings here, can do if you'd like but the iteration with jenkins is large

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, no strong feelings either ... Just ignore that comment.

Yoshiki Vázquez-Baeza

On Jun 4, 2015, at 9:45 PM, Daniel McDonald [email protected] wrote:

In qiime/beta_diversity.py:

@@ -34,7 +34,7 @@
import warnings
warnings.filterwarnings('ignore', 'Not using MPI as mpi4py not found')

-from numpy import asarray
+from numpy import vstack
no strong feelings here, can do if you'd like but the iteration with jenkins is large


Reply to this email directly or view it on GitHub.

import cogent.maths.distance_transform as distance_transform
from biom import load_table

Expand Down Expand Up @@ -144,8 +144,6 @@ def single_file_beta(input_path, metrics, tree_path, output_dir,

otu_table = load_table(input_path)

otumtx = asarray([v for v in otu_table.iter_data(axis='sample')])

if tree_path:
tree = parse_newick(open(tree_path, 'U'),
PhyloNode)
Expand Down Expand Up @@ -173,6 +171,7 @@ def single_file_beta(input_path, metrics, tree_path, output_dir,
% (metric, ', '.join(list_known_metrics())))
exit(1)
if rowids is None:
otumtx = otu_table.matrix_data.T.toarray()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably can just use the row by row stuff done below. Fairly certain there is never a need to do a dense conversion here.

# standard, full way
if is_phylogenetic:
dissims = metric_f(otumtx, otu_table.ids(axis='observation'),
Expand All @@ -198,6 +197,7 @@ def single_file_beta(input_path, metrics, tree_path, output_dir,
metric_f.__name__ == 'binary_dist_chisq':
warnings.warn('dissimilarity ' + metric_f.__name__ +
' is not parallelized, calculating the whole matrix...')
otumtx = otu_table.matrix_data.T.toarray()
row_dissims.append(metric_f(otumtx)[rowidx])
else:
try:
Expand All @@ -208,18 +208,23 @@ def single_file_beta(input_path, metrics, tree_path, output_dir,
sample_ids = otu_table.ids()
observation_ids = otu_table.ids(axis='observation')
for i in range(len(sample_ids)):
samp_a = otu_table.data(sample_ids[rowidx])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wasade after looking at the code with @ElDeveloper, it turns out these lines inside the AttributeError will never be executed for the unifrac distances, as line 204 do not raise any error.

IIRC, the current implementation of Unifrac in cogent is FastUnifrac. FastUnifrac is able to compute unifrac fast because it pre-computes a lot of data structures. that is why for Unifrac these lines never get executed.

One way of doing this will be to create a for loop and give the code an OTU table with 2 samples each time, and then just construct the distance matrix ourselves in here. On one hand, this will kill the "Fast" part of Unifrac, since it will recompute those structures for each pair of samples. On the other hand, we will be able to aggressively parallelize this, e.g. creating a job for each pair of samples; maximizing the parallelization during those pre-computations.

Other options require more though on how to parallelize this, instead of using the naïve per-row parallelization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fast Unifrac does not require a dense representation. It does cache
information on the tree, but it doesn't use the zeros. However, if enabling
fast unifrac to scale within qiime requires modification of pycogent, then
I agree it is not worth pursuing.

Do you know why one_sample_unweighted_unifrac is being used instead of
dist_unweighted_unifrac? My impression was that the latter would work and
what I was expecting the metric being used to be. It is also fast unifrac

On Tue, Jun 9, 2015 at 3:39 PM, josenavas [email protected] wrote:

In qiime/beta_diversity.py
#2040 (comment):

@@ -208,17 +208,23 @@ def single_file_beta(input_path, metrics, tree_path, output_dir,
sample_ids = otu_table.ids()
observation_ids = otu_table.ids(axis='observation')
for i in range(len(sample_ids)):

  •                        samp_a = otu_table.data(sample_ids[rowidx])
    

@wasade https://github.com/wasade after looking at the code with
@ElDeveloper https://github.com/ElDeveloper, it turns out these lines
inside the AttributeError will never be executed for the unifrac
distances, as line 204 do not raise any error.

IIRC, the current implementation of Unifrac in cogent is FastUnifrac.
FastUnifrac is able to compute unifrac fast because it pre-computes a lot
of data structures. that is why for Unifrac these lines never get executed.

One way of doing this will be to create a for loop and give the code an
OTU table with 2 samples each time, and then just construct the distance
matrix ourselves in here. On one hand, this will kill the "Fast" part of
Unifrac, since it will recompute those structures for each pair of samples.
On the other hand, we will be able to aggressively parallelize this, e.g.
creating a job for each pair of samples; maximizing the parallelization
during those pre-computations.

Other options require more though on how to parallelize this, instead of
using the naïve per-row parallelization.


Reply to this email directly or view it on GitHub
https://github.com/biocore/qiime/pull/2040/files#r32067547.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @wasade

No idea why that function is used...

samp_b = otu_table.data(sample_ids[i])
samp_data = vstack([samp_a, samp_b])

if is_phylogenetic:

dissim = metric_f(
otumtx[[rowidx, i], :], observation_ids,
samp_data, observation_ids,
tree, [sample_ids[rowidx], sample_ids[i]],
make_subtree=(not full_tree))[0, 1]
else:
dissim = metric_f(otumtx[[rowidx, i], :])[0, 1]
dissim = metric_f(samp_data)[0, 1]
dissims.append(dissim)
row_dissims.append(dissims)
else:
# do whole row at once
dissims = row_metric(otumtx,
dissims = row_metric(otu_table,
otu_table.ids(axis='observation'),
tree, otu_table.ids(), rowid,
make_subtree=(not full_tree))
Expand Down
21 changes: 16 additions & 5 deletions qiime/parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
from numpy import concatenate, repeat, zeros, nan, asarray
from numpy.random import permutation

import biom
from skbio.stats.ordination import OrdinationResults
from skbio.parse.record_finder import LabeledRecordFinder
from cogent.parse.tree import DndParser
Expand Down Expand Up @@ -566,18 +567,28 @@ def make_envs_dict(abund_mtx, sample_names, taxon_names):
sample_names is a list, length = num rows
taxon_names is a list, length = num columns
"""
num_samples, num_seqs = abund_mtx.shape
if isinstance(abund_mtx, biom.Table):
num_seqs, num_samples = abund_mtx.shape
else:
num_samples, num_seqs = abund_mtx.shape

if (num_samples, num_seqs) != (len(sample_names), len(taxon_names)):
raise ValueError(
"Shape of matrix %s doesn't match # samples and # taxa (%s and %s)" %
(abund_mtx.shape, num_samples, num_seqs))
envs_dict = {}
sample_names = asarray(sample_names)
for i, taxon in enumerate(abund_mtx.T):

nonzeros = taxon.nonzero() # this removes zero values to reduce memory
envs_dict[taxon_names[i]] = dict(zip(sample_names[nonzeros],
taxon[nonzeros]))
if isinstance(abund_mtx, biom.Table):
for i, v in enumerate(abund_mtx.matrix_data):
envs_dict[taxon_names[i]] = dict(zip(sample_names[v.indices],
v.data))
else:
for i, taxon in enumerate(abund_mtx.T):

nonzeros = taxon.nonzero() # this removes zero values to reduce memory
envs_dict[taxon_names[i]] = dict(zip(sample_names[nonzeros],
taxon[nonzeros]))
return envs_dict


Expand Down