-
Notifications
You must be signed in to change notification settings - Fork 265
ENH: avoid full dense matrix for parallel beta #2040
base: master
Are you sure you want to change the base?
Changes from all commits
23369b2
590c6ae
584af9b
636891f
b959ecb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,7 +34,7 @@ | |
import warnings | ||
warnings.filterwarnings('ignore', 'Not using MPI as mpi4py not found') | ||
|
||
from numpy import asarray | ||
from numpy import vstack | ||
import cogent.maths.distance_transform as distance_transform | ||
from biom import load_table | ||
|
||
|
@@ -144,8 +144,6 @@ def single_file_beta(input_path, metrics, tree_path, output_dir, | |
|
||
otu_table = load_table(input_path) | ||
|
||
otumtx = asarray([v for v in otu_table.iter_data(axis='sample')]) | ||
|
||
if tree_path: | ||
tree = parse_newick(open(tree_path, 'U'), | ||
PhyloNode) | ||
|
@@ -173,6 +171,7 @@ def single_file_beta(input_path, metrics, tree_path, output_dir, | |
% (metric, ', '.join(list_known_metrics()))) | ||
exit(1) | ||
if rowids is None: | ||
otumtx = otu_table.matrix_data.T.toarray() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This probably can just use the row by row stuff done below. Fairly certain there is never a need to do a dense conversion here. |
||
# standard, full way | ||
if is_phylogenetic: | ||
dissims = metric_f(otumtx, otu_table.ids(axis='observation'), | ||
|
@@ -198,6 +197,7 @@ def single_file_beta(input_path, metrics, tree_path, output_dir, | |
metric_f.__name__ == 'binary_dist_chisq': | ||
warnings.warn('dissimilarity ' + metric_f.__name__ + | ||
' is not parallelized, calculating the whole matrix...') | ||
otumtx = otu_table.matrix_data.T.toarray() | ||
row_dissims.append(metric_f(otumtx)[rowidx]) | ||
else: | ||
try: | ||
|
@@ -208,18 +208,23 @@ def single_file_beta(input_path, metrics, tree_path, output_dir, | |
sample_ids = otu_table.ids() | ||
observation_ids = otu_table.ids(axis='observation') | ||
for i in range(len(sample_ids)): | ||
samp_a = otu_table.data(sample_ids[rowidx]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @wasade after looking at the code with @ElDeveloper, it turns out these lines inside the IIRC, the current implementation of Unifrac in cogent is FastUnifrac. FastUnifrac is able to compute unifrac fast because it pre-computes a lot of data structures. that is why for Unifrac these lines never get executed. One way of doing this will be to create a for loop and give the code an OTU table with 2 samples each time, and then just construct the distance matrix ourselves in here. On one hand, this will kill the "Fast" part of Unifrac, since it will recompute those structures for each pair of samples. On the other hand, we will be able to aggressively parallelize this, e.g. creating a job for each pair of samples; maximizing the parallelization during those pre-computations. Other options require more though on how to parallelize this, instead of using the naïve per-row parallelization. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fast Unifrac does not require a dense representation. It does cache Do you know why one_sample_unweighted_unifrac is being used instead of On Tue, Jun 9, 2015 at 3:39 PM, josenavas [email protected] wrote:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry @wasade No idea why that function is used... |
||
samp_b = otu_table.data(sample_ids[i]) | ||
samp_data = vstack([samp_a, samp_b]) | ||
|
||
if is_phylogenetic: | ||
|
||
dissim = metric_f( | ||
otumtx[[rowidx, i], :], observation_ids, | ||
samp_data, observation_ids, | ||
tree, [sample_ids[rowidx], sample_ids[i]], | ||
make_subtree=(not full_tree))[0, 1] | ||
else: | ||
dissim = metric_f(otumtx[[rowidx, i], :])[0, 1] | ||
dissim = metric_f(samp_data)[0, 1] | ||
dissims.append(dissim) | ||
row_dissims.append(dissims) | ||
else: | ||
# do whole row at once | ||
dissims = row_metric(otumtx, | ||
dissims = row_metric(otu_table, | ||
otu_table.ids(axis='observation'), | ||
tree, otu_table.ids(), rowid, | ||
make_subtree=(not full_tree)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably
import numpy as np
, but this is also ok.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no strong feelings here, can do if you'd like but the iteration with jenkins is large
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, no strong feelings either ... Just ignore that comment.
Yoshiki Vázquez-Baeza