Skip to content
This repository has been archived by the owner on Nov 9, 2023. It is now read-only.

ENH: avoid full dense matrix for parallel beta #2040

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

wasade
Copy link
Member

@wasade wasade commented Jun 4, 2015

A back of the envelope calculation puts the amount of memory necessary to compute unifrac over the EMP table at 3TB (75k samples, 5 million OTUs, double precision float). The dense representation is not necessary for the pairwise parallelized beta diversity metrics. This change addresses that issue.

Row metrics and non-parallel runs will still use the dense matrix as I believe that will require lower level changes to make possible.

@ElDeveloper, able to review?

@ElDeveloper
Copy link
Member

The code looks fine, let's wait until we see the tests. Also we would need a mention in the ChangeLog and at least someone else to review this, I'm not the maintainer of the library and I'm not familiar with the intricacies of this pipeline.

This will be super useful to have!!!

@@ -34,7 +34,7 @@
import warnings
warnings.filterwarnings('ignore', 'Not using MPI as mpi4py not found')

from numpy import asarray
from numpy import vstack
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably import numpy as np, but this is also ok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no strong feelings here, can do if you'd like but the iteration with jenkins is large

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, no strong feelings either ... Just ignore that comment.

Yoshiki Vázquez-Baeza

On Jun 4, 2015, at 9:45 PM, Daniel McDonald notifications@github.com wrote:

In qiime/beta_diversity.py:

@@ -34,7 +34,7 @@
import warnings
warnings.filterwarnings('ignore', 'Not using MPI as mpi4py not found')

-from numpy import asarray
+from numpy import vstack
no strong feelings here, can do if you'd like but the iteration with jenkins is large


Reply to this email directly or view it on GitHub.

@ghost
Copy link

ghost commented Jun 5, 2015

Build results will soon be (or already are) available at: http://ci.qiime.org/job/qiime-github-pr/1658/

@wasade
Copy link
Member Author

wasade commented Jun 5, 2015

@ElDeveloper, able to pull this down and see if it does help with the EMP table? Curious to see what happens in a real world scenario.

@ElDeveloper
Copy link
Member

I'm planning on doing some testing with this branch later today or next
week, I'll report back for sure! Thanks for putting this in place!

On (Jun-05-15| 8:11), Daniel McDonald wrote:

@ElDeveloper, able to pull this down and see if it does help with the EMP table? Curious to see what happens in a real world scenario.


Reply to this email directly or view it on GitHub:
#2040 (comment)

@wasade
Copy link
Member Author

wasade commented Jun 5, 2015

Excellent, thanks!

On Fri, Jun 5, 2015 at 10:13 AM, Yoshiki Vázquez Baeza <
notifications@github.com> wrote:

I'm planning on doing some testing with this branch later today or next
week, I'll report back for sure! Thanks for putting this in place!

On (Jun-05-15| 8:11), Daniel McDonald wrote:

@ElDeveloper, able to pull this down and see if it does help with the
EMP table? Curious to see what happens in a real world scenario.


Reply to this email directly or view it on GitHub:
#2040 (comment)


Reply to this email directly or view it on GitHub
#2040 (comment).

@ElDeveloper
Copy link
Member

Thanks @wasade, I just tested this out and in came up with the error you see below:

Traceback (most recent call last):
  File "/home/yova1074/.virtualenvs/qiime-dev/bin/beta_diversity.py", line 152, in <module>
    main()
  File "/home/yova1074/.virtualenvs/qiime-dev/bin/beta_diversity.py", line 145, in main
    opts.output_dir, opts.rows, full_tree=opts.full_tree)
  File "/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/qiime/beta_diversity.py", line 227, in single_file_beta
    otumtx = otu_table.matrix_data.T.toarray()
  File "/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 940, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray
    B = self._process_toarray_args(order, out)
  File "/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/scipy/sparse/base.py", line 793, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

@wasade
Copy link
Member Author

wasade commented Jun 6, 2015

Parallel, and unifrac? If so, I do not understand how this code works
On Jun 5, 2015 6:12 PM, "Yoshiki Vázquez Baeza" notifications@github.com
wrote:

Thanks @wasade https://github.com/wasade, I just tested this out and in
came up with the error you see below:

Traceback (most recent call last):
File "/home/yova1074/.virtualenvs/qiime-dev/bin/beta_diversity.py", line 152, in
main()
File "/home/yova1074/.virtualenvs/qiime-dev/bin/beta_diversity.py", line 145, in main
opts.output_dir, opts.rows, full_tree=opts.full_tree)
File "/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/qiime/beta_diversity.py", line 227, in single_file_beta
otumtx = otu_table.matrix_data.T.toarray()
File "/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 940, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray
B = self._process_toarray_args(order, out)
File "/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/scipy/sparse/base.py", line 793, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError


Reply to this email directly or view it on GitHub
#2040 (comment).

@ElDeveloper
Copy link
Member

Yes, parallel and UniFrac. Using 8GB.

On (Jun-05-15|17:45), Daniel McDonald wrote:

Parallel, and unifrac? If so, I do not understand how this code works
On Jun 5, 2015 6:12 PM, "Yoshiki Vázquez Baeza" notifications@github.com
wrote:

Thanks @wasade https://github.com/wasade, I just tested this out and in
came up with the error you see below:

Traceback (most recent call last):
File "/home/yova1074/.virtualenvs/qiime-dev/bin/beta_diversity.py", line 152, in
main()
File "/home/yova1074/.virtualenvs/qiime-dev/bin/beta_diversity.py", line 145, in main
opts.output_dir, opts.rows, full_tree=opts.full_tree)
File "/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/qiime/beta_diversity.py", line 227, in single_file_beta
otumtx = otu_table.matrix_data.T.toarray()
File "/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 940, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray
B = self._process_toarray_args(order, out)
File "/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/scipy/sparse/base.py", line 793, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError


Reply to this email directly or view it on GitHub
#2040 (comment).


Reply to this email directly or view it on GitHub:
#2040 (comment)

@wasade
Copy link
Member Author

wasade commented Jun 6, 2015

I don't understand why that branch is being taken in that code
unfortunately... @gregcaporaso, do you know?
On Jun 5, 2015 6:47 PM, "Yoshiki Vázquez Baeza" notifications@github.com
wrote:

Yes, parallel and UniFrac. Using 8GB.

On (Jun-05-15|17:45), Daniel McDonald wrote:

Parallel, and unifrac? If so, I do not understand how this code works
On Jun 5, 2015 6:12 PM, "Yoshiki Vázquez Baeza" <
notifications@github.com>
wrote:

Thanks @wasade https://github.com/wasade, I just tested this out
and in
came up with the error you see below:

Traceback (most recent call last):
File "/home/yova1074/.virtualenvs/qiime-dev/bin/beta_diversity.py",
line 152, in
main()
File "/home/yova1074/.virtualenvs/qiime-dev/bin/beta_diversity.py",
line 145, in main
opts.output_dir, opts.rows, full_tree=opts.full_tree)
File
"/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/qiime/beta_diversity.py",
line 227, in single_file_beta
otumtx = otu_table.matrix_data.T.toarray()
File
"/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/scipy/sparse/compressed.py",
line 940, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File
"/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/scipy/sparse/coo.py",
line 274, in toarray
B = self._process_toarray_args(order, out)
File
"/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/scipy/sparse/base.py",
line 793, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError


Reply to this email directly or view it on GitHub
#2040 (comment).


Reply to this email directly or view it on GitHub:
#2040 (comment)


Reply to this email directly or view it on GitHub
#2040 (comment).

@wasade
Copy link
Member Author

wasade commented Jun 6, 2015

@ElDeveloper, that branch implies that one_sample_unifrac is being used, not dist_unweighted_unifrac. Walking backwards, the metric used in your traceback is pulled from this function, and using this unifrac metric. I was expecting this method for pulling the metric, and that it would pull this method for unifrac. Parallelizing the other aspects of this are going to be more involved unfortunately, though it may be as simple is replicating this for loop

@ElDeveloper
Copy link
Member

@wasade, thanks for checking!

@@ -208,17 +208,23 @@ def single_file_beta(input_path, metrics, tree_path, output_dir,
sample_ids = otu_table.ids()
observation_ids = otu_table.ids(axis='observation')
for i in range(len(sample_ids)):
samp_a = otu_table.data(sample_ids[rowidx])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wasade after looking at the code with @ElDeveloper, it turns out these lines inside the AttributeError will never be executed for the unifrac distances, as line 204 do not raise any error.

IIRC, the current implementation of Unifrac in cogent is FastUnifrac. FastUnifrac is able to compute unifrac fast because it pre-computes a lot of data structures. that is why for Unifrac these lines never get executed.

One way of doing this will be to create a for loop and give the code an OTU table with 2 samples each time, and then just construct the distance matrix ourselves in here. On one hand, this will kill the "Fast" part of Unifrac, since it will recompute those structures for each pair of samples. On the other hand, we will be able to aggressively parallelize this, e.g. creating a job for each pair of samples; maximizing the parallelization during those pre-computations.

Other options require more though on how to parallelize this, instead of using the naïve per-row parallelization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fast Unifrac does not require a dense representation. It does cache
information on the tree, but it doesn't use the zeros. However, if enabling
fast unifrac to scale within qiime requires modification of pycogent, then
I agree it is not worth pursuing.

Do you know why one_sample_unweighted_unifrac is being used instead of
dist_unweighted_unifrac? My impression was that the latter would work and
what I was expecting the metric being used to be. It is also fast unifrac

On Tue, Jun 9, 2015 at 3:39 PM, josenavas notifications@github.com wrote:

In qiime/beta_diversity.py
#2040 (comment):

@@ -208,17 +208,23 @@ def single_file_beta(input_path, metrics, tree_path, output_dir,
sample_ids = otu_table.ids()
observation_ids = otu_table.ids(axis='observation')
for i in range(len(sample_ids)):

  •                        samp_a = otu_table.data(sample_ids[rowidx])
    

@wasade https://github.com/wasade after looking at the code with
@ElDeveloper https://github.com/ElDeveloper, it turns out these lines
inside the AttributeError will never be executed for the unifrac
distances, as line 204 do not raise any error.

IIRC, the current implementation of Unifrac in cogent is FastUnifrac.
FastUnifrac is able to compute unifrac fast because it pre-computes a lot
of data structures. that is why for Unifrac these lines never get executed.

One way of doing this will be to create a for loop and give the code an
OTU table with 2 samples each time, and then just construct the distance
matrix ourselves in here. On one hand, this will kill the "Fast" part of
Unifrac, since it will recompute those structures for each pair of samples.
On the other hand, we will be able to aggressively parallelize this, e.g.
creating a job for each pair of samples; maximizing the parallelization
during those pre-computations.

Other options require more though on how to parallelize this, instead of
using the naïve per-row parallelization.


Reply to this email directly or view it on GitHub
https://github.com/biocore/qiime/pull/2040/files#r32067547.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @wasade

No idea why that function is used...

tree, otu_table.ids(), rowid,
make_subtree=(not full_tree))
row_dissims.append(dissims)
# do element by element
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ElDeveloper, wanna give this another spin? I verified with pdb what branches were used and output by md5 sum is identical with and without this change on the test data.

@gregcaporaso, I'm really not sure if this change is safe for all the metrics, but for some reason, dist_unweighted_unifrac and dist_weighted_unifrac were not being treated like row metrics and instead operating on the full table because of the try/except. The comment here "do element by element" is actually misleading as best I can tell as the original code looked like it was doing it row by row. So I think this is correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wasade, I think this might be it thanks for having a look, I started it
and its been running for a while in 8GB, while before I needed to use
64GB. 🍺 👍

On (Jul-23-15|11:42), Daniel McDonald wrote:

  •                            dissim = metric_f(
    
  •                                otumtx[[rowidx, i], :], observation_ids,
    
  •                                tree, [sample_ids[rowidx], sample_ids[i]],
    
  •                                make_subtree=(not full_tree))[0, 1]
    
  •                        else:
    
  •                            dissim = metric_f(otumtx[[rowidx, i], :])[0, 1]
    
  •                        dissims.append(dissim)
    
  •                    row_dissims.append(dissims)
    
  •                else:
    
  •                    # do whole row at once
    
  •                    dissims = row_metric(otumtx,
    
  •                                         otu_table.ids(axis='observation'),
    
  •                                         tree, otu_table.ids(), rowid,
    
  •                                         make_subtree=(not full_tree))
    
  •                    row_dissims.append(dissims)
    
  •                # do element by element
    

@ElDeveloper, wanna give this another spin? I verified with pdb what branches were used and output by md5 sum is identical with and without this change on the test data.

@gregcaporaso, I'm really not sure if this change is safe for all the metrics, but for some reason, dist_unweighted_unifrac and dist_weighted_unifrac were not being treated like row metrics and instead operating on the full table because of the try/except. The comment here "do element by element" is actually misleading as best I can tell as the original code looked like it was doing it row by row. So I think this is correct?


Reply to this email directly or view it on GitHub:
https://github.com/biocore/qiime/pull/2040/files#r35356369

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💥

@ghost
Copy link

ghost commented Jul 23, 2015

Build results will soon be (or already are) available at: http://ci.qiime.org/job/qiime-github-pr/1665/

@ElDeveloper
Copy link
Member

@wasade, it seems like this may have fixed the original problem, however it seems like the running time is much slower now. I'm basing this on the fact that in the previous execution that I did, the jobs finished within 48 hours. This last time it took longer than I had initially anticipated, it exceeded the walltime limit 😭 :

/home/yova1074/.bash_profile: line 57: /home/yova1074/.virtualenvs/qiime-dev/bin/postactivate: No such file or directory
=>> PBS: job killed: walltime 432039 exceeded limit 432000

@wasade
Copy link
Member Author

wasade commented Jul 29, 2015

That's unexpected. Did any of the individual jobs complete?

On Wed, Jul 29, 2015 at 11:18 AM, Yoshiki Vázquez Baeza <
notifications@github.com> wrote:

@wasade https://github.com/wasade, it seems like this may have fixed
the original problem, however it seems like the running time is much slower
now. I'm basing this on the fact that in the previous execution that I did,
the jobs finished within 48 hours. This last time it took longer than I had
initially anticipated, it exceeded the walltime limit [image: 😭] :

/home/yova1074/.bash_profile: line 57: /home/yova1074/.virtualenvs/qiime-dev/bin/postactivate: No such file or directory
=>> PBS: job killed: walltime 432039 exceeded limit 432000


Reply to this email directly or view it on GitHub
#2040 (comment).

@ElDeveloper
Copy link
Member

No, sadly not a single one of them.

On (Jul-29-15|11:01), Daniel McDonald wrote:

That's unexpected. Did any of the individual jobs complete?

On Wed, Jul 29, 2015 at 11:18 AM, Yoshiki Vázquez Baeza <
notifications@github.com> wrote:

@wasade https://github.com/wasade, it seems like this may have fixed
the original problem, however it seems like the running time is much slower
now. I'm basing this on the fact that in the previous execution that I did,
the jobs finished within 48 hours. This last time it took longer than I had
initially anticipated, it exceeded the walltime limit [image: 😭] :

/home/yova1074/.bash_profile: line 57: /home/yova1074/.virtualenvs/qiime-dev/bin/postactivate: No such file or directory
=>> PBS: job killed: walltime 432039 exceeded limit 432000


Reply to this email directly or view it on GitHub
#2040 (comment).


Reply to this email directly or view it on GitHub:
#2040 (comment)

@wasade
Copy link
Member Author

wasade commented Jul 29, 2015

Is cputime reasonably similar to walltime?

On Wed, Jul 29, 2015 at 12:04 PM, Yoshiki Vázquez Baeza <
notifications@github.com> wrote:

No, sadly not a single one of them.

On (Jul-29-15|11:01), Daniel McDonald wrote:

That's unexpected. Did any of the individual jobs complete?

On Wed, Jul 29, 2015 at 11:18 AM, Yoshiki Vázquez Baeza <
notifications@github.com> wrote:

@wasade https://github.com/wasade, it seems like this may have fixed
the original problem, however it seems like the running time is much
slower
now. I'm basing this on the fact that in the previous execution that I
did,
the jobs finished within 48 hours. This last time it took longer than
I had
initially anticipated, it exceeded the walltime limit [image: 😭] :

/home/yova1074/.bash_profile: line 57:
/home/yova1074/.virtualenvs/qiime-dev/bin/postactivate: No such file or
directory
=>> PBS: job killed: walltime 432039 exceeded limit 432000


Reply to this email directly or view it on GitHub
#2040 (comment).


Reply to this email directly or view it on GitHub:
#2040 (comment)


Reply to this email directly or view it on GitHub
#2040 (comment).

@wasade
Copy link
Member Author

wasade commented Jul 29, 2015

My guess is that each call to unifrac is doing some expensive caching, whereas previously the caching was done once. This is is going to get even more dirty.

@wasade
Copy link
Member Author

wasade commented Jul 29, 2015

Just a note, @ElDeveloper exchanged messages out-of-thread and cputime is ~= walltime, so the jobs were burning the full time.

This approach is resulting in 2 * N**2 calls to _fast_unifrac_setup, in which the bulk of the time is then spent in cogent's TreeNode.copy. (Note, I do not understand the source of the coefficient at this time, but that's minor right now). The existing trajectory through this code resulted in 2 * N calls to _fast_unifrac_setup. In the small test case from qiime_test_data, the runtime difference is 0.9s vs 8.7s`.

Continuing the investigation...

@wasade
Copy link
Member Author

wasade commented Jul 29, 2015

In true n00b form, I now know where the 2 coefficient comes from: running both weighted and unweighted during testing.

Anyway, found a reasonable way forward. It increases setup time, but I'm only operating on a small test dataset so it is not clear to me if the setup time increase will substantially increase overall runtime. The benefit of course is that we can avoid a dense representation.

...what would be really nice though is if fast_unifrac could just consume a scipy.sparse.csr_matrix for envs as the 2d dict currently being used is very heavy weight when the number of nonzero entries is large.

Commit coming momentarily

@ElDeveloper
Copy link
Member

Thanks for the info @wasade. I just started the processing again, we'll
see what happens! Agree on your proposal about fast_unifrac, not sure
where that fits in for the future of skbio.

On (Jul-29-15|16:20), Daniel McDonald wrote:

In true n00b form, I now know where the 2 coefficient comes from: running both weighted and unweighted during testing.

Anyway, found a reasonable way forward. It increases setup time, but I'm only operating on a small test dataset so it is not clear to me if the setup time increase will substantially increase overall runtime. The benefit of course is that we can avoid a dense representation.

...what would be really nice though is if fast_unifrac could just consume a scipy.sparse.csr_matrix for envs as the 2d dict currently being used is very heavy weight when the number of nonzero entries is large.

Commit coming momentarily


Reply to this email directly or view it on GitHub:
#2040 (comment)

@ghost
Copy link

ghost commented Jul 30, 2015

Build results will soon be (or already are) available at: http://ci.qiime.org/job/qiime-github-pr/1667/

@ElDeveloper
Copy link
Member

For the EMP table and using 100 workers with 8GB of RAM all workers failed with this MemoryError:

/home/yova1074/.bash_profile: line 57: /home/yova1074/.virtualenvs/qiime-dev/bin/postactivate: No such file or directory
Traceback (most recent call last):
  File "/home/yova1074/.virtualenvs/qiime-dev/bin/beta_diversity.py", line 152, in <module>
    main()
  File "/home/yova1074/.virtualenvs/qiime-dev/bin/beta_diversity.py", line 145, in main
    opts.output_dir, opts.rows, full_tree=opts.full_tree)
  File "/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/qiime/beta_diversity.py", line 230, in single_file_beta
    make_subtree=(not full_tree))
  File "/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/qiime/beta_metrics.py", line 88, in result
    tree, envs, weighted=weighted, metric=metric, **kwargs)
  File "/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/cogent/maths/unifrac/fast_unifrac.py", line 523, in fast_unifrac_one_sample
    branch_lengths, nodes, t = _fast_unifrac_setup(t, envs, make_subtree)
  File "/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/cogent/maths/unifrac/fast_unifrac.py", line 189, in _fast_unifrac_setup
    count_array, unique_envs, env_to_index, node_to_index = index_envs(envs, node_index)
  File "/home/yova1074/.virtualenvs/qiime-dev/lib/python2.7/site-packages/cogent/maths/unifrac/fast_tree.py", line 113, in index_envs
    result = zeros((num_nodes, num_envs), array_constructor)
MemoryError

@wasade
Copy link
Member Author

wasade commented Jul 30, 2015

@rob-knight, does fast unifrac require a dense representation of the OTU table? This exception looks like it is forcing that. If so, then I'm not sure how to proceed here without making changes in cogent. That, or its possible that parallel_beta_diversity.py isn't driving fast unifrac correctly?

@rob-knight
Copy link

No but it does require a dense representation of the two vectors representing a pair of environments. Trimming this vector to remove shared zeroes would require trimming the tree, which might be computationally expensive.

On Jul 30, 2015, at 12:27 PM, Daniel McDonald notifications@github.com wrote:

@rob-knight https://github.com/rob-knight, does fast unifrac require a dense representation of the OTU table? This exception looks like it is forcing that. If so, then I'm not sure how to proceed here without making changes in cogent. That, or its possible that parallel_beta_diversity.py isn't driving fast unifrac correctly?


Reply to this email directly or view it on GitHub #2040 (comment).

@wasade
Copy link
Member Author

wasade commented Jul 31, 2015

@rob-knight @ElDeveloper, then I think we're in a bind. If I understand it correctly, we either do it pairwise and eat the non-trivial compute overhead of _fast_unifrac_setup or we pass in all samples as the envs forcing a dense matrix to be produced in index_envs (i.e., ~3TB of memory). The compute overhead on top of fast unifrac itself for doing pairwise is appears to be at least O(K^2 * (N * log N)), where N is the number of tips in the tree (7.5 million, this is open ref, expense driven by TreeNode.copy prior to pruning), and K is the number of samples.

Doing some empirical tests (Notebook here) of cogent's TreeNode.copy with the Greengenes OTU trees, we can fit the growth pretty well (r^2 = 0.99). Extrapolating to 7.5 million tips in the EMP tree, and assuming 30k samples, we're looking at on the order of 56 million CPU hours of overhead.

It is my understanding that this is blocking EMP analysis. There are at least two issues that I believe need to be resolved:

  • introduce a fast TreeNode.copy, or modify index_tree to only index a subtree (thereby avoiding the need for a copy)
  • modify fast_unifrac to allow operating pairwise on the envs instead of them in bulk to reduce overhead expense associated with spinning up its machinery

In order to resolve the first point, which I think is doable, we'll need to backport all of fast unifrac.

I've been optimistic about each commit that goes in on this PR, but it seems like everyone just unravels a deeper problem. So hopefully we can resolve this with the next fix...

@rob-knight
Copy link

Part of the problem is the large number of envs so one possibility is to subset the environmental samples and run the computations such that each is compared to each other one at least once, then aggregate the table at the end, which is much less computationally expensive, right? I realize this requires some custom aggregation code and doesn’t use the current parallelization model.

On Jul 31, 2015, at 8:34 AM, Daniel McDonald notifications@github.com wrote:

@rob-knight https://github.com/rob-knight @ElDeveloper https://github.com/ElDeveloper, then I think we're in a bind. If I understand it correctly, we either do it pairwise and eat the non-trivial compute overhead of _fast_unifrac_setup or we pass in all samples as the envs forcing a dense matrix to be produced in index_envs (i.e., ~3TB of memory). The compute overhead on top of fast unifrac itself for doing pairwise is appears to be at least O(K^2 * (N * log N)), where N is the number of tips in the tree (7.5 million, this is open ref, expense driven by TreeNode.copy prior to pruning), and K is the number of samples.

Doing some empirical tests (Notebook here https://gist.github.com/wasade/22b96005a18636ef158a) of cogent's TreeNode.copy with the Greengenes OTU trees, we can fit the growth pretty well (r^2 = 0.99). Extrapolating to 7.5 million tips in the EMP tree, and assuming 30k samples, we're looking at on the order of 56 million CPU hours of overhead.

It is my understanding that this is blocking EMP analysis. There are at least two issues that I believe need to be resolved:

introduce a fast TreeNode.copy, or modify index_tree to only index a subtree (thereby avoiding the need for a copy)
modify fast_unifrac to allow operating pairwise on the envs instead of them in bulk to reduce overhead expense associated with spinning up its machinery
In order to resolve the first point, which I think is doable, we'll need to backport all of fast unifrac.

I've been optimistic about each commit that goes in on this PR, but it seems like everyone just unravels a deeper problem. So hopefully we can resolve this with the next fix...


Reply to this email directly or view it on GitHub #2040 (comment).

@wasade
Copy link
Member Author

wasade commented Jul 31, 2015

That's a good idea -- find a balancing point between 2 envs and 30k envs. The nice thing is this would not require backporting and can be done in the parallel_beta_diversity code such that it can leverage the parallelization in place. On the other hand, that code is dead, so I wonder if the better bet is to port, bench, and optimize first (particularly as I think there are some workarounds with the copy) to position us better for a subsequent port to skbio.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants