-
-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up connect_edges #725
base: main
Are you sure you want to change the base?
Conversation
1979185
to
7fa33cb
Compare
Thanks for the contribution! For the WIP items:
|
Don't think I've ever looked at the internals of that code although we do expose the functionality in holoviews. |
@philippjfr could you point out to me where that's exposed? I'd like to see an example of this argument being used. I found the holoviews function |
While CI and installation stuff gets worked out, do I have the devs blessing to change the edge id tests? |
I wish I could give you that blessing, but I won't have a chance to look at this code until Tuesday to see whether it was the code or the tests that was wrong before. Sorry about that! |
@jbednar, any chance you've been able to look at those tests? |
Sorry for the delay! I did look at them, but it's a mess! it looks like different decisions were made between HoloViews and Datashader about what to do when there is an id field available, and we haven't yet worked out what to do about that, how to maintain backwards compatibility , and how to deal with it. I had hoped we could address this before our release today, but other deadlines prevented us from fully working that out, so we have had to postpone it again to next week, after the release. If we find a problem we'll fix it and make a new release, merging your improvements either way. So we haven't forgotten about this; it's just delayed more... |
No problem! Yeah, the current implementation was much more complicated than I was expecting. I would note the current datashader function does return faulty values if the data doesn't match up exactly with the expectations. For example, import pandas as pd
import numpy as np
import datashader as ds
from datashader.layout import circular_layout
from datashader.bundling import connect_edges
n, m = 100, 1000
nodes = pd.DataFrame(["node"+str(i) for i in range(n)], columns=['name'])
edges = pd.DataFrame(np.random.randint(0,len(nodes), size=(m, 2)),
columns=['source', 'target'])
circular = circular_layout(nodes, uniform=False)
print(connect_edges(circular, edges).head())
print(connect_edges(circular, edges, include_edge_id=True).head())
# x y
# 0 0.149448 0.143471
# 1 0.037398 0.310265
# 2 NaN NaN
# 3 0.999698 0.517384
# 4 0.002216 0.452981
# edge_id x y
# 0 0.149448 0.143471 3.739811e-02
# 1 0.149448 0.310265 4.594811e-322
# 2 NaN NaN NaN
# 3 0.999698 0.517384 2.215678e-03
# 4 0.999698 0.452981 1.185758e-322 Additionally, it doesn't throw an error if you don't provide information it needs (like node positions). I would've opened an issue, but figured this PR fixes it anyways. |
I reran the Appveyor tests on this PR to try to see what is failing. Some old failures have gone away (presumably issues fixed on master) but the tests that are failing do seem to relate to this PR. In particular It is odd that this issue only appears on Windows but on that platform at least, either the unit test definition or the data generated using the code in this PR is incorrect. That is a shame as I would love to see this PR merged: those are some very nice speedups! |
Actually, I'm not convinced that unit test is being on travis which would explain why only appveyor is failing right now! |
IIRC, different environments ran different sets of tests. There was some discussion of this in #724, but I'm not sure if the issue has been solved since. |
Sorry for the long delay, we're interested in this work, could you push an empty commit or a trivial change to get the test suite to run again? Thanks ! |
Ha, no worries. I may not be able to promise much time to this now, but took a quick look over the code again. Looking at the CI, it seems like those tests still need to be modified. It's clearly wrong that there are edges with |
@ivirshup I believe that you will be attending the NumFOCUS Project Summit next week so perhaps we can have a chat about this work then? |
Sure. Did we get a list of people who are going to be there? |
There is a |
Currently,
connect_edges
is pretty slow. This speeds it up a lotHere's an example of the speed up
New implementation
I've labelled this as a work in progress for a few reasons:
njit(parallel=True)
ornjit(cache=True)
, but I don't see those being used a lot in this codebase. Should I set one of those? Is there a preference for one over the other?test_directly_connect_with_weights[True]
andtest_directly_connect_without_weights[True]
undertest_bundling.py
are failing, but the tests might be wrong. Here's what I mean:Given an edge data frame that look like this:
I would think the output should have edge ids
[0, 1, 2, 3]
. However, the test says they should have ids[1, 2, 3, 4]
(link). It looks to me like theedge_id
is being set by id of the target node because the node dataframe that's passed also has anid
field.