-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random Projection Forest #215
Conversation
Yes, I think it will be better to pass these args via opts
I don't have a strong preference, but I think it will be consistent if we rename it to |
Co-authored-by: Mateusz Sluszniak <[email protected]>
Thank you @krstopro! This is a great starting point, the big question is if we can fully convert this to |
@krstopro I had some time to look into this and I think we may be able to move this to
Given 2 is much harder than one, perhaps I would start with it first. :) |
One last comment, I took a look at what is the cost of implementing task item 2 and this parameter exists in scikit-learn as The implementation for For example, imagine we are storing 0..9 into the kdtree. With leafsize=1, we will have:
with leafsize=2 and leafsize=3, we have the same results too. The only difference is with leafsize=4:
And we can easily achieve this by changing the algorithm to stop earlier. The hardest part would be knowing how many children there are in the left side of
|
Thanks for the feedback @josevalim. This might be possible to write fully within
There might be a problem with querying such a tree (or even forest) in parallel if leaves are of different size. I am avoiding this by making the forest complete and making leaf size equal for all the leaves (left subtree might be larger by 1 than the right subtree, and when querying the right leaf we might query 1 element from the left leaf; this is no issue at all). Update: my bad, |
Ah, my comment was written from the perspective that leafsize is a maximum, not a minimum. But it is easy to flip it around and make it a minimum.
Perhaps a separate algorithm is indeed best, because the current one leads to left balanced trees which will lead to many more elements on the left side. The bounded algorithm is not complicated. The hardest part is deriving the formulas for the pivot (both bounded_segment_begin and bounded_subtree_size) but perhaps there are already definitions for these elsewhere. |
{median, left_indices, right_indices} = | ||
Nx.Defn.jit_apply(&split(&1, &2, &3), [tensor, indices, hyperplane]) | ||
|
||
medians = Nx.put_slice(medians, [0, node], Nx.new_axis(median, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this inside split
. We should have as many operations inside defn
as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is easy to do, but I don't see exactly why is it better given that put_slice
is Nx
function. Can I ask for an explanation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nx in Elixir land needs to preserve the immutable semantics of the language. So every time you receive an output, the output will be fully copied into memory. In this case, with the two operations in the snippet, you are making a full copy of median
, left_indices
, right_indices
, and medians
. By moving this operation inside, at least you don't need to copy median
.
This is exactly what moving the whole thing to a defn
is beneficial. You don't have to worry about the copying of variables in a loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks.
Yes!
As far as I understand they are already currently validated and accepted as options. I would only pass a default of 1 to So overall, this looks great. The biggest question is indeed only the |
Yes, options are already implemented.
Honestly, I am not sure how important is this at the moment. Currently it takes seconds for 100K points on CPU; I am using M1 Mac, so I cannot test it with CUDA. |
The issue is precisely that going in and out of CUDA is even more expensive. Plus, even on CPU, KDTrees are 30x faster and use considerably less memory in their |
There is another optimization I should implement. Instead of computing the projections each time in Also, I have the |
@krstopro btw, can you confirm that, for a given size N and min_leaf_size, this is how your trees look like:
|
Something different. In random projection forest as implemented here the nodes do not correspond to the points in the dataset. |
In any case, for what is worth, here is an implementation of test "sizes" do
IO.inspect(fit_bounded(Nx.iota({21, 1}), 20, min_leaf_size: 2))
IO.inspect(fit_bounded(Nx.iota({20, 1}), 20, min_leaf_size: 2))
IO.inspect(fit_bounded(Nx.iota({19, 1}), 20, min_leaf_size: 2))
IO.inspect(fit_bounded(Nx.iota({18, 1}), 20, min_leaf_size: 2))
end
import Nx.Defn
deftransformp bounded_level(%Nx.Tensor{type: {:u, 32}} = i) do
Nx.subtract(31, Nx.count_leading_zeros(Nx.add(i, 1)))
end
deftransformp levels(size, min_leaf_size, levels) do
mid = div(size, 2)
if mid < min_leaf_size do
levels
else
levels(mid + rem(size, 2), min_leaf_size, levels + 1)
end
end
defn fit_bounded(tensor, amplitude, opts \\ []) do
opts = keyword!(opts, min_leaf_size: 1)
{size, dims} = Nx.shape(tensor)
levels = levels(size, opts[:min_leaf_size], 0)
band = amplitude + 1
tags = Nx.broadcast(Nx.u32(0), {size})
{level, tags, _tensor, _band} =
while {level = Nx.u32(0), tags, tensor, band}, level < levels - 1 do
k = rem(level, dims)
indices = Nx.argsort(tensor[[.., k]] + band * tags, type: :u32, stable: true)
tags = update_tags(tags, indices, level)
{level + 1, tags, tensor, band}
end
k = rem(level, dims)
Nx.argsort(tensor[[.., k]] + band * tags, type: :u32, stable: true)
end
defnp update_tags(tags, indices, level) do
pos = Nx.argsort(indices, type: :u32)
pivot = bounded_segment_begin(tags) + bounded_subtree_size(left_child(tags))
Nx.select(
pos < (1 <<< level) - 1,
tags,
Nx.select(
pos < pivot,
left_child(tags),
Nx.select(
pos > pivot,
right_child(tags),
tags
)
)
)
end
defnp bounded_subtree_size(i) do
{size} = Nx.shape(i)
denominator = 1 <<< bounded_level(i)
top = denominator - 1
shared = div(size - top, denominator)
remaining = rem(size - top, denominator)
shared + Nx.select(i - top < remaining, 1, 0)
end
defnp bounded_segment_begin(i) do
{size} = Nx.shape(i)
denominator = 1 <<< bounded_level(i)
top = denominator - 1
left = i - top
shared = div(size - top, denominator) * left
remaining = rem(size - top, denominator)
top + shared + Nx.min(remaining, left)
end
deftransform left_child(i) when is_integer(i), do: 2 * i + 1
deftransform left_child(%Nx.Tensor{} = t), do: Nx.add(Nx.multiply(2, t), 1)
deftransform right_child(i) when is_integer(i), do: 2 * i + 2
deftransform right_child(%Nx.Tensor{} = t), do: Nx.add(Nx.multiply(2, t), 2) It is pretty much the same as our KDTree, except with new |
Alright, the bounded version has been added. There are still some TODOs (including documentation and unit tests), but it seems to be working fine. What I am not sure about is how to handle amplitude; see the comment below. |
For example, if
then Similarly, for kd-trees I am not exactly sure how to proceed with this happening. Any thoughts? |
@josevalim The updated code is here. Unbounded version is removed and predict function is added. There are few things I am not sure about. |
One thing I noticed while comparing unbounded and bounded version of fit. After running
doing
and
won't give exactly the same result which I found slightly surprising. I don't know if this is a bug or just an XLA thing. |
@krstopro This is just down to "XLA thing". Because if I run the same code in the GPU I do actually get the exact same results, even though the CPU diverges slightly. I believe it might be related to the way XLA is optimizing memory access down at the lowest levels. f64 will also give the exact same results for both implementations, but I wouldn't worry too much about that |
Yeah, I also thought it has to do with how XLA works at low level. Thanks for confirming. |
Co-authored-by: Mateusz Sluszniak <[email protected]>
@@ -0,0 +1,363 @@ | |||
defmodule Scholar.Neighbors.RandomProjectionForest do | |||
@moduledoc """ | |||
Random Projection Forest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random Projection Forest. | |
Random Projection Forest for approximate k-nearest neighbors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or alternatively:
Random Projection Forest. | |
Random Projection Forest for (approximate) nearest neighbor searches. |
Or similar. :D
end | ||
|
||
@doc """ | ||
Computes the leaf indices for every point in the input tensor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Computes the leaf indices for every point in the input tensor. | |
Computes the leaf indices for every point in the input tensor. | |
Co-authored-by: José Valim <[email protected]>
|
||
@doc """ | ||
Computes the leaf indices for every point in the input tensor. | ||
If the input tensor contains n points, then the result has shape {n, num_trees, leaf_size}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the input tensor contains n points, then the result has shape {n, num_trees, leaf_size}. | |
If the input tensor contains n points, then the result has shape `{n, num_trees, leaf_size}`. | |
These are the indices of each tree in the forest that are closest to the input data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful work @krstopro! I have only added some minor comments and we can ship this soon.
Quick question: what is the next step from here? Which algorithm do you think we could implement on top of them?
Happy holidays!
Co-authored-by: José Valim <[email protected]>
Thanks!
I have submitted another pull request for that, see #226.
Thanks, same for you too! |
Yeah, I think that the next step should be NNDescent rather than Annoy. |
Co-authored-by: José Valim <[email protected]>
💚 💙 💜 💛 ❤️ |
Adds Random Projection Forest as in Random projection trees and low dimensional manifolds.
Things I am not sure about:
grow
. Perhaps I should have usedfit
as in rest ofScholar
?num_trees
andmin_leaf_size
be passed as options (and validated)?Docstrings are missing, I will gladly add them when the questions above are answered. More unit tests might be needed as well.