Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Projection Forest #215

Merged
merged 31 commits into from
Dec 24, 2023
Merged

Random Projection Forest #215

merged 31 commits into from
Dec 24, 2023

Conversation

krstopro
Copy link
Member

@krstopro krstopro commented Dec 9, 2023

Adds Random Projection Forest as in Random projection trees and low dimensional manifolds.
Things I am not sure about:

  • Function used to construct the forest is called grow. Perhaps I should have used fit as in rest of Scholar?
  • Should arguments num_trees and min_leaf_size be passed as options (and validated)?

Docstrings are missing, I will gladly add them when the questions above are answered. More unit tests might be needed as well.

@msluszniak
Copy link
Contributor

msluszniak commented Dec 10, 2023

Should arguments num_trees and min_leaf_size be passed as options (and validated)?

Yes, I think it will be better to pass these args via opts

Function used to construct the forest is called grow. Perhaps I should have used fit as in rest of Scholar?

I don't have a strong preference, but I think it will be consistent if we rename it to fit

@josevalim
Copy link
Contributor

Thank you @krstopro! This is a great starting point, the big question is if we can fully convert this to defn or not. Unfortunately I will only be able to look at the paper much later, so I don't have an answer right now. If you have any thoughts, please let us know!

@josevalim
Copy link
Contributor

@krstopro I had some time to look into this and I think we may be able to move this to defn. The paper says that kdtree and rptree are the same, except they choose a different splitting rule. However, rptree also wants to customize the min_size. Perhaps we could submit both of these changes to the current KDTree implementation?

  1. First allow the splitting rule to be passed as an anonymous function to both fit_bounded and fit_unbounded.

  2. Change the construction algorithm and the KDTree representation to store a min_size and build leaf nodes of at least min_size.

Given 2 is much harder than one, perhaps I would start with it first. :)

@josevalim
Copy link
Contributor

One last comment, I took a look at what is the cost of implementing task item 2 and this parameter exists in scikit-learn as leafsize. All entries inside the leafsize in a regular KDtree are brute-forced in the query algorithm.

The implementation for fit_bounded is actually quite trivial, we just need to compute the number of effective levels to traverse. That's because the leafsize does not actually change the result, just how we traverse it.

For example, imagine we are storing 0..9 into the kdtree. With leafsize=1, we will have:

         6
     3       8
   1   5   7   9
  0 2 4

with leafsize=2 and leafsize=3, we have the same results too. The only difference is with leafsize=4:

         6
     3       8
  012 45   7   9

And we can easily achieve this by changing the algorithm to stop earlier. The hardest part would be knowing how many children there are in the left side of 3 (and its right side), but the original algorithm already implements this as subtree_size. So overall we need to:

  1. Change fit_unbounded to accept a leafsize (@krstopro you already did this in your algorithm, so perhaps you can submit a PR?)

  2. Change fit_bounded to accept a leafsize (I will submit a PR for this soon)

  3. Change querying to consider the leafsize. The trick is to detect if we are in the last level. If so, you will read subtree_size entries and compare them using bruteforce.

@krstopro
Copy link
Member Author

krstopro commented Dec 11, 2023

Thanks for the feedback @josevalim.

This might be possible to write fully within defn; I went with the unbounded version first until I see the accuracy - it seems fine.
I am still trying the fully defn version, but there are some issues with it.

One last comment, I took a look at what is the cost of implementing task item 2 and this parameter exists in scikit-learn as leafsize. All entries inside the leafsize in a regular KDtree are brute-forced in the query algorithm.

There might be a problem with querying such a tree (or even forest) in parallel if leaves are of different size. I am avoiding this by making the forest complete and making leaf size equal for all the leaves (left subtree might be larger by 1 than the right subtree, and when querying the right leaf we might query 1 element from the left leaf; this is no issue at all).

Update: my bad, subtree_size should be able to account for this. Apologies.
Update: Or not. Not sure about this. :)

@josevalim
Copy link
Contributor

Ah, my comment was written from the perspective that leafsize is a maximum, not a minimum. But it is easy to flip it around and make it a minimum.

There might be a problem with querying such a tree (or even forest) in parallel if leaves are of different size.

Perhaps a separate algorithm is indeed best, because the current one leads to left balanced trees which will lead to many more elements on the left side.

The bounded algorithm is not complicated. The hardest part is deriving the formulas for the pivot (both bounded_segment_begin and bounded_subtree_size) but perhaps there are already definitions for these elsewhere.

{median, left_indices, right_indices} =
Nx.Defn.jit_apply(&split(&1, &2, &3), [tensor, indices, hyperplane])

medians = Nx.put_slice(medians, [0, node], Nx.new_axis(median, 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this inside split. We should have as many operations inside defn as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is easy to do, but I don't see exactly why is it better given that put_slice is Nx function. Can I ask for an explanation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nx in Elixir land needs to preserve the immutable semantics of the language. So every time you receive an output, the output will be fully copied into memory. In this case, with the two operations in the snippet, you are making a full copy of median, left_indices, right_indices, and medians. By moving this operation inside, at least you don't need to copy median.

This is exactly what moving the whole thing to a defn is beneficial. You don't have to worry about the copying of variables in a loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks.

@josevalim
Copy link
Contributor

Function used to construct the forest is called grow. Perhaps I should have used fit as in rest of Scholar?

Yes!

Should arguments num_trees and min_leaf_size be passed as options (and validated)?

As far as I understand they are already currently validated and accepted as options. I would only pass a default of 1 to min_leaf_size.

So overall, this looks great. The biggest question is indeed only the defn parts.

@krstopro
Copy link
Member Author

krstopro commented Dec 11, 2023

As far as I understand they are already currently validated and accepted as options. I would only pass a default of 1 to min_leaf_size.

Yes, options are already implemented.

So overall, this looks great. The biggest question is indeed only the defn parts.

Honestly, I am not sure how important is this at the moment. Currently it takes seconds for 100K points on CPU; I am using M1 Mac, so I cannot test it with CUDA.

@josevalim
Copy link
Contributor

Currently it takes seconds for 100K points on CPU; I am using M1 Mac, so I cannot test it with CUDA support.

The issue is precisely that going in and out of CUDA is even more expensive. Plus, even on CPU, KDTrees are 30x faster and use considerably less memory in their defn version.

@krstopro
Copy link
Member Author

There is another optimization I should implement. Instead of computing the projections each time in split, compute it once per level for the entire tensor. This is possible as one hyperplane is used on every level (this wasn't the original approach).

Also, I have the predict function ready. I was thinking of putting it inside k-NN graph construction module, but I changed my mind and will put it here.

@josevalim
Copy link
Contributor

josevalim commented Dec 11, 2023

@krstopro btw, can you confirm that, for a given size N and min_leaf_size, this is how your trees look like:

N = 20
                        11
            6                           16
 1 2 3 4 5    7 8 9 10      12 13 14 15    17 18 19 20

N = 19
                        10
           5                            15
 1 2 3 4      6 7 8 9       11 12 13 14    16 17 18 19

N = 18
                        10
           5                            15
 1 2 3 4      6 7 8 9       11 12 13 14    16 17 18

@krstopro
Copy link
Member Author

krstopro commented Dec 11, 2023

@krstopro btw, can you confirm that, for a given size N and min_leaf_size, this is how your trees look like

Something different. In random projection forest as implemented here the nodes do not correspond to the points in the dataset. indices field specifies how the points in leaves are arranged. For the example above, if all the hyperplanes are positive indices will be 0 1 ... 19. The first five points belong to the first leaf, second five to the second, etc.

@josevalim
Copy link
Contributor

josevalim commented Dec 11, 2023

In any case, for what is worth, here is an implementation of fit_bounded that generates a balanced tree (instead of left-balanced) with leave nodes of min_leaf_size:

  test "sizes" do
    IO.inspect(fit_bounded(Nx.iota({21, 1}), 20, min_leaf_size: 2))
    IO.inspect(fit_bounded(Nx.iota({20, 1}), 20, min_leaf_size: 2))
    IO.inspect(fit_bounded(Nx.iota({19, 1}), 20, min_leaf_size: 2))
    IO.inspect(fit_bounded(Nx.iota({18, 1}), 20, min_leaf_size: 2))
  end

  import Nx.Defn

  deftransformp bounded_level(%Nx.Tensor{type: {:u, 32}} = i) do
    Nx.subtract(31, Nx.count_leading_zeros(Nx.add(i, 1)))
  end

  deftransformp levels(size, min_leaf_size, levels) do
    mid = div(size, 2)

    if mid < min_leaf_size do
      levels
    else
      levels(mid + rem(size, 2), min_leaf_size, levels + 1)
    end
  end

  defn fit_bounded(tensor, amplitude, opts \\ []) do
    opts = keyword!(opts, min_leaf_size: 1)
    {size, dims} = Nx.shape(tensor)
    levels = levels(size, opts[:min_leaf_size], 0)
    band = amplitude + 1
    tags = Nx.broadcast(Nx.u32(0), {size})

    {level, tags, _tensor, _band} =
      while {level = Nx.u32(0), tags, tensor, band}, level < levels - 1 do
        k = rem(level, dims)
        indices = Nx.argsort(tensor[[.., k]] + band * tags, type: :u32, stable: true)
        tags = update_tags(tags, indices, level)
        {level + 1, tags, tensor, band}
      end

    k = rem(level, dims)
    Nx.argsort(tensor[[.., k]] + band * tags, type: :u32, stable: true)
  end

  defnp update_tags(tags, indices, level) do
    pos = Nx.argsort(indices, type: :u32)
    pivot = bounded_segment_begin(tags) + bounded_subtree_size(left_child(tags))

    Nx.select(
      pos < (1 <<< level) - 1,
      tags,
      Nx.select(
        pos < pivot,
        left_child(tags),
        Nx.select(
          pos > pivot,
          right_child(tags),
          tags
        )
      )
    )
  end

  defnp bounded_subtree_size(i) do
    {size} = Nx.shape(i)
    denominator = 1 <<< bounded_level(i)
    top = denominator - 1
    shared = div(size - top, denominator)
    remaining = rem(size - top, denominator)
    shared + Nx.select(i - top < remaining, 1, 0)
  end

  defnp bounded_segment_begin(i) do
    {size} = Nx.shape(i)
    denominator = 1 <<< bounded_level(i)
    top = denominator - 1
    left = i - top
    shared = div(size - top, denominator) * left
    remaining = rem(size - top, denominator)
    top + shared + Nx.min(remaining, left)
  end

  deftransform left_child(i) when is_integer(i), do: 2 * i + 1
  deftransform left_child(%Nx.Tensor{} = t), do: Nx.add(Nx.multiply(2, t), 1)

  deftransform right_child(i) when is_integer(i), do: 2 * i + 2
  deftransform right_child(%Nx.Tensor{} = t), do: Nx.add(Nx.multiply(2, t), 2)

It is pretty much the same as our KDTree, except with new bounded_segment_begin and bounded_subtree_size implementations. It could even used to implement a new sort of KDTree, however the traversal is a bit more complex as you would need to ask, for the leaf nodes, what is their size.

@krstopro
Copy link
Member Author

krstopro commented Dec 16, 2023

Alright, the bounded version has been added. There are still some TODOs (including documentation and unit tests), but it seems to be working fine. What I am not sure about is how to handle amplitude; see the comment below.

@krstopro
Copy link
Member Author

krstopro commented Dec 16, 2023

For example, if

key = Nx.Random.key(12)
{x, _key} = Nx.Random.uniform(key, shape: {135, 1})

then x[79] = 0.12850868701934814 and x[133] = 0.1285092830657959 (they are very close). The last hyperplane in the forest created with fit_bounded(x, min_leaf_size: 1, num_trees: 1, key: key) will be -1.2941265106201172 and the projections of x[79] and x[133] will be -0.1663064956665039 and -0.166307270526886 respectively.
Furthermore, amplitude = 1.2841360569000244 and band = 2.2841360569000244 as they are calculated now and tags are 43 for both points on level = 6. We then get that band * tags + proj is equal to 98.05154418945312 for both points.

Similarly, for kd-trees amplitude = 0.992280125617981, band = 1.992280125617981, and if tags are 43 for both points (some other number might work as well) we get that tensor[[.., 0]] + band * tags is equal to 85.79655456542969 for both of them.

I am not exactly sure how to proceed with this happening. Any thoughts?

@krstopro
Copy link
Member Author

@josevalim The updated code is here. Unbounded version is removed and predict function is added. There are few things I am not sure about.

@krstopro
Copy link
Member Author

One thing I noticed while comparing unbounded and bounded version of fit. After running

Nx.default_backend(EXLA.Backend)
key = Nx.Random.key(12)
{matrix, _key} = Nx.Random.uniform(key, type: :f32, shape: {10, 10})
{vector, _key} = Nx.Random.normal(key, type: :f32, shape: {10})

doing

Nx.dot(matrix, vector)

and

Nx.dot(Nx.new_axis(matrix, 0), [2], Nx.new_axis(vector, 0), [1])[[0, .., 0]]

won't give exactly the same result which I found slightly surprising. I don't know if this is a bug or just an XLA thing.

@polvalente
Copy link
Contributor

won't give exactly the same result which I found slightly surprising. I don't know if this is a bug or just an XLA thing.

@krstopro This is just down to "XLA thing". Because if I run the same code in the GPU I do actually get the exact same results, even though the CPU diverges slightly. I believe it might be related to the way XLA is optimizing memory access down at the lowest levels.

f64 will also give the exact same results for both implementations, but I wouldn't worry too much about that

@krstopro
Copy link
Member Author

@krstopro This is just down to "XLA thing". Because if I run the same code in the GPU I do actually get the exact same results, even though the CPU diverges slightly. I believe it might be related to the way XLA is optimizing memory access down at the lowest levels.

Yeah, I also thought it has to do with how XLA works at low level. Thanks for confirming.

Krsto Proroković added 2 commits December 23, 2023 13:30
Co-authored-by: Mateusz Sluszniak <[email protected]>
@@ -0,0 +1,363 @@
defmodule Scholar.Neighbors.RandomProjectionForest do
@moduledoc """
Random Projection Forest.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Random Projection Forest.
Random Projection Forest for approximate k-nearest neighbors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or alternatively:

Suggested change
Random Projection Forest.
Random Projection Forest for (approximate) nearest neighbor searches.

Or similar. :D

end

@doc """
Computes the leaf indices for every point in the input tensor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Computes the leaf indices for every point in the input tensor.
Computes the leaf indices for every point in the input tensor.

Co-authored-by: José Valim <[email protected]>

@doc """
Computes the leaf indices for every point in the input tensor.
If the input tensor contains n points, then the result has shape {n, num_trees, leaf_size}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the input tensor contains n points, then the result has shape {n, num_trees, leaf_size}.
If the input tensor contains n points, then the result has shape `{n, num_trees, leaf_size}`.
These are the indices of each tree in the forest that are closest to the input data.

Copy link
Contributor

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful work @krstopro! I have only added some minor comments and we can ship this soon.

Quick question: what is the next step from here? Which algorithm do you think we could implement on top of them?

Happy holidays!

@krstopro
Copy link
Member Author

@josevalim

Beautiful work @krstopro! I have only added some minor comments and we can ship this soon.

Thanks!

Quick question: what is the next step from here? Which algorithm do you think we could implement on top of them?

I have submitted another pull request for that, see #226.
It implements NN-expansion on top of random projection forest. This is similar to the algorithm used in LargeVis. Also, PaCMAP uses Annoy, which is based on random projection trees (though in Annoy they are constructed in a different way than here) so we might be able to proceed with the dimensionality reduction methods. Ideally, we would have NNDescent on top of random projection forest, as done in PyNNDescent. I am still thinking of implementing it (I need to see if we can avoid looping over all points using while), but NN-expansion might work just fine as it is simple and quick.

Happy holidays!

Thanks, same for you too!

@msluszniak
Copy link
Contributor

@josevalim

Beautiful work @krstopro! I have only added some minor comments and we can ship this soon.

Thanks!

Quick question: what is the next step from here? Which algorithm do you think we could implement on top of them?

I have submitted another pull request for that, see #226. It implements NN-expansion on top of random projection forest. This is similar to the algorithm used in LargeVis. Also, PaCMAP uses Annoy, which is based on random projection trees (though in Annoy they are constructed in a different way than here) so we might be able to proceed with the dimensionality reduction methods. Ideally, we would have NNDescent on top of random projection forest, as done in PyNNDescent. I am still thinking of implementing it (I need to see if we can avoid looping over all points using while), but NN-expansion might work just fine as it is simple and quick.

Happy holidays!

Thanks, same for you too!

Yeah, I think that the next step should be NNDescent rather than Annoy.

Co-authored-by: José Valim <[email protected]>
@josevalim josevalim merged commit d541516 into elixir-nx:main Dec 24, 2023
2 checks passed
@josevalim
Copy link
Contributor

💚 💙 💜 💛 ❤️

@krstopro krstopro changed the title Add Random Projection Forests Random Projection Forest Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants