Random Projection Forest #215

krstopro · 2023-12-09T12:51:46Z

Adds Random Projection Forest as in Random projection trees and low dimensional manifolds.
Things I am not sure about:

Function used to construct the forest is called grow. Perhaps I should have used fit as in rest of Scholar?
Should arguments num_trees and min_leaf_size be passed as options (and validated)?

Docstrings are missing, I will gladly add them when the questions above are answered. More unit tests might be needed as well.

lib/scholar/neighbors/random_projection_forest.ex

msluszniak · 2023-12-10T13:35:19Z

Should arguments num_trees and min_leaf_size be passed as options (and validated)?

Yes, I think it will be better to pass these args via opts

Function used to construct the forest is called grow. Perhaps I should have used fit as in rest of Scholar?

I don't have a strong preference, but I think it will be consistent if we rename it to fit

Co-authored-by: Mateusz Sluszniak <[email protected]>

josevalim · 2023-12-11T08:28:56Z

Thank you @krstopro! This is a great starting point, the big question is if we can fully convert this to defn or not. Unfortunately I will only be able to look at the paper much later, so I don't have an answer right now. If you have any thoughts, please let us know!

josevalim · 2023-12-11T10:19:22Z

@krstopro I had some time to look into this and I think we may be able to move this to defn. The paper says that kdtree and rptree are the same, except they choose a different splitting rule. However, rptree also wants to customize the min_size. Perhaps we could submit both of these changes to the current KDTree implementation?

First allow the splitting rule to be passed as an anonymous function to both fit_bounded and fit_unbounded.
Change the construction algorithm and the KDTree representation to store a min_size and build leaf nodes of at least min_size.

Given 2 is much harder than one, perhaps I would start with it first. :)

josevalim · 2023-12-11T11:38:43Z

One last comment, I took a look at what is the cost of implementing task item 2 and this parameter exists in scikit-learn as leafsize. All entries inside the leafsize in a regular KDtree are brute-forced in the query algorithm.

The implementation for fit_bounded is actually quite trivial, we just need to compute the number of effective levels to traverse. That's because the leafsize does not actually change the result, just how we traverse it.

For example, imagine we are storing 0..9 into the kdtree. With leafsize=1, we will have:

         6
     3       8
   1   5   7   9
  0 2 4

with leafsize=2 and leafsize=3, we have the same results too. The only difference is with leafsize=4:

         6
     3       8
  012 45   7   9

And we can easily achieve this by changing the algorithm to stop earlier. The hardest part would be knowing how many children there are in the left side of 3 (and its right side), but the original algorithm already implements this as subtree_size. So overall we need to:

Change fit_unbounded to accept a leafsize (@krstopro you already did this in your algorithm, so perhaps you can submit a PR?)
Change fit_bounded to accept a leafsize (I will submit a PR for this soon)
Change querying to consider the leafsize. The trick is to detect if we are in the last level. If so, you will read subtree_size entries and compare them using bruteforce.

krstopro · 2023-12-11T12:12:51Z

Thanks for the feedback @josevalim.

This might be possible to write fully within defn; I went with the unbounded version first until I see the accuracy - it seems fine.
I am still trying the fully defn version, but there are some issues with it.

One last comment, I took a look at what is the cost of implementing task item 2 and this parameter exists in scikit-learn as leafsize. All entries inside the leafsize in a regular KDtree are brute-forced in the query algorithm.

There might be a problem with querying such a tree (or even forest) in parallel if leaves are of different size. I am avoiding this by making the forest complete and making leaf size equal for all the leaves (left subtree might be larger by 1 than the right subtree, and when querying the right leaf we might query 1 element from the left leaf; this is no issue at all).

Update: my bad, subtree_size should be able to account for this. Apologies.
Update: Or not. Not sure about this. :)

josevalim · 2023-12-11T12:52:25Z

Ah, my comment was written from the perspective that leafsize is a maximum, not a minimum. But it is easy to flip it around and make it a minimum.

There might be a problem with querying such a tree (or even forest) in parallel if leaves are of different size.

Perhaps a separate algorithm is indeed best, because the current one leads to left balanced trees which will lead to many more elements on the left side.

The bounded algorithm is not complicated. The hardest part is deriving the formulas for the pivot (both bounded_segment_begin and bounded_subtree_size) but perhaps there are already definitions for these elsewhere.

josevalim · 2023-12-11T12:56:21Z

lib/scholar/neighbors/random_projection_forest.ex

+      {median, left_indices, right_indices} =
+        Nx.Defn.jit_apply(&split(&1, &2, &3), [tensor, indices, hyperplane])
+
+      medians = Nx.put_slice(medians, [0, node], Nx.new_axis(median, 1))


Move this inside split. We should have as many operations inside defn as possible.

This is easy to do, but I don't see exactly why is it better given that put_slice is Nx function. Can I ask for an explanation?

Nx in Elixir land needs to preserve the immutable semantics of the language. So every time you receive an output, the output will be fully copied into memory. In this case, with the two operations in the snippet, you are making a full copy of median, left_indices, right_indices, and medians. By moving this operation inside, at least you don't need to copy median.

This is exactly what moving the whole thing to a defn is beneficial. You don't have to worry about the copying of variables in a loop.

I see, thanks.

lib/scholar/neighbors/random_projection_forest.ex

josevalim · 2023-12-11T13:00:15Z

Function used to construct the forest is called grow. Perhaps I should have used fit as in rest of Scholar?

Yes!

Should arguments num_trees and min_leaf_size be passed as options (and validated)?

As far as I understand they are already currently validated and accepted as options. I would only pass a default of 1 to min_leaf_size.

So overall, this looks great. The biggest question is indeed only the defn parts.

krstopro · 2023-12-11T13:17:31Z

As far as I understand they are already currently validated and accepted as options. I would only pass a default of 1 to min_leaf_size.

Yes, options are already implemented.

So overall, this looks great. The biggest question is indeed only the defn parts.

Honestly, I am not sure how important is this at the moment. Currently it takes seconds for 100K points on CPU; I am using M1 Mac, so I cannot test it with CUDA.

josevalim · 2023-12-11T13:35:17Z

Currently it takes seconds for 100K points on CPU; I am using M1 Mac, so I cannot test it with CUDA support.

The issue is precisely that going in and out of CUDA is even more expensive. Plus, even on CPU, KDTrees are 30x faster and use considerably less memory in their defn version.

krstopro · 2023-12-11T13:35:42Z

There is another optimization I should implement. Instead of computing the projections each time in split, compute it once per level for the entire tensor. This is possible as one hyperplane is used on every level (this wasn't the original approach).

Also, I have the predict function ready. I was thinking of putting it inside k-NN graph construction module, but I changed my mind and will put it here.

josevalim · 2023-12-11T13:36:25Z

@krstopro btw, can you confirm that, for a given size N and min_leaf_size, this is how your trees look like:

N = 20
                        11
            6                           16
 1 2 3 4 5    7 8 9 10      12 13 14 15    17 18 19 20

N = 19
                        10
           5                            15
 1 2 3 4      6 7 8 9       11 12 13 14    16 17 18 19

N = 18
                        10
           5                            15
 1 2 3 4      6 7 8 9       11 12 13 14    16 17 18

krstopro · 2023-12-11T13:46:03Z

@krstopro btw, can you confirm that, for a given size N and min_leaf_size, this is how your trees look like

Something different. In random projection forest as implemented here the nodes do not correspond to the points in the dataset. indices field specifies how the points in leaves are arranged. For the example above, if all the hyperplanes are positive indices will be 0 1 ... 19. The first five points belong to the first leaf, second five to the second, etc.

josevalim · 2023-12-11T15:24:34Z

In any case, for what is worth, here is an implementation of fit_bounded that generates a balanced tree (instead of left-balanced) with leave nodes of min_leaf_size:

  test "sizes" do
    IO.inspect(fit_bounded(Nx.iota({21, 1}), 20, min_leaf_size: 2))
    IO.inspect(fit_bounded(Nx.iota({20, 1}), 20, min_leaf_size: 2))
    IO.inspect(fit_bounded(Nx.iota({19, 1}), 20, min_leaf_size: 2))
    IO.inspect(fit_bounded(Nx.iota({18, 1}), 20, min_leaf_size: 2))
  end

  import Nx.Defn

  deftransformp bounded_level(%Nx.Tensor{type: {:u, 32}} = i) do
    Nx.subtract(31, Nx.count_leading_zeros(Nx.add(i, 1)))
  end

  deftransformp levels(size, min_leaf_size, levels) do
    mid = div(size, 2)

    if mid < min_leaf_size do
      levels
    else
      levels(mid + rem(size, 2), min_leaf_size, levels + 1)
    end
  end

  defn fit_bounded(tensor, amplitude, opts \\ []) do
    opts = keyword!(opts, min_leaf_size: 1)
    {size, dims} = Nx.shape(tensor)
    levels = levels(size, opts[:min_leaf_size], 0)
    band = amplitude + 1
    tags = Nx.broadcast(Nx.u32(0), {size})

    {level, tags, _tensor, _band} =
      while {level = Nx.u32(0), tags, tensor, band}, level < levels - 1 do
        k = rem(level, dims)
        indices = Nx.argsort(tensor[[.., k]] + band * tags, type: :u32, stable: true)
        tags = update_tags(tags, indices, level)
        {level + 1, tags, tensor, band}
      end

    k = rem(level, dims)
    Nx.argsort(tensor[[.., k]] + band * tags, type: :u32, stable: true)
  end

  defnp update_tags(tags, indices, level) do
    pos = Nx.argsort(indices, type: :u32)
    pivot = bounded_segment_begin(tags) + bounded_subtree_size(left_child(tags))

    Nx.select(
      pos < (1 <<< level) - 1,
      tags,
      Nx.select(
        pos < pivot,
        left_child(tags),
        Nx.select(
          pos > pivot,
          right_child(tags),
          tags
        )
      )
    )
  end

  defnp bounded_subtree_size(i) do
    {size} = Nx.shape(i)
    denominator = 1 <<< bounded_level(i)
    top = denominator - 1
    shared = div(size - top, denominator)
    remaining = rem(size - top, denominator)
    shared + Nx.select(i - top < remaining, 1, 0)
  end

  defnp bounded_segment_begin(i) do
    {size} = Nx.shape(i)
    denominator = 1 <<< bounded_level(i)
    top = denominator - 1
    left = i - top
    shared = div(size - top, denominator) * left
    remaining = rem(size - top, denominator)
    top + shared + Nx.min(remaining, left)
  end

  deftransform left_child(i) when is_integer(i), do: 2 * i + 1
  deftransform left_child(%Nx.Tensor{} = t), do: Nx.add(Nx.multiply(2, t), 1)

  deftransform right_child(i) when is_integer(i), do: 2 * i + 2
  deftransform right_child(%Nx.Tensor{} = t), do: Nx.add(Nx.multiply(2, t), 2)

It is pretty much the same as our KDTree, except with new bounded_segment_begin and bounded_subtree_size implementations. It could even used to implement a new sort of KDTree, however the traversal is a bit more complex as you would need to ask, for the leaf nodes, what is their size.

krstopro · 2023-12-16T11:43:03Z

Alright, the bounded version has been added. There are still some TODOs (including documentation and unit tests), but it seems to be working fine. What I am not sure about is how to handle amplitude; see the comment below.

krstopro · 2023-12-16T12:06:47Z

For example, if

key = Nx.Random.key(12)
{x, _key} = Nx.Random.uniform(key, shape: {135, 1})

then x[79] = 0.12850868701934814 and x[133] = 0.1285092830657959 (they are very close). The last hyperplane in the forest created with fit_bounded(x, min_leaf_size: 1, num_trees: 1, key: key) will be -1.2941265106201172 and the projections of x[79] and x[133] will be -0.1663064956665039 and -0.166307270526886 respectively.
Furthermore, amplitude = 1.2841360569000244 and band = 2.2841360569000244 as they are calculated now and tags are 43 for both points on level = 6. We then get that band * tags + proj is equal to 98.05154418945312 for both points.

Similarly, for kd-trees amplitude = 0.992280125617981, band = 1.992280125617981, and if tags are 43 for both points (some other number might work as well) we get that tensor[[.., 0]] + band * tags is equal to 85.79655456542969 for both of them.

I am not exactly sure how to proceed with this happening. Any thoughts?

krstopro · 2023-12-22T15:43:27Z

@josevalim The updated code is here. Unbounded version is removed and predict function is added. There are few things I am not sure about.

krstopro · 2023-12-22T15:51:10Z

One thing I noticed while comparing unbounded and bounded version of fit. After running

Nx.default_backend(EXLA.Backend)
key = Nx.Random.key(12)
{matrix, _key} = Nx.Random.uniform(key, type: :f32, shape: {10, 10})
{vector, _key} = Nx.Random.normal(key, type: :f32, shape: {10})

doing

Nx.dot(matrix, vector)

and

Nx.dot(Nx.new_axis(matrix, 0), [2], Nx.new_axis(vector, 0), [1])[[0, .., 0]]

won't give exactly the same result which I found slightly surprising. I don't know if this is a bug or just an XLA thing.

polvalente · 2023-12-22T18:27:32Z

won't give exactly the same result which I found slightly surprising. I don't know if this is a bug or just an XLA thing.

@krstopro This is just down to "XLA thing". Because if I run the same code in the GPU I do actually get the exact same results, even though the CPU diverges slightly. I believe it might be related to the way XLA is optimizing memory access down at the lowest levels.

f64 will also give the exact same results for both implementations, but I wouldn't worry too much about that

krstopro · 2023-12-22T18:33:28Z

@krstopro This is just down to "XLA thing". Because if I run the same code in the GPU I do actually get the exact same results, even though the CPU diverges slightly. I believe it might be related to the way XLA is optimizing memory access down at the lowest levels.

Yeah, I also thought it has to do with how XLA works at low level. Thanks for confirming.

test/scholar/neighbors/random_projection_forest_test.exs

Co-authored-by: Mateusz Sluszniak <[email protected]>

josevalim · 2023-12-24T14:02:52Z

lib/scholar/neighbors/random_projection_forest.ex

@@ -0,0 +1,363 @@
+defmodule Scholar.Neighbors.RandomProjectionForest do
+  @moduledoc """
+  Random Projection Forest.


Suggested change

Random Projection Forest.

Random Projection Forest for approximate k-nearest neighbors.

Or alternatively:

Suggested change

Random Projection Forest.

Random Projection Forest for (approximate) nearest neighbor searches.

Or similar. :D

lib/scholar/neighbors/random_projection_forest.ex

josevalim · 2023-12-24T14:07:13Z

lib/scholar/neighbors/random_projection_forest.ex

+  end
+
+  @doc """
+  Computes the leaf indices for every point in the input tensor.


Suggested change

Computes the leaf indices for every point in the input tensor.

Computes the leaf indices for every point in the input tensor.

lib/scholar/neighbors/random_projection_forest.ex

Co-authored-by: José Valim <[email protected]>

lib/scholar/neighbors/random_projection_forest.ex

josevalim · 2023-12-24T14:11:02Z

lib/scholar/neighbors/random_projection_forest.ex

+
+  @doc """
+  Computes the leaf indices for every point in the input tensor.
+  If the input tensor contains n points, then the result has shape {n, num_trees, leaf_size}.


Suggested change

If the input tensor contains n points, then the result has shape {n, num_trees, leaf_size}.

If the input tensor contains n points, then the result has shape `{n, num_trees, leaf_size}`.

These are the indices of each tree in the forest that are closest to the input data.

josevalim

Beautiful work @krstopro! I have only added some minor comments and we can ship this soon.

Quick question: what is the next step from here? Which algorithm do you think we could implement on top of them?

Happy holidays!

Co-authored-by: José Valim <[email protected]>

krstopro · 2023-12-24T14:32:15Z

@josevalim

Beautiful work @krstopro! I have only added some minor comments and we can ship this soon.

Thanks!

Quick question: what is the next step from here? Which algorithm do you think we could implement on top of them?

I have submitted another pull request for that, see #226.
It implements NN-expansion on top of random projection forest. This is similar to the algorithm used in LargeVis. Also, PaCMAP uses Annoy, which is based on random projection trees (though in Annoy they are constructed in a different way than here) so we might be able to proceed with the dimensionality reduction methods. Ideally, we would have NNDescent on top of random projection forest, as done in PyNNDescent. I am still thinking of implementing it (I need to see if we can avoid looping over all points using while), but NN-expansion might work just fine as it is simple and quick.

Happy holidays!

Thanks, same for you too!

msluszniak · 2023-12-24T14:42:17Z

@josevalim

Beautiful work @krstopro! I have only added some minor comments and we can ship this soon.

Thanks!

Quick question: what is the next step from here? Which algorithm do you think we could implement on top of them?

I have submitted another pull request for that, see #226. It implements NN-expansion on top of random projection forest. This is similar to the algorithm used in LargeVis. Also, PaCMAP uses Annoy, which is based on random projection trees (though in Annoy they are constructed in a different way than here) so we might be able to proceed with the dimensionality reduction methods. Ideally, we would have NNDescent on top of random projection forest, as done in PyNNDescent. I am still thinking of implementing it (I need to see if we can avoid looping over all points using while), but NN-expansion might work just fine as it is simple and quick.

Happy holidays!

Thanks, same for you too!

Yeah, I think that the next step should be NNDescent rather than Annoy.

Co-authored-by: José Valim <[email protected]>

josevalim · 2023-12-24T17:04:39Z

💚 💙 💜 💛 ❤️

Krsto Proroković added 3 commits December 8, 2023 22:23

Add Random Projection Forest

22e44f8

Update Random Projection Forest

7fff7a7

Update.

f5b1c87

msluszniak reviewed Dec 10, 2023

View reviewed changes

lib/scholar/neighbors/random_projection_forest.ex Outdated Show resolved Hide resolved

lib/scholar/neighbors/random_projection_forest.ex Outdated Show resolved Hide resolved

krstopro and others added 4 commits December 10, 2023 14:44

Update lib/scholar/neighbors/random_projection_forest.ex

e88b0dc

Co-authored-by: Mateusz Sluszniak <[email protected]>

Update

5e4ba0e

Merge branch 'main' of github.com:krstopro/scholar

ff462d7

Bug fix.

246299c

msluszniak approved these changes Dec 10, 2023

View reviewed changes

josevalim reviewed Dec 11, 2023

View reviewed changes

lib/scholar/neighbors/random_projection_forest.ex Outdated Show resolved Hide resolved

Krsto Proroković added 3 commits December 16, 2023 12:36

Bounded version

36e944c

Bug fix.

35a4c9f

Bug fix.

dd4895b

Fixed compute_depth_and_leaf_size

bccb53a

Krsto Proroković and others added 2 commits December 22, 2023 14:43

fix

1ab3e94

Merge branch 'elixir-nx:main' into main

bb625c5

Moving sorting inside loop as it is faster and uses less memory

223e150

Krsto Proroković added 2 commits December 23, 2023 13:30

fix

701a610

fix

6ff415d

msluszniak reviewed Dec 23, 2023

View reviewed changes

test/scholar/neighbors/random_projection_forest_test.exs Outdated Show resolved Hide resolved

Add key option to unit test.

2f00229

Co-authored-by: Mateusz Sluszniak <[email protected]>

josevalim reviewed Dec 24, 2023

View reviewed changes

lib/scholar/neighbors/random_projection_forest.ex Show resolved Hide resolved

josevalim reviewed Dec 24, 2023

View reviewed changes

lib/scholar/neighbors/random_projection_forest.ex Outdated Show resolved Hide resolved

josevalim reviewed Dec 24, 2023

View reviewed changes

lib/scholar/neighbors/random_projection_forest.ex Show resolved Hide resolved

josevalim reviewed Dec 24, 2023

View reviewed changes

lib/scholar/neighbors/random_projection_forest.ex Outdated Show resolved Hide resolved

Incorporate feedback.

07854e7

Co-authored-by: José Valim <[email protected]>

josevalim reviewed Dec 24, 2023

View reviewed changes

lib/scholar/neighbors/random_projection_forest.ex Show resolved Hide resolved

josevalim reviewed Dec 24, 2023

View reviewed changes

josevalim approved these changes Dec 24, 2023

View reviewed changes

Update lib/scholar/neighbors/random_projection_forest.ex

1613da4

Co-authored-by: José Valim <[email protected]>

Incorporate feedback.

2e120b6

Co-authored-by: José Valim <[email protected]>

josevalim merged commit d541516 into elixir-nx:main Dec 24, 2023
2 checks passed

This was referenced Jan 10, 2024

Bug in RandomProjectionForest due to Nx.dot with EXLA Backend #230

Closed

Random Projection Forest Update #231

Merged

krstopro changed the title ~~Add Random Projection Forests~~ Random Projection Forest Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random Projection Forest #215

Random Projection Forest #215

krstopro commented Dec 9, 2023 •

edited

Loading

msluszniak commented Dec 10, 2023 •

edited

Loading

josevalim commented Dec 11, 2023

josevalim commented Dec 11, 2023

josevalim commented Dec 11, 2023

krstopro commented Dec 11, 2023 •

edited

Loading

josevalim commented Dec 11, 2023

josevalim Dec 11, 2023

krstopro Dec 11, 2023

josevalim Dec 11, 2023

krstopro Dec 11, 2023

josevalim commented Dec 11, 2023

krstopro commented Dec 11, 2023 •

edited

Loading

josevalim commented Dec 11, 2023

krstopro commented Dec 11, 2023

josevalim commented Dec 11, 2023 •

edited

Loading

krstopro commented Dec 11, 2023 •

edited

Loading

josevalim commented Dec 11, 2023 •

edited

Loading

krstopro commented Dec 16, 2023 •

edited

Loading

krstopro commented Dec 16, 2023 •

edited

Loading

krstopro commented Dec 22, 2023

krstopro commented Dec 22, 2023

polvalente commented Dec 22, 2023

krstopro commented Dec 22, 2023

josevalim Dec 24, 2023

josevalim Dec 24, 2023

josevalim Dec 24, 2023

josevalim Dec 24, 2023

josevalim left a comment

krstopro commented Dec 24, 2023

msluszniak commented Dec 24, 2023

josevalim commented Dec 24, 2023

	Random Projection Forest.
	Random Projection Forest for approximate k-nearest neighbors.

	Random Projection Forest.
	Random Projection Forest for (approximate) nearest neighbor searches.

	Computes the leaf indices for every point in the input tensor.
	Computes the leaf indices for every point in the input tensor.

	If the input tensor contains n points, then the result has shape {n, num_trees, leaf_size}.
	If the input tensor contains n points, then the result has shape `{n, num_trees, leaf_size}`.
	These are the indices of each tree in the forest that are closest to the input data.

Random Projection Forest #215

Random Projection Forest #215

Conversation

krstopro commented Dec 9, 2023 • edited Loading

msluszniak commented Dec 10, 2023 • edited Loading

josevalim commented Dec 11, 2023

josevalim commented Dec 11, 2023

josevalim commented Dec 11, 2023

krstopro commented Dec 11, 2023 • edited Loading

josevalim commented Dec 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josevalim commented Dec 11, 2023

krstopro commented Dec 11, 2023 • edited Loading

josevalim commented Dec 11, 2023

krstopro commented Dec 11, 2023

josevalim commented Dec 11, 2023 • edited Loading

krstopro commented Dec 11, 2023 • edited Loading

josevalim commented Dec 11, 2023 • edited Loading

krstopro commented Dec 16, 2023 • edited Loading

krstopro commented Dec 16, 2023 • edited Loading

krstopro commented Dec 22, 2023

krstopro commented Dec 22, 2023

polvalente commented Dec 22, 2023

krstopro commented Dec 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josevalim left a comment

Choose a reason for hiding this comment

krstopro commented Dec 24, 2023

msluszniak commented Dec 24, 2023

josevalim commented Dec 24, 2023

krstopro commented Dec 9, 2023 •

edited

Loading

msluszniak commented Dec 10, 2023 •

edited

Loading

krstopro commented Dec 11, 2023 •

edited

Loading

krstopro commented Dec 11, 2023 •

edited

Loading

josevalim commented Dec 11, 2023 •

edited

Loading

krstopro commented Dec 11, 2023 •

edited

Loading

josevalim commented Dec 11, 2023 •

edited

Loading

krstopro commented Dec 16, 2023 •

edited

Loading

krstopro commented Dec 16, 2023 •

edited

Loading