Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sample function to List module, add log and exp functions to Float module #772

Merged
merged 7 commits into from
Dec 21, 2024

Conversation

ethanthoma
Copy link
Contributor

I implemented the sample function based on the discussion in issue #683

This PR contains an implementation of Algo L

I also tested Algo R but it was slower

Algo L requires log and exp, hence their addition to the float modulet

@ethanthoma
Copy link
Contributor Author

I would say the only concern is that replacing a value in the reservoir takes O(n) since its random replacement of a linked list. I also have an implementation using a dict instead if that is preferred?

@ethanthoma
Copy link
Contributor Author

For the dict version I do this:

          let reservoir =
            reservoir
            |> map2(range(0, k - 1), _, fn(a, b) { #(a, b) })
            |> dict.from_list

          let w = float.exp(log_random() /. int.to_float(k))

          do_sample(list, reservoir, k, k, w)
          |> dict.fold([], fn(acc, _, v) { [v, ..acc] })

which feels expensive but I also I think you'll pay more for the random replacement cost of List

@ethanthoma
Copy link
Contributor Author

I did a simple comparison for keeping reservoir as a List or changing it to a dict

I did 10_000 iterations of sampling 500 samples from a list of 1_000

List: 1:48 minutes
Dict: 0:08 minutes

I just used the builtin Linux time which IK is not benchmark precision but I think that's a sufficiently large difference for it to not matter

@ethanthoma
Copy link
Contributor Author

Another change is converting the reservoir to a dict before checking length

The most common case is likely to be taking less samples than the input length, so we can make that the "fast" path

The current implementation checks the length of the reservoir before we convert it to a dict

This is so that we don't pay the cost of turning the list to a dict when the sample size is larger than the input size

But since this is the "slow" path, we can make that more expensive to cheapen the length check cost for the "fast" path

@ethanthoma
Copy link
Contributor Author

I tested it with samples of length 500_000 and it made no difference lol

Copy link
Member

@lpil lpil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This will be super useful. I've left some small notes inline 🙏

CHANGELOG.md Outdated Show resolved Hide resolved
src/gleam/float.gleam Show resolved Hide resolved
src/gleam/float.gleam Outdated Show resolved Hide resolved
src/gleam/float.gleam Outdated Show resolved Hide resolved
}

fn log_random() -> Float {
let assert Ok(random) = float.log(float.random() +. 2.220446049250313e-16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the significance of this number float?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need the result from float.random to be non-zero for log. I kind of forgot why that specific value, something related to float error. Should I use something like this value instead: rust f64::MIN_POSITIVE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched it to use the MIN_POSITIVE value

src/gleam/list.gleam Show resolved Hide resolved
src/gleam/list.gleam Outdated Show resolved Hide resolved
}
}

fn log_random() -> Float {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this function do? The name doesn't help me understand, so I'm not sure!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Algo L takes the log of a uniform random value but float.random is inclusive of 0. I need to add a small value to it as log(0) is undefined (a Result(Nil)) to assert its not an error. I can inline it if you prefer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use it in 3 places across sample and sample_loop so I thought it would be less noisy to extract it

@ethanthoma
Copy link
Contributor Author

I updated per your feedback @lpil. Let me know if there are any other changes you'd like!

@ethanthoma
Copy link
Contributor Author

Would you like me to squash the commits? Any other changes you would like? @lpil

Copy link
Member

@lpil lpil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful work! Thank you very much!!

@lpil lpil merged commit 9d76bea into gleam-lang:main Dec 21, 2024
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants