Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Reproducing Figure 4 - Inference Time vs Vocabulary Size #1

Open
wowfingerlicker opened this issue Nov 28, 2024 · 4 comments

Comments

@wowfingerlicker
Copy link

I am currently trying to reproduce the results shown in Figure 4 - Inference Time vs Vocabulary Size from your project. I have a couple of questions regarding the methodology used for this figure:

  1. What inference framework was utilized to measure the inference time?

  2. Was the embedding layer modified to special vocab size before testing the inference speed?

Thanks

@wowfingerlicker
Copy link
Author

in my experiments, the NSL * inference time curve is continuously decreasing and does not exhibit an inflection point as shown in your figure 4 - Time Optimal Vocabulary

image

@gautierdag
Copy link
Owner

gautierdag commented Nov 28, 2024

Hi thanks for the questions!

  1. Just used multiple runs on the same hardware and tracked time naively. As long as you pick a unit of time like iteration/ms or batch/ms and keep that constant for all experiments you should find - not unsurprisingly that time increases as vocab size increases. We used an internal Meta repo for the inference, but to avoid possible caching optimisations, make sure to use random tokens in each batch.

  2. A few special tokens are negligible. What matters is you adjust the number of tokens in the embedding layer to be to the size of the vocabulary.

To obtain the optimal trade-off, like in Figure 4, also make sure that you normalise both time usage and NSL first - at the same vocabulary size.

@wowfingerlicker
Copy link
Author

Thanks for your reply. Both time usage and NSL has been normalized In my test. However, when I multiply these two metrics together, the result remains monotonic within a vocabulary size of up to 290k.

@wowfingerlicker
Copy link
Author

"to avoid possible caching optimisations, make sure to use random tokens in each batch"

-- I should try this suggestion, thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants