Merging new tokens into parts #778

RitwikGupta · 2025-01-09T12:03:04Z

❓ The question

I am attempting to pre-train OLMo on new data. I have tokenized these files using the OLMo tokenizer, resulting in millions of small npy files.

You merge your tokens into large parts. How? When I do this by naively concatenating npy files together, it results in nasty CUDA device-side assertions being thrown.

Is there a proper way to introduce new data for pre-training?

RitwikGupta added the type/question An issue that's a question label Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging new tokens into parts #778

Merging new tokens into parts #778

RitwikGupta commented Jan 9, 2025

Merging new tokens into parts #778

Merging new tokens into parts #778

Comments

RitwikGupta commented Jan 9, 2025

❓ The question