Request for pre-tokenizer that creates words based length alone. #1697

filbeofITK · 2024-12-10T12:58:06Z

Hello! I would like to request a fast pre-tokenizer to be implemented, which only splits the input to continuous pre-defined length segments. I know that this is not a common issue in NLP, but for my use-case it is necessary. I'm trying to process DNA data and that has no spaces or any type of separators, so I'm trying to use fixed length tokens.

Implementing this for someone that actually knows Rust and the backend would probably take less than half an hour but I don't want to learn a new language for this.

Biggest thanks!

jonvet · 2025-01-11T09:51:43Z

hey @filbeofITK you're right this is not common in NLP. but is it common in the biomedical domain? are there any papers where other people use something like this? if it is we might be able to merge this PR.
otherwise, if this is more of a unique research effort, then you might be better off implementing this pre tokenizer yourself in python or checkout the feature branch in the linked PR.

filbeofITK · 2025-01-12T12:02:09Z

Well the DNA Transformer uses this technique, and naturally my research would build on this. I have a custom tokenizer in Python already, but unfortunately it's rather slow.

jonvet linked a pull request Jan 5, 2025 that will close this issue

Fixed Length Pre-Tokenizer #1713

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for pre-tokenizer that creates words based length alone. #1697

Request for pre-tokenizer that creates words based length alone. #1697

filbeofITK commented Dec 10, 2024

jonvet commented Jan 11, 2025

filbeofITK commented Jan 12, 2025

Request for pre-tokenizer that creates words based length alone. #1697

Request for pre-tokenizer that creates words based length alone. #1697

Comments

filbeofITK commented Dec 10, 2024

jonvet commented Jan 11, 2025

filbeofITK commented Jan 12, 2025