-
Notifications
You must be signed in to change notification settings - Fork 830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Another Implementation (faster and more effecient) of BPE Training Algorithm #1400
Comments
Sounds nice, waiting for the translated document and more benchmarks! |
I will finish the blog translation as soon as possible. Also, do you have any suggestions about more benchmarks? I'm not quite sure what kind of testing I should do, and the current code interface doesn't exactly align with the implementation in tokenizer (in fact, this code is from an assignment in my nlp class) |
Would try with different hardware, different datasets and check this benchmark |
Effecient BPE Implementation
IntroductionThe BPE algorithm, as an unsupervised word partitioning algorithm, is widely used in the field of NLP, especially in large-scale language modeling represented by GPT (using its variant Byte Level BPE). However, most of the current open-source implementations of the algorithm have low training efficiency for Chinese. This is mainly because, although in the original algorithm, the input of BPE is the whole sentence, for Latin languages such as English, a certain degree of approximation can be made to significantly reduce the time complexity of the algorithm, that is, pre-split words according to the space first, and then for each word to perform the BPE algorithm. It is called an approximation because it forbids the merging of words across spaces in the original algorithm, but for English, it works well in most cases. Most importantly, it compresses the worst case But this approximation does not perform well in all languages. Obviously it doesn't work in languages like Chinese, and Japanese, which can't preseparate words according to spaces. And, even in Latin, there are languages like German that have quite long word lengths, making the precondition that this approximation can significantly optimize time complexity not true. So here, I have implemented an, optimized version of the BPE algorithm without approximation. Even using a pure Python implementation, this is substantially better in terms of speed and memory footprint than the version implemented in Hugging Face Tokenizer 2 using Rust. Note that the version implemented in the Tokenizer is not the original
Training Algorithmstarting pointThe biggest problem with the original BPE algorithm is that after merging merely one token pair at a time, all the data need to be recounted. Especially for a language like Chinese, which has a lot of symbols itself, each modification actually has little effect on the previous round of statistics. Moreover, each modification also requires traversing the whole corpus, in which a lot of time is wasted on retrieval. Therefore, a more ideal way is to modify only the data that need to be modified each time, and after the modification is completed, it can be directly used in the next round to continue to find and merge the best symbol pairs. The implementation in HuggingFace Tokenizer also follows this principle. However, the effeciency could be furthur improved. OptimizationIn order to make it possible to modify the parts that need to be modified without redundant retrieval operations, we essentially need to solve three core problems:
Here, I'll start by listing the data structures I chose:
Find the most frequent token pairUsing a prioritized queue to find the highest-frequency symbol pairs is a very natural choice. However, there are different options for the information stored in the priority queue.
I notice that there is a similar strategy for checking if the frequency of a token pair is valid (Check that the frequencies in the priority queue are consistent with the current statistics). A schematic code is as follows: while True:
cached_freq, pair = pair_freq_queue.pop()
cur_freq = -word_pair_len[pair]
if cached_freq == cur_freq:
break
else:
pair_freq_queue.push((cur_freq, pair)) Merging Status RepresentationIn my implementation, I don't use a Word class to represent merged token pairs. Instead, I use an array of uint8 to represent the merging status.
For a string "apple", a possible merging process might be as follows:
The effect of this representation is:
It is easy to see that under this mechanism, even if a word has a length of 1 such that its start and end positions overlap, the same process can be used to query the words before and after it. Thus the data in
Merge symbol pairsAt initialization stage, the starting positions of all pairs of two-by-two symbols are counted, stored in a dictionary, and the length of each list at that point is assigned to Next, the process for each merge is as follows (special cases will be discussed later):
Special SituationsDuring the merge process, there are two special cases that arise and need to be handled, they are:
In essence, the second special case, is also a special case of the first. Its modification for processing is also simple, viz:
Memory CompressionFor the above process, let's say we merge off At this point, a memory compression mechanism can be introduced to control memory growth. The principle is also very simple, just need to check each time the priority queue, determine the Compared to the original BPE training process, space for time is unavoidable to achieve our optimization goal, but with the memory compression mechanism, the final space complexity can still be maintained on the order of Footnotes
|
If I understand correctly, I need to train BPE on big.txt with diffrent hardware? I test it on my macbookpro (Intel i7-7700HQ (8) @ 2.80GHz, 16GB RAM), using
Note that how to use multiple threads in my implementation is still to be discussed. By the way, the blog is translated with the help of deepl (not directly translate throgh deepl). So it might not be very natural. |
hi,any further discussions? |
Hey! Sorry but I'm a bit low on bandwidth I need to read the blogpost and take some time to check this out! 🚀 |
Very exciting otherwise ! 🤗 |
Have not had the time yet sorry |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
@ArthurZucker Is there any other helps I can offer? |
Actually if you could open a PR it would be amazing! 🤗 |
I will give it a try |
I got a problem to deal with the max token length. In my implementation, I use a vector of It works in most cases. Even in some very demanding situations, I think So I'm asking what I should do about this. @ArthurZucker |
Memory should not be that much of a problem so would keep usize. Or does it affect speed too much? |
Theoretically there is little difference in speed. But if you consider GB level corpus data, 8 times the memory overhead of the text size is still something to worry about, I think. If the server doesn't have enough memory, this is likely to allow a lot of data to go into swap, which can significantly impact performance. Also, I'm not quite sure how tokenizer does parallelism now, e.g. how to get a mutex lock on some global data. So I'll implement the single-threaded version in my fork first. |
Or, we could ge a runtime dispatch for the dtype of it according to the max_token_length, and let users decide which to use. |
@ArthurZucker Progress is much faster than I thought it would be. I've now passed the 3 built-in test cases. After tidying up the code and adding comments, I'll upload it to my fork first. Further parallel optimizations will follow, as well as adding dropout support. But I think we need to add some special case tests, like how to handle strings like "00000000000". |
In the process of modifying the BPE training code in the tokenizer, I feel like I've found the main reason that slowed down the original implementation. Word's merge method is inefficiently implemented. Frequent remove and insert operations on a large number of vectors are very costly. Let the number of symbols in a word be n, and the number of token pairs hit be m. The complexity of calling the merge function will be Luckily tokenizer has a better pre tokenization mechanism that makes N smaller and keeps the overall complexity still at a manageable level. But after my detailed comparison today, I think the merge method here is still the main difference between the original implementation and mine. In my implementation, for each position of the token pair, the modification requires only O(1) complexity, and for this position for a "Word", the complexity of the merge operation is reduced to O(M). Moreover, changing Vector to List still doesn't reduce the complexity of the original implementation to O(M), because each call to merge requires traversing the entire vector of symbols, an operation that implies O(N) complexity itself. pub(super) fn merge(
&mut self,
c1: u32,
c2: u32,
replacement: u32,
max_length: usize,
) -> Vec<(Pair, i32)> {
let mut changes: Vec<(Pair, i32)> = vec![];
let mut i = 0;
loop {
if i >= self.symbols.len() {
break;
}
// Found a pair
if self.symbols[i].c == c1 && i + 1 < self.symbols.len() && self.symbols[i + 1].c == c2
{
let first = self.symbols[i];
let second = self.symbols[i + 1];
// Remove in place
let new_s = Symbol {
c: replacement,
prev: first.prev,
next: second.next,
len: first.len + second.len,
};
// If there are other characters before the pair
if i > 0 {
changes.push(((self.symbols[i - 1].c, first.c), -1));
if self.symbols[i - 1].len + new_s.len < max_length {
changes.push(((self.symbols[i - 1].c, replacement), 1));
}
}
self.symbols.insert(i, new_s); // Insert replacement before first char of pair
self.symbols.remove(i + 1); // Remove first char of pair
self.symbols.remove(i + 1); // And then the second
// If there are other characters after the pair
if i < self.symbols.len() - 1 {
changes.push(((second.c, self.symbols[i + 1].c), -1));
if self.symbols[i + 1].len + new_s.len < max_length {
changes.push(((replacement, self.symbols[i + 1].c), 1));
}
}
}
i += 1;
}
changes
} |
Wow that sounds great. If we can just modify that single function would be pretty impressive! |
I found out about this article after reading on |
@AugustasMacijauskas Thank you for your attention. I have written in detail about the improvement method in my previous reply. However, I found it still a bit difficult to add it to the existing interface of tokenizers. There are two main problems: The algorithm I proposed before limits the maximum length of a single token to 256 chars, which does not meet the interface semantic requirements (and requires some modification to the design of the algorithm I proposed) The tokenizer's bpe algorithm includes support for adding prefixes and suffixes to consecutive characters, which I hadn't considered before and haven't figured out how to do it. Since I have limited energy at the moment, I haven't made much progress yet. Of course, you can take a look at my original repository (albeit with very few comments) to help you understand my implementation. Feel free to raise an issue under my repository if you have any questions! |
oh,by the way, tiktoken is also written in rust(with just 600 lines). So it might be more feasible to introduce the rust part of tiktoken directly into tokenizers? |
When you say "introduce the rust part of tiktoken directly into tokenizers", isn't that irrelevant since tiktoken only has code for inference, while what you propose regards tokenizer training? |
Sorry, I didn't look closely at tiktoken, I thought it contained both training and inference code. By the way, one thing is worth noting that bpe's training and inference processes are similar, and in my own attempts, this improved method can also be used in inference, although I didn't notice much performance gain, probably because of the small test data. |
Yeah, tiktoken only contains inference code. Either way, thank you for your answers, I'll take some more time to process the code you proposed and I might come back if I have some more questions. |
I'll check if it's possible to include the rust part that makes it faster in tiktoken here. I think they have a super efficient regex thing. WIll check |
That's a good idea, this actually made me realize that it'd be great to profile each part of the tokenization process separately for both tiktoken and huggingface to see what improvements can be made. Essentially, the running times for regex splitting and then computing the tokens based on the vocab should be profiled, but maybe more fine-grained profiling could be useful too). I could try looking into this, or is it well-known that it's the regex splitting that bottlenecks? Also, not sure how much of a difference this makes, but tiktoken operates on byte level instead of string level. Any possibilities that this leads to performance improvements? |
Oh, and they use simple Python multiprocessing to introduce parallelism instead of |
Well, we want to make sure our rust users also benefit from parallelism! |
Oh, right, I somehow overlooked the fact that it's used as a standalone library as well, not just Python bindings 😅 |
No worries! But if it can be improved I am all for it! |
I also have recently came up with a similar algorithm as a fun excersise. I've described how it works step by step in my github repo: Efficient BPE Tokenization from Scratch. The full implementation is also there. The idea is similar, but it does not need a valid positions tracking or merge status tracking. It uses only 2 data structures: a modified version of a priority queue and a modified version of a linked list. The detailed explanation is in the readme of the repo. |
@marta1994 Great work and great animation!!!🤗 Your article is much more friendly. I notice that you only have a performance comparation with a "naive" version, could you furthur add the performance of hugging face tokenizer? Due to the bad implementation, a pure python BPE trainer could achieve compareble performance. For implementation details, your def _update_left_token(self, input_index, token_index, merge_stat, new_token):
positions = self._positions[input_index]
left_token_index = positions.get_previous_index(token_index)
if left_token_index == None:
return
pair = (positions.get_by_index(left_token_index), merge_stat.pair[0])
self._remove_position_from_pair(merge_stat, pair, input_index, left_token_index) # O (log M)
new_pair = (pair[0], new_token)
self._add_position_to_pair(new_pair, input_index, left_token_index) # O (log M) While in my implementation, the reducution to the real position list and the update to the priority queue is lazy. This strategy is more effecient. And I also think that double linked list is also not a perfect solution. If implemented in C++, this would cause severe memory fragmentation and be unfriendly to the CPU cache. Even if the issue of memory fragmentation is resolved by using methods such as a memory pool, the memory usage of this approach would still be significantly higher than using a merging status array. (3x8 bytes for a node, 1 byte for a status of a char in the corpus) |
@Yikai-Liao I wouldn't call O(log(M)) a bad complexity. It is true that O(1) is better, but order of magnitude is small. It is also an interesting idea to store pair edits for a pair of tokens being merged and only merge them once all occurances of a specific pair were replaced. This way we can merge updates for the same pairs (eg instead of removing 3 times positions for a pair "ab" and updating the heap 3 times, we can do it once). The complexity would not change, because you still would do heapify for every distinct pair. But it can improve performance in practice, because I imagine in real world you can have a lot of repeating pairs in this case, eg you are merging "ha" and "pp", then you would have a lot of " " characters to the left and "y" characters to the right. I am curious to test that and compare. I think it is worth optimizing. Also it is important how easily understandable and algorithm is. It makes it easier to test and spot bugs. So in some cases tradeoffs between the most optimal complexity / memory usage and ease of abstraction are justifiable. One example where you could simplify your implementation is to remove the unused pairs in the (update)[https://github.com/Yikai-Liao/efficient_bpe/blob/main/ebpe.py#L362] methods and not only add the new pairs (as I've explained in the second paragraph). It will ensure the hashset is valid when you need to retrieve the value. It also most likely will not worsen time complexity, because currently you still potentially do multiple heappush/heappop to retrieve the max value, which is certainly not O(1). This will allow you to make the code more clear by getting rid of the Most important, I am glad that I found your implementation to have this discussion with you! |
@marta1994 Thanks for your response. I've now thought of ways to optimize the previous implementation, although I haven't implemented it into code yet. My goal now is to prototype this algorithm as soon as possible and implement its rust version to merge into tokenizers. In the optimized version I'm envisioning, it will address the issues that previously prevented me from implementing the code required by existing interface of huggingface BPE trainer. And by my estimation, that version should be superior than the previous one in constants of space complexity and time complexity. The issues are:
Tomorrow I'll try to implement a prototype in python, and then start working on implementing the rust version. As for the pair_pos and word_pair_len, I do consider use But in my early tests in python showed that using set directly would be slower than using array, even though they have the same time complexity, i.e. the latter is superior in constant of the time complexity. In particular, to add multithreading support, I need to use an ordered sequence of positions. Although I can use things like B-Tree set in rust, I still think that Not removing is definitely faster than removing. In my new implementation, I will no longer need to use seg_status to mark split statuses. Ultimately, only the corpus and pair_pos will be kept O(N) in space complexity, which means the new implementation will use less space. Since this new implementation basically solves all the problems I had before, I now can't wait to implement it and share it with the community. |
I have finished the single thread prototype implementation in python, with other features I mentioned above https://github.com/Yikai-Liao/efficient_bpe/blob/main/ebpe_v2.py. And it should be easy to be extended to a mult-thread version, by just seperate pos_list into multiple slices with guaranteed intervals. I'm sorry to say that instead of organizing my code in an object-oriented way, I wrote several functions with a large number of parameters in order to implement my idea as quickly as possible. But this time I've added a lot of English comments while writing the code, which should facilitate your understanding. I'll add the algorithm to the repository as soon as I can in the form of a README. Feel free to share any suggestions for improvement! Looking forward to the day when tokenziers BPE trainer reaches sota in all kinds of situations! |
Good news, I have finished the sing thread version in rust, and it passes all the original tests! https://github.com/Yikai-Liao/tokenizers/blob/main/tokenizers/src/models/bpe/trainer.rs There are a few things that need to be accomplished next:
@ArthurZucker should we open another issue or reactivate this one? I'm not very familiar with how to contribute code to large open source projects like tokenizers. |
Sorry for not answering sooner! 🤗 But we can try! I am adding a new BPE here: #1712 that should be more efficient as well! |
Early in this year, I wrote an new implementation for BPE Algorithm in pure python, which is faster than the version in Tokenizer.
I hope this implementation could help tokenizers to further improve the BPE training performance.
I have writen a blog in Chinese about this implementation. I will try to translate it to English if there is any need. By the way, the code is quite short in my opinion, with about merely 400 lines.
Here is the code: https://github.com/Yikai-Liao/efficient_bpe
The text was updated successfully, but these errors were encountered: