BPE algorithm

BPE (byte-pair encoding) tokenizer algorithm used for text preprocessing for NLP
When two or more merge operations
lt + rt → ltrt
in BPE have the same occurrence count, tie-breaking should proceed as follows: the
merge operation where rt comes first in alphabetical order should be chosen; if there is still a tie, then the merge operation where lt comes first in alphabetical order should be chosen.
The list of merge operations learned by BPE follows the following format (which will be written in voc file):
lt1 rt1
lt2 rt2
...
The default size of the vocabulary is 10,000

Commands

python a1.py ––learn_bpe ––inpath {path to input text} –– outpath {path to output text} ––vocab {path to vocab file} ––vocab_size {size}

The above command will learn the BPE merge operations. Given a training text specified by the ––inpath argument (e.g., trn), it generates the BPE-tokenized text in the output file specified by the ––outpath argument (e.g., bpe_trn) and outputs the list of ordered merge operations in the file specified by the ––vocab argument (e.g., voc). The size of the vocabulary to be learned is specified by the ––vocab_size argument, which will default to 10,000 if this argument is not specified.

python a1.py ––apply_bpe ––inpath {path to input text} –– outpath {path to output text} ––vocab {path to vocab file}

The above command will apply the BPE merge operations specified by the ––vocab argument (e.g., voc). Given a text specified by the ––inpath argument (e.g., tst), it generates the BPE-tokenized text in the output file specified by the ––outpath argument (e.g., bpe_tst) based on the merge operations specified by the ––vocab argument

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BPE algorithm

Commands

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
a1.py		a1.py
bpe_trn		bpe_trn
bpe_tst		bpe_tst
trn		trn
tst		tst
voc		voc

anlamm/BPE

Folders and files

Latest commit

History

Repository files navigation

BPE algorithm

Commands

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages