Reducing Vocabulary Size

When talking about vocabulary in NLP, we mean the number of unique tokens in a corpus.

In the source code, we can see many words that occur often across the projects. For example, language keywords (while, private, int, etc.), names of common library methods (compareTo, split, toString, etc.), common identifier names (counter, i, tmp, etc.). N most often worms make up k % of all vocabulary size.

However, there are words that are more project-specific, which significantly increase the vocabulary size when the corpus grows. Some examples may include:

Less common identifier names that consist of multiple words (__new_class_creator, PlaybackClassListenerFailure, etc.)
Domain or project- specific terms (VanillaOption, dnaAnalyzer)
Words in languages other than English, including those that contain non-asci characters (comentario, ввести)
Automatically generated numbers (0xff3434, 2342423424L)
Words that do not exist in natural language dictionaries due to typos (Englihs)

The graph below shows the dependency of vocabulary size on the size of the corpus

TBA

For language modeling and classification tasks, the vocabulary size of a few million is significant, compared to the gain we get from increasing of the size of the corpus. It would be beneficial to be able to scale the tasks for larger corpus sizes but at the same time keep the vocabulary size reasonable.

We consider several ways how we could reduce the vocabulary size of the corpus.

Reducing non-English vocabulary

Normally, in the source code, English words are used to name identifiers. However, despite being a common practice, it is not always the case. Moreover, when doing language modeling, we consider comments and string literals. Here even much more non-English words can be seen, including those that consist of non-asci characters.

TBA probably we want some graph to be here that shows how many non-English words there are in the corpus.

We have considered removing non-English words can be done on several levels:

Removing a whole project
Removing a whole file
Removing file section
Removing a word

It is indeed the case that non-English vocabulary is found throughout a project. But often its often localization files, files with a great number of string literals with words in different languages etc. In these cases removing whole projects does not make sense. TBD Check if its always the case that we remove all the files in the project

Removing some files seems to be a good idea since we treat all files independently and don't exploit the file structure and relations between files in projects. So if the project, for example, contains a localization file which is the only file with a significant amount of non-English vocabulary, it will be an only file to be removed. However when using this approach instead of removing the whole projects approach, sometimes the whole projects end up being removed anyway. (show some stats)

Removing file sections is generally a bad idea because we treat files as sequences of tokens and removing some part of this sequence would mean losing information. However, there are parts of files not including which, wouldn't cause much information lost. One example is multiline comments with the description of a class/method/filed in a language different from English or large string literals.

Removing separate words means avoiding losing any English text that could be made use of.

We combine some of these approaches to get the best ratio of reduced vocabulary and lost tokens. We proceed as follows: we remove 'non-English files' from the corpus and replace remaining 'non-English sections' and 'non-English words' with corresponding placeholders.

In order to proceed with this, we need to define what we call a 'non-English word' and 'non-English file'.

A word is non-English if:

It contains Non-asci characters (Note, according to the definition, at the moment, for simplicity, words like Café and Naїve are considered non-English).

OR

it is not found in the English dictionaries AND
It is found in at least one of Non-English Dictionaries AND
It has more than three letters

Dictionaries Used: TBA

Whether a file is non-English is checked as follows:

The number and the percentage of non-English words are counted separately for identifiers, comments, string literals and compared with certain defined thresholds. We differentiate identifiers, comments and string literals because we assume that removing the same number of tokens from each of these categories will result in different lost of information (least loss for comments, biggest lost for identifiers). (Maybe some example from the corpus here?)

The thresholds were found out empirically (Need to specify how?). ** More details and exact values TBA**

Examples of 'English' file just above threshold and non-English just below in the Appendix?

In the files that remained:

Non-English words are replaced with a <non-English> placeholder.
Blocks of non-English text are replaced with a <non-eng-contents> placeholder.

Graph TBA (Dependence of reduced vocab on number of tokens we removed)

Cutting off infrequently used words

Simple but effective technic to keep the vocabulary size limited is to replace all infrequent words with the out-of-vocabulary placeholder. We need to introduce a min_freq parameter. If the number of occurrences of a word from the vocab is greater or equal than min_freq, we keep the word. Accordingly, when we want to keep all infrequent tokens, we set this parameter to 0.

Below the dependency of the vocab size on the value of in_freq param is shown. TBA

We see that already for low values of min_freq, the vocabulary is reduced significantly. Like in the case with non-English vocabulary we want to find a trade-off between the recused size of the vocabulary and the amount of the information lost.

The graph below shows the dependency of the number oov tokens in the corpus after removing infrequent words depending on the value of parameter min_freq TBA

Converting tokens to lower-case

Tokens like parser, Parser and PARSER would normally have similar semantics. However, for the model which operates on the word level, these are three different entries in the vocabulary and three different words with no connection to each other. Therefore it would make sense to lowercase all the letter preserving information about capitalization. In this case, the tokens above could be represented as follows:

parser -> parser (without changes)
Parser -> <capitalized> parser
PARSER -> <uppercase> parser

Word splitting

Basic splitting

Splitting using a few trivial algorithms

camelCase -> camel case
snake_case -> snake case
splitting with numbers: cpp2java -> cpp 2 java
number splitting: 123e-1 -> 1 2 3 <e> - 1

Subword-frequency-based heuristics

There are cases where camel-case or underscore-case conventions are not followed when forming identifiers from multiple words

samecasesplitting -> same case splitting

Currently, a heuristic that chooses a splitting among the set of all possible splittings is used. The heuristic favors splittings with the following properties:

resulting subwords should occur individually as often as possible in the dataset
the length of the subwords should be close to the average length of words in vocabulary (~ 5 in our case) with more penalty for shorter words

BPE

The general idea of it and what benefits it gives to us" smaller vocabulary, no oov durign testing [2][3]

Resulting vocabulary after bpe.

Words excluded from bpe + the number of unique characters that form the splittable vocabulary + at most* N_MERGES

The situation when a merge doesn't result in increased vocabulary occurs is when at least one of the words being merged exists only in current combination, so it won't exist in vocab on its own after the merge (when both the resulting vocab is even decreased by one)

The trivial example will explain it. Suppose there are two words in a corpus id which occurs once and if which occurs twice. After initial splitting to character the vocab will be the following V0 = i f d. After the first merge I f If the vocab remains the same v1 I d if since f occurs only in if After the second merge, the vocab will decrease v2 since both I d occur only in Id.

Split words representation

After the word is split into subwords, it needs to be represented in some way. Subwords could not be just separated with a space because they are semantically different from separate words. Moreover, the informaton that subwords belong to one word would be lost, and it would not be possible to restore the original data. Therefore the representation should be unambiguious (meaning there should be only one way to convert it back to the original word) and consice in terms of the number of tokens that are needed.

The following two methods are proposed:

Use of subword separators

The general idea is to insert a separator before every subword. The type of separator is different depending on how the subword that follows the separator was connected to the subword that goes before. Obviously, the separator after the last subword is not needed. The separator before the first subword can be omitted as well unless there are leading underscore(s). If some subwords are capitalized/upper-cased, the corresponding placeholders are inserted.

test <us_sep> <cap> best <cc_sep> play <sc_sep> er <sc_sep> s <us_sep> mod <sc_sep> if <sc_sep> ied

Use of subword boundaries

The general idea of this approach is not to put separators between the subwords that were separated by same-case splitting methods, while everything else remains the same. However for this to work, word borders should be add to the beginning and to the end. The end-boundary is always the same, whereas for the beginning boundary the separator is used.

For the same example test_BestPlay|er|s_mod|if|ied the representation would be the following:

<cc> test <us> <cap> best <cc> play er s <us> mod if ied </cc>

For more examples of how different words are represented see: Appendix B. Subword Representation Examples

Below for different vocab sizes, you can see the difference between the number of tokens in the corpus when using two mentioned above representations. Graph TBA

Later we will discuss the difference between representations from the perspective of language model training.

References

[1] Miltiadis Allamanis and Charles Sutton. 2013. Mining source code repositories at massive scale using language modeling. In Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, 207–216.

[2] Philip Gage. 1994. A New Algorithm for Data Com- pression. C Users J., 12(2):23–38, February.

[3] Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Neural machine translation of rare words with subword units." arXiv preprint arXiv:1508.07909 (2015).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing Vocabulary Size