Skip to content

4.4 Fast replacement in strings(en)

Claude Roux edited this page Oct 19, 2022 · 1 revision

Ultra-fast replacement in a string of characters

Version française

When looking for a piece of code to replace all the substrings in a string, we often find the following algorithm:

//"sub" in "s" is replaced by "rep".

string s_replacestring(string& s, string& sub, string& rep) {

    long subsz = sub.size();
    long strsz = s.size();
    long from = 0;

    long foundHere;
    while ((foundHere = s.find(sub, from)) != string::npos) {
        if (foundHere != from)
            neo += s.substr(from, foundHere - from);
        neo += rep;
        from = foundHere + subsz;
    }

    if (from < rsz)
        neo += s.substr(from, strsz - from);

    return neo;
}

Roughly speaking, it is a question of detecting in a string all the occurrences of "sub" in order to build a new string where "sub" is replaced by "rep".

It works very well but it's not very fast. The "find", especially of the type "std::string", is far from optimal.

AVX

There is another way to speed up this piece of code... AVX instructions. Remember that these instructions allow 128-bit registers to be manipulated in a single cycle. There are more recent versions such as AVX2, which allows the manipulation of 256-bit registers. AVX512, the latest version, even allows the use of 512-bit registers....

256 bits, how many characters is that?

A character is encoded on 8 bits, which makes: 32 characters per 256-bit block (8x32= 256).

The idea here is to go through our string of characters in blocks of 32 characters at a time to determine if our "sub" is present in a single operation.

If the sub is a single character...

Of course, this is not necessarily the most frequent case (although), but it will serve as a foundation for the rest of our improvements...

The Algo

The algorithm is actually very simple. We read our string by block of 32 characters that we transform each time into a 256-bit number. Then, we check in one single cycle if the character we are looking for is present in this block of 32 characters. If it is not present, 32 more characters are read. If it is present, the replacement is carried out.

//" c" is the character we're looking for in "src" 

char c = src[0];

//A 256-bit number is built, each byte is equal to "c".
__m256i firstchar = _mm256_set1_epi8(c);

//We then scan our string by block of 32 characters
for (; (i + 31) < lensrc; i += 32) {
    //"s" points to the current block
    s=USTR(src)+i;
    
    //We then transform these 32 bytes into a 256-bit number
    __m256i current_bytes = _mm256_loadu_si256((const __m256i *)s);
    //We check if "current_byte" contains "c" somewhere
    current_bytes = _mm256_cmpeq_epi8(firstchar, current_bytes);
    //If current_byte is different from 0, it means that "c" is present somewhere
    if (_mm256_movemask_epi8(current_bytes)) {
       q = _mm256_movemask_epi8(current_bytes);
       if (q) {
          //The last bit on the right returns the position where this first match occurs
	  j = _bit_scan_forward(q);
        for (; j < 32; j++) {
            if (*s == c) {
               foundHere = i + j;
               if (from != foundHere)
                   neo += src.substr(from,foundHere-from);
                neo += replace;
               from = foundHere+lensearch;
            }
            ++s;
        }
    }
}

As we can see on this piece of code, the algorithm is not very complicated and is finally very similar to the one we presented above. On the other hand, the system very quickly eliminates entire sections of the string that it knows in advance will not contain the desired character.

Okay, one character is cool, but if "sub" is larger

In fact, it is possible to extend this method to two or even four characters. In this case, the constraint will be much stronger to detect the locations where the substring is potentially present, which would allow us to eliminate even more quickly the candidates for change.

Two characters

For two characters, we simply combine them into a 16-bit integer....

 int16_t cc = search[1];
cc <<<== 8;
cc |= search[0];
// THE TRICK, we create our FILTER on a 16-bit block....

__m256i firstchar = _mm256_set1_epi16(cc);
…

Then, for the comparison, we use "_mm256_cmpeq_epi16" instead of "_mm256_cmpeq_epi8", in order to perform a 16-bit block comparison instead of 8.

Four characters and more

There is still room for further constraint. Finding a 4-character sequence is even more specific than a sequence of two. We then use a 32-bit integer....

  int32_t cc = search[3];
  for (shift = 2; shift >= 0; shift--) {
         cc <<<== 8;
          cc |= search[shift];
   }

// THE TRICK, we create our FILTER on a 32-bit block....

__m256i firstchar = _mm256_set1_epi32(cc);
…

The comparison is then carried out with "_mm256_cmpeq_epi32" instead of "_mm256_cmpeq_epi16".

Except.....

Except that if you try to do a bit-to-bit comparison, the sequences must be perfectly aligned for it to work.

Therefore, a simple test is no longer enough. For 2 characters, 2 tests are required and for the 4 characters, 4 tests are required, each time with a "shift"....

for (; (i + 31) < lensrc; i += 32) {
    shift = 0;
    //We will test 4 times, with each time a shift of 1 to the left
    for (shift = 0; shift < 4; shift ++) {
        //Our shift
        s=USTR(src) + i + shift;
        
        // Reading 32 characters at a time
        current_bytes = _mm256_loadu_si256((const __m256i *)s);
        
        //We check if our 4-character sequence appears in this segment
        current_bytes = _mm256_cmpeq_epi32(firstchar, current_bytes);
        
        //If yes, we stop 
        q = _mm256_movemask_epi8(current_bytes);
        if (q)
            break;
    }
    
    if (shift != 4) {
        //And we do our loop, up to 28... (28+4 == 32)
        //For two characters, we would loop up to 30...
        j = _bit_scan_forward(q); //we look for the first match location
        for (; j < 29; j++) {
            if (*s == c) {
                if (charcomp(s,USTR(search),lensearch)) {
                    //We take the "shift" into account
                    foundHere = i + j + shift;
                    if (from != foundHere)
                        neo += src.substr(from,foundHere-from);
                    neo += replace;
                    from = foundHere+lensearch;
                }
            }
            ++s;
        }
    }
}

Conclusion

Note that this code is only efficient if the string in which the replacements are made is large. But in this case, this method can be two to three times faster than a conventional method.

This code is part of Tamgu and is implemented in "https://github.com/naver/tamgu/blob/master/src/conversion.cxx".

Clone this wiki locally