-
Notifications
You must be signed in to change notification settings - Fork 13
4.4 Fast replacement in strings(en)
When looking for a piece of code to replace all the substrings in a string, we often find the following algorithm:
//"sub" in "s" is replaced by "rep".
string s_replacestring(string& s, string& sub, string& rep) {
long subsz = sub.size();
long strsz = s.size();
long from = 0;
long foundHere;
while ((foundHere = s.find(sub, from)) != string::npos) {
if (foundHere != from)
neo += s.substr(from, foundHere - from);
neo += rep;
from = foundHere + subsz;
}
if (from < rsz)
neo += s.substr(from, strsz - from);
return neo;
}
Roughly speaking, it is a question of detecting in a string all the occurrences of "sub" in order to build a new string where "sub" is replaced by "rep".
It works very well but it's not very fast. The "find", especially of the type "std::string", is far from optimal.
There is another way to speed up this piece of code... AVX instructions. Remember that these instructions allow 128-bit registers to be manipulated in a single cycle. There are more recent versions such as AVX2, which allows the manipulation of 256-bit registers. AVX512, the latest version, even allows the use of 512-bit registers....
A character is encoded on 8 bits, which makes: 32 characters per 256-bit block (8x32= 256).
The idea here is to go through our string of characters in blocks of 32 characters at a time to determine if our "sub" is present in a single operation.
Of course, this is not necessarily the most frequent case (although), but it will serve as a foundation for the rest of our improvements...
The algorithm is actually very simple. We read our string by block of 32 characters that we transform each time into a 256-bit number. Then, we check in one single cycle if the character we are looking for is present in this block of 32 characters. If it is not present, 32 more characters are read. If it is present, the replacement is carried out.
//" c" is the character we're looking for in "src"
char c = src[0];
//A 256-bit number is built, each byte is equal to "c".
__m256i firstchar = _mm256_set1_epi8(c);
//We then scan our string by block of 32 characters
for (; (i + 31) < lensrc; i += 32) {
//"s" points to the current block
s=USTR(src)+i;
//We then transform these 32 bytes into a 256-bit number
__m256i current_bytes = _mm256_loadu_si256((const __m256i *)s);
//We check if "current_byte" contains "c" somewhere
current_bytes = _mm256_cmpeq_epi8(firstchar, current_bytes);
//If current_byte is different from 0, it means that "c" is present somewhere
if (_mm256_movemask_epi8(current_bytes)) {
q = _mm256_movemask_epi8(current_bytes);
if (q) {
//The last bit on the right returns the position where this first match occurs
j = _bit_scan_forward(q);
for (; j < 32; j++) {
if (*s == c) {
foundHere = i + j;
if (from != foundHere)
neo += src.substr(from,foundHere-from);
neo += replace;
from = foundHere+lensearch;
}
++s;
}
}
}
As we can see on this piece of code, the algorithm is not very complicated and is finally very similar to the one we presented above. On the other hand, the system very quickly eliminates entire sections of the string that it knows in advance will not contain the desired character.
In fact, it is possible to extend this method to two or even four characters. In this case, the constraint will be much stronger to detect the locations where the substring is potentially present, which would allow us to eliminate even more quickly the candidates for change.
For two characters, we simply combine them into a 16-bit integer....
int16_t cc = search[1];
cc <<<== 8;
cc |= search[0];
// THE TRICK, we create our FILTER on a 16-bit block....
__m256i firstchar = _mm256_set1_epi16(cc);
…
Then, for the comparison, we use "_mm256_cmpeq_epi16" instead of "_mm256_cmpeq_epi8", in order to perform a 16-bit block comparison instead of 8.
There is still room for further constraint. Finding a 4-character sequence is even more specific than a sequence of two. We then use a 32-bit integer....
int32_t cc = search[3];
for (shift = 2; shift >= 0; shift--) {
cc <<<== 8;
cc |= search[shift];
}
// THE TRICK, we create our FILTER on a 32-bit block....
__m256i firstchar = _mm256_set1_epi32(cc);
…
The comparison is then carried out with "_mm256_cmpeq_epi32" instead of "_mm256_cmpeq_epi16".
Except that if you try to do a bit-to-bit comparison, the sequences must be perfectly aligned for it to work.
Therefore, a simple test is no longer enough. For 2 characters, 2 tests are required and for the 4 characters, 4 tests are required, each time with a "shift"....
for (; (i + 31) < lensrc; i += 32) {
shift = 0;
//We will test 4 times, with each time a shift of 1 to the left
for (shift = 0; shift < 4; shift ++) {
//Our shift
s=USTR(src) + i + shift;
// Reading 32 characters at a time
current_bytes = _mm256_loadu_si256((const __m256i *)s);
//We check if our 4-character sequence appears in this segment
current_bytes = _mm256_cmpeq_epi32(firstchar, current_bytes);
//If yes, we stop
q = _mm256_movemask_epi8(current_bytes);
if (q)
break;
}
if (shift != 4) {
//And we do our loop, up to 28... (28+4 == 32)
//For two characters, we would loop up to 30...
j = _bit_scan_forward(q); //we look for the first match location
for (; j < 29; j++) {
if (*s == c) {
if (charcomp(s,USTR(search),lensearch)) {
//We take the "shift" into account
foundHere = i + j + shift;
if (from != foundHere)
neo += src.substr(from,foundHere-from);
neo += replace;
from = foundHere+lensearch;
}
}
++s;
}
}
}
Note that this code is only efficient if the string in which the replacements are made is large. But in this case, this method can be two to three times faster than a conventional method.
This code is part of Tamgu and is implemented in "https://github.com/naver/tamgu/blob/master/src/conversion.cxx".