Skip to content

Regular expressions

IS4 edited this page Jul 23, 2024 · 8 revisions

Regular expressions can be used to recognize strings that match a certain pattern. PawnPlus exposes the regex features of C++ in a simple yet useful set of functions that support simple matching (finding a pattern in a string), extraction (indetifying submatches in a pattern), and replacement (substituting a matched portion of a string with a new substring).

Regex options

All regex functions accept a regex_options bit-field. The regex_options enum is based on a combination of C++ types syntax_option_type and match_flag_type, allowing changing the language or the way patterns are matched. The custom changes to those flags are:

  • regex_default uses the modified ECMAScript regular expression grammar. This grammar is used by default if no other is specified.
  • regex_lex uses an alternative grammar provided by lex (through a fork), which in turn implements Lua patterns. For differences, see below.
  • regex_percents changes the character used in escape sequences from \ to % (in both pattern and replacement).
  • regex_start_at_pos affects situations when pos is greater than 0, i.e. the matching starts in the middle of the string. By default, the string is assumed to start at 0, and so ^ will not match pos, but with this option, any characters before pos are effectively ignored.
  • regex_cached can increase the efficiency of matching for more complex patterns. Usually, every function call has to construct the regex object from the pattern, but with this option, the object is stored in a map and can be retrieved later when the same pattern and options are used.
  • regex_cached_addr uses the memory address of the pattern instead of its value, so there is no need to check the whole pattern to retrieve the cached regex object, saving speed. However, this option should be used only when the pattern is immutable in its location, such as when stored in a static const variable or a long-lived ConstString: value. If the value of the pattern changes after caching, the regex may not be regenerated to reflect the changes.

Caching

All syntax options support caching the intermediate representation of the pattern, via regex_cached (by value) and regex_cached_addr (by address).

Do not use regex caching for highly variable patterns, when the value is different each time (and in case of regex_cached_addr, when it is constructed on the stack or stored in a temporary String: value), as the cached regex instance will likely not reused again and will remain in memory. Please see the examples below when and how to use caching:

// Pattern comes from a literal string
str_match(s, "pattern", .options = regex_cached); // imperfect - statically-known string; no need to cache by value
str_match(s, "pattern", .options = regex_cached_addr); // optimal - address is constant (in the AMX instance)
str_match(s, str_new("pattern"), .options = regex_cached_addr); // bad - temporary string; address is different each time so no cache hits
str_match(s, str_new_static("pattern"), .options = regex_cached_addr); // still bad - good use of str_new_static, but it is still a temporary string
str_match(s, str_new_static("pattern"), .options = regex_cached); // okay in principle - value is always the same (but why construct the string at all?)
str_match(s, str_format("pattern %d", random(100000)), .options = regex_cached); // bad - cache is almost never hit

// Pattern comes from the stack
new p[] = "pattern";
str_match(s, p, .options = regex_cached_addr); // very bad - string is on the stack, so the address may be different each time and could conflict with another one
str_match(s, p, .options = regex_cached); // okay in principle - value is always the same (but do not copy a string literal to a variable when you don't need to modify it!)
format(p, sizeof(p), "pattern %d", random(100000));
str_match(s, p, .options = regex_cached); // bad - different pattern each time again
str_match(s, p); // okay - just don't use caching when the pattern is different

Efficient regex caching should be used together with regex_optimize to improve speed when the instance is reused.

Grammar

The grammar can be changed through the options. All grammars support the custom locale extension.

  • regex_default uses the modified ECMAScript grammar.
  • regex_basic uses the basic POSIX grammar.
  • regex_extended uses the extended POSIX grammar.
  • regex_awk uses the grammar used by awk.
  • regex_grep uses the grammar used by grep, effectively the same as regex_basic with the addition of \n as an alternation separator in addition to |.
  • regex_egrep uses the grammar used by grep -E, effectively the same as regex_extended with the addition of \n as an alternation separator.
  • regex_lex uses the grammar and matching used by Lua patterns. More below. Note that the replacement syntax, used by str_replace and similar function, is plugin-specific and unchanged by these options.
Locale

Any regular expression may be prepended by a locale/encoding identifier in the form of (?$identifier). This specifier needs to be present at the beginning of the pattern, and identifier must be a valid encoding identifier, otherwise the pattern will be treated as invalid. If unspecified, the default locale is used (set by pp_locale).

The default syntax supports the following character classes, with interpretation depending on the configured locale:

Expression Meaning
\d or [:digit:] Any digit.
\w Any letter or digit or _.
\s or [:space:] Any whitespace character.
[:alpha:] Any letter.
[:graph:] Any visible character (alphanumeric or punctuation).
[:alnum:] Any letter or digit.
[:blank:] Any space character used for word separation (such as or \t).
[:cntrl:] Any (non-printable) control character.
[:lower:] Any lowercase character.
[:upper:] Any uppercase character.
[:print:] Any printable character (opposite of [:cntrl:]).
[:punct:] Any visible character that is not alphanumeric.
[:xdigit:] Any hexadecimal digit.

Additional classes may be defined by the current locale. Note that the […]-enclosed forms need to be in an explicit character set (e.g. [[:alpha:]] or [^[:alpha:]]) to be recognized.

Lex

The Lua-style pattern matching has several notable differences, both from other grammars and from Lua as well:

  • Unlike in Lua, \ is used as the escape character by default, but it can be changed to % via regex_percents.
  • \b and \f have the meaning as in Lua (balanced sequence and frontier).
  • No alternatives or look-aheads/look-behinds could be used. The presence of ^ and $ in any other place in the pattern than the beginning/end is not special.
  • Character classes use identical comparisons as the other grammars (therefore \w is not different from how it is used in the other grammars), but \a, \g, \c, \p, \l, \u, and \x are available as well (the first letter of the corresponding names). Using any other character after \ matches that character exactly, bypassing regex_icase and regex_collate.
  • regex_not_bow and regex_not_eow affect the \f pattern ‒ if used, the frontier cannot be matched at the beginning/end of the string.
  • regex_optimize and regex_any do not do anything, because there is no intermediate representation or indeterminism.

Matching

Matching is the simplest application of regular expressions. The function str_match can be used to test if a string contains the provided pattern.

assert str_match(@("apple banana orange"), \"\b(apple|banana)\b");

The function doesn't match the whole string against the pattern, just the first occurence. ^ and $ can be used to anchor the pattern at the beginning and at the end of the string. The pos parameter can be used to specify the starting offset of matching and to obtain the ending offset of the match.

Extraction

str_extract looks for a pattern in a string and constructs a new list holding all the groups captured from the string, or List:0 in case of no match. This function can also be used to iterate over all occurences of the pattern:

new pos = 0;
new String:str = @("apple banana orange");
new List:l;
while((l = str_extract(str, \"\b[[:alpha:]]+\b", .pos=pos)))
{
    print_s(list_get_str_s(l, 0));
    list_delete(l);
}
//prints all words in the string

Replacement

All occurences of a pattern in a string can be replaced with another string, a list, or a function.

str_replace can be used to specify a single replacement string. If you use capturing groups, you can refer to them via $1, $2 (or \1, \2) etc. in the replacement string.

assert str_replace(@("apple banana orange"), \"\b([[:alpha:]]+)\b", "word($1)") == @("word(apple) word(banana) word(orange)");

Sometimes, it is useful to replace multiple things at once, but be able to select the specific replacement based on the pattern. For this usage, it is possible to use str_replace_list:

new List:l = list_new_args_str("1", "0");
print_s(str_replace_list(@("apple"), "(..)|(.)", l)); //110
list_delete(l);

The function finds the first range of successfully matched groups and gets the list index corresponding to the first group.

The individual replacement strings can reference the match:

new List:l = list_new_args_str("($1)", "[$1]");
print_s(str_replace_list(@("(abc)[def]"), \"\[(.*?)\]|\((.*?)\)", l)); //[abc](def)
list_delete(l);

However, if more groups are part of a single alternative, the function cannot determine this information from the pattern, and so you have to add an empty string in its place in the list:

new List:l = list_new_args_str("($2:$1)", "", "[$2:$1]");
print_s(str_replace_list(@("(a:b)[c:d]"), \"\[(.*?):(.*?)\]|\((.*?):(.*?)\)", l)); //[b:a](d:c)
list_delete(l);

If the second alternative (\((.*?):(.*?)\)) is encountered, it determines the replacement index is 2 (because of the number of previous unmatched groups), so the replacement has to be padded with "" (which will never be selected alone).

The most powerful type of replacement is using a function to generate the replacement string. str_replace_func accepts a public function that will be called for every occurence of the pattern inside the string, passing the values of all groups to it:

forward String:regex_func(gr1[], gr1_size);
public String:regex_func(gr1[], gr1_size)
{
    new tmp = gr1[1];
    gr1[1] = gr1[0], gr1[0] = tmp;
    return str_new(gr1);
}

main()
{
    print_s(str_replace_func(@("abcd"), "..", pawn_nameof(regex_func))); //badc
}

The function is provided with all groups from the pattern, together with their lengths. The first group is the whole match. The function is supposed to return a valid dynamic string. More complex operations can be performed on the string, such as converting to uppercase.

forward String:regex_to_upper(gr1[], gr1_size);
public String:regex_to_upper(gr1[], gr1_size)
{
    return str_to_upper(str_new_arr(gr1, 1)) + str_new(gr1[1]);
}

main()
{
    print_s(str_replace_func(@("apple banana orange"), "\\w+", pawn_nameof(regex_to_upper))); //Apple Banana Orange
}
Clone this wiki locally