-
Notifications
You must be signed in to change notification settings - Fork 18
Regular expressions
Regular expressions can be used to recognize strings that match a certain pattern. PawnPlus exposes the regex features of C++ in a simple yet useful set of functions that support simple matching (finding a pattern in a string), extraction (indetifying submatches in a pattern), and replacement (substituting a matched portion of a string with a new substring).
All regex functions accept a regex_options
bit-field. The regex_options
enum is based on a combination of C++ types syntax_option_type
and match_flag_type
, allowing changing the language or the way patterns are matched. The custom changes to those flags are:
-
regex_default
uses the modified ECMAScript regular expression grammar. This grammar is used by default if no other is specified. -
regex_lex
uses an alternative grammar provided by lex (through a fork), which in turn implements Lua patterns. For differences, see below. -
regex_percents
changes the character used in escape sequences from\
to%
(in both pattern and replacement). -
regex_start_at_pos
affects situations whenpos
is greater than 0, i.e. the matching starts in the middle of the string. By default, the string is assumed to start at 0, and so^
will not matchpos
, but with this option, any characters beforepos
are effectively ignored. -
regex_cached
can increase the efficiency of matching for more complex patterns. Usually, every function call has to construct the regex object from the pattern, but with this option, the object is stored in a map and can be retrieved later when the same pattern and options are used. -
regex_cached_addr
uses the memory address of the pattern instead of its value, so there is no need to check the whole pattern to retrieve the cached regex object, saving speed. However, this option should be used only when the pattern is immutable in its location, such as when stored in astatic const
variable or a long-livedConstString:
value. If the value of the pattern changes after caching, the regex may not be regenerated to reflect the changes.
All syntax options support caching the intermediate representation of the pattern, via regex_cached
(by value) and regex_cached_addr
(by address).
Do not use regex caching for highly variable patterns, when the value is different each time (and in case of regex_cached_addr
, when it is constructed on the stack or stored in a temporary String:
value), as the cached regex instance will likely not reused again and will remain in memory. Please see the examples below when and how to use caching:
// Pattern comes from a literal string
str_match(s, "pattern", .options = regex_cached); // imperfect - statically-known string; no need to cache by value
str_match(s, "pattern", .options = regex_cached_addr); // optimal - address is constant (in the AMX instance)
str_match(s, str_new("pattern"), .options = regex_cached_addr); // bad - temporary string; address is different each time so no cache hits
str_match(s, str_new_static("pattern"), .options = regex_cached_addr); // still bad - good use of str_new_static, but it is still a temporary string
str_match(s, str_new_static("pattern"), .options = regex_cached); // okay in principle - value is always the same (but why construct the string at all?)
str_match(s, str_format("pattern %d", random(100000)), .options = regex_cached); // bad - cache is almost never hit
// Pattern comes from the stack
new p[] = "pattern";
str_match(s, p, .options = regex_cached_addr); // very bad - string is on the stack, so the address may be different each time and could conflict with another one
str_match(s, p, .options = regex_cached); // okay in principle - value is always the same (but do not copy a string literal to a variable when you don't need to modify it!)
format(p, sizeof(p), "pattern %d", random(100000));
str_match(s, p, .options = regex_cached); // bad - different pattern each time again
str_match(s, p); // okay - just don't use caching when the pattern is different
Efficient regex caching should be used together with regex_optimize
to improve speed when the instance is reused.
The grammar can be changed through the options. All grammars support the custom locale extension.
-
regex_default
uses the modified ECMAScript grammar. -
regex_basic
uses the basic POSIX grammar. -
regex_extended
uses the extended POSIX grammar. -
regex_awk
uses the grammar used byawk
. -
regex_grep
uses the grammar used bygrep
, effectively the same asregex_basic
with the addition of\n
as an alternation separator in addition to|
. -
regex_egrep
uses the grammar used bygrep -E
, effectively the same asregex_extended
with the addition of\n
as an alternation separator. -
regex_lex
uses the grammar and matching used by Lua patterns. More below. Note that the replacement syntax, used bystr_replace
and similar function, is plugin-specific and unchanged by these options.
Any regular expression may be prepended by a locale/encoding identifier in the form of (?$identifier)
. This specifier needs to be present at the beginning of the pattern, and identifier must be a valid encoding identifier, otherwise the pattern will be treated as invalid. If unspecified, the default locale is used (set by pp_locale
).
The default syntax supports the following character classes, with interpretation depending on the configured locale:
Expression | Meaning |
---|---|
\d or [:digit:]
|
Any digit. |
\w |
Any letter or digit or _ . |
\s or [:space:]
|
Any whitespace character. |
[:alpha:] |
Any letter. |
[:graph:] |
Any visible character (alphanumeric or punctuation). |
[:alnum:] |
Any letter or digit. |
[:blank:] |
Any space character used for word separation (such as or \t ). |
[:cntrl:] |
Any (non-printable) control character. |
[:lower:] |
Any lowercase character. |
[:upper:] |
Any uppercase character. |
[:print:] |
Any printable character (opposite of [:cntrl:] ). |
[:punct:] |
Any visible character that is not alphanumeric. |
[:xdigit:] |
Any hexadecimal digit. |
Additional classes may be defined by the current locale. Note that the […]
-enclosed forms need to be in an explicit character set (e.g. [[:alpha:]]
or [^[:alpha:]]
) to be recognized.
The Lua-style pattern matching has several notable differences, both from other grammars and from Lua as well:
- Unlike in Lua,
\
is used as the escape character by default, but it can be changed to%
viaregex_percents
. -
\b
and\f
have the meaning as in Lua (balanced sequence and frontier). - No alternatives or look-aheads/look-behinds could be used. The presence of
^
and$
in any other place in the pattern than the beginning/end is not special. - Character classes use identical comparisons as the other grammars (therefore
\w
is not different from how it is used in the other grammars), but\a
,\g
,\c
,\p
,\l
,\u
, and\x
are available as well (the first letter of the corresponding names). Using any other character after\
matches that character exactly, bypassingregex_icase
andregex_collate
. -
regex_not_bow
andregex_not_eow
affect the\f
pattern ‒ if used, the frontier cannot be matched at the beginning/end of the string. -
regex_optimize
andregex_any
do not do anything, because there is no intermediate representation or indeterminism.
Matching is the simplest application of regular expressions. The function str_match
can be used to test if a string contains the provided pattern.
assert str_match(@("apple banana orange"), \"\b(apple|banana)\b");
The function doesn't match the whole string against the pattern, just the first occurence. ^
and $
can be used to anchor the pattern at the beginning and at the end of the string. The pos
parameter can be used to specify the starting offset of matching and to obtain the ending offset of the match.
str_extract
looks for a pattern in a string and constructs a new list holding all the groups captured from the string, or List:0
in case of no match. This function can also be used to iterate over all occurences of the pattern:
new pos = 0;
new String:str = @("apple banana orange");
new List:l;
while((l = str_extract(str, \"\b[[:alpha:]]+\b", .pos=pos)))
{
print_s(list_get_str_s(l, 0));
list_delete(l);
}
//prints all words in the string
All occurences of a pattern in a string can be replaced with another string, a list, or a function.
str_replace
can be used to specify a single replacement string. If you use capturing groups, you can refer to them via $1
, $2
(or \1
, \2
) etc. in the replacement string.
assert str_replace(@("apple banana orange"), \"\b([[:alpha:]]+)\b", "word($1)") == @("word(apple) word(banana) word(orange)");
Sometimes, it is useful to replace multiple things at once, but be able to select the specific replacement based on the pattern. For this usage, it is possible to use str_replace_list
:
new List:l = list_new_args_str("1", "0");
print_s(str_replace_list(@("apple"), "(..)|(.)", l)); //110
list_delete(l);
The function finds the first range of successfully matched groups and gets the list index corresponding to the first group.
The individual replacement strings can reference the match:
new List:l = list_new_args_str("($1)", "[$1]");
print_s(str_replace_list(@("(abc)[def]"), \"\[(.*?)\]|\((.*?)\)", l)); //[abc](def)
list_delete(l);
However, if more groups are part of a single alternative, the function cannot determine this information from the pattern, and so you have to add an empty string in its place in the list:
new List:l = list_new_args_str("($2:$1)", "", "[$2:$1]");
print_s(str_replace_list(@("(a:b)[c:d]"), \"\[(.*?):(.*?)\]|\((.*?):(.*?)\)", l)); //[b:a](d:c)
list_delete(l);
If the second alternative (\((.*?):(.*?)\)
) is encountered, it determines the replacement index is 2 (because of the number of previous unmatched groups), so the replacement has to be padded with ""
(which will never be selected alone).
The most powerful type of replacement is using a function to generate the replacement string. str_replace_func
accepts a public function that will be called for every occurence of the pattern inside the string, passing the values of all groups to it:
forward String:regex_func(gr1[], gr1_size);
public String:regex_func(gr1[], gr1_size)
{
new tmp = gr1[1];
gr1[1] = gr1[0], gr1[0] = tmp;
return str_new(gr1);
}
main()
{
print_s(str_replace_func(@("abcd"), "..", pawn_nameof(regex_func))); //badc
}
The function is provided with all groups from the pattern, together with their lengths. The first group is the whole match. The function is supposed to return a valid dynamic string. More complex operations can be performed on the string, such as converting to uppercase.
forward String:regex_to_upper(gr1[], gr1_size);
public String:regex_to_upper(gr1[], gr1_size)
{
return str_to_upper(str_new_arr(gr1, 1)) + str_new(gr1[1]);
}
main()
{
print_s(str_replace_func(@("apple banana orange"), "\\w+", pawn_nameof(regex_to_upper))); //Apple Banana Orange
}