Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQueryのREGEPX_関数のエスケープ処理: ベンチマークつき #60

Open
takegue opened this issue Mar 26, 2023 · 0 comments

Comments

@takegue
Copy link
Owner

takegue commented Mar 26, 2023

BigQueryでは、REGEXP_CONTAINS などの REGEXP_系の関数が、第二引数として正規表現文字列を受け付ける。
この値はテーブルの値をそのまま渡すことも可能であり、その場合、テーブルの文字列が安全でないと問題が生じることになる。

REGEXP_CONTAINS を テーブルの値から利用する場合には適切なエスケープ操作が必要になるが
この記事ではエスケープ操作の3種の実装を行い結果を比較し、最も効率的な計算方法について調査する。

実験コード

-- JSによる書き換え
CREATE TEMPORARY FUNCTION escape_regex_pattern1(pattern STRING)
RETURNS STRING
LANGUAGE js AS r"""
  return pattern.replace(/[.*+?^${}()|[\]\\]/g,  '\\$&');
""";

--  SQLのみでのエスケープ (REPALCE版)
CREATE TEMPORARY FUNCTION escape_regex_pattern2(pattern STRING)
RETURNS STRING
AS (
  REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    pattern,
    '\\', '\\\\\\'),
    '|', '\\|'),
    '(', '\\('),
    ')', '\\)'),
    '{', '\\{'),
    '}', '\\}'),
    '$', '\\$'),
    '^', '\\^'),
    '*', '\\*'),
    '+', '\\+'),
    '?', '\\?'),
    '.', '\\.')
)
;

--  SQLのみでのエスケープ (REGEXP_REPLACE)
CREATE TEMPORARY FUNCTION escape_regex_pattern3(pattern STRING)
RETURNS STRING
AS (
  REGEXP_REPLACE(pattern, r'([.*+?^${}()|[\]\\])', r'\\\1')
)
;

-- テストコード
WITH test_cases AS (
  SELECT '.*+' AS pattern, r'\.\*\+' AS expected_result UNION ALL
  SELECT '$.[]()' AS pattern, r'\$\.\[\]\(\)' AS expected_result UNION ALL
  SELECT 'hello, world!' AS pattern, r'hello, world!' AS expected_result UNION ALL
  SELECT '\\[\\]' AS pattern, r'\\\[\\\]' AS expected_result UNION ALL
  SELECT '' AS pattern, '' AS expected_result UNION ALL
  SELECT REPEAT('abcdefghijklmnopqrstuvwxyz', 200) AS pattern, REPEAT('abcdefghijklmnopqrstuvwxyz', 200) AS expected_result
)
, benchmark__dataset as (
  select * from test_cases, unnest(generate_array(1, 100000))
)
, benchmark1 as (
 select escape_regex_pattern1(pattern) = expected_result as assert from benchmark__dataset 
)
, benchmark2 as (
 select escape_regex_pattern2(pattern) = expected_result as assert from benchmark__dataset 
)
, benchmark3 as (
 select escape_regex_pattern3(pattern) = expected_result as assert from benchmark__dataset 
)

select countif(assert) from benchmark1
-- select countif(assert) from benchmark2
-- select countif(assert) from benchmark3
-- select pattern, escape_regex_pattern2(pattern), expected_result from test_cases

ベンチマーク結果

benchmark1

image

benchmark2

image

benchmark3

image

結論

パフォーマンスは REPLACE > REGEXP_REPLACE > JS の順番に良いことがわかった。
REPALCEは複数(x14-5回程度)呼び出ししていることもあり、REGEXP_REPLACEとREPLACEの間で反転すると思ったが
REPLACEの方が性能として良い結果となった。

@takegue takegue changed the title BigQueryにおけるパフォーマン BigQueryのREGEPX_関数のエスケープ処理: ベンチマークつき Mar 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant