Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise string trimming on JavaScript #736

Merged
merged 2 commits into from
Nov 26, 2024

Conversation

richard-viney
Copy link
Contributor

@richard-viney richard-viney commented Nov 12, 2024

Optimises the regexes used for string trimming on JavaScript be to 1.5-3x faster, depending on the input and trimming function used.

@richard-viney richard-viney force-pushed the js-optimise-string-trim branch 2 times, most recently from b510efa to 5303d38 Compare November 12, 2024 17:06
@richard-viney richard-viney marked this pull request as ready for review November 12, 2024 17:17
Copy link
Member

@lpil lpil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that regex doesn't iterate faster than a JS loop.

Is it faster due to not using regex or is it because string slicing is being used instead of string replacement?

@richard-viney richard-viney force-pushed the js-optimise-string-trim branch 2 times, most recently from 2952fc8 to da7f50e Compare November 20, 2024 11:50
@richard-viney
Copy link
Contributor Author

On Node.js 22.11 the speedup comes from both not using a regex, and also from using string slicing instead of string replacement:

Approach Time (N = 500 million)
Regexp.replace() (current approach) 19.2 s
RegExp.exec() then String.substring() 14.6 s
Manual trimming (this PR) 2.7 s

However, turns out it this also very runtime dependent. While the initial code on this PR was an order of magnitude faster on Node/Deno, it's actually slower on Bun. This seems to be because Bun handles a regex replace better than it can JIT optimise the hand-written trimming code.

I've therefore changed this PR to the following:

  1. Remove g from the trim regexes as it's unnecessary and slows things down quite a bit on all runtimes.
  2. Use a single unified regex for trim().
  3. Use regex based trimming on the Bun runtime.
  4. Use manual JS trimming code on the Node/Deno runtimes.
  5. The exception is trim_start(), because in this case Bun runs the manual JS trimming code faster than it runs a regex.

This gives the best performance on all runtimes.

If taking different code paths based on the JS runtime is undesirable then what we could do instead is only implement points 1, 2, and potentially 5 as listed above. This would be simpler, still give a meaningful speedup on all runtimes (at least 2-3x), but does leave a fair amount of performance on the table.

Copy link
Member

@lpil lpil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to have any runtime specific checks in the stdlib, especially since we'd need to have checks for the browsers (which are the main places Gleam JS use used) so let's stick with the optimised regex for all platforms.

@richard-viney richard-viney force-pushed the js-optimise-string-trim branch from da7f50e to 5ba0cc9 Compare November 20, 2024 18:50
@richard-viney
Copy link
Contributor Author

Have changed to only apply the regex-based improvements.


If in future anyone comes across this thread searching for a boost to trimming performance on JS and mainly cares about V8-based runtimes, try the following code to get trimming performance that can be many times faster:

export function trim(string) {
  const start_index = find_non_whitespace_char(string);

  let end_index = rfind_non_whitespace_char(string) + 1;
  if (end_index < start_index) {
    end_index = start_index;
  }

  return string.substring(start_index, end_index);
}

export function trim_start(string) {
  return string.substring(find_non_whitespace_char(string));
}

export function trim_end(string) {
  return string.substring(0, rfind_non_whitespace_char(string) + 1);
}

function isUnicodeWhitespace(c) {
  return (
    c === "\u0020" || // Space
    c === "\u0009" || // Horizontal tab
    c === "\u000A" || // Line feed
    c === "\u000B" || // Vertical tab
    c === "\u000C" || // Form feed
    c === "\u000D" || // Carriage return
    c === "\u0085" || // Next line
    c === "\u2028" || // Line separator
    c === "\u2029" // Paragraph separator
  );
}

function find_non_whitespace_char(string) {
  let i = 0;

  for (; i < string.length; i++) {
    if (!isUnicodeWhitespace(string[i])) {
      break;
    }
  }

  return i;
}

function rfind_non_whitespace_char(string) {
  let i = string.length - 1;

  for (; i >= 0; i--) {
    if (!isUnicodeWhitespace(string[i])) {
      break;
    }
  }

  return i;
}

@richard-viney richard-viney force-pushed the js-optimise-string-trim branch from 5ba0cc9 to 5e9c98b Compare November 25, 2024 22:23
Copy link
Member

@lpil lpil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!!

@lpil lpil merged commit b0afb8f into gleam-lang:main Nov 26, 2024
7 checks passed
@richard-viney richard-viney deleted the js-optimise-string-trim branch November 26, 2024 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants