Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

toml parsing is too slow #39

Open
kuangyunsheng opened this issue Jul 11, 2017 · 7 comments
Open

toml parsing is too slow #39

kuangyunsheng opened this issue Jul 11, 2017 · 7 comments

Comments

@kuangyunsheng
Copy link

I compare toml-node and js-yaml to load toml/yaml file with a large array config.

demo.toml:

[[arr]]
name = "abcdefg"

demo.yaml:
arr:

  • name: "abcdefg"

and, repeat the same element 1000 times in each file.

then parse them each other with toml-node and js-yaml, the result is:
toml costs 662 ms
yaml costs 33 ms

the toml parsing is so slow

@kuangyunsheng
Copy link
Author

and I try to parse the same toml file with toml-j0.4, it only cost 45 ms.

@felix9
Copy link
Contributor

felix9 commented Sep 1, 2017

This looks like it's mostly compile() being slow, and I think the best fix is to rewrite it.

@BinaryMuse do you mind if I rewrite compile.js? I'd probably also add more testing.

@BinaryMuse
Copy link
Owner

I opened an PR here to begin to address this: #42

Some interesting findings so far.

@felix9
Copy link
Contributor

felix9 commented Sep 2, 2017

ok, it looks like the grammar is slow.

This is a benchmark on my laptop of parsing a realistic cargo.toml file from the Rust Cargo project:

parse cargo with js-yaml       x 9,848 ops/sec ±2.25% (85 runs sampled)
parse cargo with json          x 64,939 ops/sec ±0.80% (88 runs sampled)
parse cargo with toml-j0.4     x 2,833 ops/sec ±1.87% (84 runs sampled)
parse cargo with toml-node     x 170 ops/sec ±1.07% (80 runs sampled)

I modified toml.pegjs to remove all actions, so it's just doing parsing, and it's still worse than toml-j0.4 (which also uses pegjs).

parse cargo with toml-node     x 329 ops/sec ±1.79% (85 runs sampled)

So I'm now going to try to figure out what's slow in the grammar

@felix9
Copy link
Contributor

felix9 commented Sep 2, 2017

pegjs doesn't generate very efficient parsers. In particular, every rule is a function call, So rules like:

line
  = S* expr:expression S* comment* (NL+ / EOF)
S                = [ \t]

will call the S function in a loop to satisfy the *.
Also, the pegjs --cache option increases the cost of each S call.

Rules like this are faster:

line
  = ws? expr:expression ws? comment* (NL+ / EOF)
ws                = [ \t]+

pegjs still matches [ \t] one char at a time, but it's in a while loop in the ws function.

There doesn't seem to be a way right now to get pegjs to use a /[ \t]+/ regex match instead of repeating /[ \t]/ in a loop.

So... rewriting the grammar slightly to reduce function calls will probably yield a fairly substantial speed increase, and then I'll look at other problems.

@felix9
Copy link
Contributor

felix9 commented Sep 3, 2017

I've gotten part of the way to toml-j0.4 performance for the cargo.toml case just by refactoring the grammar.

But there's a large performance gap that's due to toml-node calling line() and column() all the time. I'm not sure what to do about that yet. Options:

  1. Instead of line() and column(), use offset(), which is cheap, and then recompute line/column in the error reporter. Unfortunately, pegjs-0.10 gets rid of offset() and replaces it with location(), which is as expensive as line() and column().
  2. Fold compile into parse, so we don't have to save line/column.

1 seems fairly easy, I think I'll try that first.

felix9 added a commit to felix9/toml-node that referenced this issue Sep 3, 2017
felix9 added a commit to felix9/toml-node that referenced this issue Sep 3, 2017
@felix9
Copy link
Contributor

felix9 commented Sep 3, 2017

#44 and #45 combined makes toml-node about half the speed of toml-j0.4 for parsing my cargo.toml test case.

Also removing the pegjs --cache option will make toml-node about the same speed as toml-j0.4. I think the pegjs caching is unnecessary for this grammar, but I haven't worked through it fully yet.

There are more things to improve; toml parsing should be about as fast as yaml parsing, and potentially faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants