toml parsing is too slow #39

kuangyunsheng · 2017-07-11T03:27:19Z

I compare toml-node and js-yaml to load toml/yaml file with a large array config.

demo.toml:

[[arr]]
name = "abcdefg"

demo.yaml:
arr:

name: "abcdefg"

and, repeat the same element 1000 times in each file.

then parse them each other with toml-node and js-yaml, the result is:
toml costs 662 ms
yaml costs 33 ms

the toml parsing is so slow

kuangyunsheng · 2017-07-11T03:34:34Z

and I try to parse the same toml file with toml-j0.4, it only cost 45 ms.

felix9 · 2017-09-01T11:38:01Z

This looks like it's mostly compile() being slow, and I think the best fix is to rewrite it.

@BinaryMuse do you mind if I rewrite compile.js? I'd probably also add more testing.

BinaryMuse · 2017-09-01T18:13:44Z

I opened an PR here to begin to address this: #42

Some interesting findings so far.

felix9 · 2017-09-02T17:51:24Z

ok, it looks like the grammar is slow.

This is a benchmark on my laptop of parsing a realistic cargo.toml file from the Rust Cargo project:

parse cargo with js-yaml       x 9,848 ops/sec ±2.25% (85 runs sampled)
parse cargo with json          x 64,939 ops/sec ±0.80% (88 runs sampled)
parse cargo with toml-j0.4     x 2,833 ops/sec ±1.87% (84 runs sampled)
parse cargo with toml-node     x 170 ops/sec ±1.07% (80 runs sampled)

I modified toml.pegjs to remove all actions, so it's just doing parsing, and it's still worse than toml-j0.4 (which also uses pegjs).

parse cargo with toml-node     x 329 ops/sec ±1.79% (85 runs sampled)

So I'm now going to try to figure out what's slow in the grammar

felix9 · 2017-09-02T19:42:39Z

pegjs doesn't generate very efficient parsers. In particular, every rule is a function call, So rules like:

line
  = S* expr:expression S* comment* (NL+ / EOF)
S                = [ \t]

will call the S function in a loop to satisfy the *.
Also, the pegjs --cache option increases the cost of each S call.

Rules like this are faster:

line
  = ws? expr:expression ws? comment* (NL+ / EOF)
ws                = [ \t]+

pegjs still matches [ \t] one char at a time, but it's in a while loop in the ws function.

There doesn't seem to be a way right now to get pegjs to use a /[ \t]+/ regex match instead of repeating /[ \t]/ in a loop.

So... rewriting the grammar slightly to reduce function calls will probably yield a fairly substantial speed increase, and then I'll look at other problems.

felix9 · 2017-09-03T00:19:15Z

I've gotten part of the way to toml-j0.4 performance for the cargo.toml case just by refactoring the grammar.

But there's a large performance gap that's due to toml-node calling line() and column() all the time. I'm not sure what to do about that yet. Options:

Instead of line() and column(), use offset(), which is cheap, and then recompute line/column in the error reporter. Unfortunately, pegjs-0.10 gets rid of offset() and replaces it with location(), which is as expensive as line() and column().
Fold compile into parse, so we don't have to save line/column.

1 seems fairly easy, I think I'll try that first.

felix9 · 2017-09-03T19:41:27Z

#44 and #45 combined makes toml-node about half the speed of toml-j0.4 for parsing my cargo.toml test case.

Also removing the pegjs --cache option will make toml-node about the same speed as toml-j0.4. I think the pegjs caching is unnecessary for this grammar, but I haven't worked through it fully yet.

There are more things to improve; toml parsing should be about as fast as yaml parsing, and potentially faster.

BinaryMuse mentioned this issue Sep 1, 2017

Address performance #42

Open

felix9 added a commit to felix9/toml-node that referenced this issue Sep 3, 2017

faster grammar for BinaryMuse#39

7093967

felix9 added a commit to felix9/toml-node that referenced this issue Sep 3, 2017

use offset() instead of line()/column() for issue BinaryMuse#39

ae25000

bd82 mentioned this issue Jan 28, 2019

Suggestion: A new infrastructure for this Toml -> Json compiler #56

Open

omus mentioned this issue Jan 4, 2024

Avoid cloning cached registries julia-actions/add-julia-registry#25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

toml parsing is too slow #39

toml parsing is too slow #39

kuangyunsheng commented Jul 11, 2017

kuangyunsheng commented Jul 11, 2017

felix9 commented Sep 1, 2017

BinaryMuse commented Sep 1, 2017

felix9 commented Sep 2, 2017

felix9 commented Sep 2, 2017

felix9 commented Sep 3, 2017

felix9 commented Sep 3, 2017

toml parsing is too slow #39

toml parsing is too slow #39

Comments

kuangyunsheng commented Jul 11, 2017

kuangyunsheng commented Jul 11, 2017

felix9 commented Sep 1, 2017

BinaryMuse commented Sep 1, 2017

felix9 commented Sep 2, 2017

felix9 commented Sep 2, 2017

felix9 commented Sep 3, 2017

felix9 commented Sep 3, 2017