Skip to content

Latest commit

 

History

History

pgen

Parser generator

Not only other parser generators for web weren't written here, but they lack a set of features we really need:

  • Type-safety: API of generated parser should be typed without any
  • AST from grammar: converting untyped trees to AST is unsafe and boring
  • TBD CST: pretty-printer has to keep comments /**/, underscores in numbers 1_234 and other features that are nowhere represented in AST.
  • Named lexemes: good error messages shouldn't report an identifier as "a-z, A-Z, 0-9, or _".
  • TBD Error recovery: programming languages should report more than one error at a time.
  • TBD Incremental: reparse shouldn't take time proprtional to size of the file.
  • High-order rules A<B>: duplicated code leads to increased chance to make a mistake, and high-order rules are required for duplication.
  • TBD No stack overflow on large expressions: nested constructions might lead to stack overflow.
  • Space skipping: manually annotating grammar with spaces is error-prone and boring.

Comparison to peggy

pgen mostly follows grammar of peggy with a few notable differences.

  • Capitalized rules Foo = ... create AST nodes with { $: 'Foo' }.
  • Rules have to end with semicolon ;.
  • Inline semantic actions { return 42; } are not supported. We can't infer types of AST when there is some inlined JavaScript code, because JS is untyped.
  • High-order rules A<B> = ... were added.
  • Space skipping was added. It uses space rule.
  • Lexification operator # was added.
  • Character classes do not support modifiers [a-z]i.

Syntax reference

  • Non-AST rule defintion rule = ...;
  • AST rule defintion Rule = .... Returns an object with { $: 'Rule', loc: Loc } with rest of the fields defined with named clauses in right-hand side.
  • Display override for error messaging Id "identifier" = ...;
  • High-order rule defintion inter<A, B> = ...; and call inter<expression, ",">
  • Left-biased choice "A" / "B". Will match the first matching clause.
  • Sequence foo bar baz. All clauses should match in sequence.
  • Named clauses "if" "(" expr:expression ")" stmts:statements. Sequence operator generates an object, and named clauses become its fields { expr: ..., stmts: ... }.
  • Picked clause "if" "(" @expression ")". Sequence operator returns only a single value of picked clause.
  • Single clause sequence a = b. Works as a = @b.
  • Negative lookahead !x. Fails if x matches. Doesn't consume input.
  • Positive lookahead &x. Passes if x matches. Doesn't consume input.
  • Stringification $x. Ignores AST computed by x, returns string that x matched.
  • Lexification #x. Does not skip spaces inside of x. If x calls some other rules, doesn't skip spaces there either.
  • Repeat x*.
  • Repeat at least once x+.
  • Optional x?.
  • String "abc".
  • Character class [a-z_]. Supports ranges a-z. Supports negation [^a-z].

Implicit syntax

  • Spaces are skipped after every terminal: "string", [a-z]
  • Spaces are skipped after lexification operator #x
  • Spaces are not skipped inside lexification operator #x.
  • Spaces are skipped at the start, before rest of the parsing will happen
  • If not the whole input was consumed, error will be emitted