Would scryer-prolog be usable to write parsers for lightweight markup languages? #1599

alexpdp7 · 2022-09-03T15:00:26Z

alexpdp7
Sep 3, 2022

I'm researching ways of writing parsers for languages such as AsciiDoc or Markdown.

Even though I loved Prolog back in the day, I never used it for parsing, and until recently, I had forgotten that one of its original purposes was to write parsers. I was reminded of this, and doing some research, I thought Prolog and DCGs would let me write parsers for say, AsciiDoc in a mostly declarative way (neither AsciiDoc nor Markdown are CFG, and AsciiDoc has particularly complex structure¹).

The README for Scryer seems to indicate that string handling is particularly efficient, so I'm thinking about starting my experiments on Prolog parsing with it. My idea would be to create a framework that allows me to write a grammar in Prolog, then write some scaffolding that applies that grammar to a file, then outputs a JSON AST of the parsed document. Probably I would add a post-processing step to annotate the AST with line/column information².

Ideally, I think I'd like to write a tool that lets you do something like:

$ parser grammar.pl file.ext

Which loads up the rules in grammar.pl, applies them to file.ext and outputs a JSON AST.

...

Is this a bad/crazy idea? Would this be reasonably implementable with Scryer as it exists today?

AsciiDoc parameterizes how escaping/parsing is done in a block (see https://docs.asciidoctor.org/asciidoc/latest/subs/apply-subs-to-blocks/#the-subs-attribute ). ↩
If I understand correctly, if my AST has nodes for all the characters in the parsed file, I should be able to add the position annotations to the AST without having to clutter the Prolog "grammar" with code to generate these annotations. ↩

Answered by alexpdp7

Sep 20, 2022

An update.

https://github.com/alexpdp7/prolog-parsing/blob/main/asciidoc_poc.pro

This has the "minor" issues that the plunit tests don't work under Scryer, and that Scryer doesn't have flatten/2 (but then, I shouldn't be using it- I'm lazy), but that's a parser for a minimal "hard" subset of AsciiDoc- and my only experience with Prolog until recently was a semester in University (not writing parsers) like 20 years ago.

I've moved to SWI because the tooling is a bit more complete, but if I spend more time on this I'll definitely fix the small compat issues and do some benchmarking!

View full answer

aarroyoc · 2022-09-12T17:56:01Z

aarroyoc
Sep 12, 2022
Sponsor

¡Hola Alex!

I think you're the same person who asked on Reddit a similar question. I believe that it was already shown that it is possible to write parsers for more-than-CFG grammars. Also, remember that DCGs are just an abstraction that gets translated to standard Prolog code. Normal Prolog code is Turing-complete so any parser can be written in Prolog, even though some of them will be trickier.

As for myself, I've written several parsers in Prolog (some of them in Scryer) and for me, it's way better than the Lex/Yacc workflow. It's very easy to compose rules, add easy checks and more advanced checks (require calls to Prolog code). However, it's not perfect. A problem I have found in some of my grammars is that while indeed the first parsing solution is what I want, there are some other solutions left. Be careful with leaving choicepoints as in larger samples, the parser might seem like it succeeded but in reality, it doesn't. Other problem you might have is that while theoretically, the order of some stuff doesn't matter, in practice it does matter and it can make the parsing much faster/slower. Two of my most complex grammars are Teruel, which is a Jinja like template language: https://github.com/aarroyoc/teruel/ and MIPSie, a MIPS-assembly emulator: https://github.com/aarroyoc/mipsie/. Use them as inspiration if you want but the code quality is still not there yet.

Finally, why Scryer Prolog? Scryer treats strings very efficiently and makes writing DCGs for them very easily. Other systems also treat strings somehow efficiently but they're an opaque type and they loose the properties of reasoning about them as lists. Scryer is one of the few systems that use an internal representation of strings that is compact and efficient and at the same time, exposes to the program the strings as mere lists of characters (atoms).

Just my two cents

4 replies

alexpdp7 Sep 12, 2022
Author

Hola!

Yes, indeed that's me on Reddit. Yes, I think Prolog is worth exploring for those languages.

For example the "subs" attribute in AsciiDoc means that the parsing in a block is "parameterized" by a (simple) semantic piece of code before it. With Lex/Yacc and most traditional parses, I need to introduce procedural code and "hack" the parser to cater for this. With Prolog, I can write this logic declaratively in Prolog- while it will not be, say, a BNF "clean" grammar, it will be pretty easier to understand than digging into the internals of the parsers.

Yeah, I suspect I will need to be careful adding cuts, etc. for disambiguation- I fear that's one of the things where I will sweat more. I am not especially concerned about performance- Scryer's strings sound like a great idea, and the files I will be parsing are "book sections", so it's maybe a few hundred lines long of text, with not a lot of markup, so I suspect performance will be good enough.

I have already been able to parse some basic text (a "the bat eats a cat" English sentence into subject, verb, object), now I'm struggling with making this a command-line program and outputting a JSON AST.

Your code was helpful:

parse(Filename, MipsCode, MipsState) :-
    pio:phrase_from_file(lines(MipsCodeLines), Filename),
    meta::map(no_comment_line, MipsCodeLines, MipsCodeLinesClean),
    filter_empty_lines(MipsCodeLinesClean, MipsLines),
    mips_asm(MipsLines, MipsCode, MipsState).

I hadn't realized that the first argument to phrase_from_file is not just a "grammar", but also a way to get the AST (in the MipsCodeLines variable, right?)! I think that solves the first issue I had; getting the AST as a JSON will be next.

Thanks!

Álex

alexpdp7 Sep 12, 2022
Author

Yeah:

:- use_module(library(dcgs)).
:- use_module(library(pio)).
:- use_module(library(serialization/json)).

sentence(s(NP,VP)) --> noun_phrase(NP), verb_phrase(VP).
noun_phrase(np(D,N)) --> det(D), noun(N).
verb_phrase(vp(V,NP)) --> verb(V), noun_phrase(NP).
det(d("the")) --> "the".
det(d("a")) --> "a".
noun(n("bat")) --> "bat".
noun(n("cat")) --> "cat".
verb(v("eats")) --> "eats".

parse(F, J):-phrase_from_file(sentence(S), F), phrase(json_chars(J), S).

phrase_from_file works, now I need to figure out how to get phrase/json_chars to work.

edit: ah, I think I need to map my s, np, etc. to json_*.

edit 2: more notes to self: phrase(json_chars(null), J)..

alexpdp7 Sep 12, 2022
Author

Ahhhh! Cracked it, thanks! https://github.com/alexpdp7/prolog-parsing/blob/main/simple.pro

I don't like to_json, but I suppose I can make it prettier using univ :)

Prolog I studied 20 years ago is coming back to me :)

triska Sep 12, 2022

Awesome, thank you a lot for sharing you code!

From a quick glance, it seems that to_json could benefit from using DCGs to describe the intended JSON output (as a list of characters).

alexpdp7 · 2022-09-15T11:21:07Z

alexpdp7
Sep 15, 2022
Author

For the moment, I think the answer is "yes". I got Scryer Prolog to parse a simple grammar and spit a JSON, and I think https://github.com/rla/prolog-markdown is proof enough. I'll have to conduct more experiments, though.

0 replies

alexpdp7 · 2022-09-20T18:51:31Z

alexpdp7
Sep 20, 2022
Author

An update.

https://github.com/alexpdp7/prolog-parsing/blob/main/asciidoc_poc.pro

This has the "minor" issues that the plunit tests don't work under Scryer, and that Scryer doesn't have flatten/2 (but then, I shouldn't be using it- I'm lazy), but that's a parser for a minimal "hard" subset of AsciiDoc- and my only experience with Prolog until recently was a semester in University (not writing parsers) like 20 years ago.

I've moved to SWI because the tooling is a bit more complete, but if I spend more time on this I'll definitely fix the small compat issues and do some benchmarking!

8 replies

alexpdp7 Sep 20, 2022
Author

Oh, I thought there was a way of using seqq/2 as part of a DCG. I'm actually already doing stuff which should be replaced by phrase, so I think I might do this in a couple of places. Thanks!

pmoura Sep 20, 2022

This has the "minor" issues that the plunit tests don't work under Scryer, ...

You can use lgtunit with Scryer Prolog (and most Prolog systems) to define and run tests.

alexpdp7 Sep 20, 2022
Author

Oh, yes! I was considering using LogTalk- I understand that would also let me check easily if my parser works on SWI, Scryer... and Tau Prolog (running this on the browser is also an interesting use case), so I was tempted to do that. It's just I was a bit afraid of adding too many different elements when I'm an absolute noob!

I'll take that in consideration. I'm pondering whether to expand my parser to some further "hard" AsciiDoc syntax (to prove more thoroughly that Prolog is the way to parse this), or stop, polish everything and publish some tutorials about this.

gitonthescene Oct 11, 2022

@alexpdp7 Not directly Prolog related, but you might be interested in unified as something to target your AST to.

alexpdp7 Oct 12, 2022
Author

Oh yeah, that sounds really really interesting. A standardized "text" AST for spellcheckers et al. to target is a wonderful idea and that would make things much simpler. I'm actually tempted to do a quick AsciiDoc adapter for that using the "cheap trick" (convert to HTML, extract text and match positions).

OTOH, BTW, I think I have a decent v1 of the article I was talking about https://github.com/alexpdp7/prolog-asciidoc/blob/main/parsing-asciidoc-in-prolog.adoc .

infogulch · 2022-10-12T18:24:45Z

infogulch
Oct 12, 2022

If we're talking about general parsing tools, I think tree-sitter would be a good way to interface with a wide variety of pre-built grammars.

Ideally you'd have a prolog program that takes an evaluated tree sitter grammar (a JSON document) and produces a prolog DCG that you can use with phrase/3 that can convert text into an AST according to the grammar.

I'm not sure where to start with this but I suspect that it would be a great way to interact with an existing large community.

7 replies

gitonthescene Oct 13, 2022

That sounds like implementing tools with tree-sitter as opposed to a format to build AST pipelines.

infogulch Oct 13, 2022

tree-sitter is an incremental parser that outputs ASTs, so I'm not sure what the distinction is.

gitonthescene Oct 13, 2022

Oh, okay. That’s what I thought. Unified is less about the parser than the output format of the parsed input. The standardization allows you to string together parsers developed independently into pipelines that perform various transformations.

infogulch Oct 13, 2022

Unified appears to be a collection of independently developed parsers, transformers, and printers for JS. Perhaps organized around a standard set of APIs. That's pretty neat for JS, but doesn't really help prolog aside from being a good example of such a system.

tree-sitter may still be useful here because it's output is a regularized JSON document that describes the grammar of the language, which can be utilized in a much wider variety of contexts.

gitonthescene Oct 13, 2022

I’m not sure how you mean “help”. But in any event I think we’re mostly on the same page now. I only meant to offer a suggestion I thought alexdp7 would find useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Would scryer-prolog be usable to write parsers for lightweight markup languages? #1599

{{title}}

Replies: 4 comments 19 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Would scryer-prolog be usable to write parsers for lightweight markup languages? #1599

Footnotes

Replies: 4 comments · 19 replies

aarroyoc Sep 12, 2022 Sponsor

alexpdp7 Sep 12, 2022 Author

alexpdp7 Sep 12, 2022 Author

alexpdp7 Sep 12, 2022 Author

alexpdp7 Sep 15, 2022 Author

alexpdp7 Sep 20, 2022 Author

alexpdp7 Sep 20, 2022 Author

alexpdp7 Sep 20, 2022 Author

alexpdp7 Oct 12, 2022 Author

Replies: 4 comments 19 replies

aarroyoc
Sep 12, 2022
Sponsor

alexpdp7 Sep 12, 2022
Author

alexpdp7 Sep 12, 2022
Author

alexpdp7 Sep 12, 2022
Author

alexpdp7
Sep 15, 2022
Author

alexpdp7
Sep 20, 2022
Author

alexpdp7 Sep 20, 2022
Author

alexpdp7 Sep 20, 2022
Author

alexpdp7 Oct 12, 2022
Author