Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax highlighting hook #11

Open
aminya opened this issue Oct 30, 2020 · 9 comments
Open

Syntax highlighting hook #11

aminya opened this issue Oct 30, 2020 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@aminya
Copy link

aminya commented Oct 30, 2020

Marked supports providing a custom highlight function to colour the code. It would be nice if markdown-wasm provides a similar feature. This means that every time a <code> or a <pre><code> is seen, it should pass the code to the hook for external rendering.

Here is how marked allows it:
https://github.com/atom-community/atom-ide-markdown-service/blob/9a95f3815bbcf2507a31235a9d2f9102192f92b2/src/renderer.ts#L33-L42

@aminya
Copy link
Author

aminya commented Jun 25, 2021

@rsms Could you shed some light on this issue? Is this possible given the current codebase?

@rsms rsms added the enhancement New feature or request label Jun 25, 2021
@rsms
Copy link
Owner

rsms commented Jun 25, 2021

That's a cool idea. It is likely to be a complicated addition as this is a WASM module, not JavaScript. Even if a js function is called as a callback, you will likely be paying a pretty steep perf cost.

@aminya
Copy link
Author

aminya commented Jun 26, 2021

So do you think that editing the HTML code blocks (adding colors) after getting the initial markdown would have better performance?

rsms added a commit that referenced this issue Jul 1, 2021
@rsms
Copy link
Owner

rsms commented Jul 1, 2021

@aminya Okay, I've got a proposal in the main branch. It was quite a bit of effort as I had to do some maintenance on some tools (wasmc and emsdk docker image) but this approach actually works really well and is reasonably performant.

2938af6 adds onCodeBlock to ParseOptions with the following signature:

interface ParseOptions {
  /**
   * onCodeBlock is an optional callback which if provided is called for each code block.
   * langname holds the "language tag", if any, of the block.
   *
   * The returned value is inserted into the resulting HTML verbatim, without HTML escaping.
   * Thus, you should take care of properly escaping any special HTML characters.
   *
   * If the function returns null or undefined, or an exception occurs, the body will be
   * included as-is after going through HTML escaping.
   *
   * Note that use of this callback has an adverse impact on performance as it casues
   * calls and data to be bridged between WASM and JS on every invocation.
   */
  onCodeBlock? :(langname :string, body :UTF8Bytes) => Uint8Array|string|null|undefined
  // ...
}

Try it like this:

$ git checkout https://github.com/rsms/markdown-wasm.git
$ cd markdown-wasm
$ git checkout 571f5ceb09730208edcf8abb19444ed24432c337
$ npm install
$ npm run build
$ node example/example.js

Example program:

const fs = require("fs")
const md = require("../dist/markdown.node.js")
const source = fs.readFileSync(__dirname + "/example.md")
const outbuf = md.parse(source, {
  bytes: true,
  onCodeBlock(lang, body) {
    return body.toString().toUpperCase()
  },
})
fs.writeFileSync("example.html", outbuf)

Since WASM maintains an isolated "heap of memory", every call to the filter function implies one copy of the bytes (from a JS heap Uint8Array buffer to the WASM heap). Additionally if you do body.toString() to get a JS string (or decode the body, which is a view into the WASM heap) you pay 3x additional "tolls" of copying (WASM heap to JS heap as a UTF-16 string + return value JS string to JS heap Uint8Array + JS heap Uint8Array to WASM heap.) This is unavoidable if you deal with JS strings. I'm looking into the possibility to directly encode a JS string into WASM heap memory (see withUTF16Str in wlib.js) but it means that the data needs to be translated into UTF8 and copied anyways, but in WASM, which is likely going to be slower.

Anyhow, the performance is still really good. On my 2018 MacBook Pro, nodejs v16.1:

  • without onCodeBlock: avg parse time: 31.0us
  • with onCodeBlock: avg parse time: 470.6us

(repro with node example/example.js -bench)

@rsms rsms self-assigned this Jul 1, 2021
@rsms
Copy link
Owner

rsms commented Jul 1, 2021

Released alongside a security fix in v1.2.0 on NPM

@rsms
Copy link
Owner

rsms commented Jul 1, 2021

@aminya Let me know if this is a working solution for you and we can close the issue.

@aminya
Copy link
Author

aminya commented Jul 1, 2021

Thank you so much! ❤️

The only issue that I have now is that onCodeBlock is a synchronous function, while my highlighting function is asynchronous. I might be able to use make-synchronous or deasync, but I was wondering if it is possible to make it async.

Regarding the string conversion, this video might help.

@rsms
Copy link
Owner

rsms commented Jul 1, 2021

There's really no reasonably-affordable way to make it async as parsing is inherently synchronous. I'm surprised your syntax highlighter is async since it's really a CPU-bound thing with no I/O involved on a single-threaded platform (nodejs.) Perhaps look into using a different syntax highlighter? Alt you can do what is pretty common and apply post-processing to the output HTML (just grep for <code> blocks)

@aminya
Copy link
Author

aminya commented Jul 1, 2021

This code runs inside Atom, and the syntax highlighting is done by SuperString and TreeSitter, which are native code and run in parallel. If it was in JavaScript, there was no reason to make the function async, but this isn't the case here.

Anyways, thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants