-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
command line script for node #84
base: main
Are you sure you want to change the base?
Conversation
iloveitaly
commented
Aug 12, 2023
- build: adding rehype-parse for node dom parsing
- feat: use rehype-node when running on the cli
- feat: remove top-level element to fix node dom parsing
- feat: cli script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I don’t have a lot of time to dive through this right now, but added a few quick questions inline. I will try to get back with a more detailed review later.
cmd.js
Outdated
const inputHTML = stdinReadSync() | ||
|
||
if(!inputHTML) { | ||
console.error('no HTML provided over stdin'); | ||
process.exit(1); | ||
} | ||
|
||
convertDocsHtmlToMarkdown(inputHTML).then(markdown => { | ||
process.stdout.write(markdown); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all needs exception handling. A CLI tool meant to general usage should avoid crashing with a stack trace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think it would be better to expose the raw exception until we know what type of exceptions can occur. IMHO it's frusterating when a generic error is given from a CLI tool, it requires that I go and edit out the catch block to actually understand what went wrong and if there's anything simple I can do to fix it.
Will add this in though!
lib/fix-google-html.js
Outdated
// this seems to only occur when running via nodejs (not the browser): there's a top-level <b> tag | ||
// this is only called once with the parent node | ||
function removeRootBoldWrapper(node) { | ||
// there are some cases, like translating <b>something something</b> where we don't want to remove the root node | ||
if(node.children.length === 1 && node.children[0].tagName === 'b') { | ||
return | ||
} | ||
|
||
for(let i = 0; i < node.children.length; i++) { | ||
const child = node.children[i]; | ||
|
||
if(child.tagName === 'b') { | ||
node.children.splice(i, 1, ...child.children); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain more about what the purpose here is? Why are you concerned particularly with <b>
elements here, and what circumstances are you encountering? If this is a CLI-specific issue, it should probably be handled in the CLI portion of the code, not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more in the comments! Let me know if this still isn't clear.
README.md
Outdated
However, what you really want it run this after copying text from Google Docs. To do this, you'll need to extract the HTML | ||
on the clipboard. Here's a script to do this: | ||
|
||
```shell | ||
swift - <<EOF | npx google-docs-to-markdown | pbcopy | ||
import Cocoa | ||
|
||
let type = NSPasteboard.PasteboardType.html | ||
|
||
guard let string = NSPasteboard.general.string(forType:type) else { | ||
fputs("Could not find string data of type '\(type)' on the system pasteboard\n", stderr) | ||
exit(1) | ||
} | ||
|
||
print(string) | ||
EOF | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it needs to be made more clear that this is a macOS-specific example, and will not work on other platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjusted the comment to make this clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some more detailed notes inline in addition to my previous question.
This also needs to at least run the unit tests in Node. That should hopefully be as simple as:
- Add a new script in
package.json
namedtest-unit-node
(or something like that) that just runsmocha
(maybe needs some other options/settings). - Add that script to the main
test
script. - Add a CI job to run it as well:
google-docs-to-markdown/.github/workflows/ci.yml
Lines 28 to 64 in 5801163
unit-tests: # Use MacOS so we can test in Safari. runs-on: macos-latest steps: - uses: actions/checkout@v3 - name: Install Node.js uses: actions/setup-node@v3 with: node-version: 16 cache: npm - name: Install dependencies run: npm ci - name: Unit Tests (in Browser) run: npm run test-unit e2e-tests: # Use MacOS so we can test in Safari. runs-on: macos-latest steps: - uses: actions/checkout@v3 - name: Install Node.js uses: actions/setup-node@v3 with: node-version: 16 cache: npm - name: Install dependencies run: npm ci - name: End-to-End Tests run: npm run test-e2e
Ideally, there should be some basic end-to-end tests for the CLI, too (e.g. run node cmd.js
with some input), but I think we can iterate towards that.
Could you also please add an engines
section to package.json
that specifies Node.js >=16.20.0
? I think we might be able to do an earlier 16.x version, but 16.20+ is definitely supported. (Or change #83 to do this instead and we’ll merge that.)
cmd.js
Outdated
if (process.stdin.isTTY) { | ||
return ''; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why give up if the session is interactive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal here is stop accepting new input and send it through for processing. ^D didn't work for me by default. Maybe there's a better trick here or fixing up the stdin processing would be better.
cmd.js
Outdated
} | ||
|
||
if (!n) break | ||
data += b.toString('utf8', 0, n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think this is critical for a first pass on the CLI feature, but it might be good to either add a CLI option to specify the encoding or add sniffing via jschardet (or both).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not aware of node:stream/consumers
! Thanks for pointing that out.
import {default as rehypeDom} from 'rehype-dom-parse'; | ||
import {default as rehypeNode} from 'rehype-parse'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we’d be better off here to require just rehype-parse
, and substitute rehype-dom-parse
during Webpack compilation. (Then rehype-dom-parse
can also just be a dev-time dependency.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with webpack—what's the best way to do this? Can you submit a change for this PR for it?
@@ -35,10 +37,14 @@ function doubleBlankLinesBeforeHeadings (previous, next, _parent, _state) { | |||
return undefined; | |||
} | |||
|
|||
const isNode = typeof process !== 'undefined' && process.versions && process.versions.node; | |||
const rehypeParse = isNode ? rehypeNode : rehypeDom; | |||
const rehypeParseOptions = isNode ? {fragment: true, verbose: true, emitParseErrors: true} : {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an important reason to have different parser options on Node vs. browsers? I know the default for fragment
is different in the two parsers, but I think it’s probably clearer and less error-prone to just specify it explicitly in both cases.
Why set verbose
? This adds line/character information to the nodes AST nodes the parser outputs, which we don’t use.
Why set emitParseErrors
here if we are not showing the errors we get? Is this a debugging flag that should have been removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember exactly, but I believe the existing parser just didn't work at all/well when I was testing.
Thanks for all of the great comments! I won't have much more time to devote here until this script breaks for me (which could happen soon with weird gdocs formatting). I'll reply then! Feel free to commit to/improve this PR if you have the time in the meantime. |
Co-authored-by: Rob Brackett <[email protected]>
Fixed most/all of the comments outside the mocha/test comment. Will handle that another time! |