Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

command line script for node #84

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
34 changes: 30 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,18 +27,44 @@ First make sure you have Node.js installed. Then:
```sh
> npm run build
```

…and the built output will be in the `dist` folder.

To start a server with live rebuilding, run:

```sh
> npm start
```

Then point your browser to `http://localhost:9000` to see the site. It will automatically rebuild whenever you change any files.


## CLI Usage

```shell
echo "<b>some html</b>" | npx google-docs-to-markdown
```

However, what you really want it run this after copying text from Google Docs. To do this, you'll need to extract the HTML
on the clipboard. Here's a script for macOS to do this:

```shell
swift - <<EOF | npx google-docs-to-markdown | pbcopy
import Cocoa

let type = NSPasteboard.PasteboardType.html

guard let string = NSPasteboard.general.string(forType:type) else {
fputs("Could not find string data of type '\(type)' on the system pasteboard\n", stderr)
exit(1)
}

print(string)
EOF
```

You can then tie this script to a keyboard shortcut if using something like Raycast or another launchbar.

## Contributors

This project is open source, and gets better with the hard work and collaboration of multiple people. Thanks to the following for their contributions:
Expand Down
23 changes: 23 additions & 0 deletions cmd.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/usr/bin/env node

import {convertDocsHtmlToMarkdown} from './lib/convert.js';
import { buffer as readBuffer } from 'node:stream/consumers';

// in order to debug this tool over the command line, you can read a file with broken input locally
// ex: fs.readFileSync('./local-file.html', 'utf8');

const rawInput = await readBuffer(process.stdin);
const inputHTML = rawInput.toString('utf-8');

if(!inputHTML) {
console.error('no HTML provided over stdin');
process.exit(1);
}

try {
convertDocsHtmlToMarkdown(inputHTML).then(markdown => {
process.stdout.write(markdown);
});
} catch (error) {
console.error(`Error converting HTML to markdown: ${error.message}`);
}
12 changes: 9 additions & 3 deletions lib/convert.js
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
import fixGoogleHtml from './fix-google-html.js';
// rehype-dom-parse is a lightweight version of rehype-parse that leverages
// browser APIs -- reduces bundle size by ~200 kB!
import parse from 'rehype-dom-parse';
import {default as rehypeDom} from 'rehype-dom-parse';
import {default as rehypeNode} from 'rehype-parse';
Comment on lines +4 to +5
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we’d be better off here to require just rehype-parse, and substitute rehype-dom-parse during Webpack compilation. (Then rehype-dom-parse can also just be a dev-time dependency.)

Copy link
Contributor Author

@iloveitaly iloveitaly Oct 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with webpack—what's the best way to do this? Can you submit a change for this PR for it?

import { all } from 'rehype-remark';
import rehype2remarkWithSpaces from './rehype-to-remark-with-spaces.js';
import remarkGfm from 'remark-gfm';
import stringify from 'remark-stringify';
import { unified } from 'unified';
import { default as logTree } from './log-tree.js';

/** @typedef {import("mdast-util-to-markdown").State} as MdastState */
/** @typedef {import("unist").Node} UnistNode */
Expand Down Expand Up @@ -35,10 +37,14 @@ function doubleBlankLinesBeforeHeadings (previous, next, _parent, _state) {
return undefined;
}

const isNode = typeof process !== 'undefined' && process.versions?.node;
const rehypeParse = isNode ? rehypeNode : rehypeDom;
const rehypeParseOptions = isNode ? {fragment: true, verbose: true, emitParseErrors: true} : {}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an important reason to have different parser options on Node vs. browsers? I know the default for fragment is different in the two parsers, but I think it’s probably clearer and less error-prone to just specify it explicitly in both cases.

Why set verbose? This adds line/character information to the nodes AST nodes the parser outputs, which we don’t use.

Why set emitParseErrors here if we are not showing the errors we get? Is this a debugging flag that should have been removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember exactly, but I believe the existing parser just didn't work at all/well when I was testing.


const processor = unified()
.use(parse)
.use(rehypeParse, rehypeParseOptions)
.use(fixGoogleHtml)
// .use(require('./lib/log-tree').default)
// .use(logTree)
.use(rehype2remarkWithSpaces, {
handlers: {
// Preserve sup/sub markup; most Markdowns have no markup for it.
Expand Down
21 changes: 20 additions & 1 deletion lib/fix-google-html.js
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,24 @@ export function fixNestedLists (node) {
});
}

// Google Docs wraps the entire document in a <b> tag which is not removed when parsed by the parsing library
// this seems to only occur when running via nodejs (not the browser).
// this function is only ever called once with the parent node as the input node
function removeRootBoldWrapper(node) {
// there are some cases, like translating <b>something something</b> where we don't want to remove the root node
if(node.children.length === 1 && node.children[0].tagName === 'b') {
return
}

for(let i = 0; i < node.children.length; i++) {
const child = node.children[i];

if(child.tagName === 'b') {
node.children.splice(i, 1, ...child.children);
}
}
}

/**
* Google Docs does italics/bolds/etc on `<span>`s with style attributes, but
* rehype-remark does not pick up on those. Instead, transform them into
Expand Down Expand Up @@ -259,7 +277,7 @@ export function unwrapLineBreaks (node) {
/**
* Moves linebreaks outside of anchor elements,
* if the linebreak is the first and/or last child of the anchor.
* @param {RehypeNode} node
* @param {RehypeNode} node
*/
export function moveLinebreaksOutsideOfAnchors (node) {
visit(node, isAnchor, (node, index, parent) => {
Expand Down Expand Up @@ -579,6 +597,7 @@ function fixChecklists (node) {
*/
export default function fixGoogleHtml () {
return (tree, _file) => {
removeRootBoldWrapper(tree);
unInlineStyles(tree);
createCodeBlocks(tree);
moveSpaceOutsideSensitiveChildren(tree);
Expand Down
Loading