forked from langchain-ai/langchainjs
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Make document transformers runnable, adds HTML to text document trans…
…former (langchain-ai#2149) * Remove precommit compilation hook * Adds HTML to text document transformer * Docs, entrypoint * Typo fix
- Loading branch information
1 parent
1c7f87e
commit 7691e10
Showing
14 changed files
with
251 additions
and
22 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
#!/usr/bin/env sh | ||
. "$(dirname -- "$0")/_/husky.sh" | ||
|
||
yarn run precommit | ||
# yarn run precommit |
32 changes: 32 additions & 0 deletions
32
...ras/modules/data_connection/document_transformers/integrations/html-to-text.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# html-to-text | ||
|
||
When ingesting HTML documents for later retrieval, we are often interested only in the actual content of the webpage rather than semantics. | ||
Stripping HTML tags from documents with the HtmlToTextTransformer can result in more content-rich chunks, making retrieval more effective. | ||
|
||
## Setup | ||
|
||
You'll need to install the [`html-to-text`](https://www.npmjs.com/package/html-to-text) npm package: | ||
|
||
```bash npm2yarn | ||
npm install html-to-text | ||
``` | ||
|
||
Though not required for the transformer by itself, the below usage examples require [`cheerio`](https://www.npmjs.com/package/cheerio) for scraping: | ||
|
||
```bash npm2yarn | ||
npm install cheerio | ||
``` | ||
|
||
## Usage | ||
|
||
The below example scrapes a Hacker News thread, splits it based on HTML tags to group chunks based on the semantic information from the tags, | ||
then extracts content from the individual chunks: | ||
|
||
import CodeBlock from "@theme/CodeBlock"; | ||
import Example from "@examples/document_transformers/html_to_text.ts"; | ||
|
||
<CodeBlock language="typescript">{Example}</CodeBlock> | ||
|
||
## Customization | ||
|
||
You can pass the transformer any [arguments accepted by the `html-to-text` package](https://www.npmjs.com/package/html-to-text) to customize how it works. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
import { CheerioWebBaseLoader } from "langchain/document_loaders/web/cheerio"; | ||
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter"; | ||
import { HtmlToTextTransformer } from "langchain/document_transformers/html_to_text"; | ||
|
||
const loader = new CheerioWebBaseLoader( | ||
"https://news.ycombinator.com/item?id=34817881" | ||
); | ||
|
||
const docs = await loader.load(); | ||
|
||
const splitter = RecursiveCharacterTextSplitter.fromLanguage("html"); | ||
const transformer = new HtmlToTextTransformer(); | ||
|
||
const sequence = splitter.pipe(transformer); | ||
|
||
const newDocuments = await sequence.invoke(docs); | ||
|
||
console.log(newDocuments); | ||
|
||
/* | ||
[ | ||
Document { | ||
pageContent: 'Hacker News new | past | comments | ask | show | jobs | submit login What Lights\n' + | ||
'the Universe’s Standard Candles? (quantamagazine.org) 75 points by Amorymeltzer\n' + | ||
'5 months ago | hide | past | favorite | 6 comments delta_p_delta_x 5 months ago\n' + | ||
'| next [–] Astrophysical and cosmological simulations are often insightful.\n' + | ||
"They're also very cross-disciplinary; besides the obvious astrophysics, there's\n" + | ||
'networking and sysadmin, parallel computing and algorithm theory (so that the\n' + | ||
'simulation programs are actually fast but still accurate), systems design, and\n' + | ||
'even a bit of graphic design for the visualisations.Some of my favourite\n' + | ||
'simulation projects:- IllustrisTNG:', | ||
metadata: { | ||
source: 'https://news.ycombinator.com/item?id=34817881', | ||
loc: [Object] | ||
} | ||
}, | ||
Document { | ||
pageContent: 'that the simulation programs are actually fast but still accurate), systems\n' + | ||
'design, and even a bit of graphic design for the visualisations.Some of my\n' + | ||
'favourite simulation projects:- IllustrisTNG: https://www.tng-project.org/-\n' + | ||
'SWIFT: https://swift.dur.ac.uk/- CO5BOLD:\n' + | ||
'https://www.astro.uu.se/~bf/co5bold_main.html (which produced these animations\n' + | ||
'of a red-giant star: https://www.astro.uu.se/~bf/movie/AGBmovie.html)-\n' + | ||
'AbacusSummit: https://abacussummit.readthedocs.io/en/latest/And I can add the\n' + | ||
'simulations in the article, too. froeb 5 months ago | parent | next [–]\n' + | ||
'Supernova simulations are especially interesting too. I have heard them\n' + | ||
'described as the only time in physics when all 4 of the fundamental forces are\n' + | ||
'important. The explosion can be quite finicky too. If I remember right, you\n' + | ||
"can't get supernova to explode", | ||
metadata: { | ||
source: 'https://news.ycombinator.com/item?id=34817881', | ||
loc: [Object] | ||
} | ||
}, | ||
Document { | ||
pageContent: 'heard them described as the only time in physics when all 4 of the fundamental\n' + | ||
'forces are important. The explosion can be quite finicky too. If I remember\n' + | ||
"right, you can't get supernova to explode properly in 1D simulations, only in\n" + | ||
'higher dimensions. This was a mystery until the realization that turbulence is\n' + | ||
'necessary for supernova to trigger--there is no turbulent flow in 1D. andrewflnr\n' + | ||
"5 months ago | prev | next [–] Whoa. I didn't know the accretion theory of Ia\n" + | ||
'supernovae was dead, much less that it had been since 2011. andreareina 5 months\n' + | ||
'ago | prev | next [–] This seems to be the paper', | ||
metadata: { | ||
source: 'https://news.ycombinator.com/item?id=34817881', | ||
loc: [Object] | ||
} | ||
}, | ||
Document { | ||
pageContent: 'andreareina 5 months ago | prev | next [–] This seems to be the paper\n' + | ||
'https://academic.oup.com/mnras/article/517/4/5260/6779709 andreareina 5 months\n' + | ||
"ago | prev [–] Wouldn't double detonation show up as variance in the brightness?\n" + | ||
'yencabulator 5 months ago | parent [–] Or widening of the peak. If one type Ia\n' + | ||
'supernova goes 1,2,3,2,1, the sum of two could go 1+0=1 2+1=3 3+2=5 2+3=5 1+2=3\n' + | ||
'0+1=1 Guidelines | FAQ | Lists |', | ||
metadata: { | ||
source: 'https://news.ycombinator.com/item?id=34817881', | ||
loc: [Object] | ||
} | ||
}, | ||
Document { | ||
pageContent: 'the sum of two could go 1+0=1 2+1=3 3+2=5 2+3=5 1+2=3 0+1=1 Guidelines | FAQ |\n' + | ||
'Lists | API | Security | Legal | Apply to YC | Contact Search:', | ||
metadata: { | ||
source: 'https://news.ycombinator.com/item?id=34817881', | ||
loc: [Object] | ||
} | ||
} | ||
] | ||
*/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
import { htmlToText } from "html-to-text"; | ||
import type { HtmlToTextOptions } from "html-to-text"; | ||
import { Document } from "../document.js"; | ||
import { MappingDocumentTransformer } from "../schema/document.js"; | ||
|
||
export class HtmlToTextTransformer extends MappingDocumentTransformer { | ||
constructor(protected options: HtmlToTextOptions = {}) { | ||
super(options); | ||
} | ||
|
||
async _transformDocument(document: Document): Promise<Document> { | ||
const extractedContent = htmlToText(document.pageContent, this.options); | ||
return new Document({ | ||
pageContent: extractedContent, | ||
metadata: { ...document.metadata }, | ||
}); | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
47 changes: 47 additions & 0 deletions
47
langchain/src/document_transformers/tests/html_to_text.int.test.ts
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
import { expect, test } from "@jest/globals"; | ||
|
||
import { HtmlToTextTransformer } from "../html_to_text.js"; | ||
import { Document } from "../../document.js"; | ||
|
||
test("Test HTML to text transformer", async () => { | ||
const webpageText = `<!DOCTYPE html> | ||
<html> | ||
<head> | ||
<title>🦜️🔗 LangChain</title> | ||
<style> | ||
body { | ||
font-family: Arial, sans-serif; | ||
} | ||
h1 { | ||
color: darkblue; | ||
} | ||
</style> | ||
</head> | ||
<body> | ||
<div> | ||
<h1>🦜️🔗 LangChain</h1> | ||
<p>⚡ Building applications with LLMs through composability ⚡</p> | ||
</div> | ||
<div> | ||
As an open source project in a rapidly developing field, we are extremely open to contributions. | ||
</div> | ||
</body> | ||
</html>`; | ||
const documents = [ | ||
new Document({ | ||
pageContent: webpageText, | ||
}), | ||
new Document({ | ||
pageContent: "<div>Mitochondria is the powerhouse of the cell.</div>", | ||
metadata: { reliable: false }, | ||
}), | ||
]; | ||
const transformer = new HtmlToTextTransformer(); | ||
const newDocuments = await transformer.transformDocuments(documents); | ||
expect(newDocuments.length).toBe(2); | ||
expect(newDocuments[0].pageContent.length).toBeLessThan(webpageText.length); | ||
expect(newDocuments[1].pageContent).toBe( | ||
"Mitochondria is the powerhouse of the cell." | ||
); | ||
expect(newDocuments[1].metadata).toEqual({ reliable: false }); | ||
}); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters