Skip to content

Commit

Permalink
Make document transformers runnable, adds HTML to text document trans…
Browse files Browse the repository at this point in the history
…former (langchain-ai#2149)

* Remove precommit compilation hook

* Adds HTML to text document transformer

* Docs, entrypoint

* Typo fix
  • Loading branch information
jacoblee93 authored Aug 3, 2023
1 parent 1c7f87e commit 7691e10
Show file tree
Hide file tree
Showing 14 changed files with 251 additions and 22 deletions.
2 changes: 1 addition & 1 deletion .husky/pre-commit
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/usr/bin/env sh
. "$(dirname -- "$0")/_/husky.sh"

yarn run precommit
# yarn run precommit
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# html-to-text

When ingesting HTML documents for later retrieval, we are often interested only in the actual content of the webpage rather than semantics.
Stripping HTML tags from documents with the HtmlToTextTransformer can result in more content-rich chunks, making retrieval more effective.

## Setup

You'll need to install the [`html-to-text`](https://www.npmjs.com/package/html-to-text) npm package:

```bash npm2yarn
npm install html-to-text
```

Though not required for the transformer by itself, the below usage examples require [`cheerio`](https://www.npmjs.com/package/cheerio) for scraping:

```bash npm2yarn
npm install cheerio
```

## Usage

The below example scrapes a Hacker News thread, splits it based on HTML tags to group chunks based on the semantic information from the tags,
then extracts content from the individual chunks:

import CodeBlock from "@theme/CodeBlock";
import Example from "@examples/document_transformers/html_to_text.ts";

<CodeBlock language="typescript">{Example}</CodeBlock>

## Customization

You can pass the transformer any [arguments accepted by the `html-to-text` package](https://www.npmjs.com/package/html-to-text) to customize how it works.
90 changes: 90 additions & 0 deletions examples/src/document_transformers/html_to_text.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
import { CheerioWebBaseLoader } from "langchain/document_loaders/web/cheerio";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { HtmlToTextTransformer } from "langchain/document_transformers/html_to_text";

const loader = new CheerioWebBaseLoader(
"https://news.ycombinator.com/item?id=34817881"
);

const docs = await loader.load();

const splitter = RecursiveCharacterTextSplitter.fromLanguage("html");
const transformer = new HtmlToTextTransformer();

const sequence = splitter.pipe(transformer);

const newDocuments = await sequence.invoke(docs);

console.log(newDocuments);

/*
[
Document {
pageContent: 'Hacker News new | past | comments | ask | show | jobs | submit login What Lights\n' +
'the Universe’s Standard Candles? (quantamagazine.org) 75 points by Amorymeltzer\n' +
'5 months ago | hide | past | favorite | 6 comments delta_p_delta_x 5 months ago\n' +
'| next [–] Astrophysical and cosmological simulations are often insightful.\n' +
"They're also very cross-disciplinary; besides the obvious astrophysics, there's\n" +
'networking and sysadmin, parallel computing and algorithm theory (so that the\n' +
'simulation programs are actually fast but still accurate), systems design, and\n' +
'even a bit of graphic design for the visualisations.Some of my favourite\n' +
'simulation projects:- IllustrisTNG:',
metadata: {
source: 'https://news.ycombinator.com/item?id=34817881',
loc: [Object]
}
},
Document {
pageContent: 'that the simulation programs are actually fast but still accurate), systems\n' +
'design, and even a bit of graphic design for the visualisations.Some of my\n' +
'favourite simulation projects:- IllustrisTNG: https://www.tng-project.org/-\n' +
'SWIFT: https://swift.dur.ac.uk/- CO5BOLD:\n' +
'https://www.astro.uu.se/~bf/co5bold_main.html (which produced these animations\n' +
'of a red-giant star: https://www.astro.uu.se/~bf/movie/AGBmovie.html)-\n' +
'AbacusSummit: https://abacussummit.readthedocs.io/en/latest/And I can add the\n' +
'simulations in the article, too. froeb 5 months ago | parent | next [–]\n' +
'Supernova simulations are especially interesting too. I have heard them\n' +
'described as the only time in physics when all 4 of the fundamental forces are\n' +
'important. The explosion can be quite finicky too. If I remember right, you\n' +
"can't get supernova to explode",
metadata: {
source: 'https://news.ycombinator.com/item?id=34817881',
loc: [Object]
}
},
Document {
pageContent: 'heard them described as the only time in physics when all 4 of the fundamental\n' +
'forces are important. The explosion can be quite finicky too. If I remember\n' +
"right, you can't get supernova to explode properly in 1D simulations, only in\n" +
'higher dimensions. This was a mystery until the realization that turbulence is\n' +
'necessary for supernova to trigger--there is no turbulent flow in 1D. andrewflnr\n' +
"5 months ago | prev | next [–] Whoa. I didn't know the accretion theory of Ia\n" +
'supernovae was dead, much less that it had been since 2011. andreareina 5 months\n' +
'ago | prev | next [–] This seems to be the paper',
metadata: {
source: 'https://news.ycombinator.com/item?id=34817881',
loc: [Object]
}
},
Document {
pageContent: 'andreareina 5 months ago | prev | next [–] This seems to be the paper\n' +
'https://academic.oup.com/mnras/article/517/4/5260/6779709 andreareina 5 months\n' +
"ago | prev [–] Wouldn't double detonation show up as variance in the brightness?\n" +
'yencabulator 5 months ago | parent [–] Or widening of the peak. If one type Ia\n' +
'supernova goes 1,2,3,2,1, the sum of two could go 1+0=1 2+1=3 3+2=5 2+3=5 1+2=3\n' +
'0+1=1 Guidelines | FAQ | Lists |',
metadata: {
source: 'https://news.ycombinator.com/item?id=34817881',
loc: [Object]
}
},
Document {
pageContent: 'the sum of two could go 1+0=1 2+1=3 3+2=5 2+3=5 1+2=3 0+1=1 Guidelines | FAQ |\n' +
'Lists | API | Security | Legal | Apply to YC | Contact Search:',
metadata: {
source: 'https://news.ycombinator.com/item?id=34817881',
loc: [Object]
}
}
]
*/
3 changes: 3 additions & 0 deletions langchain/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -310,6 +310,9 @@ document_loaders/fs/notion.d.ts
document_loaders/fs/unstructured.cjs
document_loaders/fs/unstructured.js
document_loaders/fs/unstructured.d.ts
document_transformers/html_to_text.cjs
document_transformers/html_to_text.js
document_transformers/html_to_text.d.ts
document_transformers/openai_functions.cjs
document_transformers/openai_functions.js
document_transformers/openai_functions.d.ts
Expand Down
8 changes: 8 additions & 0 deletions langchain/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -322,6 +322,9 @@
"document_loaders/fs/unstructured.cjs",
"document_loaders/fs/unstructured.js",
"document_loaders/fs/unstructured.d.ts",
"document_transformers/html_to_text.cjs",
"document_transformers/html_to_text.js",
"document_transformers/html_to_text.d.ts",
"document_transformers/openai_functions.cjs",
"document_transformers/openai_functions.js",
"document_transformers/openai_functions.d.ts",
Expand Down Expand Up @@ -1446,6 +1449,11 @@
"import": "./document_loaders/fs/unstructured.js",
"require": "./document_loaders/fs/unstructured.cjs"
},
"./document_transformers/html_to_text": {
"types": "./document_transformers/html_to_text.d.ts",
"import": "./document_transformers/html_to_text.js",
"require": "./document_transformers/html_to_text.cjs"
},
"./document_transformers/openai_functions": {
"types": "./document_transformers/openai_functions.d.ts",
"import": "./document_transformers/openai_functions.js",
Expand Down
3 changes: 3 additions & 0 deletions langchain/scripts/create-entrypoints.js
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,8 @@ const entrypoints = {
"document_loaders/fs/notion": "document_loaders/fs/notion",
"document_loaders/fs/unstructured": "document_loaders/fs/unstructured",
// document_transformers
"document_transformers/html_to_text":
"document_transformers/html_to_text",
"document_transformers/openai_functions":
"document_transformers/openai_functions",
// chat_models
Expand Down Expand Up @@ -287,6 +289,7 @@ const requiresOptionalDependency = [
"document_loaders/fs/csv",
"document_loaders/fs/notion",
"document_loaders/fs/unstructured",
"document_transformers/html_to_text",
"chat_models/googlevertexai",
"chat_models/googlepalm",
"sql_db",
Expand Down
18 changes: 18 additions & 0 deletions langchain/src/document_transformers/html_to_text.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import { htmlToText } from "html-to-text";
import type { HtmlToTextOptions } from "html-to-text";
import { Document } from "../document.js";
import { MappingDocumentTransformer } from "../schema/document.js";

export class HtmlToTextTransformer extends MappingDocumentTransformer {
constructor(protected options: HtmlToTextOptions = {}) {
super(options);
}

async _transformDocument(document: Document): Promise<Document> {
const extractedContent = htmlToText(document.pageContent, this.options);
return new Document({
pageContent: extractedContent,
metadata: { ...document.metadata },
});
}
}
29 changes: 12 additions & 17 deletions langchain/src/document_transformers/openai_functions.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@ import type { JsonSchema7ObjectType } from "zod-to-json-schema/src/parsers/objec

import { Document } from "../document.js";
import { BaseChain } from "../chains/base.js";
import { BaseDocumentTransformer } from "../schema/document.js";
import { MappingDocumentTransformer } from "../schema/document.js";
import {
TaggingChainOptions,
createTaggingChain,
} from "../chains/openai_functions/index.js";
import { ChatOpenAI } from "../chat_models/openai.js";

export class MetadataTagger extends BaseDocumentTransformer {
export class MetadataTagger extends MappingDocumentTransformer {
protected taggingChain: BaseChain;

constructor(fields: { taggingChain: BaseChain }) {
Expand All @@ -29,21 +29,16 @@ export class MetadataTagger extends BaseDocumentTransformer {
}
}

async transformDocuments(documents: Document[]): Promise<Document[]> {
const newDocuments = [];
for (const document of documents) {
const taggingChainResponse = await this.taggingChain.call({
[this.taggingChain.inputKeys[0]]: document.pageContent,
});
const extractedMetadata =
taggingChainResponse[this.taggingChain.outputKeys[0]];
const newDocument = new Document({
pageContent: document.pageContent,
metadata: { ...extractedMetadata, ...document.metadata },
});
newDocuments.push(newDocument);
}
return newDocuments;
async _transformDocument(document: Document): Promise<Document> {
const taggingChainResponse = await this.taggingChain.call({
[this.taggingChain.inputKeys[0]]: document.pageContent,
});
const extractedMetadata =
taggingChainResponse[this.taggingChain.outputKeys[0]];
return new Document({
pageContent: document.pageContent,
metadata: { ...extractedMetadata, ...document.metadata },
});
}
}

Expand Down
47 changes: 47 additions & 0 deletions langchain/src/document_transformers/tests/html_to_text.int.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
import { expect, test } from "@jest/globals";

import { HtmlToTextTransformer } from "../html_to_text.js";
import { Document } from "../../document.js";

test("Test HTML to text transformer", async () => {
const webpageText = `<!DOCTYPE html>
<html>
<head>
<title>🦜️🔗 LangChain</title>
<style>
body {
font-family: Arial, sans-serif;
}
h1 {
color: darkblue;
}
</style>
</head>
<body>
<div>
<h1>🦜️🔗 LangChain</h1>
<p>⚡ Building applications with LLMs through composability ⚡</p>
</div>
<div>
As an open source project in a rapidly developing field, we are extremely open to contributions.
</div>
</body>
</html>`;
const documents = [
new Document({
pageContent: webpageText,
}),
new Document({
pageContent: "<div>Mitochondria is the powerhouse of the cell.</div>",
metadata: { reliable: false },
}),
];
const transformer = new HtmlToTextTransformer();
const newDocuments = await transformer.transformDocuments(documents);
expect(newDocuments.length).toBe(2);
expect(newDocuments[0].pageContent.length).toBeLessThan(webpageText.length);
expect(newDocuments[1].pageContent).toBe(
"Mitochondria is the powerhouse of the cell."
);
expect(newDocuments[1].metadata).toEqual({ reliable: false });
});
1 change: 1 addition & 0 deletions langchain/src/load/import_constants.ts
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ export const optionalImportEntrypoints = [
"langchain/document_loaders/fs/csv",
"langchain/document_loaders/fs/notion",
"langchain/document_loaders/fs/unstructured",
"langchain/document_transformers/html_to_text",
"langchain/chat_models/googlevertexai",
"langchain/chat_models/googlepalm",
"langchain/sql_db",
Expand Down
3 changes: 3 additions & 0 deletions langchain/src/load/import_type.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,9 @@ export interface OptionalImportMap {
"langchain/document_loaders/fs/unstructured"?:
| typeof import("../document_loaders/fs/unstructured.js")
| Promise<typeof import("../document_loaders/fs/unstructured.js")>;
"langchain/document_transformers/html_to_text"?:
| typeof import("../document_transformers/html_to_text.js")
| Promise<typeof import("../document_transformers/html_to_text.js")>;
"langchain/chat_models/googlevertexai"?:
| typeof import("../chat_models/googlevertexai.js")
| Promise<typeof import("../chat_models/googlevertexai.js")>;
Expand Down
34 changes: 31 additions & 3 deletions langchain/src/schema/document.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import { BaseCallbackConfig } from "../callbacks/manager.js";
import { Document } from "../document.js";
import { Serializable } from "../load/serializable.js";
import { Runnable } from "./runnable.js";

/**
* Abstract base class for document transformation systems.
Expand All @@ -11,13 +12,40 @@ import { Serializable } from "../load/serializable.js";
* One example of this is a text splitter that splits a large document into
* many smaller documents.
*/
export abstract class BaseDocumentTransformer extends Serializable {
export abstract class BaseDocumentTransformer<
RunInput extends Document[] = Document[],
RunOutput extends Document[] = Document[]
> extends Runnable<RunInput, RunOutput> {
lc_namespace = ["langchain", "document_transformers"];

/**
* Transform a list of documents.
* @param documents A sequence of documents to be transformed.
* @returns A list of transformed documents.
*/
abstract transformDocuments(documents: Document[]): Promise<Document[]>;
abstract transformDocuments(documents: RunInput): Promise<RunOutput>;

invoke(
input: RunInput,
_options?: Partial<BaseCallbackConfig>
): Promise<RunOutput> {
return this.transformDocuments(input);
}
}

/**
* Class for document transformers that return exactly one transformed document
* for each input document.
*/
export abstract class MappingDocumentTransformer extends BaseDocumentTransformer {
async transformDocuments(documents: Document[]): Promise<Document[]> {
const newDocuments = [];
for (const document of documents) {
const transformedDocument = await this._transformDocument(document);
newDocuments.push(transformedDocument);
}
return newDocuments;
}

abstract _transformDocument(document: Document): Promise<Document>;
}
2 changes: 1 addition & 1 deletion langchain/src/text_splitter.ts
Original file line number Diff line number Diff line change
Expand Up @@ -319,7 +319,7 @@ export class RecursiveCharacterTextSplitter

static fromLanguage(
language: SupportedTextSplitterLanguage,
options: Partial<RecursiveCharacterTextSplitterParams>
options?: Partial<RecursiveCharacterTextSplitterParams>
) {
return new RecursiveCharacterTextSplitter({
...options,
Expand Down
1 change: 1 addition & 0 deletions langchain/tsconfig.json
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@
"src/document_loaders/fs/csv.ts",
"src/document_loaders/fs/notion.ts",
"src/document_loaders/fs/unstructured.ts",
"src/document_transformers/html_to_text.ts",
"src/document_transformers/openai_functions.ts",
"src/chat_models/base.ts",
"src/chat_models/openai.ts",
Expand Down

0 comments on commit 7691e10

Please sign in to comment.