Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch processing order in Fonduer tutorial #78

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 72 additions & 62 deletions hardware/max_storage_temp_tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@
"The tutorial is broken into several parts, each covering a phase of the `Fonduer` pipeline (as outlined in the [paper](https://arxiv.org/abs/1703.05028)), and the iterative KBC process:\n",
"\n",
"1. KBC Initialization\n",
"2. Candidate Generation and Multimodal Featurization\n",
"3. Probabilistic Relation Classification\n",
"2. Candidate Generation\n",
"3. Training a Multimodal LSTM for KBC\n",
"4. Error Analysis and Iterative KBC\n",
"\n",
"In addition, we show how users can iteratively improve labeling functions to improve relation extraction quality.\n",
Expand Down Expand Up @@ -174,7 +174,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"docs = session.query(Document).order_by(Document.name).all()\n",
Expand All @@ -201,12 +203,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Phase 2: Mention Extraction, Candidate Extraction Multimodal Featurization\n",
"# Phase 2: Candidate Generation\n",
"\n",
"Given the unified data model from Phase 1, `Fonduer` extracts relation\n",
"candidates based on user-provided **matchers** and **throttlers**. Then,\n",
"`Fonduer` leverages the multimodality information captured in the unified data\n",
"model to provide multimodal features for each candidate.\n",
"candidates based on user-provided **matchers** and **throttlers**.\n",
"\n",
"## 2.1 Mention Extraction\n",
"\n",
Expand Down Expand Up @@ -562,59 +562,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.2 Multimodal Featurization\n",
"Unlike dealing with plain unstructured text, `Fonduer` deals with richly formatted data, and consequently featurizes each candidate with a baseline library of multimodal features. \n",
"\n",
"### Featurize with `Fonduer`'s optimized Postgres Featurizer\n",
"We now annotate the candidates in our training, dev, and test sets with features. The `Featurizer` provided by `Fonduer` allows this to be done in parallel to improve performance.\n",
"At the end of this phase, `Fonduer` has generated the set of candidates and the feature matrix.\n",
"\n",
"View the API provided by the `Featurizer` on [ReadTheDocs](https://fonduer.readthedocs.io/en/stable/user/features.html#fonduer.features.Featurizer)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from fonduer.features import Featurizer\n",
"\n",
"featurizer = Featurizer(session, [PartTemp])\n",
"%time featurizer.apply(split=0, train=True, parallelism=PARALLEL)\n",
"%time F_train = featurizer.get_feature_matrices(train_cands)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(F_train[0].shape)\n",
"%time featurizer.apply(split=1, parallelism=PARALLEL)\n",
"%time F_dev = featurizer.get_feature_matrices(dev_cands)\n",
"print(F_dev[0].shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%time featurizer.apply(split=2, parallelism=PARALLEL)\n",
"%time F_test = featurizer.get_feature_matrices(test_cands)\n",
"print(F_test[0].shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At the end of this phase, `Fonduer` has generated the set of candidates and the feature matrix. Note that Phase 1 and 2 are relatively static and typically are only executed once during the KBC process.\n",
"# Phase 3: Training a Multimodal LSTM for KBC\n",
"In this phase, `Fonduer` first trains the generative model based on user-defined **labeling function**. Next, `Fonduer` trains the discriminative model with the trained generative model and multimodal features.\n",
"\n",
"# Phase 3: Probabilistic Relation Classification\n",
"In this phase, `Fonduer` applies user-defined **labeling functions**, which express various heuristics, patterns, and [weak supervision](http://hazyresearch.github.io/snorkel/blog/weak_supervision.html) strategies to label our data, to each of the candidates to create a label matrix that is used by our data programming engine.\n",
"## 3.1 Training the Generative Model\n",
"`Fonduer` applies user-defined **labeling functions**, which express various heuristics, patterns, and [weak supervision](http://hazyresearch.github.io/snorkel/blog/weak_supervision.html) strategies to label our data, to each of the candidates to create a label matrix that is used by our data programming engine.\n",
"\n",
"In the wild, hand-labeled training data is rare and expensive. A common scenario is to have access to tons of unlabeled training data, and have some idea of how to label them programmatically. For example:\n",
"* We may be able to think of text patterns that would indicate a part and polarity mention are related, for example the word \"temperature\" appearing between them.\n",
Expand All @@ -623,6 +577,7 @@
"\n",
"Using data programming, we can then train machine learning models to learn which features are the most important in classifying candidates.\n",
"\n",
"\n",
"### Loading Gold Data\n",
"For convenience in error analysis and evaluation, we have already annotated the dev and test set for this tutorial, and we'll now load it using an externally-defined helper function: `gold`. Technically, this is also a labeling function similar to labeling functions we will be writing later. The only difference is that (gold) labels are written to `GoldLabel` table while (non-gold) labels are written to `Label` table. If you're interested in the example implementation details, please see the script we now load:"
]
Expand Down Expand Up @@ -905,13 +860,68 @@
"\n",
"In fact, it is probably somewhat overfit to this set. However this is fine, since in the next, we'll train a more powerful end extraction model which will generalize beyond the development set, and which we will evaluate on a blind test set (i.e. one we never looked at during development).\n",
"\n",
"## 3.2 Training the Discriminative Model\n",
"\n",
"### Training the Discriminative Model\n",
"Now, we'll use the noisy training labels we generated in the last part to train our end extraction model. `Fonduer` also leverage the multimodality information captured in the unified data model to provide multimodal features for each candidate. For this tutorial, we will be training a simple--but fairly effective--logistic regression model.\n",
"\n",
"Now, we'll use the noisy training labels we generated in the last part to train our end extraction model. For this tutorial, we will be training a simple--but fairly effective--logistic regression model.\n",
"We use the training marginals to train a discriminative model that classifies each Candidate as a true or false mention."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Multimodal Featurization\n",
"Unlike dealing with plain unstructured text, `Fonduer` deals with richly formatted data, and consequently featurizes each candidate with a baseline library of multimodal features. \n",
"\n",
"We now annotate the candidates in our training, dev, and test sets with features. The optimized Postgres `Featurizer` provided by `Fonduer` allows this to be done in parallel to improve performance.\n",
"\n",
"We use the training marginals to train a discriminative model that classifies each Candidate as a true or false mention.\n",
"View the API provided by the `Featurizer` on [ReadTheDocs](https://fonduer.readthedocs.io/en/stable/user/features.html#fonduer.features.Featurizer)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from fonduer.features import Featurizer\n",
"\n",
"featurizer = Featurizer(session, [PartTemp])\n",
"%time featurizer.apply(split=0, train=True, parallelism=PARALLEL)\n",
"%time F_train = featurizer.get_feature_matrices(train_cands)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(F_train[0].shape)\n",
"%time featurizer.apply(split=1, parallelism=PARALLEL)\n",
"%time F_dev = featurizer.get_feature_matrices(dev_cands)\n",
"print(F_dev[0].shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%time featurizer.apply(split=2, parallelism=PARALLEL)\n",
"%time F_test = featurizer.get_feature_matrices(test_cands)\n",
"print(F_test[0].shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that this multimodal featurization and phase 1, 2 are relatively static and typically are only executed once during the KBC process.\n",
"\n",
"### Model Training with Emmental\n",
"In `Fonduer`, we use a new machine learning framework [Emmental](https://github.com/SenWu/emmental) to support all model training."
]
},
Expand Down Expand Up @@ -1458,7 +1468,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.9"
}
},
"nbformat": 4,
Expand Down
8 changes: 3 additions & 5 deletions hardware_image/transistor_image_tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -187,12 +187,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Phase 2: Mention Extraction, Candidate Extraction Multimodal Featurization\n",
"# Phase 2: Candidate Generation\n",
"\n",
"Given the unified data model from Phase 1, `Fonduer` extracts relation\n",
"candidates based on user-provided **matchers** and **throttlers**. Then,\n",
"`Fonduer` leverages the multimodality information captured in the unified data\n",
"model to provide multimodal features for each candidate.\n",
"candidates based on user-provided **matchers** and **throttlers**.\n",
"\n",
"## 2.1 Mention Extraction\n",
"\n",
Expand Down Expand Up @@ -416,7 +414,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.9"
}
},
"nbformat": 4,
Expand Down
129 changes: 68 additions & 61 deletions wiki/president_place_of_birth_tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,8 @@
"The tutorial is broken into several parts, each covering a phase of the `Fonduer` pipeline (as outlined in the [paper](https://arxiv.org/abs/1703.05028)), and the iterative KBC process:\n",
"\n",
"1. KBC Initialization\n",
"2. Candidate Generation and Multimodal Featurization\n",
"3. Probabilistic Relation Classification\n",
"4. Error Analysis and Iterative KBC\n",
"2. Candidate Generation\n",
"3. Training a Multimodal LSTM for KBC\n",
"\n",
"In addition, we show how users can iteratively improve labeling functions to improve relation extraction quality.\n",
"\n",
Expand Down Expand Up @@ -198,12 +197,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Phase 2: Mention Extraction, Candidate Extraction Multimodal Featurization\n",
"# Phase 2: Candidate Generation\n",
"\n",
"Given the unified data model from Phase 1, `Fonduer` extracts relation\n",
"candidates based on user-provided **matchers** and **throttlers**. Then,\n",
"`Fonduer` leverages the multimodality information captured in the unified data\n",
"model to provide multimodal features for each candidate.\n",
"candidates based on user-provided **matchers** and **throttlers**.\n",
"\n",
"## 2.1 Mention Extraction\n",
"\n",
Expand Down Expand Up @@ -542,59 +539,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.2 Multimodal Featurization\n",
"Unlike dealing with plain unstructured text, `Fonduer` deals with richly formatted data, and consequently featurizes each candidate with a baseline library of multimodal features. \n",
"At the end of this phase, `Fonduer` has generated the set of candidates and the feature matrix.\n",
"\n",
"### Featurize with `Fonduer`'s optimized Postgres Featurizer\n",
"We now annotate the candidates in our training, dev, and test sets with features. The `Featurizer` provided by `Fonduer` allows this to be done in parallel to improve performance.\n",
"# Phase 3: Training a Multimodal LSTM for KBC\n",
"In this phase, `Fonduer` first trains the generative model based on user-defined **labeling function**. Next, `Fonduer` trains the discriminative model with the trained generative model and multimodal features.\n",
"\n",
"View the API provided by the `Featurizer` on [ReadTheDocs](https://fonduer.readthedocs.io/en/stable/user/features.html#fonduer.features.Featurizer)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from fonduer.features import Featurizer\n",
"\n",
"featurizer = Featurizer(session, [PresidentnamePlaceofbirth])\n",
"%time featurizer.apply(split=0, train=True, parallelism=PARALLEL)\n",
"%time F_train = featurizer.get_feature_matrices(train_cands)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(F_train[0].shape)\n",
"%time featurizer.apply(split=1, parallelism=PARALLEL)\n",
"%time F_dev = featurizer.get_feature_matrices(dev_cands)\n",
"print(F_dev[0].shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%time featurizer.apply(split=2, parallelism=PARALLEL)\n",
"%time F_test = featurizer.get_feature_matrices(test_cands)\n",
"print(F_test[0].shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At the end of this phase, `Fonduer` has generated the set of candidates and the feature matrix. Note that Phase 1 and 2 are relatively static and typically are only executed once during the KBC process.\n",
"\n",
"# Phase 3: Probabilistic Relation Classification\n",
"In this phase, `Fonduer` applies user-defined **labeling functions**, which express various heuristics, patterns, and [weak supervision](http://hazyresearch.github.io/snorkel/blog/weak_supervision.html) strategies to label our data, to each of the candidates to create a label matrix that is used by our data programming engine.\n",
"## 3.1 Training the Generative Model\n",
"`Fonduer` applies user-defined **labeling functions**, which express various heuristics, patterns, and [weak supervision](http://hazyresearch.github.io/snorkel/blog/weak_supervision.html) strategies to label our data, to each of the candidates to create a label matrix that is used by our data programming engine.\n",
"\n",
"In the wild, hand-labeled training data is rare and expensive. A common scenario is to have access to tons of unlabeled training data, and have some idea of how to label them programmatically. For example:\n",
"* We may have knowledge about typical place constructs, such as combinations of strings with words like 'New','County' or 'City'.\n",
Expand Down Expand Up @@ -958,12 +909,68 @@
"In fact, it is probably somewhat overfit to this set. However this is fine, since in the next, we'll train a more powerful end extraction model which will generalize beyond the development set, and which we will evaluate on a blind test set (i.e. one we never looked at during development).\n",
"\n",
"\n",
"### Training the Discriminative Model\n",
"## 3.2 Training the Discriminative Model\n",
"\n",
"Now, we'll use the noisy training labels we generated in the last part to train our end extraction model. For this tutorial, we will be training a simple--but fairly effective--logistic regression model.\n",
"\n",
"We use the training marginals to train a discriminative model that classifies each Candidate as a true or false mention.\n",
"We use the training marginals to train a discriminative model that classifies each Candidate as a true or false mention."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Multimodal Featurization\n",
"Unlike dealing with plain unstructured text, `Fonduer` deals with richly formatted data, and consequently featurizes each candidate with a baseline library of multimodal features. \n",
"\n",
"We now annotate the candidates in our training, dev, and test sets with features. The optimized Postgres `Featurizer` provided by `Fonduer` allows this to be done in parallel to improve performance.\n",
"\n",
"View the API provided by the `Featurizer` on [ReadTheDocs](https://fonduer.readthedocs.io/en/stable/user/features.html#fonduer.features.Featurizer)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from fonduer.features import Featurizer\n",
"\n",
"featurizer = Featurizer(session, [PresidentnamePlaceofbirth])\n",
"%time featurizer.apply(split=0, train=True, parallelism=PARALLEL)\n",
"%time F_train = featurizer.get_feature_matrices(train_cands)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(F_train[0].shape)\n",
"%time featurizer.apply(split=1, parallelism=PARALLEL)\n",
"%time F_dev = featurizer.get_feature_matrices(dev_cands)\n",
"print(F_dev[0].shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%time featurizer.apply(split=2, parallelism=PARALLEL)\n",
"%time F_test = featurizer.get_feature_matrices(test_cands)\n",
"print(F_test[0].shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that this multimodal featurization and phase 1, 2 are relatively static and typically are only executed once during the KBC process.\n",
"\n",
"### Model Training with Emmental\n",
"In `Fonduer`, we use a new machine learning framework [Emmental](https://github.com/SenWu/emmental) to support all model training."
]
},
Expand Down Expand Up @@ -1166,7 +1173,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.9"
}
},
"nbformat": 4,
Expand Down