From f7a02d3aa0e67ca2edb9f146421023dd0a18ab5d Mon Sep 17 00:00:00 2001 From: YasushiMiyata Date: Tue, 14 Jul 2020 09:09:21 +0900 Subject: [PATCH 1/2] Switch processing order in Fonduer tutorial MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Changes to be committed: modified: hardware/max_storage_temp_tutorial.ipynb modified: hardware_image/transistor_image_tutorial.ipynb modified: wiki/president_place_of_birth_tutorial.ipynb (1) Switching the Multimodal Featurization process to ahead of training the Discriminative Model process. (2) Fixing the tutorial sentence along with the outline in Fonduer paper. (1) is because Fonduer’s Multimodal Featurization process affects not the Generative Model but the Discriminative Model. Previous version is a little bit confusing to understand the difference between Discriminative Model and Generative Model. In regards of (2), I guess that previous version intends collecting static process before the iterative process. However, to understand Generative Model , Discriminative Model and Fonduer’s Multimodal Featurization, it might be a better way to switch tutorial order and fix the tutorial sentence along with the switching. --- hardware/max_storage_temp_tutorial.ipynb | 134 ++++++++++-------- .../transistor_image_tutorial.ipynb | 8 +- wiki/president_place_of_birth_tutorial.ipynb | 129 +++++++++-------- 3 files changed, 143 insertions(+), 128 deletions(-) diff --git a/hardware/max_storage_temp_tutorial.ipynb b/hardware/max_storage_temp_tutorial.ipynb index c9044b0..da212e4 100644 --- a/hardware/max_storage_temp_tutorial.ipynb +++ b/hardware/max_storage_temp_tutorial.ipynb @@ -20,8 +20,8 @@ "The tutorial is broken into several parts, each covering a phase of the `Fonduer` pipeline (as outlined in the [paper](https://arxiv.org/abs/1703.05028)), and the iterative KBC process:\n", "\n", "1. KBC Initialization\n", - "2. Candidate Generation and Multimodal Featurization\n", - "3. Probabilistic Relation Classification\n", + "2. Candidate Generation\n", + "3. Training a Multimodal LSTM for KBC\n", "4. Error Analysis and Iterative KBC\n", "\n", "In addition, we show how users can iteratively improve labeling functions to improve relation extraction quality.\n", @@ -174,7 +174,9 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "scrolled": true + }, "outputs": [], "source": [ "docs = session.query(Document).order_by(Document.name).all()\n", @@ -201,12 +203,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Phase 2: Mention Extraction, Candidate Extraction Multimodal Featurization\n", + "# Phase 2: Candidate Generation\n", "\n", "Given the unified data model from Phase 1, `Fonduer` extracts relation\n", - "candidates based on user-provided **matchers** and **throttlers**. Then,\n", - "`Fonduer` leverages the multimodality information captured in the unified data\n", - "model to provide multimodal features for each candidate.\n", + "candidates based on user-provided **matchers** and **throttlers**.\n", "\n", "## 2.1 Mention Extraction\n", "\n", @@ -562,59 +562,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## 2.2 Multimodal Featurization\n", - "Unlike dealing with plain unstructured text, `Fonduer` deals with richly formatted data, and consequently featurizes each candidate with a baseline library of multimodal features. \n", - "\n", - "### Featurize with `Fonduer`'s optimized Postgres Featurizer\n", - "We now annotate the candidates in our training, dev, and test sets with features. The `Featurizer` provided by `Fonduer` allows this to be done in parallel to improve performance.\n", + "At the end of this phase, `Fonduer` has generated the set of candidates and the feature matrix.\n", "\n", - "View the API provided by the `Featurizer` on [ReadTheDocs](https://fonduer.readthedocs.io/en/stable/user/features.html#fonduer.features.Featurizer)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from fonduer.features import Featurizer\n", - "\n", - "featurizer = Featurizer(session, [PartTemp])\n", - "%time featurizer.apply(split=0, train=True, parallelism=PARALLEL)\n", - "%time F_train = featurizer.get_feature_matrices(train_cands)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(F_train[0].shape)\n", - "%time featurizer.apply(split=1, parallelism=PARALLEL)\n", - "%time F_dev = featurizer.get_feature_matrices(dev_cands)\n", - "print(F_dev[0].shape)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%time featurizer.apply(split=2, parallelism=PARALLEL)\n", - "%time F_test = featurizer.get_feature_matrices(test_cands)\n", - "print(F_test[0].shape)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "At the end of this phase, `Fonduer` has generated the set of candidates and the feature matrix. Note that Phase 1 and 2 are relatively static and typically are only executed once during the KBC process.\n", + "# Phase 3: Training a Multimodal LSTM for KBC\n", + "In this phase, `Fonduer`, first, trains the generative model based on user-defined **labeling function**. Next, `Fonduer` trains the discriminative model with the trained generative model and multimodal features.\n", "\n", - "# Phase 3: Probabilistic Relation Classification\n", - "In this phase, `Fonduer` applies user-defined **labeling functions**, which express various heuristics, patterns, and [weak supervision](http://hazyresearch.github.io/snorkel/blog/weak_supervision.html) strategies to label our data, to each of the candidates to create a label matrix that is used by our data programming engine.\n", + "## 3.1 Training the Generative Model\n", + "`Fonduer` applies user-defined **labeling functions**, which express various heuristics, patterns, and [weak supervision](http://hazyresearch.github.io/snorkel/blog/weak_supervision.html) strategies to label our data, to each of the candidates to create a label matrix that is used by our data programming engine.\n", "\n", "In the wild, hand-labeled training data is rare and expensive. A common scenario is to have access to tons of unlabeled training data, and have some idea of how to label them programmatically. For example:\n", "* We may be able to think of text patterns that would indicate a part and polarity mention are related, for example the word \"temperature\" appearing between them.\n", @@ -623,6 +577,7 @@ "\n", "Using data programming, we can then train machine learning models to learn which features are the most important in classifying candidates.\n", "\n", + "\n", "### Loading Gold Data\n", "For convenience in error analysis and evaluation, we have already annotated the dev and test set for this tutorial, and we'll now load it using an externally-defined helper function: `gold`. Technically, this is also a labeling function similar to labeling functions we will be writing later. The only difference is that (gold) labels are written to `GoldLabel` table while (non-gold) labels are written to `Label` table. If you're interested in the example implementation details, please see the script we now load:" ] @@ -905,13 +860,68 @@ "\n", "In fact, it is probably somewhat overfit to this set. However this is fine, since in the next, we'll train a more powerful end extraction model which will generalize beyond the development set, and which we will evaluate on a blind test set (i.e. one we never looked at during development).\n", "\n", + "## 3.2 Training the Discriminative Model\n", "\n", - "### Training the Discriminative Model\n", + "Now, we'll use the noisy training labels we generated in the last part to train our end extraction model. `Fonduer` also leverage the multimodality information captured in the unified data model to provide multimodal features for each candidate. For this tutorial, we will be training a simple--but fairly effective--logistic regression model.\n", "\n", - "Now, we'll use the noisy training labels we generated in the last part to train our end extraction model. For this tutorial, we will be training a simple--but fairly effective--logistic regression model.\n", + "We use the training marginals to train a discriminative model that classifies each Candidate as a true or false mention." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Multimodal Featurization\n", + "Unlike dealing with plain unstructured text, `Fonduer` deals with richly formatted data, and consequently featurizes each candidate with a baseline library of multimodal features. \n", + "\n", + "We now annotate the candidates in our training, dev, and test sets with features. The optimized Postgres `Featurizer` provided by `Fonduer` allows this to be done in parallel to improve performance.\n", "\n", - "We use the training marginals to train a discriminative model that classifies each Candidate as a true or false mention.\n", + "View the API provided by the `Featurizer` on [ReadTheDocs](https://fonduer.readthedocs.io/en/stable/user/features.html#fonduer.features.Featurizer)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from fonduer.features import Featurizer\n", + "\n", + "featurizer = Featurizer(session, [PartTemp])\n", + "%time featurizer.apply(split=0, train=True, parallelism=PARALLEL)\n", + "%time F_train = featurizer.get_feature_matrices(train_cands)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(F_train[0].shape)\n", + "%time featurizer.apply(split=1, parallelism=PARALLEL)\n", + "%time F_dev = featurizer.get_feature_matrices(dev_cands)\n", + "print(F_dev[0].shape)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%time featurizer.apply(split=2, parallelism=PARALLEL)\n", + "%time F_test = featurizer.get_feature_matrices(test_cands)\n", + "print(F_test[0].shape)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that this multimodal featurization and phase 1, 2 are relatively static and typically are only executed once during the KBC process.\n", "\n", + "### Model Training with Emmental\n", "In `Fonduer`, we use a new machine learning framework [Emmental](https://github.com/SenWu/emmental) to support all model training." ] }, @@ -1458,7 +1468,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.9" } }, "nbformat": 4, diff --git a/hardware_image/transistor_image_tutorial.ipynb b/hardware_image/transistor_image_tutorial.ipynb index 6548f11..3dbde5a 100644 --- a/hardware_image/transistor_image_tutorial.ipynb +++ b/hardware_image/transistor_image_tutorial.ipynb @@ -187,12 +187,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Phase 2: Mention Extraction, Candidate Extraction Multimodal Featurization\n", + "# Phase 2: Candidate Generation\n", "\n", "Given the unified data model from Phase 1, `Fonduer` extracts relation\n", - "candidates based on user-provided **matchers** and **throttlers**. Then,\n", - "`Fonduer` leverages the multimodality information captured in the unified data\n", - "model to provide multimodal features for each candidate.\n", + "candidates based on user-provided **matchers** and **throttlers**.\n", "\n", "## 2.1 Mention Extraction\n", "\n", @@ -416,7 +414,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.9" } }, "nbformat": 4, diff --git a/wiki/president_place_of_birth_tutorial.ipynb b/wiki/president_place_of_birth_tutorial.ipynb index 8741fdd..d41b437 100644 --- a/wiki/president_place_of_birth_tutorial.ipynb +++ b/wiki/president_place_of_birth_tutorial.ipynb @@ -37,9 +37,8 @@ "The tutorial is broken into several parts, each covering a phase of the `Fonduer` pipeline (as outlined in the [paper](https://arxiv.org/abs/1703.05028)), and the iterative KBC process:\n", "\n", "1. KBC Initialization\n", - "2. Candidate Generation and Multimodal Featurization\n", - "3. Probabilistic Relation Classification\n", - "4. Error Analysis and Iterative KBC\n", + "2. Candidate Generation\n", + "3. Training a Multimodal LSTM for KBC\n", "\n", "In addition, we show how users can iteratively improve labeling functions to improve relation extraction quality.\n", "\n", @@ -198,12 +197,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Phase 2: Mention Extraction, Candidate Extraction Multimodal Featurization\n", + "# Phase 2: Candidate Generation\n", "\n", "Given the unified data model from Phase 1, `Fonduer` extracts relation\n", - "candidates based on user-provided **matchers** and **throttlers**. Then,\n", - "`Fonduer` leverages the multimodality information captured in the unified data\n", - "model to provide multimodal features for each candidate.\n", + "candidates based on user-provided **matchers** and **throttlers**.\n", "\n", "## 2.1 Mention Extraction\n", "\n", @@ -542,59 +539,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## 2.2 Multimodal Featurization\n", - "Unlike dealing with plain unstructured text, `Fonduer` deals with richly formatted data, and consequently featurizes each candidate with a baseline library of multimodal features. \n", + "At the end of this phase, `Fonduer` has generated the set of candidates and the feature matrix.\n", "\n", - "### Featurize with `Fonduer`'s optimized Postgres Featurizer\n", - "We now annotate the candidates in our training, dev, and test sets with features. The `Featurizer` provided by `Fonduer` allows this to be done in parallel to improve performance.\n", + "# Phase 3: Training a Multimodal LSTM for KBC\n", + "In this phase, `Fonduer`, first, trains the generative model based on user-defined **labeling function**. Next, `Fonduer` trains the discriminative model with the trained generative model and multimodal features.\n", "\n", - "View the API provided by the `Featurizer` on [ReadTheDocs](https://fonduer.readthedocs.io/en/stable/user/features.html#fonduer.features.Featurizer)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from fonduer.features import Featurizer\n", - "\n", - "featurizer = Featurizer(session, [PresidentnamePlaceofbirth])\n", - "%time featurizer.apply(split=0, train=True, parallelism=PARALLEL)\n", - "%time F_train = featurizer.get_feature_matrices(train_cands)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(F_train[0].shape)\n", - "%time featurizer.apply(split=1, parallelism=PARALLEL)\n", - "%time F_dev = featurizer.get_feature_matrices(dev_cands)\n", - "print(F_dev[0].shape)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%time featurizer.apply(split=2, parallelism=PARALLEL)\n", - "%time F_test = featurizer.get_feature_matrices(test_cands)\n", - "print(F_test[0].shape)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "At the end of this phase, `Fonduer` has generated the set of candidates and the feature matrix. Note that Phase 1 and 2 are relatively static and typically are only executed once during the KBC process.\n", - "\n", - "# Phase 3: Probabilistic Relation Classification\n", - "In this phase, `Fonduer` applies user-defined **labeling functions**, which express various heuristics, patterns, and [weak supervision](http://hazyresearch.github.io/snorkel/blog/weak_supervision.html) strategies to label our data, to each of the candidates to create a label matrix that is used by our data programming engine.\n", + "## 3.1 Training the Generative Model\n", + "`Fonduer` applies user-defined **labeling functions**, which express various heuristics, patterns, and [weak supervision](http://hazyresearch.github.io/snorkel/blog/weak_supervision.html) strategies to label our data, to each of the candidates to create a label matrix that is used by our data programming engine.\n", "\n", "In the wild, hand-labeled training data is rare and expensive. A common scenario is to have access to tons of unlabeled training data, and have some idea of how to label them programmatically. For example:\n", "* We may have knowledge about typical place constructs, such as combinations of strings with words like 'New','County' or 'City'.\n", @@ -958,12 +909,68 @@ "In fact, it is probably somewhat overfit to this set. However this is fine, since in the next, we'll train a more powerful end extraction model which will generalize beyond the development set, and which we will evaluate on a blind test set (i.e. one we never looked at during development).\n", "\n", "\n", - "### Training the Discriminative Model\n", + "## 3.2 Training the Discriminative Model\n", "\n", "Now, we'll use the noisy training labels we generated in the last part to train our end extraction model. For this tutorial, we will be training a simple--but fairly effective--logistic regression model.\n", "\n", - "We use the training marginals to train a discriminative model that classifies each Candidate as a true or false mention.\n", + "We use the training marginals to train a discriminative model that classifies each Candidate as a true or false mention." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Multimodal Featurization\n", + "Unlike dealing with plain unstructured text, `Fonduer` deals with richly formatted data, and consequently featurizes each candidate with a baseline library of multimodal features. \n", + "\n", + "We now annotate the candidates in our training, dev, and test sets with features. The optimized Postgres `Featurizer` provided by `Fonduer` allows this to be done in parallel to improve performance.\n", + "\n", + "View the API provided by the `Featurizer` on [ReadTheDocs](https://fonduer.readthedocs.io/en/stable/user/features.html#fonduer.features.Featurizer)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from fonduer.features import Featurizer\n", + "\n", + "featurizer = Featurizer(session, [PresidentnamePlaceofbirth])\n", + "%time featurizer.apply(split=0, train=True, parallelism=PARALLEL)\n", + "%time F_train = featurizer.get_feature_matrices(train_cands)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(F_train[0].shape)\n", + "%time featurizer.apply(split=1, parallelism=PARALLEL)\n", + "%time F_dev = featurizer.get_feature_matrices(dev_cands)\n", + "print(F_dev[0].shape)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%time featurizer.apply(split=2, parallelism=PARALLEL)\n", + "%time F_test = featurizer.get_feature_matrices(test_cands)\n", + "print(F_test[0].shape)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that this multimodal featurization and phase 1, 2 are relatively static and typically are only executed once during the KBC process.\n", "\n", + "### Model Training with Emmental\n", "In `Fonduer`, we use a new machine learning framework [Emmental](https://github.com/SenWu/emmental) to support all model training." ] }, @@ -1166,7 +1173,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.9" } }, "nbformat": 4, From b406d3897451f160f987bc91277ddb7c1a899f09 Mon Sep 17 00:00:00 2001 From: YasushiMiyata Date: Fri, 17 Jul 2020 11:38:12 +0900 Subject: [PATCH 2/2] Minor fixing of sentences in PR. Changes to be committed: modified: hardware/max_storage_temp_tutorial.ipynb modified: wiki/president_place_of_birth_tutorial.ipynb --- hardware/max_storage_temp_tutorial.ipynb | 2 +- wiki/president_place_of_birth_tutorial.ipynb | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/hardware/max_storage_temp_tutorial.ipynb b/hardware/max_storage_temp_tutorial.ipynb index da212e4..76a0915 100644 --- a/hardware/max_storage_temp_tutorial.ipynb +++ b/hardware/max_storage_temp_tutorial.ipynb @@ -565,7 +565,7 @@ "At the end of this phase, `Fonduer` has generated the set of candidates and the feature matrix.\n", "\n", "# Phase 3: Training a Multimodal LSTM for KBC\n", - "In this phase, `Fonduer`, first, trains the generative model based on user-defined **labeling function**. Next, `Fonduer` trains the discriminative model with the trained generative model and multimodal features.\n", + "In this phase, `Fonduer` first trains the generative model based on user-defined **labeling function**. Next, `Fonduer` trains the discriminative model with the trained generative model and multimodal features.\n", "\n", "## 3.1 Training the Generative Model\n", "`Fonduer` applies user-defined **labeling functions**, which express various heuristics, patterns, and [weak supervision](http://hazyresearch.github.io/snorkel/blog/weak_supervision.html) strategies to label our data, to each of the candidates to create a label matrix that is used by our data programming engine.\n", diff --git a/wiki/president_place_of_birth_tutorial.ipynb b/wiki/president_place_of_birth_tutorial.ipynb index d41b437..7e75561 100644 --- a/wiki/president_place_of_birth_tutorial.ipynb +++ b/wiki/president_place_of_birth_tutorial.ipynb @@ -542,7 +542,7 @@ "At the end of this phase, `Fonduer` has generated the set of candidates and the feature matrix.\n", "\n", "# Phase 3: Training a Multimodal LSTM for KBC\n", - "In this phase, `Fonduer`, first, trains the generative model based on user-defined **labeling function**. Next, `Fonduer` trains the discriminative model with the trained generative model and multimodal features.\n", + "In this phase, `Fonduer` first trains the generative model based on user-defined **labeling function**. Next, `Fonduer` trains the discriminative model with the trained generative model and multimodal features.\n", "\n", "## 3.1 Training the Generative Model\n", "`Fonduer` applies user-defined **labeling functions**, which express various heuristics, patterns, and [weak supervision](http://hazyresearch.github.io/snorkel/blog/weak_supervision.html) strategies to label our data, to each of the candidates to create a label matrix that is used by our data programming engine.\n",