first commit

surajiyer · Jul 26, 2020 · b652c23 · b652c23
commit b652c23
Show file tree

Hide file tree

Showing 9 changed files with 552 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,105 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+.pytest_cache/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+.static_storage/
+.media/
+local_settings.py
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# pyenv
+.python-version
+
+# celery beat schedule file
+celerybeat-schedule
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+The MIT License (MIT)
+
+Copyright (C) 2017 Ines Montani
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,107 @@
+# spacybert: Bert inference for spaCy
+[spaCy v2.0](https://spacy.io/usage/v2) extension and pipeline component for loading BERT sentence / document embedding meta data to `Doc`, `Span` and `Token` objects. The Bert backend itself is supported by the [Hugging Face transformers](https://github.com/huggingface/transformers) library.
+
+## Installation
+`spacybert` requires `spacy` v2.0.0 or higher.
+
+## Usage
+### Getting BERT embeddings for single language dataset
+```
+import spacy
+from spacybert import BertInference
+nlp = spacy.load('en')
+```
+
+Then either use BertInference as part of a pipeline,
+```
+bert = BertInference(
+    from_pretrained='path/to/pretrained_bert_weights_dir',
+    set_extension=False)
+nlp.add_pipe(bert, last=True)
+```
+Or not...
+```
+bert = BertInference(
+    from_pretrained='path/to/pretrained_bert_weights_dir',
+    set_extension=True)
+```
+The difference is that when `set_extension=True`, `bert_repr` is set as a property extension for the Doc, Span and Token spacy objects. If `set_extension=False`, the `bert_repr` is set as an attribute extension with a default value (`=None`). The attribute computes the correct value when `doc._.bert_repr` is called.
+
+Get the Bert representation / embedding.
+```
+doc = nlp("This is a test")
+print(doc._.bert_repr)  # <-- torch.Tensor
+```
+
+### Getting BERT embeddings for multiple languages dataset.
+```
+import spacy
+from spacy_langdetect import LanguageDetector
+from spacybert import MultiLangBertInference
+
+nlp = spacy.load('en')
+nlp.add_pipe(LanguageDetector(), name='language_detector', last=True)
+bert = MultiLangBertInference(
+    from_pretrained={
+        'en': 'path/to/en_pretrained_bert_weights_dir',
+        'nl': 'path/to/nl_pretrained_bert_weights_dir'
+    },
+    set_extension=False)
+nlp.add_pipe(bert, after='language_detector')
+
+texts = [
+    "This is a test",  # English
+    "Dit is een test"  # Dutch
+]
+for doc in nlp.pipe(texts):
+    print(doc._.bert_repr)  # <-- torch.Tensor
+```
+When language_detector detects languages other than the ones for which pre-trained weights is specified, by default `doc._.bert_repr = None`.
+
+## Available attributes
+The extension sets attributes on the `Doc`, `Span` and `Token`. You can change the attribute name on initializing the extension.
+| | | |
+|-|-|-|
+| `Doc._.bert_repr` | `torch.Tensor` | Document BERT embedding |
+| `Span._.bert_repr` | `torch.Tensor` | Span BERT embedding |
+| `Token._.bert_repr` | `torch.Tensor` | Token BERT embedding |
+| | | |
+
+## Settings
+On initialization of `BertInference`, you can define the following:
+
+| name | type | default | description |
+|-|-|-|-|
+| `from_pretrained` | `str` | `None` | Path to Bert model directory or name of HuggingFace transformers pre-trained Bert weights, e.g., `bert-base-uncased` |
+| `attr_name` | `str` | `'bert_repr'` | Name of the BERT embedding attribute to set to the `._` property |
+| `max_seq_len` | `int` | 512 | Max sequence length for input to Bert |
+| `pooling_strategy` | `str` | `'REDUCE_MEAN'` | Strategy to generate single sentence embedding from multiple word embeddings. See below for the various pooling strategies available. |
+| `set_extension` | `bool` | `True` | If `True`, then `'bert_repr'` is set as a property extension for the `Doc`, `Span` and `Token` spacy objects. If `False`, the `'bert_repr'` is set as an attribute extension with a default value (`None`) which gets filled correctly when called in a pipeline. Set it to `False` if you want to use this extension in a spacy pipeline. |
+| `force_extension` | `bool` | `True` | A boolean value to create the same 'Extension Attribute' upon being executed again |
+
+On initialization of `MultiLangBertInference`, you can define the following:
+
+| name | type | default | description |
+|-|-|-|-|
+| `from_pretrained` | `Dict[LANG_ISO_639_1, str]` | `None` | Mapping between two-letter language codes to path to model directory or HuggingFace transformers pre-trained Bert weights |
+| `attr_name` | `str` | `'bert_repr'` | Same as in BertInference |
+| `max_seq_len` | `int` | 512 | Same as in BertInference |
+| `pooling_strategy` | `str` | `'REDUCE_MEAN'` | Same as in BertInference |
+| `set_extension` | `bool` | `True` | Same as in BertInference |
+| `force_extension` | `bool` | `True` | Same as in BertInference |
+
+## Pooling strategies
+| strategy | description |
+|-|-|
+| `REDUCE_MEAN` | Element-wise average the word embeddings |
+| `REDUCE_MAX` | Element-wise maximum of the word embeddings |
+| `REDUCE_MEAN_MAX` | Apply both `'REDUCE_MEAN'` and `'REDUCE_MAX'` and concatenate. So if the original word embedding is of dimensions `(768,)`, then the output will have shape `(1536,)` |
+| `CLS_TOKEN`, `FIRST_TOKEN` | Take the embedding of only the first `[CLS]` token |
+| `SEP_TOKEN`, `LAST_TOKEN` | Take the embedding of only the last `[SEP]` token |
+| `None` | No reduction is applied and a matrix of embeddings per word in the sentence is returned |
+
+## Roadmap
+This extension is still experimental. Possible future updates include:
+* Getting document representation from other state-of-the-art NLP models other than Google's BERT.
+* Method for computing similarity between `Doc`, `Span` and `Token` objects using the `bert_repr` tensor.
+* Getting representation from multiple / other layers in the models.
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,4 @@
+torch>=1.4.0
+transformers>=3.0.0
+spacy>=2.2.1,<3.0.0
+spacy-langdetect>=0.1.2
diff --git a/setup.py b/setup.py
@@ -0,0 +1,39 @@
+from pathlib import Path
+from setuptools import setup, find_packages
+
+package_name = 'spacybert'
+root = Path(__file__).parent.resolve()
+
+# Read in package meta from about.py
+about_path = root / package_name / 'about.py'
+with about_path.open('r', encoding='utf8') as f:
+    about = {}
+    exec(f.read(), about)
+
+# Get readme
+readme_path = root / 'README.md'
+with readme_path.open('r', encoding='utf8') as f:
+    readme = f.read()
+
+install_requires = [
+    'torch>=1.4.0',
+    'transformers>=3.0.0',
+    'spacy>=2.2.1,<3.0.0',
+    'spacy-langdetect>=0.1.2'
+]
+test_requires = ['pytest']
+
+setup(
+    name=package_name,
+    description=about['__summary__'],
+    long_description=readme,
+    author=about['__author__'],
+    author_email=about['__email__'],
+    url=about['__uri__'],
+    version=about['__version__'],
+    license=about['__license__'],
+    packages=find_packages(),
+    install_requires=install_requires,
+    test_requires=test_requires,
+    zip_safe=False,
+)