Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Japanese TokenizerRuntimeError: Already borrowed #13733

Open
joelbartlett20 opened this issue Jan 24, 2025 · 0 comments
Open

Japanese TokenizerRuntimeError: Already borrowed #13733

joelbartlett20 opened this issue Jan 24, 2025 · 0 comments

Comments

@joelbartlett20
Copy link

joelbartlett20 commented Jan 24, 2025

  • Japanese tokenizer specifically began throwing RuntimeError: Already borrowed when receiving multi-threaded requests.
  • We noticed this error began on 2025-01-15, and did not previously struggle with this.
  • Other language tokenizers are still able to handle multi-threading (see english and chinese succeed in example below, while japanese fails)

How to reproduce the behaviour

python -m spacy download en_core_web_md
python -m spacy download zh_core_web_md
python -m spacy download ja_core_news_md

python3 (open shell, paste below in):

import spacy
import pandas as pd
from multiprocessing.dummy import Pool as ThreadPool

data = ['string'] * 10
langs = [
    ("English", spacy.load("en_core_web_md")), 
    ("Chinese", spacy.load("zh_core_web_md")), 
    ("Japanese", spacy.load("ja_core_news_md"))
]

for (lang, model) in langs:
    print(lang)
    pool = ThreadPool(10)
    pool.map(model, data)
    pool.close()
    pool.join()

Output:

English
[string, string, string, string, string, string, string, string, string, string]
Chinese
[string, string, string, string, string, string, string, string, string, string]
Japanese
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "/home/jbartlett/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jbartlett/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/pool.py", line 774, in get
    raise self._value
  File "/home/jbartlett/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/jbartlett/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
           ^^^^^^^^^^^^^^^^
  File "/home/jbartlett/data-airflow/workspace/ml-services/slack-ml/.venv/lib/python3.11/site-packages/spacy/language.py", line 1037, in __call__
    doc = self._ensure_doc(text)
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jbartlett/data-airflow/workspace/ml-services/slack-ml/.venv/lib/python3.11/site-packages/spacy/language.py", line 1128, in _ensure_doc
    return self.make_doc(doc_like)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jbartlett/data-airflow/workspace/ml-services/slack-ml/.venv/lib/python3.11/site-packages/spacy/language.py", line 1120, in make_doc
    return self.tokenizer(text)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/jbartlett/data-airflow/workspace/ml-services/slack-ml/.venv/lib/python3.11/site-packages/spacy/lang/ja/__init__.py", line 56, in __call__
    sudachipy_tokens = self.tokenizer.tokenize(text)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Already borrowed

Info about spaCy

python -m spacy info --markdown

  • spaCy version: 3.7.5
  • Platform: Linux-6.8.0-1021-aws-x86_64-with-glibc2.35
  • Python version: 3.11.11
  • Pipelines: ja_core_news_lg (3.7.0), ja_core_news_md (3.7.0), en_core_web_md (3.7.1), zh_core_web_sm (3.7.0), ja_core_news_sm (3.7.0), zh_core_web_md (3.7.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant