Fine-tuning the BERT model #24

naturecreator · 2020-09-24T11:11:47Z

naturecreator
Sep 24, 2020

I want to use contextualSpellCheck for my project. Is it possible to fine-tune the BERT model and use this fine-tuned model for contextual spell checking (I want to fine-tune the BERT model for a specific domain)?

Can you please provide some insights or an example on this?

Answered by R1j1t

Sep 24, 2020

ContextualSpellCheck relies on 🤗Transformers library to provide the model. So you can follow their fine tuning pipeline and then load the updated model by providing the local path to contextualSpellCheck as below:

contextualSpellCheck/examples/ja_example.py

Lines 5 to 8 in 96cb79e

     checker = ContextualSpellCheck(  
   model_name="cl-tohoku/bert-base-japanese-whole-word-masking",  
   max_edit_dist=2,  
   )  

 

A snippet on model loading in contextualSpellCheck is below:

contextualSpellCheck/contextualSpellCheck/contextualSpellCheck.py

Lines 112 to 119 in 96cb79e

     self.model_name = model_name  
   self.BertTokenizer = AutoTokenizer.from_pretrained(self.model_name)  
 

View full answer

R1j1t · 2020-09-24T15:53:15Z

R1j1t
Sep 24, 2020
Maintainer

ContextualSpellCheck relies on 🤗Transformers library to provide the model. So you can follow their fine tuning pipeline and then load the updated model by providing the local path to contextualSpellCheck as below:

contextualSpellCheck/examples/ja_example.py

Lines 5 to 8 in 96cb79e

    
           checker = ContextualSpellCheck( 
        
               model_name="cl-tohoku/bert-base-japanese-whole-word-masking", 
        
               max_edit_dist=2, 
        
           )

A snippet on model loading in contextualSpellCheck is below:

contextualSpellCheck/contextualSpellCheck/contextualSpellCheck.py

Lines 112 to 119 in 96cb79e

    
           self.model_name = model_name 
        
           self.BertTokenizer = AutoTokenizer.from_pretrained(self.model_name) 
        
           if vocab_path == "": 
        
               words = list(self.BertTokenizer.get_vocab().keys()) 
        
           self.vocab = Vocab(strings=words) 
        
           logging.getLogger("transformers").setLevel(logging.ERROR) 
        
           self.BertModel = AutoModelForMaskedLM.from_pretrained(self.model_name)

If you see above, L113 and L119, depends on 🤗Transformers to load the model either from 🤗Transformers cloud storage or local. Depending on 🤗Transformers allows us to support pretrained models in other languages and not isolate the library from taking advantage from peoples contribution to custom models.

I hope this solves your issue.

0 replies

naturecreator · 2020-09-24T19:26:47Z

naturecreator
Sep 24, 2020
Author

Thanks a lot for the detailed information :). I am also interested to work on real-word errors in my project. Do you have any suggestions to tackle this problem?

0 replies

R1j1t · 2020-09-25T13:15:33Z

R1j1t
Sep 25, 2020
Maintainer

I am happy the solution worked for you!

Regrading the RWE, it is a difficult problem to deal with. I did some reading before starting this project on both RWE and NWE. In my understanding a spell corrector pipeline can be divided as follows:

Misspell word identification
Candidate generation for replacement
selection from the candidate

RWE and NWE are very similar except for step 1. If I remember correctly, they identify misspell (RWE) using n-grams and frequency of occurrence. I did not find anywork which uses ML models to identify misspell (both RWE and NWE).

So, a good start maybe for RWE would be to start with this approach. I have the following resource in my backlog. If you find it useful:

If you would like to contribute, that would be great!

0 replies

R1j1t · 2020-09-25T13:17:13Z

R1j1t
Sep 25, 2020
Maintainer

Also I will close this issue for now. If you face any issue with fine tuning, then please check 🤗Transformers repo for more doc or raise an issue if you encounter a bug while fine tuning.

0 replies

naturecreator · 2020-09-29T21:02:15Z

naturecreator
Sep 29, 2020
Author

Hello @R1j1t ,

I could fine-tune the BERT model as it is shown in the [example] (https://github.com/huggingface/transformers/tree/master/examples/language-modeling) and the model is saved as pytorch_model.bin (413 mb). Then I tried to use this model for the contextualSpellCheck but I am facing the following error:

Traceback (most recent call last):
File "C:/Users/ravida6d/Desktop/Darshan/spell_correction/contextualSpellCheck/contextualSpellCheck.py", line 587, in <module>
  checker = ContextualSpellCheck(model_name="C:/Users/ravida6d/Desktop/Darshan/spell_correction/contextualSpellCheck/pytorch_model.bin", debug=True, max_edit_dist=3)
File "C:/Users/ravida6d/Desktop/Darshan/spell_correction/contextualSpellCheck/contextualSpellCheck.py", line 113, in _init_
  self.BertTokenizer = AutoTokenizer.from_pretrained(self.model_name)
File "C:\Users\ravida6d\AppData\Local\Continuum\anaconda3\envs\contextualSpellCheck\lib\site-packages\transformers\tokenization_auto.py", line 210, in from_pretrained
  config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
File "C:\Users\ravida6d\AppData\Local\Continuum\anaconda3\envs\contextualSpellCheck\lib\site-packages\transformers\configuration_auto.py", line 303, in from_pretrained
  config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "C:\Users\ravida6d\AppData\Local\Continuum\anaconda3\envs\contextualSpellCheck\lib\site-packages\transformers\configuration_utils.py", line 357, in get_config_dict
  config_dict = cls._dict_from_json_file(resolved_config_file)
File "C:\Users\ravida6d\AppData\Local\Continuum\anaconda3\envs\contextualSpellCheck\lib\site-packages\transformers\configuration_utils.py", line 439, in _dict_from_json_file
  text = reader.read()
File "C:\Users\ravida6d\AppData\Local\Continuum\anaconda3\envs\contextualSpellCheck\lib\codecs.py", line 321, in decode
  (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Can you please help me to resolve this issue?

0 replies

naturecreator · 2020-09-29T22:29:48Z

naturecreator
Sep 29, 2020
Author

I got it worked and the following files must be in the same folder and the path should be projected to the folder (not to the pytorch_model.bin):

vocab.txt - vocabulary file
pytorch_model.bin - the Pytorch-compatible (and converted) model
config.json - json-based model configuration

0 replies

naturecreator · 2020-09-30T09:40:49Z

naturecreator
Sep 30, 2020
Author

Hi @R1j1t ,

Do you know how loss is calculated during fine-tuning the BERTForMaskedLM using run_language_modeling.py?

While fine-tuning, we can only see loss and perplexity which is useful.
Is it also possible to see the accuracy of the model and also the tensorboard when using the “run_language_modeling.py” script? It would be really helpful if you could explain how the “loss” is calculated for BERTForMaskedLM task (as there are no labels provided while fine-tuning).

Do you have any better ideas to fine-tune the BERT (apart from BERTForMaskedLM) for spelling correction?

0 replies

R1j1t · 2020-09-30T15:22:14Z

R1j1t
Sep 30, 2020
Maintainer

@naturecreator please check the relevant documentations or code to understand this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning the BERT model #24

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

	checker = ContextualSpellCheck(
	model_name="cl-tohoku/bert-base-japanese-whole-word-masking",
	max_edit_dist=2,
	)

	self.model_name = model_name
	self.BertTokenizer = AutoTokenizer.from_pretrained(self.model_name)

Fine-tuning the BERT model #24

naturecreator Sep 24, 2020

Replies: 8 comments

R1j1t Sep 24, 2020 Maintainer

naturecreator Sep 24, 2020 Author

R1j1t Sep 25, 2020 Maintainer

R1j1t Sep 25, 2020 Maintainer

naturecreator Sep 29, 2020 Author

naturecreator Sep 29, 2020 Author

naturecreator Sep 30, 2020 Author

R1j1t Sep 30, 2020 Maintainer

naturecreator
Sep 24, 2020

R1j1t
Sep 24, 2020
Maintainer

naturecreator
Sep 24, 2020
Author

R1j1t
Sep 25, 2020
Maintainer

R1j1t
Sep 25, 2020
Maintainer

naturecreator
Sep 29, 2020
Author

naturecreator
Sep 29, 2020
Author

naturecreator
Sep 30, 2020
Author

R1j1t
Sep 30, 2020
Maintainer