Support for AutoModelWithLMHead #13

smeylan · 2020-12-18T21:29:29Z

I would like to adapt this library to work with user-contributed multilingual models from the transformers library.

I tried to add another model class in a fork to handle AutoModelWithLMHead models here: https://github.com/smeylan/lm-scorer/blob/master/lm_scorer/models/automodel.py, just substituting the transformer model class (GPT2LMHeadModel -> AutoModelWithLMHead)

I am running into two (possibly related) issues with this approach.

First, it errors out on this line: sent_logits[:, self.tokenizer.pad_token_id] = float("-inf"), with what seems to be an off-by-one indexing error.

/content/drive/MyDrive/Repos/lm-scorer/lm_scorer/models/automodel.py in _tokens_log_prob_for_batch(self, text)
     66             # logits.shape = [len(text[sent_index]) + 1, vocab_size]
     67             sent_logits = logits[sent_index, sent_nopad_mask][:-1, :]
---> 68             sent_logits[:, self.tokenizer.pad_token_id] = float("-inf")
     69             # ids_scores.shape = [seq_len + 1]
     70             sent_ids_scores = sent_logits.gather(1, sent_ids.unsqueeze(1)).squeeze(1)
IndexError: index 52001 is out of bounds for dimension 1 with size 52001

If I comment out this line and let it continue, I get back probabilities, but they seem to be odd. Probabilities of the first token and the endoftext token are both very low compared to the English model on a matched sentence. For example, compare French

([-13.103885650634766,
  -7.141622066497803,
  -2.2347683906555176,
  -6.366621017456055,
  -1.1687631607055664,
  -3.626580238342285,
  -10.760506629943848],
 [2532, 5985, 327, 375, 295, 7536, 50257],
 ['Le', 'Ġchat', 'Ġest', 'Ġsur', 'Ġle', 'Ġtoit', '<|endoftext|>'])

vs. English

([-2.4790897369384766,
  -9.218439102172852,
  -2.2219443321228027,
  -5.678627967834473,
  -0.41474056243896484,
  -4.27750301361084,
  -2.19716739654541,
  -5.7754011154174805],
 [464, 3797, 318, 319, 262, 9753, 13, 50256],
 ['The', 'Ġcat', 'Ġis', 'Ġon', 'Ġthe', 'Ġroof', '.', '<|endoftext|>'])

The same also holds for German (i.e. it follows the pattern fo French), so I don't think it's a model-specific problem.

Any help appreciated figuring out how AutoModelWithLMHead might differ from GPT2LMHeadModel !

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for AutoModelWithLMHead #13

Support for AutoModelWithLMHead #13

smeylan commented Dec 18, 2020

Support for AutoModelWithLMHead #13

Support for AutoModelWithLMHead #13

Comments

smeylan commented Dec 18, 2020