Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for AutoModelWithLMHead #13

Open
smeylan opened this issue Dec 18, 2020 · 0 comments
Open

Support for AutoModelWithLMHead #13

smeylan opened this issue Dec 18, 2020 · 0 comments

Comments

@smeylan
Copy link

smeylan commented Dec 18, 2020

I would like to adapt this library to work with user-contributed multilingual models from the transformers library.

I tried to add another model class in a fork to handle AutoModelWithLMHead models here: https://github.com/smeylan/lm-scorer/blob/master/lm_scorer/models/automodel.py, just substituting the transformer model class (GPT2LMHeadModel -> AutoModelWithLMHead)

I am running into two (possibly related) issues with this approach.

First, it errors out on this line: sent_logits[:, self.tokenizer.pad_token_id] = float("-inf"), with what seems to be an off-by-one indexing error.

/content/drive/MyDrive/Repos/lm-scorer/lm_scorer/models/automodel.py in _tokens_log_prob_for_batch(self, text)
     66             # logits.shape = [len(text[sent_index]) + 1, vocab_size]
     67             sent_logits = logits[sent_index, sent_nopad_mask][:-1, :]
---> 68             sent_logits[:, self.tokenizer.pad_token_id] = float("-inf")
     69             # ids_scores.shape = [seq_len + 1]
     70             sent_ids_scores = sent_logits.gather(1, sent_ids.unsqueeze(1)).squeeze(1)
IndexError: index 52001 is out of bounds for dimension 1 with size 52001

If I comment out this line and let it continue, I get back probabilities, but they seem to be odd. Probabilities of the first token and the endoftext token are both very low compared to the English model on a matched sentence. For example, compare French

([-13.103885650634766,
  -7.141622066497803,
  -2.2347683906555176,
  -6.366621017456055,
  -1.1687631607055664,
  -3.626580238342285,
  -10.760506629943848],
 [2532, 5985, 327, 375, 295, 7536, 50257],
 ['Le', 'Ġchat', 'Ġest', 'Ġsur', 'Ġle', 'Ġtoit', '<|endoftext|>'])

vs. English

([-2.4790897369384766,
  -9.218439102172852,
  -2.2219443321228027,
  -5.678627967834473,
  -0.41474056243896484,
  -4.27750301361084,
  -2.19716739654541,
  -5.7754011154174805],
 [464, 3797, 318, 319, 262, 9753, 13, 50256],
 ['The', 'Ġcat', 'Ġis', 'Ġon', 'Ġthe', 'Ġroof', '.', '<|endoftext|>'])

The same also holds for German (i.e. it follows the pattern fo French), so I don't think it's a model-specific problem.

Any help appreciated figuring out how AutoModelWithLMHead might differ from GPT2LMHeadModel !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant