-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace Luke with MLuke in Notebook/ConLL-2003 #166
Comments
Hi @mrpeerat, thank you for reporting the problem. It seems that the problem can be solved by setting This should have been set by default when instantiating the tokenizer from the huggingface hub... |
Hi again, the bug is fixed. Thank you for the suggestion @Ryou0634 . Errors:ValueError Traceback (most recent call last) File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/seqeval/metrics/sequence_labeling.py:692, in classification_report(y_true, y_pred, digits, suffix, output_dict, mode, sample_weight, zero_division, scheme) File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/seqeval/metrics/sequence_labeling.py:130, in precision_recall_fscore_support(y_true, y_pred, average, warn_for, beta, sample_weight, zero_division, suffix) File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/seqeval/metrics/v1.py:122, in _precision_recall_fscore_support(y_true, y_pred, average, warn_for, beta, sample_weight, zero_division, scheme, suffix, extract_tp_actual_correct) File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/seqeval/metrics/v1.py:101, in check_consistent_length(y_true, y_pred) ValueError: Found input variables with inconsistent numbers of samples: Thank you. |
It seems the notebook is not compatible with mLUKE, because the notebook was created for LUKE and there have been significant updates in the codebase in the repository... (I guess that something is not compatible with MLukeTokenizer in preprocessing) You may consider using |
I got the same results of
|
After editing |
@chantera |
I finally figured out why the performances of For the evaluation script, inputs["entity_attention_mask"] = entity_ids != 0 However, To ensure that this causes the discrepancy, I added the following code in the notebook. inputs = tokenizer(texts, entity_spans=entity_spans, return_tensors="pt", padding=True)
###
inputs["entity_attention_mask"] = torch.zeros_like(inputs["entity_attention_mask"])
###
inputs = inputs.to("cuda")
with torch.no_grad():
outputs = model(**inputs)
all_logits.extend(outputs.logits.tolist()) This worked and gave 94.23 F1 score. I will send a PR to fix this soon. |
Note that the preprocessing in the notebook did not cause the problem. def data_to_instance(self, words: List[str], labels: List[str], sentence_boundaries: List[int], doc_index: str):
subword_lengths = [len(self.tokenizer.tokenize(w)) for w in words]
total_subword_length = sum(subword_lengths)
max_token_length = self.max_num_subwords
max_mention_length = self.max_mention_length
entities = {}
for s, e in zip(sentence_boundaries[:-1], sentence_boundaries[1:]):
for ent in Entities([labels[s:e]], scheme=self.iob_scheme).entities[0]:
entities[(ent.start + s, ent.end + s)] = ent.tag
for i in range(len(sentence_boundaries) - 1):
sentence_start, sentence_end = sentence_boundaries[i:i+2]
if total_subword_length <= max_token_length:
context_start = 0
context_end = len(words)
else:
context_start = sentence_start
context_end = sentence_end
cur_length = sum(subword_lengths[context_start:context_end])
while True:
if context_start > 0:
if cur_length + subword_lengths[context_start - 1] <= max_token_length:
cur_length += subword_lengths[context_start - 1]
context_start -= 1
else:
break
if context_end < len(words):
if cur_length + subword_lengths[context_end] <= max_token_length:
cur_length += subword_lengths[context_end]
context_end += 1
else:
break
text = ""
for word in words[context_start:sentence_start]:
# if word[0] == "'" or (len(word) == 1 and is_punctuation(word)):
# text = text.rstrip()
text += word
text += " "
sentence_words = words[sentence_start:sentence_end]
sentence_subword_lengths = subword_lengths[sentence_start:sentence_end]
word_start_char_positions = []
word_end_char_positions = []
for word in sentence_words:
# if word[0] == "'" or (len(word) == 1 and is_punctuation(word)):
# text = text.rstrip()
word_start_char_positions.append(len(text))
text += word
word_end_char_positions.append(len(text))
text += " "
for word in words[sentence_end:context_end]:
# if word[0] == "'" or (len(word) == 1 and is_punctuation(word)):
# text = text.rstrip()
text += word
text += " "
text = text.rstrip()
entity_spans = []
original_word_spans = []
original_entity_spans = []
labels = []
for word_start in range(len(sentence_words)):
for word_end in range(word_start, len(sentence_words)):
if sum(sentence_subword_lengths[word_start:word_end + 1]) <= max_mention_length:
entity_spans.append(
(word_start_char_positions[word_start], word_end_char_positions[word_end])
)
original_word_spans.append(
(word_start, word_end + 1)
)
original_entity_span = (word_start + sentence_start, word_end + 1 + sentence_start)
labels.append(entities.get(original_entity_span, NON_ENTITY))
original_entity_spans.append(original_entity_span)
self.tokenizer.tokenizer.task = "entity_span_classification"
inputs = self.tokenizer.tokenizer(text, entity_spans=entity_spans)
word_ids = self.tokenizer.tokenizer.convert_ids_to_tokens(inputs["input_ids"])
entity_ids = inputs["entity_ids"]
split_size = math.ceil(len(entity_ids) / self.max_entity_length)
for i in range(split_size):
entity_size = math.ceil(len(entity_ids) / split_size)
start = i * entity_size
end = start + entity_size
fields = {
"word_ids": TextField([Token(w) for w in word_ids], token_indexers=self.token_indexers),
"entity_start_positions": TensorField(np.array(inputs["entity_start_positions"][start:end])),
"entity_end_positions": TensorField(np.array(inputs["entity_end_positions"][start:end])),
"original_entity_spans": TensorField(np.array(original_entity_spans[start:end]), padding_value=-1),
"labels": ListField([LabelField(l) for l in labels[start:end]]),
"doc_id": MetadataField(doc_index),
"input_words": MetadataField(words),
"entity_ids": TensorField(np.array(entity_ids[start:end]), padding_value=0),
"entity_position_ids": TensorField(np.array(inputs["entity_position_ids"][start:end])),
}
yield Instance(fields) |
Hi!
I'm trying to run MLuke on https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb by replacing
studio-ousia/luke-large-finetuned-conll-2003
withstudio-ousia/mluke-large-lite-finetuned-conll-2003
and changingLukeTokenizer
toMLukeTokenizer
.Every thing looks find until the block:
The error is
AttributeError Traceback (most recent call last)
Cell In [8], line 12
10 inputs = inputs.to("cuda")
11 with torch.no_grad():
---> 12 outputs = model(**inputs)
13 all_logits.extend(outputs.logits.tolist())
File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, **kwargs)
1098 # If we don't have any hooks, we want to skip the rest of the logic in
1099 # this function, and just call forward.
1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1101 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102 return forward_call(*input, **kwargs)
1103 # Do not call functions when jit is used
1104 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/transformers/models/luke/modeling_luke.py:1588, in LukeForEntitySpanClassification.forward(self, input_ids, attention_mask, token_type_ids, position_ids, entity_ids, entity_attention_mask, entity_token_type_ids, entity_position_ids, entity_start_positions, entity_end_positions, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
1571 outputs = self.luke(
1572 input_ids=input_ids,
1573 attention_mask=attention_mask,
(...)
1584 return_dict=True,
1585 )
1586 hidden_size = outputs.last_hidden_state.size(-1)
-> 1588 entity_start_positions = entity_start_positions.unsqueeze(-1).expand(-1, -1, hidden_size)
1589 start_states = torch.gather(outputs.last_hidden_state, -2, entity_start_positions)
1590 entity_end_positions = entity_end_positions.unsqueeze(-1).expand(-1, -1, hidden_size)
AttributeError: 'NoneType' object has no attribute 'unsqueeze'
Thank you.
The text was updated successfully, but these errors were encountered: