Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ablation studies #45

Closed
wants to merge 26 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
83378ee
Ablation studies (#40)
zausin33 Jan 9, 2025
2e9c543
updated config for better evaluation in ablation studies
Jan 9, 2025
b430d82
add MultiRC dataset
Hajuj Jan 11, 2025
a0b1e48
add preprocessed data
Hajuj Jan 12, 2025
6833713
Testing on arithmetic dataset, Pearson and Spearman metrics using sci…
AnamarijaKozina Jan 15, 2025
c4686cf
removed unused method
Jan 15, 2025
c937db6
Added debug dataset
Jan 16, 2025
b72e707
Merge branch 'main' into ablation_studies
Jan 16, 2025
940828a
fixed merge error
Jan 16, 2025
15c8954
Fix eval and test dataset path, add yaml files
Thorben010 Jan 16, 2025
50bf412
fixed test
Jan 16, 2025
50be234
added special name again for ablation studies
Jan 16, 2025
08cfc25
fixed debug dataset loading
Jan 16, 2025
62b346c
merged with nlp_task
Jan 16, 2025
873fd1a
fixed tests
Jan 16, 2025
23b83b0
Merge branch 'nlp_task' into ablation_studies
Jan 16, 2025
e4acdd9
finished multirc merge
Jan 16, 2025
75aac1f
changed device number back to 1
Jan 16, 2025
878c63e
added ntl with default tokenizer
Jan 22, 2025
706590b
chore: uv compataibility (#46)
jannisborn Jan 22, 2025
9102dba
fixed AbsDiffNumberTokenLoss
Thorben010 Jan 23, 2025
004269a
Merge remote-tracking branch 'origin/ablation_studies' into ablation_…
Thorben010 Jan 23, 2025
4688936
improved config
Thorben010 Jan 23, 2025
7ede55c
Main rjokes (#47)
zausin33 Jan 23, 2025
dc3a925
Ensure torchrun compatibility (#48)
jannisborn Jan 27, 2025
75fffe8
fix: use output dir as experiment id to avoid conflicts in evaluate p…
NinaWie Jan 27, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions config/dataset_args/debug.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
dataset_name: debug
2 changes: 2 additions & 0 deletions config/dataset_args/multirc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
dataset_name: multirc
compute_number_metrics: false
1 change: 1 addition & 0 deletions config/dataset_args/rjokes.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
dataset_name: rjokes
2 changes: 1 addition & 1 deletion config/model_args/vanilla_t5.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
name: vanilla_t5
config_name: t5-base
config_name: t5-small
number_encoding: none
4 changes: 4 additions & 0 deletions config/model_args/vanilla_t5_custom_tokenizer.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
name: vanilla_t5_custom_tokenizer
config_name: t5-small
number_encoding: none
tokenizer_type: custom
5 changes: 2 additions & 3 deletions config/model_args/vanilla_t5_ntl.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
name: vanilla_t5_ntl
config_name: t5-base
config_name: t5-small
number_encoding: none
number_token_loss: true
number_token_loss_weight: 0.3
number_token_loss_with_wasserstein: false
#number_token_loss_function:
number_token_loss_with_wasserstein: true
7 changes: 7 additions & 0 deletions config/model_args/vanilla_t5_ntl_default_tokenizer.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
name: vanilla_t5_ntl_default_tokenizer
config_name: t5-small
number_encoding: none
number_token_loss: true
number_token_loss_weight: 0.3
number_token_loss_with_wasserstein: true
tokenizer_type: auto
2 changes: 1 addition & 1 deletion config/model_args/vanilla_t5_regression_head.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: vanilla_t5_regression_head
config_name: t5-base
config_name: t5-small
number_encoding: none_regression_head
log_scale_embeddings: false
6 changes: 4 additions & 2 deletions config/run_specific_config/config.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
training_args:
trial:
trial: ablation_studies
special_name:
max_steps: 2500000
load_best_model_at_end: false

model_args:
model_name_or_path: google-t5/t5-small
config_name: t5-small
config_name: t5-small
2 changes: 1 addition & 1 deletion config/training_args/train.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ lr_scheduler_kwargs:
factor: 0.5
patience: 5
weight_decay: 0.01
num_train_epochs: 2000
max_steps: 2500000
save_total_limit: 2
save_steps: 25000
eval_steps: 25000
Expand Down
1,820 changes: 1,820 additions & 0 deletions data/multirc/data/preprocessed/test_clean.jsonl

Large diffs are not rendered by default.

5,131 changes: 5,131 additions & 0 deletions data/multirc/data/preprocessed/train_clean.jsonl

Large diffs are not rendered by default.

953 changes: 953 additions & 0 deletions data/multirc/data/preprocessed/val_clean.jsonl

Large diffs are not rendered by default.

166 changes: 166 additions & 0 deletions data/multirc/data/test.jsonl

Large diffs are not rendered by default.

456 changes: 456 additions & 0 deletions data/multirc/data/train.jsonl

Large diffs are not rendered by default.

83 changes: 83 additions & 0 deletions data/multirc/data/val.jsonl

Large diffs are not rendered by default.

69 changes: 69 additions & 0 deletions data/multirc/preprocess_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import json


def create_clean_jsonl(input_file, output_file):
"""
Reads the original MultiRC-style JSONL (with passage -> multiple questions -> answers)
and writes a new JSONL file in the format:

{
"question": <string>,
"answer": <string>
}

- "question": we combine the passage text and the question text into one.
- "answer": we concatenate all correct answers (label=1) from that question.
"""

with open(input_file, 'r', encoding='utf-8') as fin, \
open(output_file, 'w', encoding='utf-8') as fout:

for line in fin:
line = line.strip()
if not line:
continue

# Parse the original record
record = json.loads(line)
passage_text = record["passage"]["text"]
questions = record["passage"]["questions"]

# For each question in this passage
for q in questions:
question_text = q["question"]
answers = q["answers"]

# Gather all the correct answers (label=1)
correct_answers = [
ans["text"] for ans in answers
if ans.get("label", 0) == 1
]

# If no correct answers, you could skip or store empty
if not correct_answers:
final_answer = ""
else:
# Join multiple correct answers with " | " or any delimiter
final_answer = " | ".join(correct_answers)

# Build the "question" field by including passage + question
combined_question = (
f"{passage_text.strip()}\n\n"
f"Question: {question_text.strip()}"
)

# The "answer" field (here, just the correct answers)
out_record = {
"question": combined_question,
"answer": final_answer
}

fout.write(json.dumps(out_record, ensure_ascii=False) + "\n")


if __name__ == "__main__":
input_path = "../../data/multirc/data/val.jsonl"
output_path = "../../data/multirc/data/preprocessed/val_clean.jsonl"

create_clean_jsonl(input_path, output_path)
print(f"Finished writing to {output_path}")
43,246 changes: 43,246 additions & 0 deletions data/rjokes-dataset/data/dev.jsonl

Large diffs are not rendered by default.

43,246 changes: 43,246 additions & 0 deletions data/rjokes-dataset/data/dev.tsv

Large diffs are not rendered by default.

14 changes: 14 additions & 0 deletions data/rjokes-dataset/data/dev_distribution.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"1": 10382,
"0": 14962,
"3": 4194,
"2": 7715,
"6": 825,
"5": 1590,
"4": 2540,
"9": 172,
"7": 461,
"10": 160,
"8": 239,
"11": 6
}
43,246 changes: 43,246 additions & 0 deletions data/rjokes-dataset/data/test.jsonl

Large diffs are not rendered by default.

43,246 changes: 43,246 additions & 0 deletions data/rjokes-dataset/data/test.tsv

Large diffs are not rendered by default.

14 changes: 14 additions & 0 deletions data/rjokes-dataset/data/test_distribution.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"0": 14942,
"7": 446,
"1": 10336,
"2": 7908,
"4": 2571,
"3": 4069,
"6": 829,
"5": 1558,
"8": 240,
"10": 163,
"9": 179,
"11": 5
}
Loading
Loading