-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[New Task] Add AlpacaEval LC #139
Comments
Hi! Ideal would be to add it as a community task for now, and once we have non regression tests on the results, we'll move it to extended tasks. However, since it's using LLM as a judge, we would want to first move the LLM-as-a-judge code that @NathanHB developed for MTBench to the metrics, and allow to select several judges. (We will want this to be homogeneous for easier debugging). If you are interested in this, you can start by it, else you can wait for us to add it, it should be integrated soon. |
I saw the PR, it looks great and homogeneity definitely makes sense. Adding AlpacaEval might require a few changes for homogenization though. The pipeline for AlpacaEval at a high level is:
I only had a quick skim through the MTBench PR, but my understanding is that steps 2, 3 and 4 would all require slight changes to I'm curious to hear your thoughts! I won't have the time to do such homoneigenzation, and in any case I guess you'd prefer choosing the right abstraction yourselves! But I'm happy to help if there's interest in supporting AlpacaEval, e.g., by writing some minimal implementation. |
Thanks for detailing these steps! |
Hi ! Thanks for your interest in lighteval !
Step 5 should be easy to add. We have a system that allows to plug functions acting on the whole corpus instead of the individual samples. For example, mt_bench_metric = SampleLevelMetricGrouping(
metric=["single_turn", "multi_turn"],
higher_is_better=True,
category=MetricCategory.GENERATIVE_MULTI_TURN,
use_case=MetricUseCase.SUMMARIZATION,
sample_level_fn=LlmAsJudge(
judge_model_name="gpt-3.5-turbo", template_path="src/lighteval/tasks/extended/mt_bench/judge_prompts.jsonl"
).compute_multi_turn,
corpus_level_fn={
"single_turn": np.mean,
"multi_turn": np.mean,
},
) Here, the sample is evaluated by the judge and the whole corpus is evaluated using the mean of all samples. We could replace That would make a metric for alpaca_metric = SampleLevelMetric(
metric="lc_alpaca",
higher_is_better=True,
category=MetricCategory.GENERATIVE,
use_case=MetricUseCase.SUMMARIZATION,
sample_level_fn=LlmAsJudge(
judge_model_name="gpt-4", template_path="path/to/alpaca_judge_template.jsonl"
).compute_alpaca,
corpus_level_fn=length_controlled_mean,
) |
Great, to know that there's a place for a corpus level function, I can write a minimal |
Hi @YannDubs ! |
Hey @clefourrier! So the current JudgeOpenAI still seems pretty specialized to MT-bench. E.g. it makes a few assumptions that will not be true for AlpacaEval and more generally for other LLMJudge benchmark. E.g.:
1 and 2 is what we did in AlpacaEval but switched to logprobs of tokens as it's cheaper and gives more statistical efficiency. Do you want different classes (say an MTBenchJudge class and an AlpacaEvalJudge class) or different parameters in the main Judge class? I can implement something minimal next weekend. But it will probably be easier if you end up writing the final abstraction that you would like to keep! |
Tagging @NathanHB since he worked on it the most, but imo it would be great to have the option to pass different parameters in the main Judge class, and we'll load it with different metric definitions like the above example for mt_bench_metric vs alpaca_metric. |
hi @YannDubs ! having multiple parameters passed to the judge would be our preferred way. for example using a parameter to switch from using logprobs and text. don't hesitate to tell us if you have more questions ! |
Great library, a light library for all the main evals was really needed!💯
I just came across this line: is there any interest of adding length-controlled AlpacaEval to lighteval? If so I'm happy to help, e.g., if you want a minimal script that doesn't depend on
alpaca_eval
.Let me know
The text was updated successfully, but these errors were encountered: