You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The benchmarks are code-related tasks focused on measuring how well models can process large context-windows.
They are different from other popular benchmarks both in how large they allow the context to be, and in how realistic they aim to be: the datasets are based on real-world repos, and the tasks replicate real-world scenarios rather than synthetic "evaluation-focused" use-cases.
It is particular relevant to our case because:
it's a great way to evaluate a model's code-assistant capabilities
the approach to building the benchmark-suite could be expanded into additional tasks and programming languages: keeping the focus on realistic tasks and large-context. This would make the suite itself more useful and help evaluating models across more and more features
@andyjda
The text was updated successfully, but these errors were encountered: