Highlights
- Pro
Stars
Visualization of cache-optimized matrix multiplication
A comprehensive set of LLM benchmark scores and provider prices.
Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!
Benchmarking Benchmark Leakage in Large Language Models
aider is AI pair programming in your terminal
The central repo for Creole based NLU and NLG work
Make awesome display tables using Python.
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!
Hallucinations (Confabulations) Document-Based Benchmark for RAG
An extremely fast Python package and project manager, written in Rust.
Machine Learning Engineering Open Book
BigCodeBench: Benchmarking Code Generation Towards AGI
Website for hosting the Open Foundation Models Cheat Sheet.
datasets from the paper "Towards Understanding Sycophancy in Language Models"
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
A benchmark to evaluate language models on questions I've previously asked them to solve.
Doing simple retrieval from LLM models at various context lengths to measure accuracy
Course to get into Large Language Models (LLMs) with roadmaps and Colab notebooks.
A natural language interface for computers
Scrape and export data from the Open LLM Leaderboard.
List of papers on hallucination detection in LLMs.
This repo includes ChatGPT prompt curation to use ChatGPT and other LLM tools better.