auto-bench
is a flexible tool for benchmarking LLMs on Hugging Face Inference Endpoints. It provides an automated way to deploy models, run load tests, and analyze performance across different hardware configurations.
- Automated deployment of models to Hugging Face Inference Endpoints
- Configurable load testing scenarios using K6
- Support for various GPU instances
- Detailed performance metrics collection and analysis
- Easy-to-use Python API for creating and running benchmarks
auto-bench
relies on Grafana K6 to run load tests and collect metrics. The following metrics are collected:
- Inter token latency: Time to generate a new output token for each user that is querying the system. It translates as the “speed” perceived by the end-user.
- Time to First Token: Time the user has to wait before seeing the first token of its answer. Lower waiting time are essential for real-time interactions.
- End to End latency: The overall time the system took to generate the full response to the user.
- Throughput: The number of tokens per second the system can generate across all requests.
- Successful requests: The number of requests the system was able to honor in the benchmark timeframe.
- Error rate: The percentage of requests that ended up in error, as the system could not process them in time or failed to process them.
To get started with auto-bench
, follow these steps:
- Clone the repository:
git clone https://github.com/andrewrreed/auto-bench.git
- Set up a virtual environment and activate it:
python -m venv .venv
source .venv/bin/activate
- Build the custom K6 binary with SSE support:
make build-k6
- Install the required Python packages:
poetry install
Check out the Getting Started Notebook to get familiar with basic usage.
For questions or suggestions, please open an issue on the GitHub repository.