A Python backend for benchmarking prompts by using different AI LLM Providers like OpenAI, Claude, and Google. This tool helps evaluate the performance of different language models on specific tasks.
- Benchmark prompts across multiple LLM providers
- Support for various language models:
- OpenAI (GPT-4, GPT-4o, GPT-4o-mini)
- Anthropic (Claude-3.5-Sonnet, Claude-3.7-Sonnet)
- Google (Gemini 2.0 Flash, Gemini 2.5 Pro)
- Meta (Llama-3.2, Llama-3.3)
- DeepSeek (V3, R1)
- Ollama (local open-source models)
- Multiple evaluation methods:
- Exact match comparison
- Semantic similarity
- Python code execution with output validation
- Regular expression matching
- LLM-based evaluation (using AI to score responses)
- Comprehensive benchmark reports with model rankings
- Asynchronous API calls for efficient benchmarking
# Basic installation
pip install llm-benchmark-backend
# With semantic similarity evaluation support
pip install llm-benchmark-backend[semantic_similarity]
# For development
pip install llm-benchmark-backend[dev]
Or install from source:
git clone https://github.com/yourusername/llm-benchmark-backend.git
cd llm-benchmark-backend
pip install -e .
To use the different LLM providers, you need to set the corresponding API keys as environment variables:
# OpenAI
export OPENAI_API_KEY=your-openai-key
# Anthropic
export ANTHROPIC_API_KEY=your-anthropic-key
# Google
export GOOGLE_API_KEY=your-google-key
# Other providers
export DEEPSEEK_API_KEY=your-deepseek-key
export LLAMA_API_KEY=your-llama-key
For Ollama (local models), ensure the Ollama service is running. You can customize the API endpoint:
export OLLAMA_API_BASE=http://your-ollama-host:11434/api
Run a benchmark using a JSON configuration file:
llm-benchmark examples/math_benchmark.json
Additional options:
# Save results to a specific file (default: results/<benchmark>_results.json)
llm-benchmark examples/math_benchmark.json -o custom_results.json
# Use specific evaluator
llm-benchmark examples/math_benchmark.json -e llm
# Verbose output
llm-benchmark examples/math_benchmark.json --verbose
# Validate a configuration file without running the benchmark
llm-benchmark examples/math_benchmark.json --validate-only
# List available providers
llm-benchmark --list-providers
# List available evaluators
llm-benchmark --list-evaluators
# Check if required API keys are set
llm-benchmark --check-api-keys
You can also use the benchmark programmatically:
import asyncio
from llm_benchmark_backend.benchmark_runner import run_benchmark
async def main():
results = await run_benchmark("examples/math_benchmark.json", "results.json")
print(f"Overall average score: {results['summary']['overall']['average_score']:.4f}")
if __name__ == "__main__":
asyncio.run(main())
By default, benchmark results are automatically saved to the results/
directory in the project root:
- Each benchmark run creates a file named
<benchmark_name>_results.json
- Contains detailed scoring, model responses, and evaluation reasoning
- Custom output paths can be specified with the
-o
flag
Benchmark configurations are defined in JSON format. Here's an example:
{
"benchmark_name": "Simple Math in Python",
"purpose": "Evaluate the ability of a language model to perform simple math operations",
"base_prompt": "{{purpose}}\n{{instructions}}\n{{statement}}",
"evaluator": "llm",
"evaluator_config": {
"model": "openai:gpt-4" // Optional: specify model for LLM evaluator
},
"models": [
"anthropic~claude-3-sonnet-20240229",
"openai~gpt-4",
"google~gemini-pro"
],
"prompts": [
{
"dynamic_variables": {
"purpose": "<purpose>Calculate the result</purpose>",
"instructions": "<instructions>Output Python code only</instructions>",
"statement": "<statement>add 5 and 5</statement>"
},
"expectation": "10.0"
}
]
}
See the examples/
directory for more comprehensive benchmark configurations.
You can extend the system with custom LLM providers:
- Create a new provider class that inherits from
BaseProvider
- Implement the
generate
method - Register the provider in
llm_providers/__init__.py
The LLM evaluator uses AI to assess response quality on a scale of 0-1. It provides:
- Detailed reasoning for each score
- Semantic understanding of responses
- Flexible comparison criteria
Usage:
{
"evaluator": "llm",
"evaluator_config": {
"model": "openai:gpt-4" // Optional: specify which model to use for evaluation
}
}
Or via command line:
llm-benchmark example.json -e llm
exact_match
: Perfect string matchingsemantic_similarity
: Embeddings-based comparisonexecute_python_code
: Runs and validates Python outputregex
: Regular expression matching
Similarly, you can add custom evaluation methods:
- Create a new evaluator class that inherits from
BaseEvaluator
- Implement the
evaluate
method - Register the evaluator in
evaluators/__init__.py
Currently there are 3 example benchmarks with question & answers from Huggingface:
- math benchmark
- reasoning benchmark
- thought-rl benchmark
- MMLU benchmark
- Opencoder
- Further benchmarks are coomming soon
This project is licensed under the MIT License - see the LICENSE file for details.