LLM Benchmark Backend

A Python backend for benchmarking prompts by using different AI LLM Providers like OpenAI, Claude, and Google. This tool helps evaluate the performance of different language models on specific tasks.

Features

Benchmark prompts across multiple LLM providers
Support for various language models:
- OpenAI (GPT-4, GPT-4o, GPT-4o-mini)
- Anthropic (Claude-3.5-Sonnet, Claude-3.7-Sonnet)
- Google (Gemini 2.0 Flash, Gemini 2.5 Pro)
- Meta (Llama-3.2, Llama-3.3)
- DeepSeek (V3, R1)
- Ollama (local open-source models)
Multiple evaluation methods:
- Exact match comparison
- Semantic similarity
- Python code execution with output validation
- Regular expression matching
- LLM-based evaluation (using AI to score responses)
Comprehensive benchmark reports with model rankings
Asynchronous API calls for efficient benchmarking

Installation

# Basic installation
pip install llm-benchmark-backend

# With semantic similarity evaluation support
pip install llm-benchmark-backend[semantic_similarity]

# For development
pip install llm-benchmark-backend[dev]

Or install from source:

git clone https://github.com/yourusername/llm-benchmark-backend.git
cd llm-benchmark-backend
pip install -e .

API Keys

To use the different LLM providers, you need to set the corresponding API keys as environment variables:

# OpenAI
export OPENAI_API_KEY=your-openai-key

# Anthropic
export ANTHROPIC_API_KEY=your-anthropic-key

# Google
export GOOGLE_API_KEY=your-google-key

# Other providers
export DEEPSEEK_API_KEY=your-deepseek-key
export LLAMA_API_KEY=your-llama-key

For Ollama (local models), ensure the Ollama service is running. You can customize the API endpoint:

export OLLAMA_API_BASE=http://your-ollama-host:11434/api

Usage

Command Line Interface

Run a benchmark using a JSON configuration file:

llm-benchmark examples/math_benchmark.json

Additional options:

# Save results to a specific file (default: results/<benchmark>_results.json)
llm-benchmark examples/math_benchmark.json -o custom_results.json

# Use specific evaluator
llm-benchmark examples/math_benchmark.json -e llm

# Verbose output
llm-benchmark examples/math_benchmark.json --verbose

# Validate a configuration file without running the benchmark
llm-benchmark examples/math_benchmark.json --validate-only

# List available providers
llm-benchmark --list-providers

# List available evaluators
llm-benchmark --list-evaluators

# Check if required API keys are set
llm-benchmark --check-api-keys

Python API

You can also use the benchmark programmatically:

import asyncio
from llm_benchmark_backend.benchmark_runner import run_benchmark

async def main():
    results = await run_benchmark("examples/math_benchmark.json", "results.json")
    print(f"Overall average score: {results['summary']['overall']['average_score']:.4f}")

if __name__ == "__main__":
    asyncio.run(main())

Results

By default, benchmark results are automatically saved to the results/ directory in the project root:

Each benchmark run creates a file named <benchmark_name>_results.json
Contains detailed scoring, model responses, and evaluation reasoning
Custom output paths can be specified with the -o flag

Configuration Format

Benchmark configurations are defined in JSON format. Here's an example:

{
  "benchmark_name": "Simple Math in Python",
  "purpose": "Evaluate the ability of a language model to perform simple math operations",
  "base_prompt": "{{purpose}}\n{{instructions}}\n{{statement}}",
  "evaluator": "llm",
  "evaluator_config": {
    "model": "openai:gpt-4"  // Optional: specify model for LLM evaluator
  },
  "models": [
    "anthropic~claude-3-sonnet-20240229",
    "openai~gpt-4",
    "google~gemini-pro"
  ],
  "prompts": [
    {
      "dynamic_variables": {
        "purpose": "<purpose>Calculate the result</purpose>",
        "instructions": "<instructions>Output Python code only</instructions>",
        "statement": "<statement>add 5 and 5</statement>"
      },
      "expectation": "10.0"
    }
  ]
}

See the examples/ directory for more comprehensive benchmark configurations.

Adding Custom Providers

You can extend the system with custom LLM providers:

Create a new provider class that inherits from BaseProvider
Implement the generate method
Register the provider in llm_providers/__init__.py

Built-in Evaluators

LLM Evaluator

The LLM evaluator uses AI to assess response quality on a scale of 0-1. It provides:

Detailed reasoning for each score
Semantic understanding of responses
Flexible comparison criteria

Usage:

{
  "evaluator": "llm",
  "evaluator_config": {
    "model": "openai:gpt-4"  // Optional: specify which model to use for evaluation
  }
}

Or via command line:

llm-benchmark example.json -e llm

Other Evaluators

exact_match: Perfect string matching
semantic_similarity: Embeddings-based comparison
execute_python_code: Runs and validates Python output
regex: Regular expression matching

Adding Custom Evaluators

Similarly, you can add custom evaluation methods:

Create a new evaluator class that inherits from BaseEvaluator
Implement the evaluate method
Register the evaluator in evaluators/__init__.py

Benchmarks

Currently there are 3 example benchmarks with question & answers from Huggingface:

math benchmark
reasoning benchmark
thought-rl benchmark
MMLU benchmark
Opencoder
Further benchmarks are coomming soon

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
examples		examples
llm_benchmark_backend.egg-info		llm_benchmark_backend.egg-info
llm_benchmark_backend		llm_benchmark_backend
results		results
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
benchmark_runner.py		benchmark_runner.py
docker-compose.yml		docker-compose.yml
get-pip.py		get-pip.py
requirements.txt		requirements.txt
run_benchmark.sh		run_benchmark.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Benchmark Backend

Features

Installation

API Keys

Usage

Command Line Interface

Python API

Results

Configuration Format

Adding Custom Providers

Built-in Evaluators

LLM Evaluator

Other Evaluators

Adding Custom Evaluators

Benchmarks

License

About

Uh oh!

Releases

Packages

Languages

License

yzyydev/local-llm-benchmark

Folders and files

Latest commit

History

Repository files navigation

LLM Benchmark Backend

Features

Installation

API Keys

Usage

Command Line Interface

Python API

Results

Configuration Format

Adding Custom Providers

Built-in Evaluators

LLM Evaluator

Other Evaluators

Adding Custom Evaluators

Benchmarks

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages