Skip to content

A Python backend for benchmarking prompts by using different AI LLM Providers like Ollama, OpenAI, Claude, and Google. This tool helps evaluate the performance of different language models on specific tasks.

License

Notifications You must be signed in to change notification settings

yzyydev/local-llm-benchmark

Repository files navigation

LLM Benchmark Backend

A Python backend for benchmarking prompts by using different AI LLM Providers like OpenAI, Claude, and Google. This tool helps evaluate the performance of different language models on specific tasks.

Features

  • Benchmark prompts across multiple LLM providers
  • Support for various language models:
    • OpenAI (GPT-4, GPT-4o, GPT-4o-mini)
    • Anthropic (Claude-3.5-Sonnet, Claude-3.7-Sonnet)
    • Google (Gemini 2.0 Flash, Gemini 2.5 Pro)
    • Meta (Llama-3.2, Llama-3.3)
    • DeepSeek (V3, R1)
    • Ollama (local open-source models)
  • Multiple evaluation methods:
    • Exact match comparison
    • Semantic similarity
    • Python code execution with output validation
    • Regular expression matching
    • LLM-based evaluation (using AI to score responses)
  • Comprehensive benchmark reports with model rankings
  • Asynchronous API calls for efficient benchmarking

Installation

# Basic installation
pip install llm-benchmark-backend

# With semantic similarity evaluation support
pip install llm-benchmark-backend[semantic_similarity]

# For development
pip install llm-benchmark-backend[dev]

Or install from source:

git clone https://github.com/yourusername/llm-benchmark-backend.git
cd llm-benchmark-backend
pip install -e .

API Keys

To use the different LLM providers, you need to set the corresponding API keys as environment variables:

# OpenAI
export OPENAI_API_KEY=your-openai-key

# Anthropic
export ANTHROPIC_API_KEY=your-anthropic-key

# Google
export GOOGLE_API_KEY=your-google-key

# Other providers
export DEEPSEEK_API_KEY=your-deepseek-key
export LLAMA_API_KEY=your-llama-key

For Ollama (local models), ensure the Ollama service is running. You can customize the API endpoint:

export OLLAMA_API_BASE=http://your-ollama-host:11434/api

Usage

Command Line Interface

Run a benchmark using a JSON configuration file:

llm-benchmark examples/math_benchmark.json

Additional options:

# Save results to a specific file (default: results/<benchmark>_results.json)
llm-benchmark examples/math_benchmark.json -o custom_results.json

# Use specific evaluator
llm-benchmark examples/math_benchmark.json -e llm

# Verbose output
llm-benchmark examples/math_benchmark.json --verbose

# Validate a configuration file without running the benchmark
llm-benchmark examples/math_benchmark.json --validate-only

# List available providers
llm-benchmark --list-providers

# List available evaluators
llm-benchmark --list-evaluators

# Check if required API keys are set
llm-benchmark --check-api-keys

Python API

You can also use the benchmark programmatically:

import asyncio
from llm_benchmark_backend.benchmark_runner import run_benchmark

async def main():
    results = await run_benchmark("examples/math_benchmark.json", "results.json")
    print(f"Overall average score: {results['summary']['overall']['average_score']:.4f}")

if __name__ == "__main__":
    asyncio.run(main())

Results

By default, benchmark results are automatically saved to the results/ directory in the project root:

  • Each benchmark run creates a file named <benchmark_name>_results.json
  • Contains detailed scoring, model responses, and evaluation reasoning
  • Custom output paths can be specified with the -o flag

Configuration Format

Benchmark configurations are defined in JSON format. Here's an example:

{
  "benchmark_name": "Simple Math in Python",
  "purpose": "Evaluate the ability of a language model to perform simple math operations",
  "base_prompt": "{{purpose}}\n{{instructions}}\n{{statement}}",
  "evaluator": "llm",
  "evaluator_config": {
    "model": "openai:gpt-4"  // Optional: specify model for LLM evaluator
  },
  "models": [
    "anthropic~claude-3-sonnet-20240229",
    "openai~gpt-4",
    "google~gemini-pro"
  ],
  "prompts": [
    {
      "dynamic_variables": {
        "purpose": "<purpose>Calculate the result</purpose>",
        "instructions": "<instructions>Output Python code only</instructions>",
        "statement": "<statement>add 5 and 5</statement>"
      },
      "expectation": "10.0"
    }
  ]
}

See the examples/ directory for more comprehensive benchmark configurations.

Adding Custom Providers

You can extend the system with custom LLM providers:

  1. Create a new provider class that inherits from BaseProvider
  2. Implement the generate method
  3. Register the provider in llm_providers/__init__.py

Built-in Evaluators

LLM Evaluator

The LLM evaluator uses AI to assess response quality on a scale of 0-1. It provides:

  • Detailed reasoning for each score
  • Semantic understanding of responses
  • Flexible comparison criteria

Usage:

{
  "evaluator": "llm",
  "evaluator_config": {
    "model": "openai:gpt-4"  // Optional: specify which model to use for evaluation
  }
}

Or via command line:

llm-benchmark example.json -e llm

Other Evaluators

  • exact_match: Perfect string matching
  • semantic_similarity: Embeddings-based comparison
  • execute_python_code: Runs and validates Python output
  • regex: Regular expression matching

Adding Custom Evaluators

Similarly, you can add custom evaluation methods:

  1. Create a new evaluator class that inherits from BaseEvaluator
  2. Implement the evaluate method
  3. Register the evaluator in evaluators/__init__.py

Benchmarks

Currently there are 3 example benchmarks with question & answers from Huggingface:

  • math benchmark
  • reasoning benchmark
  • thought-rl benchmark
  • MMLU benchmark
  • Opencoder
  • Further benchmarks are coomming soon

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A Python backend for benchmarking prompts by using different AI LLM Providers like Ollama, OpenAI, Claude, and Google. This tool helps evaluate the performance of different language models on specific tasks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages