A comprehensive framework for evaluating GenAI applications.
This is a WIP. We’re actively adding features, fixing issues, and expanding examples. Please give it a try, share feedback, and report bugs.
- Multi-Framework Support: Seamlessly use metrics from Ragas, DeepEval, and custom implementations
- Turn & Conversation-Level Evaluation: Support for both individual queries and multi-turn conversations
- Evaluation types: Response, Context, Tool Call, Overall Conversation evaluation & Script-based evaluation
- LLM Provider Flexibility: OpenAI, Watsonx, Gemini, vLLM and others
- API Integration: Direct integration with external API for real-time data generation (if enabled)
- Setup/Cleanup Scripts: Support for running setup and cleanup scripts before/after each conversation evaluation (applicable when API is enabled)
- Flexible Configuration: Configurable environment & metric metadata
- Rich Output: CSV, JSON, TXT reports + visualization graphs (pass rates, distributions, heatmaps)
- Early Validation: Catch configuration errors before expensive LLM calls
- Statistical Analysis: Statistics for every metric with score distribution analysis
# From Git
pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git
# Local Development
pip install uv
uv sync# Set required environment variable(s) for Judge-LLM
export OPENAI_API_KEY="your-key"
# Optional: For script-based evaluations requiring Kubernetes access
export KUBECONFIG="/path/to/your/kubeconfig"
# Run evaluation
lightspeed-eval --system-config <CONFIG.yaml> --eval-data <EVAL_DATA.yaml> --output-dir <OUTPUT_DIR>Please make any necessary modifications to system.yaml and evaluation_data.yaml. The evaluation_data.yaml file includes sample data for guidance.
# Set required environment variable(s) for both Judge-LLM and API authentication (for MCP)
export OPENAI_API_KEY="your-evaluation-llm-key"
export API_KEY="your-api-endpoint-key"
# Ensure API is running at configured endpoint
# Default: http://localhost:8080
# Run with API-enabled configuration
lightspeed-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml# Set required environment variable(s) for Judge-LLM
export OPENAI_API_KEY="your-key"
# Use configuration with api.enabled: false
# Pre-fill response, contexts & tool_calls data in YAML
lightspeed-eval --system-config config/system_api_disabled.yaml --eval-data config/evaluation_data.yaml- Ragas -- docs on Ragas website
- Response Evaluation
- Context Evaluation
 
- Custom
- Response Evaluation
- answer_correctness
- intent_eval- Evaluates whether the response demonstrates the expected intent or purpose
 
- Tool Evaluation
- tool_eval- Validates tool calls and arguments with regex pattern matching
 
 
- Response Evaluation
- Script-based
- Action Evaluation
- script:action_eval- Executes verification scripts to validate actions (e.g., infrastructure changes)
 
 
- Action Evaluation
- DeepEval -- docs on DeepEval website
# Core evaluation parameters
core:
  # Maximum number of threads, set to null for Python default.
  # 50 is OK on a typical laptop. Check your Judge-LLM service for max requests per minute
  max_threads: 50
# Judge-LLM Configuration
llm:
  provider: openai            # openai, watsonx, azure, gemini etc.
  model: gpt-4o-mini          # Model name for the provider
  temperature: 0.0            # Generation temperature
  max_tokens: 512             # Maximum tokens in response
  timeout: 300                # Request timeout in seconds
  num_retries: 3              # Retry attempts
# Lightspeed API Configuration for Real-time Data Generation
api:
  enabled: true                        # Enable/disable API calls
  api_base: http://localhost:8080      # Base API URL
  endpoint_type: streaming             # streaming or query endpoint
  timeout: 300                         # API request timeout in seconds
  
  provider: openai                     # LLM provider for API queries (optional)
  model: gpt-4o-mini                   # Model to use for API queries (optional)
  no_tools: null                       # Whether to bypass tools (optional)
  system_prompt: null                  # Custom system prompt (optional)
# Metrics Configuration with thresholds and defaults
metrics_metadata:
  turn_level:
    "ragas:response_relevancy":
      threshold: 0.8
      description: "How relevant the response is to the question"
      default: true   # Used by default when turn_metrics is null
    "ragas:faithfulness":
      threshold: 0.8
      description: "How faithful the response is to the provided context"
      default: false  # Only used when explicitly specified
      
    "custom:intent_eval":
      threshold: 1  # Binary evaluation (0 or 1)
      description: "Intent alignment evaluation using custom LLM evaluation"
      
    "custom:tool_eval":
      description: "Tool call evaluation comparing expected vs actual tool calls (regex for arguments)"
  
  conversation_level:
    "deepeval:conversation_completeness":
      threshold: 0.8
      description: "How completely the conversation addresses user intentions"
# Output Configuration
output:
  output_dir: ./eval_output
  base_filename: evaluation
  enabled_outputs:          # Enable specific output types
    - csv                   # Detailed results CSV
    - json                  # Summary JSON with statistics
    - txt                   # Human-readable summary
# Visualization Configuration
visualization:
  figsize: [12, 8]            # Graph size (width, height)
  dpi: 300                    # Image resolution
  enabled_graphs:
    - "pass_rates"            # Pass rate bar chart
    - "score_distribution"    # Score distribution box plot
    - "conversation_heatmap"  # Heatmap of conversation performance
    - "status_breakdown"      # Pie chart for pass/fail/error breakdown# Judge-LLM Google Gemini
llm:
  provider: "gemini"
  model: "gemini-1.5-pro"    
  temperature: 0.0  
  max_tokens: 512  
  timeout: 120        
  num_retries: 3
# Embeddings for Judge-LLM
# provider: "huggingface" or "openai"
# model: model name
# provider_kwargs: additional arguments,
#   for examples see https://docs.ragas.io/en/stable/references/embeddings/#ragas.embeddings.HuggingfaceEmbeddings
embedding:
  provider: "huggingface"
  model: "sentence-transformers/all-mpnet-base-v2"
  provider_kwargs:
    # cache_folder: <path_for_downloaded_model>
    model_kwargs:
      device: "cpu"
...- conversation_group_id: "test_conversation"
  description: "Sample evaluation"
  
  # Optional: Environment setup/cleanup scripts, when API is enabled
  setup_script: "scripts/setup_env.sh"      # Run before conversation
  cleanup_script: "scripts/cleanup_env.sh"  # Run after conversation
  
  # Conversation-level metrics   
  conversation_metrics:
    - "deepeval:conversation_completeness"
  
  conversation_metrics_metadata:
    "deepeval:conversation_completeness":
      threshold: 0.8
  
  turns:
    - turn_id: id1
      query: What is OpenShift Virtualization?
      response: null                    # Populated by API if enabled, otherwise provide
      contexts:
        - OpenShift Virtualization is an extension of the OpenShift ...
      attachments: []                   # Attachments (Optional)
      expected_response: OpenShift Virtualization is an extension of the OpenShift Container Platform that allows running virtual machines alongside containers
      expected_intent: "explain a concept"  # Expected intent for intent evaluation
      
      # Per-turn metrics (overrides system defaults)
      turn_metrics:
        - "ragas:faithfulness"
        - "custom:answer_correctness"
        - "custom:intent_eval"
      
      # Per-turn metric configuration
      turn_metrics_metadata:
        "ragas:faithfulness": 
          threshold: 0.9  # Override system default
      # turn_metrics: null (omitted) → Use system defaults (metrics with default=true)
      
    - turn_id: id2
      query: Skip this turn evaluation
      turn_metrics: []                  # Skip evaluation for this turn
    - turn_id: id3
      query: Create a namespace called test-ns
      verify_script: "scripts/verify_namespace.sh"  # Script-based verification
      turn_metrics:
        - "script:action_eval"          # Script-based evaluation (if API is enabled)- Real-time data generation: Queries are sent to external API
- Dynamic responses: responseandtool_callsfields populated by API
- Conversation context: Conversation context is maintained across turns
- Authentication: Use API_KEYenvironment variable
- Data persistence: Saves amended response/tool_callsdata to output directory so it can be used with API disabled
- Static data mode: Use pre-filled responseandtool_callsdata
- Faster execution: No external API calls
- Reproducible results: Same data used across runs
| Field | Type | Required | Description | 
|---|---|---|---|
| conversation_group_id | string | ✅ | Unique identifier for conversation | 
| description | string | ❌ | Optional description | 
| setup_script | string | ❌ | Path to setup script (Optional, used when API is enabled) | 
| cleanup_script | string | ❌ | Path to cleanup script (Optional, used when API is enabled) | 
| conversation_metrics | list[string] | ❌ | Conversation-level metrics (Optional, if override is required) | 
| conversation_metrics_metadata | dict | ❌ | Conversation-level metric config (Optional, if override is required) | 
| turns | list[TurnData] | ✅ | List of conversation turns | 
| Field | Type | Required | Description | API Populated | 
|---|---|---|---|---|
| turn_id | string | ✅ | Unique identifier for the turn | ❌ | 
| query | string | ✅ | The question/prompt to evaluate | ❌ | 
| response | string | 📋 | Actual response from system | ✅ (if API enabled) | 
| contexts | list[string] | 📋 | Context information for evaluation | ✅ (if API enabled) | 
| attachments | list[string] | ❌ | Attachments | ❌ | 
| expected_response | string | 📋 | Expected response for comparison | ❌ | 
| expected_intent | string | 📋 | Expected intent for intent evaluation | ❌ | 
| expected_tool_calls | list[list[dict]] | 📋 | Expected tool call sequences | ❌ | 
| tool_calls | list[list[dict]] | ❌ | Actual tool calls from API | ✅ (if API enabled) | 
| verify_script | string | 📋 | Path to verification script | ❌ | 
| turn_metrics | list[string] | ❌ | Turn-specific metrics to evaluate | ❌ | 
| turn_metrics_metadata | dict | ❌ | Turn-specific metric configuration | ❌ | 
📋 Required based on metrics: Some fields are required only when using specific metrics
Examples
expected_response: Required forcustom:answer_correctness
expected_intent: Required forcustom:intent_eval
expected_tool_calls: Required forcustom:tool_eval
verify_script: Required forscript:action_eval(used when API is enabled)
response: Required for most metrics (auto-populated if API enabled)
| Override Value | Behavior | 
|---|---|
| null(or omitted) | Use system defaults (metrics with default: true) | 
| [](empty list) | Skip evaluation for this turn | 
| ["metric1", ...] | Use specified metrics only | 
expected_tool_calls:
  -
    - tool_name: oc_get           # Tool name
      arguments:                  # Tool arguments
        kind: pod
        name: openshift-light*    # Regex patterns supported for flexible matchingThe framework supports script-based evaluations. Note: Scripts only execute when API is enabled - they're designed to test with actual environment changes.
- Setup scripts: Run before conversation evaluation (e.g., create failed deployment for troubleshoot query)
- Cleanup scripts: Run after conversation evaluation (e.g., cleanup failed deployment)
- Verify scripts: Run per turn for script:action_evalmetric (e.g., validate if a pod has been created or not)
# Example: evaluation_data.yaml
- conversation_group_id: infrastructure_test
  setup_script: ./scripts/setup_cluster.sh
  cleanup_script: ./scripts/cleanup_cluster.sh
  turns:
    - turn_id: turn_id
      query: Create a new cluster
      verify_script: ./scripts/verify_cluster.sh
      turn_metrics:
        - script:action_evalScript Path Resolution
Script paths in evaluation data can be specified in multiple ways:
- Relative Paths: Resolved relative to the evaluation data YAML file location, not the current working directory
- Absolute Paths: Used as-is
- Home Directory Paths: Expands to user's home directory
# Hosted vLLM (provider: hosted_vllm)
export HOSTED_VLLM_API_KEY="your-key"
export HOSTED_VLLM_API_BASE="https://your-vllm-endpoint/v1"
# OpenAI (provider: openai)
export OPENAI_API_KEY="your-openai-key"
# IBM Watsonx (provider: watsonx)
export WATSONX_API_KEY="your-key"
export WATSONX_API_BASE="https://us-south.ml.cloud.ibm.com"
export WATSONX_PROJECT_ID="your-project-id"
# Gemini (provider: gemini)
export GEMINI_API_KEY="your-key"# API authentication for external system (MCP)
export API_KEY="your-api-endpoint-key"- CSV: Detailed results with status, scores, reasons
- JSON: Summary statistics with score distributions
- TXT: Human-readable summary
- PNG: 4 visualization types (pass rates, score distributions, heatmaps, status breakdown)
- PASS/FAIL/ERROR: Status based on thresholds
- Actual Reasons: DeepEval provides LLM-generated explanations, Custom metrics provide detailed reasoning
- Score Statistics: Mean, median, standard deviation, min/max for every metric
uv sync --group dev
make format
make pylint
make pyright
make docstyle
make check-types
uv run pytest tests --cov=srcFor a detailed walkthrough of the new agent-evaluation framework, refer lsc_agent_eval/README.md
For generating answers (optional) refer README-generate-answers
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
Contributions welcome - see development setup above for code quality tools.