This project implements a Ground-Truth-Guided Evaluation (GTGE) framework using a multi-agent Crew AI system to improve the self-correction capabilities of Large Language Models (LLMs).
LLMs often generate plausible but incorrect responses and lack robust self-correction abilities. This project introduces a novel multi-agent system using Crew AI to:
- Evaluate LLM responses against ground-truth answers
- Refine incorrect outputs using a Refiner Agent
- Improve overall accuracy across various benchmarks
- Crew AI Integration: Multi-agent architecture with Evaluator and Refiner agents
- Ground-Truth Benchmarking: Uses known correct answers to guide corrections
- Single-Pass Correction: Enhances efficiency without iterative loops
- Support for Gemini Models: Tested on Gemini 2.0 Flash, Flash-Lite, 1.5 Flash, 1.5 Flash-8B, and 1.5 Pro
Evaluated on six diverse tasks:
- GSM8K (Math Reasoning)
- SVAMP (Arithmetic)
- HotpotQA (Multi-hop QA)
- Sports (Commonsense)
- LLC (Symbolic Reasoning)
- Domestic Robot (Instruction Following)
The GTGE framework demonstrated the following improvements:
Model | Dataset | Baseline Accuracy | GTGE Accuracy |
---|---|---|---|
Gemini 1.5 Pro | GSM8K | 87.0% | 96.0% |
Gemini 2.0 Flash | SVAMP | 86.0% | 94.5% |
Gemini 1.5 Flash | HotpotQA | 66.0% | 73.0% |
Gemini 2.0 F-Lite | Sports | 74.0% | 82.0% |
Gemini 1.5 Pro | LLC | 64.0% | 96.0% |
- Up to 12% improvement in accuracy
- Higher correction and consistency rates compared to baseline and confidence-based approaches
- Stronger correlation between confidence and correctness in structured tasks
- Python
- Crew AI
- Gemini API (via prompt-based LLM interface)
- Matplotlib, Pandas, NumPy
├── agents/ # Crew AI agent definitions
├── benchmarks/ # Datasets and task definitions
├── core/ # GTGE evaluation and refinement logic
├── results/ # Plots, accuracy logs, visualizations
└── calm_core.py # Entry point
- Clone the repository
- Install requirements
pip install -r requirements.txt
- Run the main script
python calm_core.py
For questions or collaborations, reach out via LinkedIn or check the GitHub repository.
Made with ❤️ by Roshan George