Less Code to Load Data In and Out of DeepEval's Ecosystem :)
If you're using any of the features below, you'll likely see a 50% reduction in code required, especially around ETL for formatting things in and out of DeepEval's ecosystem. This includes:
π Arena-GEval
The first LLM-arena-as-a-Judge metric, now runs a blinded experiment and swaps positions randomly for a fair verdict on which LLM output is better.
Docs: https://deepeval.com/docs/metrics-arena-g-eval
βοΈ You can now run component-level evals by simply running a for loop against your dataset of goldens.
Simply run your loop -> call your agent X number of times -> get your evaluation results. No more trying to fit non-test-case-friendly formats. Instead DeepEval will find your LLM traces automatically to run evals.
from somewhere import your_async_llm_app # Replace with your async LLM app
from deepeval.dataset import EvaluationDataset, Golden
dataset = EvaluationDataset(goldens=[Golden(input="...")])
for golden in dataset.evals_iterator():
# Create task to invoke your async LLM app
task = asyncio.create_task(your_async_llm_app(golden.input))
dataset.evaluate(task)
Docs: https://deepeval.com/docs/evaluation-component-level-llm-evals
π¬ Conversation simulator is now based on goldens.
Previously you have to define a list of user intentions, profile items, with a ton of more configs to juggle between. Now you can define a list of goldens with a standardized benchmark of scenarios to have turns generated for.
from deepeval.test_case import Turn
from deepeval.simulator import ConversationSimulator
# Create ConversationalGolden
conversation_golden = ConversationalGolden(
scenario="Andy Byron wants to purchase a VIP ticket to a cold play concert.",
expected_outcome="Successful purchase of a ticket.",
user_description="Andy Byron is the CEO of Astronomer.",
)
# Define chatbot callback
async def chatbot_callback(input):
return Turn(role="assistant", content=f"Chatbot response to: {input}")
# Run Simulation
simulator = ConversationSimulator(model_callback=chatbot_callback)
conversational_test_cases = simulator.simulate(goldens=[conversation_golden])
print(conversational_test_cases)
Docs: https://deepeval.com/docs/conversation-simulator
We also updated our docs with more improvements to come π