Python SDK for Plum AI.
pip install plum-sdkThe Plum SDK allows you to upload training examples, generate and define metric questions, and evaluate your LLM's performance.
from plum_sdk import PlumClient, IOPair
# Initialize the SDK with your API key
api_key = "YOUR_API_KEY"
plum_client = PlumClient(api_key)
# Or initialize with a custom base URL
# plum_client = PlumClient(api_key, base_url="https://custom.api.url/v1")
# Create training examples
training_examples = [
IOPair(
input="What is the capital of France?",
output="The capital of France is Paris."
),
IOPair(
input="How do I make pasta?",
output="1. Boil water\n2. Add salt\n3. Cook pasta until al dente"
),
IOPair(
id="custom_id_123",
input="What is machine learning?",
output="Machine learning is a branch of artificial intelligence that focuses on building systems that learn from data."
)
]
# Define your system prompt
system_prompt = "You are a helpful assistant that provides accurate and concise answers."
# Upload the data
response = plum_client.upload_data(training_examples, system_prompt)
print(f"Dataset uploaded with ID: {response.id}")You can also upload training examples to update an existing dataset by providing the dataset ID:
# Update an existing dataset with new training examples
existing_dataset_id = "data:0:123456" # ID from a previous dataset
new_training_examples = [
IOPair(
input="What is the capital of Italy?",
output="The capital of Italy is Rome."
),
IOPair(
input="How do you say 'hello' in Spanish?",
output="'Hello' in Spanish is 'Hola'."
)
]
# This will overwrite the existing dataset with the new data
response = plum_client.upload_data(
training_examples=new_training_examples,
system_prompt=system_prompt,
dataset_id=existing_dataset_id
)
print(f"Dataset updated with ID: {response.id}")You can add additional training examples to an existing dataset:
# Add a single example to an existing dataset
dataset_id = "data:0:123456" # ID from previous upload_data response
response = plum_client.upload_pair(
dataset_id=dataset_id,
input_text="What is the tallest mountain in the world?",
output_text="Mount Everest is the tallest mountain in the world, with a height of 8,848.86 meters (29,031.7 feet).",
labels=["geography", "mountains"] # Optional labels for categorization
)
print(f"Added pair with ID: {response.pair_id}")If you want to add a single example but don't have an existing dataset ID, you can use upload_pair_with_prompt. This method will either find an existing dataset with the same system prompt or create a new one:
# Add a single example with a system prompt - will auto-create or find matching dataset
response = plum_client.upload_pair_with_prompt(
input_text="What is the capital of Japan?",
output_text="The capital of Japan is Tokyo.",
system_prompt_template="You are a helpful assistant that provides accurate and concise answers.",
labels=["geography", "capitals"] # Optional labels
)
print(f"Added pair with ID: {response.pair_id} to dataset: {response.dataset_id}")# Generate evaluation metrics based on your system prompt
metrics_response = plum_client.generate_metric_questions(system_prompt)
print(f"Generated metrics with ID: {metrics_response.metrics_id}")
# Evaluate your dataset
evaluation_response = plum_client.evaluate(
data_id=response.id, # Dataset ID from upload_data response
metrics_id=metrics_response.metrics_id
)
print(f"Evaluation completed with ID: {evaluation_response.eval_results_id}")You can filter which pairs to evaluate using pair_query parameters:
# Evaluate only the latest 50 pairs
evaluation_response = plum_client.evaluate(
data_id=dataset_id,
metrics_id=metrics_id,
latest_n_pairs=50
)
# Evaluate only pairs with specific labels
evaluation_response = plum_client.evaluate(
data_id=dataset_id,
metrics_id=metrics_id,
pair_labels=["geography"]
)
# Evaluate only pairs created in the last hour (3600 seconds)
evaluation_response = plum_client.evaluate(
data_id=dataset_id,
metrics_id=metrics_id,
last_n_seconds=3600
)
# Combine multiple filters
evaluation_response = plum_client.evaluate(
data_id=dataset_id,
metrics_id=metrics_id,
latest_n_pairs=50,
pair_labels=["geography", "capitals"], # Only pairs tagged with both "geography" AND "capitals" labels
last_n_seconds=1800 # Last 30 minutes
)
# Evaluate synthetic data instead of seed data
evaluation_response = plum_client.evaluate(
data_id=synthetic_data_id,
metrics_id=metrics_id,
is_synthetic=True,
latest_n_pairs=100
)Generate synthetic training examples from your seed data:
# Basic augmentation - generates 3x the original dataset size
augment_response = plum_client.augment(
seed_data_id=dataset_id,
multiple=3
)
print(f"Generated synthetic data with ID: {augment_response['synthetic_data_id']}")
# Advanced augmentation with filtering and target metric
augment_response = plum_client.augment(
seed_data_id=dataset_id,
multiple=2,
eval_results_id=evaluation_response.eval_results_id,
latest_n_pairs=50, # Only use latest 50 pairs for augmentation
pair_labels=["geography"], # Only use pairs with these labels
target_metric="accuracy" # Target specific metric for redrafting
)The SDK will raise exceptions for non-200 responses:
from plum_sdk import PlumClient
import requests
try:
plum_client = PlumClient(api_key="YOUR_API_KEY")
response = plum_client.upload_data(training_examples, system_prompt)
print(response)
except requests.exceptions.HTTPError as e:
print(f"Error uploading data: {e}")Retrieve datasets and individual pairs from Plum:
# Get a complete dataset with all its pairs
dataset = plum_client.get_dataset(dataset_id="data:0:123456")
print(f"Dataset {dataset.id} contains {len(dataset.data)} pairs")
print(f"System prompt: {dataset.system_prompt}")
# Iterate through all pairs in the dataset
for pair in dataset.data:
print(f"Pair {pair.id}: {pair.input[:50]}...")
if pair.metadata and pair.metadata.labels:
print(f" Labels: {pair.metadata.labels}")
# Get a synthetic dataset instead of seed data
synthetic_dataset = plum_client.get_dataset(
dataset_id="synthetic:0:789012",
is_synthetic=True
)
# Get a specific pair from a dataset
pair = plum_client.get_pair(
dataset_id="data:0:123456",
pair_id="pair_abc123"
)
print(f"Input: {pair.input}")
print(f"Output: {pair.output}")
if pair.metadata:
print(f"Created at: {pair.metadata.created_at}")
print(f"Labels: {pair.metadata.labels}")List and retrieve evaluation metrics:
# List all available metrics
metrics_list = plum_client.list_metrics()
print(f"Total metrics available: {metrics_list.total_count}")
# Browse through all metrics
for metrics_id, metric_details in metrics_list.metrics.items():
print(f"\nMetrics ID: {metrics_id}")
print(f"Created at: {metric_details.created_at}")
print(f"Number of questions: {metric_details.metric_count}")
# Show each metric question
for definition in metric_details.definitions:
print(f" - {definition.name}: {definition.description}")
# Get detailed information about a specific metric
metric_details = plum_client.get_metric(metrics_id="metrics:0:456789")
print(f"Metric {metric_details.metrics_id} has {metric_details.metric_count} questions")
if metric_details.system_prompt:
print(f"Associated system prompt: {metric_details.system_prompt}")
# Show all metric definitions
for definition in metric_details.definitions:
print(f"Question ID: {definition.id}")
print(f"Name: {definition.name}")
print(f"Description: {definition.description}")api_key(str): Your Plum API keybase_url(str, optional): Custom base URL for the Plum API (defaults to "https://beta.getplum.ai/v1")
-
upload_data(training_examples: List[IOPair], system_prompt: str, dataset_id: Optional[str] = None) -> UploadResponse: Uploads training examples and system prompt to create a new dataset or update an existing one. If dataset_id is provided, overwrites the existing dataset; otherwise creates a new dataset. -
upload_pair(dataset_id: str, input_text: str, output_text: str, pair_id: Optional[str] = None, labels: Optional[List[str]] = None) -> PairUploadResponse: Adds a single input-output pair to an existing dataset -
upload_pair_with_prompt(input_text: str, output_text: str, system_prompt_template: str, pair_id: Optional[str] = None, labels: Optional[List[str]] = None) -> PairUploadResponse: Adds a single input-output pair to a dataset, creating the dataset if it doesn't exist -
generate_metric_questions(system_prompt: str) -> MetricsQuestions: Automatically generates evaluation metric questions based on a system prompt -
define_metric_questions(questions: List[str]) -> MetricsResponse: Defines custom evaluation metric questions -
evaluate(data_id: str, metrics_id: str, latest_n_pairs: Optional[int] = None, pair_labels: Optional[List[str]] = None, last_n_seconds: Optional[int] = None, is_synthetic: bool = False) -> EvaluationResponse: Evaluates uploaded data against defined metrics and returns detailed scoring results -
augment(seed_data_id: Optional[str] = None, multiple: int = 1, eval_results_id: Optional[str] = None, latest_n_pairs: Optional[int] = None, pair_labels: Optional[List[str]] = None, target_metric: Optional[str] = None) -> dict: Augments seed data to generate synthetic training examples -
get_dataset(dataset_id: str, is_synthetic: bool = False) -> Dataset: Retrieves a complete dataset with all its pairs by ID -
get_pair(dataset_id: str, pair_id: str, is_synthetic: bool = False) -> IOPair: Retrieves a specific pair from a dataset by its ID -
list_metrics() -> MetricsListResponse: Lists all available evaluation metrics with their definitions -
get_metric(metrics_id: str) -> DetailedMetricsResponse: Retrieves detailed information about a specific metric by ID
A dataclass representing a single example of your application's interaction with an LLM:
input(str): The input textoutput(str): The output text produced by your LLMid(Optional[str]): Optional custom identifier for the example
Response from uploading training data:
id(str): Unique identifier for the created dataset
Response from uploading a pair to a dataset:
dataset_id(str): ID of the dataset the pair was added topair_id(str): Unique identifier for the uploaded pair
Contains generated evaluation metrics:
metrics_id(str): Unique identifier for the metricsdefinitions(List[str]): List of generated metric questions
Response from defining custom metrics:
metrics_id(str): Unique identifier for the defined metrics
Contains evaluation results:
eval_results_id(str): Unique identifier for the evaluation resultsscores(List[MetricScore]): Detailed scoring information for each metricpair_count(int): Number of pairs that were evaluateddataset_id(Optional[str]): ID of the dataset that was evaluatedcreated_at(Optional[str]): Timestamp when the evaluation was created
Contains detailed scoring information for a single metric:
metric(str): Name of the metricmean_score(float): Average score across all evaluated pairsstd_dev(float): Standard deviation of scoresci_low(float): Lower bound of confidence intervalci_high(float): Upper bound of confidence intervalci_confidence(float): Confidence level (e.g., 0.95 for 95%)median_score(float): Median score across all evaluated pairsmin_score(float): Minimum score observedmax_score(float): Maximum score observedlowest_scoring_pairs(List[ScoringPair]): Pairs that received the lowest scores
Contains information about a specific scoring result:
pair_id(str): Unique identifier for the pairscore_reason(str): Explanation of why the pair received its score
Contains a complete dataset with all its pairs:
id(str): Unique identifier for the datasetdata(List[IOPair]): List of all input-output pairs in the datasetsystem_prompt(Optional[str]): The system prompt associated with the datasetcreated_at(Optional[str]): Timestamp when the dataset was created
Represents a single input-output pair:
id(str): Unique identifier for the pairinput(str): The input textoutput(str): The output textmetadata(Optional[IOPairMeta]): Additional metadata about the pairinput_media(Optional[bytes]): Optional media contentuse_media_mime_type(Optional[str]): MIME type for media contenthuman_critique(Optional[str]): Human feedback on the pairtarget_metric(Optional[str]): Target metric for evaluation
Metadata for an input-output pair:
created_at(Optional[str]): Timestamp when the pair was createdlabels(Optional[List[str]]): List of labels associated with the pair
Response containing all available metrics:
metrics(Dict[str, DetailedMetricsResponse]): Dictionary mapping metric IDs to detailed metric informationtotal_count(int): Total number of available metrics
Detailed information about a specific metric:
metrics_id(str): Unique identifier for the metricdefinitions(List[MetricDefinition]): List of all metric questions/definitionssystem_prompt(Optional[str]): System prompt associated with the metricmetric_count(int): Number of questions in the metriccreated_at(Optional[str]): Timestamp when the metric was created
Individual metric question definition:
id(str): Unique identifier for the metric questionname(str): Display name of the metric questiondescription(str): Detailed description of what the metric evaluates