Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Azure OpenAI graders are a new set of evaluation tools in the Microsoft Foundry SDK that evaluate the performance of AI models and their outputs. These graders include:
| Grader | What it measures | Required parameters | Output |
|---|---|---|---|
label_model |
Classifies text into predefined categories | model, input, labels, passing_labels |
Pass/Fail based on label |
score_model |
Assigns a numeric score based on criteria | model, input, range, pass_threshold |
0-1 float |
string_check |
Exact or pattern string matching | input, reference, operation |
Pass/Fail |
text_similarity |
Similarity between two text strings | input, reference, evaluation_metric, pass_threshold |
0-1 float |
You can run graders locally or remotely. Each grader assesses specific aspects of AI models and their outputs.
Using Azure OpenAI graders
Azure OpenAI graders provide flexible evaluation using LLM-based or deterministic approaches:
- Model-based graders (
label_model,score_model) - Use an LLM to evaluate outputs - Deterministic graders (
string_check,text_similarity) - Use algorithmic comparison
Examples:
See Run evaluations from the SDK for details on running evaluations and configuring data sources.
Example input
Your test dataset should contain the fields referenced in your grader configurations.
{"query": "What is the weather like today?", "response": "It's sunny and warm with clear skies.", "ground_truth": "Today is sunny with temperatures around 75°F."}
{"query": "Summarize the meeting notes.", "response": "The team discussed Q3 goals and assigned action items.", "ground_truth": "Meeting covered quarterly objectives and task assignments."}
Label grader
The label grader (label_model) uses an LLM to classify text into predefined categories. Use it for sentiment analysis, content classification, or any multi-class labeling task.
{
"type": "label_model",
"name": "sentiment_check",
"model": model_deployment,
"input": [
{"role": "developer", "content": "Classify the sentiment as 'positive', 'neutral', or 'negative'"},
{"role": "user", "content": "Statement: {{item.query}}"},
],
"labels": ["positive", "neutral", "negative"],
"passing_labels": ["positive", "neutral"],
}
Output: Returns the assigned label from your defined set. The grader passes if the label is in passing_labels.
Score grader
The score grader (score_model) uses an LLM to assign a numeric score to model outputs, reflecting quality, correctness, or similarity to a reference. Use it for nuanced evaluation requiring reasoning.
{
"type": "score_model",
"name": "quality_score",
"model": model_deployment,
"input": [
{"role": "system", "content": "Rate the response quality from 0 to 1. 1 = perfect, 0 = completely wrong."},
{"role": "user", "content": "Response: {{item.response}}\nGround Truth: {{item.ground_truth}}"},
],
"pass_threshold": 0.7,
"range": [0, 1]
}
Output: Returns a float score (for example, 0.85). The grader passes if the score meets or exceeds pass_threshold.
String check grader
The string check grader (string_check) performs deterministic string comparisons. Use it for exact match validation where responses must match a reference exactly.
{
"type": "string_check",
"name": "exact_match",
"input": "{{item.response}}",
"reference": "{{item.ground_truth}}",
"operation": "eq",
}
Operations:
| Operation | Description |
|---|---|
eq |
Exact match (case-sensitive) |
ne |
Not equal |
like |
Pattern match with wildcards |
ilike |
Case-insensitive pattern match |
Output: Returns a score of 1 for match, 0 for no match.
Text similarity grader
The text similarity grader (text_similarity) compares two text strings using similarity metrics. Use it for open-ended or paraphrase matching where exact match is too strict.
{
"type": "text_similarity",
"name": "similarity_check",
"input": "{{item.response}}",
"reference": "{{item.ground_truth}}",
"evaluation_metric": "bleu",
"pass_threshold": 0.8,
}
Metrics:
| Metric | Description |
|---|---|
fuzzy_match |
Approximate string matching using edit distance |
bleu |
N-gram overlap score, commonly used for translation |
gleu |
Google's variant of BLEU with sentence-level scoring |
meteor |
Alignment-based metric considering synonyms and paraphrases |
cosine |
Cosine similarity on vectorized text |
rouge_* |
N-gram overlap variants (rouge_1, rouge_2, ..., rouge_l) |
Output: Returns a similarity score as a float (higher means more similar). The grader passes if the score meets or exceeds pass_threshold.
Example output
Graders return results with pass/fail status. Key output fields:
{
"type": "score_model",
"name": "quality_score",
"score": 0.85,
"passed": true
}