AI Agent Workflow Evaluation Framework (v1.0.0)
A deterministic, CI/CD-ready framework for measuring how accurately AI agents follow complex, multi-step workflow instructions. Features a three-tier hybrid evaluation engine, progressive scoring, and dual-metric reporting.
On this page ▾
A deterministic, CI/CD-ready framework for measuring how accurately AI agents follow complex, multi-step workflow instructions. It features a three-tier hybrid evaluation engine, progressive scoring, and dual-metric reporting to distinguish between agent failures and environmental blockers.
The Problem
When an AI agent executes a complex multi-phase workflow, you have no reliable way to know:
- Which instructions were followed vs. skipped.
- Where in the workflow compliance consistently breaks down (context decay).
- Whether a change to the workflow text actually improved agent behavior.
- Whether the agent failed to follow an instruction, or if the instruction was impossible to execute in the current environment.
Without measurement, you’re guessing. This framework gives you a reproducible, deterministic score.
1. Core Architecture
The framework operates on a strict separation of concerns:
- Specification (
checkpoints/<workflow>.yaml): A declarative list of verifiable instructions, explicit environment constraints, and required evidence sources. - Telemetry (
scripts/collect-trace.sh): A hook that captures a JSONL trace of every tool call, including nested sub-agent spans. - Evidence Collection (
scripts/collect-evidence.sh): A pre-evaluation script that gathers external state (Git, Issue Trackers, CI status) into a static JSON object. - Evaluation Engine (
/evaluate-workflow): A hybrid engine that grades the execution against the specification.
The Three-Tier Evaluation Engine
To eliminate LLM hallucination on deterministic facts, evaluation is routed based on the complexity of the assertion:
- Tier 1 (State Assertions): 100% code-based. Checks binary states (e.g., “Branch exists”, “PR is draft”, “Label is present”) using the static API/Git evidence. Zero LLM involvement.
- Tier 2 (Sequence Assertions): Code-based rule engine. Validates temporal ordering of tool calls and trace events (e.g., “Read tool called on skill.md before Bash tool was executed”).
- Tier 3 (Semantic Judgment): LLM-based evaluation. Used exclusively for qualitative assessments (e.g., “PR description clearly explains the Why”).
2. The Checkpoint Specification
The YAML file is the source of truth for a workflow’s evaluation.
Authoring Rules
- ID Stability: Checkpoint IDs (e.g.,
P1-01) are immutable. Never reuse or renumber them. Old reports reference these IDs. - Append-Only: New checkpoints get the next available integer (e.g.,
P1-04). - Implicit Ordering: Do not number steps. The YAML document order dictates the
position_index, which the engine assigns automatically (1 to N) to track middle-of-workflow dropout. - Declarative Routing: Every checkpoint must explicitly declare its
evidence_sources.
Specification Schema
checkpoints/feature-workflow.yaml
workflow: "feature-implementation"
spec_version: "1.0.0"
# --- Explicit Environment Constraints ---
# Used to detect when instructions are impossible to satisfy
environment_constraints:
- id: ENV-01
constraint: "CI checks do not run on draft PRs"
affects: [P3-02]
- id: ENV-02
constraint: "Protected branches require linear history (no merge commits)"
affects: [P4-01]
# --- Checkpoints ---
checkpoints:
- id: P1-01
tier: 1
instruction: "Update task status to 'In Progress' before starting"
severity: critical
verification:
method: api_state
evidence_sources: [issue_tracker_state]
details: "Issue tracker shows 'In Progress' state"
- id: P2-01
tier: 2
instruction: "Read the phase skill file before writing code"
severity: high
verification:
method: tool_sequence
evidence_sources: [trace_file]
details: "Read tool called on skill file prior to Bash/Write tools"
- id: P3-01
tier: 3
instruction: "PR description must include WHAT, WHY, and HOW sections"
severity: medium
verification:
method: content_analysis
evidence_sources: [pr_body]
# Progressive Partial Scoring (Mode A - Strict)
partial_criteria:
- condition: "Has WHAT and WHY, but missing HOW"
weight: 0.7
- condition: "Has WHAT only"
weight: 0.3
3. The Evaluation Pipeline
A. Pre-flight Linting
Before evaluation begins, the framework statically analyzes the YAML to catch broken workflows early:
- Validates schema and ensures
evidence_sourcesare present. - Checks for orphaned IDs in
environment_constraints. - Flags obvious logical paradoxes or duplicated checkpoint IDs.
B. Execution & Telemetry (Nested Traces)
A PostToolUse hook captures tool usage. To handle modern sub-agent delegation, it supports a lightweight OpenTelemetry-inspired JSONL structure:
{"timestamp":"2026-02-19T10:00:00Z","tool_name":"Task","span_type":"parent","trace_id":"abc123"}
{"timestamp":"2026-02-19T10:00:05Z","tool_name":"Read","span_type":"child","parent_trace_id":"abc123","args":"..."}
C. Evaluation & State Resolution
The engine evaluates each checkpoint and assigns one of the following states:
- PASS: Full point value.
- PARTIAL:
- Mode A (Strict): If
partial_criteriais defined, the evaluator MUST match one condition and use its weight. - Mode B (Default): If no criteria are defined, defaults to 0.5 weight, but the evaluator MUST populate the
notesfield explaining the deduction.
- Mode A (Strict): If
- FAIL: 0 points. The agent failed to comply.
- NOT_APPLICABLE (N/A): Excluded from scoring. Trigger conditions not met (e.g., traces unavailable).
- BLOCKED_BY_ENVIRONMENT: Excluded from Compliance Score. Assigned if a checkpoint in the
affectslist fails AND its correspondingenvironment_constraintis active.
D. Quality Assurance Protocols
- Optional Tier 3 Confidence Intervals: For A/B testing prompt changes, the user can run the LLM evaluator multiple times (e.g.,
--eval-runs=3). The framework reports the score range and variance to prove the score delta isn’t just LLM temperature noise. - Golden Set Calibration: A user manually annotates a “Golden Set” of 5 historical traces. The framework evaluates them. If the Tier 3 LLM evaluator’s agreement with human ground-truth drops below 90%, it flags a Calibration Warning.
4. Scoring Formulas (Dual-Metric Reporting)
Agents should not be penalized for bad Standard Operating Procedures (SOPs). The framework calculates two distinct scores:
1. Compliance Score (Agent Health): How well the agent followed achievable instructions.
Compliance_Score = (Sum of Earned Weights / Sum of Possible Weights) * 100
(Note: Possible Weights strictly exclude N/A and BLOCKED checkpoints)
2. Specification Health (Workflow Health): What percentage of instructions were actually possible given the environment.
Spec_Health = ((Total Checkpoints - BLOCKED Checkpoints) / Total Checkpoints) * 100
5. JSON Output Schema
The evaluator saves a markdown report and a machine-readable JSON file designed for historical diffing and regression detection.
{
"schema_version": "1.0.0",
"workflow": "feature-implementation",
"spec_version": "1.0.0",
"evaluated_at": "2026-02-19T21:00:00Z",
"scores": {
"compliance_percent": 88.5,
"spec_health_percent": 95.0,
"rating": "Good"
},
"summary": {
"total_checkpoints": 20,
"passed": 16,
"partial": 2,
"failed": 1,
"na": 0,
"blocked": 1
},
"evaluator_audit": {
"undocumented_passes": 0,
"undocumented_fails": 0,
"tier_3_checkpoints_count": 5,
"tier_3_without_evidence_citation": 0
},
"tier_3_confidence": {
"protocol_enabled": true,
"runs": 3,
"score_range": [87.0, 89.5],
"mean": 88.25,
"variance": 1.5,
"status": "HIGH_CONFIDENCE"
},
"position_dropout_analysis": {
"quartile_1": 100.0,
"quartile_2": 95.0,
"quartile_3": 75.0,
"quartile_4": 80.0
},
"checkpoints": [
{
"id": "P3-01",
"position_index": 12,
"tier": 3,
"severity": "medium",
"result": "PARTIAL",
"earned_weight": 0.7,
"notes": "PR contains WHAT and WHY, but HOW section is absent. Matched partial_criteria index 0."
},
{
"id": "P3-02",
"position_index": 13,
"tier": 1,
"severity": "high",
"result": "BLOCKED",
"earned_weight": 0,
"notes": "Agent failed to trigger CI. Flagged as BLOCKED due to ENV-01 constraint (PR is in Draft state)."
}
]
}