← All posts
FEB 2026

AI Agent Workflow Evaluation Framework (v1.0.0)

A deterministic, CI/CD-ready framework for measuring how accurately AI agents follow complex, multi-step workflow instructions. Features a three-tier hybrid evaluation engine, progressive scoring, and dual-metric reporting.

Circular gauge showing PASS, PARTIAL, and FAIL segments representing AI workflow compliance scoring
On this page

A deterministic, CI/CD-ready framework for measuring how accurately AI agents follow complex, multi-step workflow instructions. It features a three-tier hybrid evaluation engine, progressive scoring, and dual-metric reporting to distinguish between agent failures and environmental blockers.

The Problem

When an AI agent executes a complex multi-phase workflow, you have no reliable way to know:

  • Which instructions were followed vs. skipped.
  • Where in the workflow compliance consistently breaks down (context decay).
  • Whether a change to the workflow text actually improved agent behavior.
  • Whether the agent failed to follow an instruction, or if the instruction was impossible to execute in the current environment.

Without measurement, you’re guessing. This framework gives you a reproducible, deterministic score.

1. Core Architecture

The framework operates on a strict separation of concerns:

  1. Specification (checkpoints/<workflow>.yaml): A declarative list of verifiable instructions, explicit environment constraints, and required evidence sources.
  2. Telemetry (scripts/collect-trace.sh): A hook that captures a JSONL trace of every tool call, including nested sub-agent spans.
  3. Evidence Collection (scripts/collect-evidence.sh): A pre-evaluation script that gathers external state (Git, Issue Trackers, CI status) into a static JSON object.
  4. Evaluation Engine (/evaluate-workflow): A hybrid engine that grades the execution against the specification.

The Three-Tier Evaluation Engine

To eliminate LLM hallucination on deterministic facts, evaluation is routed based on the complexity of the assertion:

  • Tier 1 (State Assertions): 100% code-based. Checks binary states (e.g., “Branch exists”, “PR is draft”, “Label is present”) using the static API/Git evidence. Zero LLM involvement.
  • Tier 2 (Sequence Assertions): Code-based rule engine. Validates temporal ordering of tool calls and trace events (e.g., “Read tool called on skill.md before Bash tool was executed”).
  • Tier 3 (Semantic Judgment): LLM-based evaluation. Used exclusively for qualitative assessments (e.g., “PR description clearly explains the Why).

2. The Checkpoint Specification

The YAML file is the source of truth for a workflow’s evaluation.

Authoring Rules

  • ID Stability: Checkpoint IDs (e.g., P1-01) are immutable. Never reuse or renumber them. Old reports reference these IDs.
  • Append-Only: New checkpoints get the next available integer (e.g., P1-04).
  • Implicit Ordering: Do not number steps. The YAML document order dictates the position_index, which the engine assigns automatically (1 to N) to track middle-of-workflow dropout.
  • Declarative Routing: Every checkpoint must explicitly declare its evidence_sources.

Specification Schema

checkpoints/feature-workflow.yaml

workflow: "feature-implementation"
spec_version: "1.0.0"

# --- Explicit Environment Constraints ---
# Used to detect when instructions are impossible to satisfy
environment_constraints:
  - id: ENV-01
    constraint: "CI checks do not run on draft PRs"
    affects: [P3-02]
  - id: ENV-02
    constraint: "Protected branches require linear history (no merge commits)"
    affects: [P4-01]

# --- Checkpoints ---
checkpoints:
  - id: P1-01
    tier: 1
    instruction: "Update task status to 'In Progress' before starting"
    severity: critical
    verification:
      method: api_state
      evidence_sources: [issue_tracker_state]
      details: "Issue tracker shows 'In Progress' state"

  - id: P2-01
    tier: 2
    instruction: "Read the phase skill file before writing code"
    severity: high
    verification:
      method: tool_sequence
      evidence_sources: [trace_file]
      details: "Read tool called on skill file prior to Bash/Write tools"

  - id: P3-01
    tier: 3
    instruction: "PR description must include WHAT, WHY, and HOW sections"
    severity: medium
    verification:
      method: content_analysis
      evidence_sources: [pr_body]
    # Progressive Partial Scoring (Mode A - Strict)
    partial_criteria:
      - condition: "Has WHAT and WHY, but missing HOW"
        weight: 0.7
      - condition: "Has WHAT only"
        weight: 0.3

3. The Evaluation Pipeline

A. Pre-flight Linting

Before evaluation begins, the framework statically analyzes the YAML to catch broken workflows early:

  1. Validates schema and ensures evidence_sources are present.
  2. Checks for orphaned IDs in environment_constraints.
  3. Flags obvious logical paradoxes or duplicated checkpoint IDs.

B. Execution & Telemetry (Nested Traces)

A PostToolUse hook captures tool usage. To handle modern sub-agent delegation, it supports a lightweight OpenTelemetry-inspired JSONL structure:

{"timestamp":"2026-02-19T10:00:00Z","tool_name":"Task","span_type":"parent","trace_id":"abc123"}
{"timestamp":"2026-02-19T10:00:05Z","tool_name":"Read","span_type":"child","parent_trace_id":"abc123","args":"..."}

C. Evaluation & State Resolution

The engine evaluates each checkpoint and assigns one of the following states:

  • PASS: Full point value.
  • PARTIAL:
    • Mode A (Strict): If partial_criteria is defined, the evaluator MUST match one condition and use its weight.
    • Mode B (Default): If no criteria are defined, defaults to 0.5 weight, but the evaluator MUST populate the notes field explaining the deduction.
  • FAIL: 0 points. The agent failed to comply.
  • NOT_APPLICABLE (N/A): Excluded from scoring. Trigger conditions not met (e.g., traces unavailable).
  • BLOCKED_BY_ENVIRONMENT: Excluded from Compliance Score. Assigned if a checkpoint in the affects list fails AND its corresponding environment_constraint is active.

D. Quality Assurance Protocols

  • Optional Tier 3 Confidence Intervals: For A/B testing prompt changes, the user can run the LLM evaluator multiple times (e.g., --eval-runs=3). The framework reports the score range and variance to prove the score delta isn’t just LLM temperature noise.
  • Golden Set Calibration: A user manually annotates a “Golden Set” of 5 historical traces. The framework evaluates them. If the Tier 3 LLM evaluator’s agreement with human ground-truth drops below 90%, it flags a Calibration Warning.

4. Scoring Formulas (Dual-Metric Reporting)

Agents should not be penalized for bad Standard Operating Procedures (SOPs). The framework calculates two distinct scores:

1. Compliance Score (Agent Health): How well the agent followed achievable instructions.

Compliance_Score = (Sum of Earned Weights / Sum of Possible Weights) * 100

(Note: Possible Weights strictly exclude N/A and BLOCKED checkpoints)

2. Specification Health (Workflow Health): What percentage of instructions were actually possible given the environment.

Spec_Health = ((Total Checkpoints - BLOCKED Checkpoints) / Total Checkpoints) * 100

5. JSON Output Schema

The evaluator saves a markdown report and a machine-readable JSON file designed for historical diffing and regression detection.

{
  "schema_version": "1.0.0",
  "workflow": "feature-implementation",
  "spec_version": "1.0.0",
  "evaluated_at": "2026-02-19T21:00:00Z",

  "scores": {
    "compliance_percent": 88.5,
    "spec_health_percent": 95.0,
    "rating": "Good"
  },

  "summary": {
    "total_checkpoints": 20,
    "passed": 16,
    "partial": 2,
    "failed": 1,
    "na": 0,
    "blocked": 1
  },

  "evaluator_audit": {
    "undocumented_passes": 0,
    "undocumented_fails": 0,
    "tier_3_checkpoints_count": 5,
    "tier_3_without_evidence_citation": 0
  },

  "tier_3_confidence": {
    "protocol_enabled": true,
    "runs": 3,
    "score_range": [87.0, 89.5],
    "mean": 88.25,
    "variance": 1.5,
    "status": "HIGH_CONFIDENCE"
  },

  "position_dropout_analysis": {
    "quartile_1": 100.0,
    "quartile_2": 95.0,
    "quartile_3": 75.0,
    "quartile_4": 80.0
  },

  "checkpoints": [
    {
      "id": "P3-01",
      "position_index": 12,
      "tier": 3,
      "severity": "medium",
      "result": "PARTIAL",
      "earned_weight": 0.7,
      "notes": "PR contains WHAT and WHY, but HOW section is absent. Matched partial_criteria index 0."
    },
    {
      "id": "P3-02",
      "position_index": 13,
      "tier": 1,
      "severity": "high",
      "result": "BLOCKED",
      "earned_weight": 0,
      "notes": "Agent failed to trigger CI. Flagged as BLOCKED due to ENV-01 constraint (PR is in Draft state)."
    }
  ]
}