AI Agent Workflow Evaluation Framework (v1.0.0)

A deterministic, CI/CD-ready framework for measuring how accurately AI agents follow complex, multi-step workflow instructions. Features a three-tier hybrid evaluation engine, progressive scoring, and dual-metric reporting.

A deterministic, CI/CD-ready framework for measuring how accurately AI agents follow complex, multi-step workflow instructions. It features a three-tier hybrid evaluation engine, progressive scoring, and dual-metric reporting to distinguish between agent failures and environmental blockers.

The Problem

When an AI agent executes a complex multi-phase workflow, you have no reliable way to know:

Which instructions were followed vs. skipped.
Where in the workflow compliance consistently breaks down (context decay).
Whether a change to the workflow text actually improved agent behavior.
Whether the agent failed to follow an instruction, or if the instruction was impossible to execute in the current environment.

Without measurement, you’re guessing. This framework gives you a reproducible, deterministic score.

1. Core Architecture

The framework operates on a strict separation of concerns:

Specification (checkpoints/<workflow>.yaml): A declarative list of verifiable instructions, explicit environment constraints, and required evidence sources.
Telemetry (scripts/collect-trace.sh): A hook that captures a JSONL trace of every tool call, including nested sub-agent spans.
Evidence Collection (scripts/collect-evidence.sh): A pre-evaluation script that gathers external state (Git, Issue Trackers, CI status) into a static JSON object.
Evaluation Engine (/evaluate-workflow): A hybrid engine that grades the execution against the specification.

The Three-Tier Evaluation Engine

To eliminate LLM hallucination on deterministic facts, evaluation is routed based on the complexity of the assertion:

Tier 1 (State Assertions): 100% code-based. Checks binary states (e.g., “Branch exists”, “PR is draft”, “Label is present”) using the static API/Git evidence. Zero LLM involvement.
Tier 2 (Sequence Assertions): Code-based rule engine. Validates temporal ordering of tool calls and trace events (e.g., “Read tool called on skill.md before Bash tool was executed”).
Tier 3 (Semantic Judgment): LLM-based evaluation. Used exclusively for qualitative assessments (e.g., “PR description clearly explains the Why”).

2. The Checkpoint Specification

The YAML file is the source of truth for a workflow’s evaluation.

Authoring Rules

ID Stability: Checkpoint IDs (e.g., P1-01) are immutable. Never reuse or renumber them. Old reports reference these IDs.
Append-Only: New checkpoints get the next available integer (e.g., P1-04).
Implicit Ordering: Do not number steps. The YAML document order dictates the position_index, which the engine assigns automatically (1 to N) to track middle-of-workflow dropout.
Declarative Routing: Every checkpoint must explicitly declare its evidence_sources.

Specification Schema

checkpoints/feature-workflow.yaml

workflow: "feature-implementation"
spec_version: "1.0.0"

# --- Explicit Environment Constraints ---
# Used to detect when instructions are impossible to satisfy
environment_constraints:
  - id: ENV-01
    constraint: "CI checks do not run on draft PRs"
    affects: [P3-02]
  - id: ENV-02
    constraint: "Protected branches require linear history (no merge commits)"
    affects: [P4-01]

# --- Checkpoints ---
checkpoints:
  - id: P1-01
    tier: 1
    instruction: "Update task status to 'In Progress' before starting"
    severity: critical
    verification:
      method: api_state
      evidence_sources: [issue_tracker_state]
      details: "Issue tracker shows 'In Progress' state"

  - id: P2-01
    tier: 2
    instruction: "Read the phase skill file before writing code"
    severity: high
    verification:
      method: tool_sequence
      evidence_sources: [trace_file]
      details: "Read tool called on skill file prior to Bash/Write tools"

  - id: P3-01
    tier: 3
    instruction: "PR description must include WHAT, WHY, and HOW sections"
    severity: medium
    verification:
      method: content_analysis
      evidence_sources: [pr_body]
    # Progressive Partial Scoring (Mode A - Strict)
    partial_criteria:
      - condition: "Has WHAT and WHY, but missing HOW"
        weight: 0.7
      - condition: "Has WHAT only"
        weight: 0.3

3. The Evaluation Pipeline

A. Pre-flight Linting

Before evaluation begins, the framework statically analyzes the YAML to catch broken workflows early:

Validates schema and ensures evidence_sources are present.
Checks for orphaned IDs in environment_constraints.
Flags obvious logical paradoxes or duplicated checkpoint IDs.

B. Execution & Telemetry (Nested Traces)

A PostToolUse hook captures tool usage. To handle modern sub-agent delegation, it supports a lightweight OpenTelemetry-inspired JSONL structure:

{"timestamp":"2026-02-19T10:00:00Z","tool_name":"Task","span_type":"parent","trace_id":"abc123"}
{"timestamp":"2026-02-19T10:00:05Z","tool_name":"Read","span_type":"child","parent_trace_id":"abc123","args":"..."}

C. Evaluation & State Resolution

The engine evaluates each checkpoint and assigns one of the following states:

PASS: Full point value.
PARTIAL:
- Mode A (Strict): If partial_criteria is defined, the evaluator MUST match one condition and use its weight.
- Mode B (Default): If no criteria are defined, defaults to 0.5 weight, but the evaluator MUST populate the notes field explaining the deduction.
FAIL: 0 points. The agent failed to comply.
NOT_APPLICABLE (N/A): Excluded from scoring. Trigger conditions not met (e.g., traces unavailable).
BLOCKED_BY_ENVIRONMENT: Excluded from Compliance Score. Assigned if a checkpoint in the affects list fails AND its corresponding environment_constraint is active.

D. Quality Assurance Protocols

Optional Tier 3 Confidence Intervals: For A/B testing prompt changes, the user can run the LLM evaluator multiple times (e.g., --eval-runs=3). The framework reports the score range and variance to prove the score delta isn’t just LLM temperature noise.
Golden Set Calibration: A user manually annotates a “Golden Set” of 5 historical traces. The framework evaluates them. If the Tier 3 LLM evaluator’s agreement with human ground-truth drops below 90%, it flags a Calibration Warning.

4. Scoring Formulas (Dual-Metric Reporting)

Agents should not be penalized for bad Standard Operating Procedures (SOPs). The framework calculates two distinct scores:

1. Compliance Score (Agent Health): How well the agent followed achievable instructions.

Compliance_Score = (Sum of Earned Weights / Sum of Possible Weights) * 100

(Note: Possible Weights strictly exclude N/A and BLOCKED checkpoints)

2. Specification Health (Workflow Health): What percentage of instructions were actually possible given the environment.

Spec_Health = ((Total Checkpoints - BLOCKED Checkpoints) / Total Checkpoints) * 100

5. JSON Output Schema

The evaluator saves a markdown report and a machine-readable JSON file designed for historical diffing and regression detection.

{
  "schema_version": "1.0.0",
  "workflow": "feature-implementation",
  "spec_version": "1.0.0",
  "evaluated_at": "2026-02-19T21:00:00Z",

  "scores": {
    "compliance_percent": 88.5,
    "spec_health_percent": 95.0,
    "rating": "Good"
  },

  "summary": {
    "total_checkpoints": 20,
    "passed": 16,
    "partial": 2,
    "failed": 1,
    "na": 0,
    "blocked": 1
  },

  "evaluator_audit": {
    "undocumented_passes": 0,
    "undocumented_fails": 0,
    "tier_3_checkpoints_count": 5,
    "tier_3_without_evidence_citation": 0
  },

  "tier_3_confidence": {
    "protocol_enabled": true,
    "runs": 3,
    "score_range": [87.0, 89.5],
    "mean": 88.25,
    "variance": 1.5,
    "status": "HIGH_CONFIDENCE"
  },

  "position_dropout_analysis": {
    "quartile_1": 100.0,
    "quartile_2": 95.0,
    "quartile_3": 75.0,
    "quartile_4": 80.0
  },

  "checkpoints": [
    {
      "id": "P3-01",
      "position_index": 12,
      "tier": 3,
      "severity": "medium",
      "result": "PARTIAL",
      "earned_weight": 0.7,
      "notes": "PR contains WHAT and WHY, but HOW section is absent. Matched partial_criteria index 0."
    },
    {
      "id": "P3-02",
      "position_index": 13,
      "tier": 1,
      "severity": "high",
      "result": "BLOCKED",
      "earned_weight": 0,
      "notes": "Agent failed to trigger CI. Flagged as BLOCKED due to ENV-01 constraint (PR is in Draft state)."
    }
  ]
}