
A deterministic, CI/CD-ready framework for measuring how accurately AI agents follow complex, multi-step workflow instructions. It features a three-tier hybrid evaluation engine, progressive scoring, and dual-metric reporting to distinguish between agent failures and environmental blockers.

## The Problem

When an AI agent executes a complex multi-phase workflow, you have no reliable way to know:

* Which instructions were followed vs. skipped.
* Where in the workflow compliance consistently breaks down (context decay).
* Whether a change to the workflow text actually improved agent behavior.
* Whether the agent failed to follow an instruction, or if the instruction was impossible to execute in the current environment.

Without measurement, you're guessing. This framework gives you a reproducible, deterministic score.

## 1. Core Architecture

The framework operates on a strict separation of concerns:

1. **Specification (`checkpoints/<workflow>.yaml`)**: A declarative list of verifiable instructions, explicit environment constraints, and required evidence sources.
2. **Telemetry (`scripts/collect-trace.sh`)**: A hook that captures a JSONL trace of every tool call, including nested sub-agent spans.
3. **Evidence Collection (`scripts/collect-evidence.sh`)**: A pre-evaluation script that gathers external state (Git, Issue Trackers, CI status) into a static JSON object.
4. **Evaluation Engine (`/evaluate-workflow`)**: A hybrid engine that grades the execution against the specification.

### The Three-Tier Evaluation Engine

To eliminate LLM hallucination on deterministic facts, evaluation is routed based on the complexity of the assertion:

* **Tier 1 (State Assertions):** 100% code-based. Checks binary states (e.g., **"Branch exists"**, **"PR is draft"**, **"Label is present"**) using the static API/Git evidence. Zero LLM involvement.
* **Tier 2 (Sequence Assertions):** Code-based rule engine. Validates temporal ordering of tool calls and trace events (e.g., **"Read tool called on skill.md before Bash tool was executed"**).
* **Tier 3 (Semantic Judgment):** LLM-based evaluation. Used exclusively for qualitative assessments (e.g., **"PR description clearly explains the *Why*"**).

## 2. The Checkpoint Specification

The YAML file is the source of truth for a workflow's evaluation.

### Authoring Rules

* **ID Stability:** Checkpoint IDs (e.g., `P1-01`) are immutable. Never reuse or renumber them. Old reports reference these IDs.
* **Append-Only:** New checkpoints get the next available integer (e.g., `P1-04`).
* **Implicit Ordering:** Do not number steps. The YAML document order dictates the `position_index`, which the engine assigns automatically (1 to N) to track middle-of-workflow dropout.
* **Declarative Routing:** Every checkpoint must explicitly declare its `evidence_sources`.

### Specification Schema

**`checkpoints/feature-workflow.yaml`**

```yaml
workflow: "feature-implementation"
spec_version: "1.0.0"

# --- Explicit Environment Constraints ---
# Used to detect when instructions are impossible to satisfy
environment_constraints:
  - id: ENV-01
    constraint: "CI checks do not run on draft PRs"
    affects: [P3-02]
  - id: ENV-02
    constraint: "Protected branches require linear history (no merge commits)"
    affects: [P4-01]

# --- Checkpoints ---
checkpoints:
  - id: P1-01
    tier: 1
    instruction: "Update task status to 'In Progress' before starting"
    severity: critical
    verification:
      method: api_state
      evidence_sources: [issue_tracker_state]
      details: "Issue tracker shows 'In Progress' state"

  - id: P2-01
    tier: 2
    instruction: "Read the phase skill file before writing code"
    severity: high
    verification:
      method: tool_sequence
      evidence_sources: [trace_file]
      details: "Read tool called on skill file prior to Bash/Write tools"

  - id: P3-01
    tier: 3
    instruction: "PR description must include WHAT, WHY, and HOW sections"
    severity: medium
    verification:
      method: content_analysis
      evidence_sources: [pr_body]
    # Progressive Partial Scoring (Mode A - Strict)
    partial_criteria:
      - condition: "Has WHAT and WHY, but missing HOW"
        weight: 0.7
      - condition: "Has WHAT only"
        weight: 0.3
```

## 3. The Evaluation Pipeline

### A. Pre-flight Linting

Before evaluation begins, the framework statically analyzes the YAML to catch broken workflows early:

1. Validates schema and ensures `evidence_sources` are present.
2. Checks for orphaned IDs in `environment_constraints`.
3. Flags obvious logical paradoxes or duplicated checkpoint IDs.

### B. Execution & Telemetry (Nested Traces)

A `PostToolUse` hook captures tool usage. To handle modern sub-agent delegation, it supports a lightweight OpenTelemetry-inspired JSONL structure:

```json
{"timestamp":"2026-02-19T10:00:00Z","tool_name":"Task","span_type":"parent","trace_id":"abc123"}
{"timestamp":"2026-02-19T10:00:05Z","tool_name":"Read","span_type":"child","parent_trace_id":"abc123","args":"..."}
```

### C. Evaluation & State Resolution

The engine evaluates each checkpoint and assigns one of the following states:

* **PASS:** Full point value.
* **PARTIAL:**
  * *Mode A (Strict):* If `partial_criteria` is defined, the evaluator MUST match one condition and use its weight.
  * *Mode B (Default):* If no criteria are defined, defaults to 0.5 weight, but the evaluator MUST populate the `notes` field explaining the deduction.
* **FAIL:** 0 points. The agent failed to comply.
* **NOT\_APPLICABLE (N/A):** Excluded from scoring. Trigger conditions not met (e.g., traces unavailable).
* **BLOCKED\_BY\_ENVIRONMENT:** Excluded from Compliance Score. Assigned if a checkpoint in the `affects` list fails AND its corresponding `environment_constraint` is active.

### D. Quality Assurance Protocols

* **Optional Tier 3 Confidence Intervals:** For A/B testing prompt changes, the user can run the LLM evaluator multiple times (e.g., `--eval-runs=3`). The framework reports the score range and variance to prove the score delta isn't just LLM temperature noise.
* **Golden Set Calibration:** A user manually annotates a **"Golden Set"** of 5 historical traces. The framework evaluates them. If the Tier 3 LLM evaluator's agreement with human ground-truth drops below **90%**, it flags a Calibration Warning.

## 4. Scoring Formulas (Dual-Metric Reporting)

Agents should not be penalized for bad Standard Operating Procedures (SOPs). The framework calculates two distinct scores:

**1. Compliance Score (Agent Health):** How well the agent followed *achievable* instructions.

```
Compliance_Score = (Sum of Earned Weights / Sum of Possible Weights) * 100
```

*(Note: Possible Weights strictly exclude N/A and BLOCKED checkpoints)*

**2. Specification Health (Workflow Health):** What percentage of instructions were actually possible given the environment.

```
Spec_Health = ((Total Checkpoints - BLOCKED Checkpoints) / Total Checkpoints) * 100
```

## 5. JSON Output Schema

The evaluator saves a markdown report and a machine-readable JSON file designed for historical diffing and regression detection.

```json
{
  "schema_version": "1.0.0",
  "workflow": "feature-implementation",
  "spec_version": "1.0.0",
  "evaluated_at": "2026-02-19T21:00:00Z",

  "scores": {
    "compliance_percent": 88.5,
    "spec_health_percent": 95.0,
    "rating": "Good"
  },

  "summary": {
    "total_checkpoints": 20,
    "passed": 16,
    "partial": 2,
    "failed": 1,
    "na": 0,
    "blocked": 1
  },

  "evaluator_audit": {
    "undocumented_passes": 0,
    "undocumented_fails": 0,
    "tier_3_checkpoints_count": 5,
    "tier_3_without_evidence_citation": 0
  },

  "tier_3_confidence": {
    "protocol_enabled": true,
    "runs": 3,
    "score_range": [87.0, 89.5],
    "mean": 88.25,
    "variance": 1.5,
    "status": "HIGH_CONFIDENCE"
  },

  "position_dropout_analysis": {
    "quartile_1": 100.0,
    "quartile_2": 95.0,
    "quartile_3": 75.0,
    "quartile_4": 80.0
  },

  "checkpoints": [
    {
      "id": "P3-01",
      "position_index": 12,
      "tier": 3,
      "severity": "medium",
      "result": "PARTIAL",
      "earned_weight": 0.7,
      "notes": "PR contains WHAT and WHY, but HOW section is absent. Matched partial_criteria index 0."
    },
    {
      "id": "P3-02",
      "position_index": 13,
      "tier": 1,
      "severity": "high",
      "result": "BLOCKED",
      "earned_weight": 0,
      "notes": "Agent failed to trigger CI. Flagged as BLOCKED due to ENV-01 constraint (PR is in Draft state)."
    }
  ]
}
```
