Outcome Measurement — BulletproofSoftware.ai

1

Problem Statement

Agent Economics (PRD 10) answers "how much did this cost?" but not "how much value did this deliver?" Operators can see token spend, model routing decisions, and per-agent cost breakdowns, but have no way to correlate that spend with actual outcomes. The system tracks inputs (tokens consumed, models invoked) without tracking outputs (tasks completed, quality achieved, time saved).

Without outcome measurement, optimization becomes guesswork. An operator might see that Opus costs 10x more than Haiku for a particular task type, but has no data on whether the Opus run produced a first-pass success while the Haiku run required three rework cycles. The cost comparison is meaningless without outcome context.

No Task Completion Rates

Agents report success/failure, but there is no independent measurement of whether the task actually achieved its intended outcome after verification.

No Time-to-Resolution Tracking

No measurement of elapsed time from task dispatch to verified completion, making it impossible to identify bottlenecks or estimate future delivery timelines.

No Rework Frequency Data

Retry counts exist in conductor state, but there is no aggregation across workflows to identify which agent types, task categories, or tier levels produce the most rework.

No Quality Trending

Gemini validation verdicts (PASS/FAIL/PARTIAL) are recorded per-agent but never analyzed over time. Is quality improving, degrading, or plateauing? Nobody knows.

No Cost-per-Outcome

Agent Economics tracks cost-per-token and cost-per-agent. The metric that matters is cost-per-successful-outcome, and it does not exist today.

No Value Attribution

When the system saves an operator 4 hours on a MAJOR workflow, there is no framework to quantify that time savings, quality improvement, or risk reduction as measurable value.

2

Architecture

Outcome Measurement operates as a passive observation layer. It never modifies conductor state, never blocks agent dispatch, and never introduces latency into the orchestration pipeline. It reads events, computes metrics, and stores results in a dedicated Qdrant collection, completely decoupled from the orchestration path.

Passive Observation Model

The Outcome Collector reads conductor-state.json changes, Gemini validation results, and governance audit events without modifying any source data. Zero performance impact on orchestration.

Compute-then-Store Pipeline

Raw observations are transformed into 8 standardized outcome metrics before storage. No raw event data is persisted in the outcome store; only computed metrics and attribution records.

Dual Retention Strategy

Granular per-task metrics retained for 90 days. Aggregated metrics (per-agent-type, per-tier, per-project) retained indefinitely for long-term trend analysis.

3

Key Components

8 Outcome Metrics

Every metric is computed from observable data already present in conductor state, Gemini validations, and governance events. No additional instrumentation required from agents.

✅

Task Completion Rate

Percentage of dispatched tasks that reach verified-complete status without operator intervention.

⏱

Time-to-Resolution

Elapsed time from task dispatch to verified completion, segmented by tier, agent type, and complexity.

★

First-Pass Success Rate

Percentage of tasks that pass Gemini validation and quality gates on the first attempt, with no retries.

↻

Rework Frequency

Average number of retry cycles per task, tracked by agent type and task category. Lower is better.

📈

Quality Score Trend

Rolling average of Gemini validation scores over time, revealing whether system quality is improving or degrading.

🛡

Recovery Rate

Percentage of initially-failed tasks that recover through retry or escalation without operator intervention.

📊

Context Efficiency

Ratio of productive output tokens to total context consumed, measuring how effectively the context window is utilized.

💰

Cost per Successful Outcome

Total cost (from Agent Economics) divided by verified-complete tasks. The metric that connects spend to value.

Core Components

Outcome Collector

Passive observer that monitors four event streams: conductor state transitions, Gemini validation verdicts, governance audit events, and agent completion signals. Emits structured observation records without modifying source data.

Value Attribution Framework

Maps outcomes to five value categories: time savings (estimated hours saved), quality improvement (defect prevention), throughput gain (tasks per unit time), risk avoidance (security/compliance catches), and knowledge preservation (institutional memory captured).

Outcome Store

Dedicated Qdrant collection (outcome_metrics) with vector embeddings for semantic similarity search across outcomes. Enables queries like "show me outcomes similar to this failing pattern" for root cause analysis.

Outcome Dashboard

Extends the Memory Dashboard (PRD 6) with a dedicated outcome analytics tab. Real-time metric visualization, trend charts, agent comparison views, and cost-effectiveness rankings.

Outcome Reports

Two report types: weekly digest (automated summary of outcome trends, top-performing agents, emerging issues) and project retrospective (complete outcome analysis for a completed workflow, from planning through delivery).

Baseline Manager

Operator-defined performance baselines per metric. Outcomes are compared against baselines to generate improvement percentages. Baselines auto-adjust quarterly based on rolling averages, with operator override capability.

4

Requirements

ID	Requirement	Priority
REQ-OM-001	System SHALL compute task completion rate as the ratio of verified-complete tasks to total dispatched tasks, segmented by tier (TRIVIAL/MINOR/STANDARD/MAJOR), agent type, and project.	MUST
REQ-OM-002	System SHALL track time-to-resolution from agent dispatch timestamp to verified-completion timestamp with millisecond precision, excluding operator-blocked wait time.	MUST
REQ-OM-003	System SHALL compute first-pass success rate from Gemini validation verdicts, counting only tasks that receive PASS on the initial validation without any prior FAIL or PARTIAL verdict.	MUST
REQ-OM-004	System SHALL track rework frequency as the count of retry cycles per task, aggregated by agent type, task category, and tier level, with alert thresholds configurable by operator.	MUST
REQ-OM-005	System SHALL compute quality score trend as a 7-day rolling average of Gemini validation scores, with separate trend lines per agent type and a combined system-wide trend.	MUST
REQ-OM-006	System SHALL compute recovery rate as the percentage of initially-failed tasks (FAIL or PARTIAL <70%) that subsequently reach verified-complete status through retry or escalation.	SHOULD
REQ-OM-007	System SHALL compute context efficiency as the ratio of output tokens in verified-complete deliverables to total input+output tokens consumed during task execution.	SHOULD
REQ-OM-008	System SHALL compute cost-per-successful-outcome by dividing total Agent Economics cost for a task (including retries) by a binary success indicator (1 for verified-complete, 0 for failed).	MUST
REQ-OM-009	Outcome Collector SHALL operate in read-only mode, never modifying conductor-state.json, Gemini validation records, or governance audit events. Observation must not affect the observed system.	MUST
REQ-OM-010	Value Attribution Framework SHALL map each successful outcome to at least one of five value categories (time savings, quality improvement, throughput gain, risk avoidance, knowledge preservation) with quantified estimates.	SHOULD
REQ-OM-011	Outcome Store SHALL retain granular per-task metrics for 90 days and aggregated metrics indefinitely. Granular records older than 90 days SHALL be compacted into aggregate summaries before deletion.	MUST
REQ-OM-012	Outcome Dashboard SHALL extend the Memory Dashboard (PRD 6) with real-time metric visualization, including trend charts, agent comparison views, and cost-effectiveness rankings, updating within 30 seconds of metric computation.	SHOULD
REQ-OM-013	System SHALL generate automated weekly digest reports summarizing outcome trends, top-performing agents, degradation alerts, and cost-efficiency changes relative to prior week.	COULD
REQ-OM-014	Operator SHALL be able to define custom performance baselines per metric, per agent type. System SHALL compare outcomes against baselines and compute improvement percentages in all dashboard views and reports.	SHOULD

5

Design Decisions

Key architectural and design choices, with rationale for each decision.

Passive observation, never active instrumentation. The Outcome Collector reads existing data streams (conductor state, Gemini verdicts, governance events) rather than requiring agents to emit outcome telemetry. This eliminates the risk of observation affecting agent behavior and requires zero changes to existing agents.
Operator-defined baselines, not system-imposed targets. Performance baselines vary dramatically by project type, team size, and domain complexity. The system provides baseline suggestions from historical data but requires operator confirmation. Auto-adjustment happens quarterly with operator override.
Separate Qdrant collection for outcome data. Outcome metrics are stored in a dedicated outcome_metrics collection rather than co-locating with memory or governance data. This enables independent retention policies, dedicated vector indexes for outcome similarity search, and clean separation of concerns.
Extend existing dashboard rather than build standalone. The Memory Dashboard (PRD 6) already provides the infrastructure for real-time data visualization. Adding an outcome analytics tab leverages existing authentication, layout, and deployment rather than creating a parallel dashboard that operators must monitor separately.
Exactly 8 metrics, no more. Every proposed metric was evaluated against three criteria: is it observable from existing data, is it actionable by an operator, and does it measure outcomes (not inputs)? Metrics like "tokens consumed" and "model selection accuracy" were excluded because they measure process, not outcome.
90-day granular retention with indefinite aggregates. Per-task granular data is essential for debugging and root cause analysis but grows linearly with workload. The 90-day window provides sufficient history for trend analysis. Aggregated data (per-agent-type, per-tier, per-project) is compact enough to retain indefinitely for long-term trending.

6

Integration Map

Outcome Measurement connects to 8 other PRDs in the ecosystem. Each integration is directional: Outcome Measurement reads from source systems and writes to presentation layers.

Integration	Direction	Data Exchanged
Conductor (PRD 2)	Reads from	State transitions, phase completions, agent dispatch records, tier classifications, and task verification results from conductor-state.json.
Agent Economics (PRD 10)	Reads from	Per-task cost data, model routing decisions, and token consumption records. Used to compute cost-per-successful-outcome and cost-effectiveness rankings.
Governance (PRD 5)	Reads from	Governance audit events including approval gate decisions, policy violations, and compliance checks. Feeds into risk avoidance value attribution.
Code Assurance (PRD 8)	Reads from	Quality gate results, test pass rates, and security scan outcomes. Provides independent quality signal for first-pass success rate calculation.
Self-Healing (PRD 12)	Reads from	Recovery events and auto-remediation records. Feeds recovery rate metric and provides data on autonomous system resilience.
Memory Dashboard (PRD 6)	Writes to	Outcome analytics tab with real-time metric visualization, trend charts, and agent comparison views. Extends existing dashboard infrastructure.
Memory System (PRD 4)	Reads/Writes	Reads historical outcome patterns for baseline computation. Writes outcome summaries for cross-session context and institutional memory.
Event-Driven (PRD 13)	Subscribes	Subscribes to conductor events via event bus for real-time observation. Publishes outcome metric events for downstream consumers (alerts, reports).

PRD 2 Conductor PRD 4 Memory System PRD 5 Governance PRD 6 Memory Dashboard PRD 8 Code Assurance PRD 10 Agent Economics PRD 12 Self-Healing PRD 13 Event-Driven

7

Prompt to Build It

Copy and paste this prompt into Claude Code to begin implementing Outcome Measurement. It references the architecture, metrics, and integration points defined in this PRD.

Build the Outcome Measurement system (PRD 15) for the conductor plugin ecosystem.

## Context
Agent Economics (PRD 10) tracks cost but not value. We need a passive observation
layer that computes outcome metrics from existing conductor data without modifying
any source systems.

## Architecture
- Outcome Collector: passive observer reading conductor-state.json transitions,
  Gemini validation verdicts, governance audit events, and agent completion signals
- Metric Compute: transforms observations into 8 standardized metrics
- Outcome Store: dedicated Qdrant collection (outcome_metrics) with 90-day
  granular retention and indefinite aggregate retention
- Dashboard: extends Memory Dashboard (PRD 6) with outcome analytics tab
- Reports: automated weekly digest and project retrospective

## 8 Outcome Metrics to Implement
1. Task Completion Rate — verified-complete / total dispatched, by tier + agent
2. Time-to-Resolution — dispatch to verified-complete, excluding blocked time
3. First-Pass Success Rate — PASS on initial Gemini validation, no retries
4. Rework Frequency — retry cycles per task, by agent type + category
5. Quality Score Trend — 7-day rolling average of Gemini validation scores
6. Recovery Rate — initially-failed tasks that reach verified-complete
7. Context Efficiency — output tokens in deliverables / total tokens consumed
8. Cost per Successful Outcome — Agent Economics cost / success binary

## Value Attribution Framework
Map each outcome to: time savings, quality improvement, throughput gain,
risk avoidance, knowledge preservation. Quantify estimates per category.

## Key Constraints
- Read-only observation: NEVER modify conductor-state.json or source data
- Operator-defined baselines with quarterly auto-adjustment
- Separate Qdrant collection, not co-located with memory data
- 90-day granular + indefinite aggregate retention
- Dashboard updates within 30 seconds of metric computation

## Integration Points
- Reads: Conductor (PRD 2), Agent Economics (PRD 10), Governance (PRD 5),
  Code Assurance (PRD 8), Self-Healing (PRD 12)
- Writes: Memory Dashboard (PRD 6), Memory System (PRD 4)
- Subscribes: Event-Driven (PRD 13) event bus

## Requirements
14 requirements (REQ-OM-001 through REQ-OM-014). MUST-priority requirements
are non-negotiable: task completion rate, TTR, first-pass success, rework
frequency, quality trend, cost-per-outcome, read-only observation, and
retention policy.