Predictive Scaling.
Stop reacting to workload spikes and cost overruns. Predictive Scaling uses statistical forecasting to anticipate demand, pre-warm caches, optimize model routing, and provide cost forecasts with confidence intervals — before the first token is consumed.
Problem Statement
Agent Economics (PRD 10) is entirely reactive. It tracks what was spent after the spend occurs, routes models based on the current request in isolation, and has no awareness of historical patterns. The system makes the same cold-start mistakes every Monday morning, every sprint boundary, and every project kickoff.
The consequences are concrete: Opus is invoked for trivial classification tasks because the router lacks pattern memory. Qdrant collections are cold on first access because nobody predicted they would be needed. Budget alerts fire after the overspend occurs rather than before. Concurrency limits are static when workload is dynamic.
No Demand Anticipation
Every session starts cold. The system cannot predict that Monday mornings produce 3x the workload of Friday afternoons, or that sprint boundaries trigger MAJOR-tier workflows.
Wasteful Model Routing
Without historical pattern analysis, the model router treats every request as novel. Trivial tasks receive Opus-level processing. Complex tasks get Haiku-level treatment when budget is low, degrading quality instead of forecasting and reserving capacity.
Cold Cache Penalty
Qdrant collections, prompt caches, and context pre-loads are populated on demand. First access is always slow. Predictable access patterns (daily standup context, sprint planning data) could be pre-warmed but are not.
Reactive Budget Alerts
Budget warnings fire after thresholds are exceeded. A cost forecast with confidence intervals would enable proactive budget management and prevent mid-project cost surprises.
Real-World Scenarios
Developer logs in at 09:00 after a weekend
Without prediction: cold caches, Opus routed for a TRIVIAL task (backlog grooming), context window fills with stale data. Cost: $2.40 for a $0.15 task. With prediction: caches pre-warmed at 06:00, Haiku pre-selected for morning classification tasks, relevant Qdrant collections loaded. Cost: $0.12.
End-of-sprint retrospective triggers 8 concurrent agent dispatches
Without prediction: concurrency limit hit, tasks queued, TTR doubles. Budget alert fires at 80% after 6 of 8 tasks complete. With prediction: concurrency limit temporarily raised to 12, budget forecast showed 95% consumption 2 days ago, operator pre-approved the overage.
Architecture
The Prediction Engine operates as a sidecar to Agent Economics, consuming the same data streams but producing forward-looking advisories rather than backward-looking reports. All predictions are advisory; no automated system changes occur without operator confirmation or pre-configured policies.
Statistical Forecasting, Not ML
The Prediction Engine uses time-series decomposition (trend + seasonal + residual), exponential smoothing, and simple regression. No neural networks, no training pipelines, no GPU requirements. Runs on the same hardware as the conductor.
Advisory, Not Mandatory
Every prediction produces an advisory with a confidence interval. Model routing suggestions, concurrency limit changes, and budget forecasts are presented to the operator or to Agent Economics as recommendations, never as forced overrides.
Three Output Channels
Predictions flow to three consumers: the Adaptive Model Router (pre-selects optimal model), the Concurrency Advisor (recommends limit adjustments), and the Budget Alert system (provides forward-looking cost projections).
Key Components
Workload Predictor
Analyzes three temporal pattern layers to forecast upcoming workload:
- Temporal patterns: Hour-of-day, day-of-week, and day-of-month demand curves built from 30+ days of historical data. Detects recurring spikes (Monday mornings, end-of-month, sprint boundaries).
- Session trajectory analysis: Once a session begins, predicts remaining workload based on the current trajectory. A MAJOR-tier workflow at phase 2 of 8 predicts 6 more agent dispatches.
- Weekly profiles: Stored in Qdrant as vector embeddings, enabling similarity search. "This week looks like sprint-end week #14" enables pattern matching against similar historical periods.
Adaptive Model Router
Extends Agent Economics model routing with predictive pre-selection. Instead of routing each request in isolation, the router considers:
- Predicted workload volume: If 20 tasks are forecast in the next hour, reserve Opus capacity for the 3 likely-complex tasks and pre-select Haiku for the 17 likely-trivial ones.
- Budget trajectory: If the cost forecast shows 90% budget consumption by Wednesday, begin model downgrade recommendations on Monday.
- Historical success rates: For this task type + agent combination, which model historically produces the highest first-pass success rate at the lowest cost?
Cache Pre-Warming Engine
Runs at 06:00 daily (configurable). Analyzes predicted workload and pre-loads:
- Qdrant collections likely to be accessed
- Prompt cache entries for predicted task types
- Context pre-loads for active projects
Cost Forecaster
Produces cost projections with confidence intervals (p25/p50/p75/p95) for the next 24 hours, 7 days, and 30 days. Factors in predicted workload, model routing changes, and historical cost-per-task data.
Concurrency Advisor
Recommends concurrency limit adjustments based on predicted demand. Proposes temporary increases before anticipated spikes and decreases during predicted quiet periods. All changes require operator policy or explicit approval.
Prediction Quality Monitor
Tracks prediction accuracy over time using MAPE (Mean Absolute Percentage Error). Automatically degrades to conservative defaults when prediction quality drops below configured thresholds. Reports accuracy trends to the dashboard.
Prediction Trigger Rules
| Trigger | Condition | Action |
|---|---|---|
| Session Start | New conductor session detected | Generate session-scope workload forecast and model routing advisory |
| 3-Task Threshold | 3 tasks completed in current session | Refine session trajectory prediction with actual data, update model routing |
| Daily Profile | 06:00 local time (configurable) | Generate 24-hour workload forecast, execute cache pre-warming, update budget forecast |
| Weekly Profile | Sunday 22:00 (configurable) | Generate 7-day workload forecast, identify similar historical weeks, update monthly budget projection |
| Budget Alert | Cost forecast exceeds 70% of budget at any confidence level | Emit proactive budget warning with forecast details and recommended actions |
Requirements
| ID | Requirement | Priority |
|---|---|---|
| REQ-PS-001 | Workload Predictor SHALL analyze at least 30 days of historical data to produce temporal demand curves (hour-of-day, day-of-week, day-of-month) with confidence intervals. | MUST |
| REQ-PS-002 | Session trajectory analysis SHALL activate after 3 completed tasks in a session and produce updated workload forecasts for the remaining session within 2 seconds of trigger. | MUST |
| REQ-PS-003 | Adaptive Model Router SHALL produce model pre-selection advisories that consider predicted workload volume, budget trajectory, and historical success rates per task-type/agent/model combination. | MUST |
| REQ-PS-004 | Cache Pre-Warming Engine SHALL execute daily at a configurable time (default 06:00) and pre-load Qdrant collections, prompt cache entries, and context data predicted for the next 12 hours. | MUST |
| REQ-PS-005 | Cost Forecaster SHALL produce cost projections with p25/p50/p75/p95 confidence intervals for 24-hour, 7-day, and 30-day horizons, updated at each prediction trigger event. | MUST |
| REQ-PS-006 | All predictions SHALL be advisory. No automated system changes (model routing overrides, concurrency limit adjustments, budget modifications) SHALL occur without operator policy or explicit approval. | MUST |
| REQ-PS-007 | Concurrency Advisor SHALL recommend limit adjustments (increase before predicted spikes, decrease during predicted quiet periods) with supporting data from the workload forecast. | SHOULD |
| REQ-PS-008 | Weekly profiles SHALL be stored as vector embeddings in Qdrant, enabling similarity search to match current-week patterns against historical weeks for improved forecasting accuracy. | SHOULD |
| REQ-PS-009 | Prediction Quality Monitor SHALL track MAPE (Mean Absolute Percentage Error) for all forecast types and automatically degrade to conservative defaults when MAPE exceeds 40% for 7 consecutive days. | MUST |
| REQ-PS-010 | System SHALL use statistical methods only (time-series decomposition, exponential smoothing, regression). No machine learning models, no training pipelines, no GPU requirements. | MUST |
| REQ-PS-011 | Budget Alert system SHALL emit proactive warnings when cost forecast exceeds 70% of the configured budget at any confidence level (p25/p50/p75/p95), including recommended mitigation actions. | SHOULD |
| REQ-PS-012 | System SHALL expose prediction data to the Memory Dashboard (PRD 6) including forecast visualizations, prediction accuracy trends, and cache warming effectiveness metrics. | COULD |
Design Decisions
Key architectural and design choices, with rationale for each decision.
- Statistical methods, not machine learning. ML models require training data volumes, GPU compute, and maintenance overhead that are disproportionate to the prediction task. Time-series decomposition and exponential smoothing produce adequate forecasts for workload patterns that are fundamentally cyclical (hour/day/week). When prediction quality degrades, the system falls back to conservative defaults rather than producing confidently wrong forecasts.
- Advisory, not mandatory. Predictions are probabilistic by nature. Forcing automated actions based on forecasts introduces a new failure mode: the system actively doing the wrong thing based on a bad prediction. Advisory mode preserves operator agency while providing the data needed for informed decisions. Operators can create policies to auto-accept specific advisory types once they trust the predictions.
- 3-task prediction trigger. Session trajectory analysis requires a minimum sample size before predictions become meaningful. With fewer than 3 completed tasks, the signal-to-noise ratio is too low. The 3-task threshold was chosen empirically: it provides enough data to distinguish TRIVIAL from MAJOR workflow patterns while activating early enough to be useful.
- Daily profile updates at 06:00. Cache pre-warming must complete before the typical start of the workday. 06:00 provides a 2-3 hour buffer for pre-warming execution. The time is configurable because team schedules vary. Pre-warming earlier wastes resources on cache entries that expire; pre-warming later risks cold caches at session start.
- Confidence intervals on all forecasts. A point estimate ("you will spend $47 tomorrow") is misleading. Confidence intervals ("you will spend between $31 and $68 with 80% probability") communicate the inherent uncertainty in prediction and enable operators to make risk-appropriate decisions. The p25/p50/p75/p95 format is familiar to engineers.
- Automatic degradation on poor prediction quality. When MAPE exceeds 40% for 7 consecutive days, the system stops producing predictions and falls back to conservative defaults (no pre-warming, no routing advisories, static concurrency limits). This prevents a poorly-calibrated prediction engine from causing more harm than no predictions at all.
Integration Map
Predictive Scaling integrates with 9 systems across the ecosystem, serving as a forward-looking complement to Agent Economics' backward-looking cost analysis.
| Integration | Direction | Data Exchanged |
|---|---|---|
| Agent Economics (PRD 10) | Reads / Advises | Reads historical cost data, model routing decisions, token consumption. Provides model routing advisories and cost forecasts back to Agent Economics for enhanced routing decisions. |
| Conductor (PRD 2) | Reads from | Session start events, tier classifications, phase sequences, and agent dispatch records. Primary data source for workload pattern analysis and session trajectory prediction. |
| Memory System (PRD 4) | Reads / Writes | Reads historical patterns and weekly profiles from Qdrant. Writes prediction metadata and weekly profile embeddings for cross-session continuity. |
| Memory Dashboard (PRD 6) | Writes to | Forecast visualizations, prediction accuracy trends, cache warming effectiveness metrics, and cost forecast charts with confidence intervals. |
| Event-Driven (PRD 13) | Subscribes / Publishes | Subscribes to conductor events for real-time session tracking. Publishes prediction events (forecast generated, advisory issued, cache warming completed) for downstream consumers. |
| Outcome Measurement (PRD 15) | Reads from | Historical success rates, TTR data, and quality scores to improve model routing accuracy and workload estimation. |
| Context Guard (PRD 3) | Advises | Provides context pre-load recommendations based on predicted task types, enabling Context Guard to proactively manage context window allocation. |
| n8n Workflows | Triggers | Prediction events can trigger n8n workflows for external notifications (Slack alerts for budget forecasts, email digests for weekly predictions). |
Prompt to Build It
Copy and paste this prompt into Claude Code to begin implementing Predictive Scaling. It references the architecture, components, and integration points defined in this PRD.
Build the Predictive Scaling system (PRD 16) for the conductor plugin ecosystem. ## Context Agent Economics (PRD 10) is reactive — it tracks cost after spend occurs and routes models without historical pattern awareness. We need a forward-looking prediction layer that anticipates demand, pre-warms caches, and optimizes model routing using statistical forecasting. ## Architecture Prediction Engine (sidecar to Agent Economics) with three sub-components: - Workload Predictor: temporal patterns (hour/day/week), session trajectory analysis (activates after 3 tasks), weekly profiles stored as Qdrant vectors - Cost Forecaster: p25/p50/p75/p95 confidence intervals for 24h/7d/30d horizons - Cache Pre-Warming Engine: runs daily at 06:00, pre-loads Qdrant collections, prompt caches, and context data based on 12-hour workload forecast ## Output Channels 1. Adaptive Model Router — predictive pre-selection based on forecast + budget + historical success rates per task-type/agent/model combination 2. Concurrency Advisor — temporary limit adjustments before predicted spikes 3. Budget Alert system — proactive warnings at 70% forecast threshold ## Key Constraints - Statistical methods ONLY: time-series decomposition, exponential smoothing, regression. No ML, no training pipelines, no GPU requirements - Advisory ONLY: no automated changes without operator policy or approval - 3-task minimum before session trajectory prediction activates - Automatic degradation when MAPE exceeds 40% for 7 consecutive days - All forecasts include confidence intervals (p25/p50/p75/p95) ## Prediction Triggers - Session Start: generate session-scope workload forecast - 3-Task Threshold: refine trajectory with actual session data - Daily (06:00): 24-hour forecast + cache pre-warming execution - Weekly (Sunday 22:00): 7-day forecast + similar-week matching - Budget Alert: proactive warning when forecast exceeds 70% budget ## Integration Points - Reads: Agent Economics (PRD 10), Conductor (PRD 2), Memory (PRD 4), Outcome Measurement (PRD 15) - Writes: Memory Dashboard (PRD 6), Memory System (PRD 4) - Subscribes/Publishes: Event-Driven (PRD 13) - Advises: Context Guard (PRD 3), n8n workflows ## Requirements 12 requirements (REQ-PS-001 through REQ-PS-012). All MUST-priority items are non-negotiable: temporal demand curves, session trajectory analysis, model routing advisories, daily cache pre-warming, cost confidence intervals, advisory-only mode, statistical-only methods, and automatic degradation.