What Good Looks Like
This roadmap describes one year of deliberate organizational change, not twelve months of model building. The goal is durability—systems that work, are trusted, and remain governable after leadership attention moves elsewhere.
Annual View
A good year ends with:
| Outcome | Why It Matters |
| Fewer arguments | Reality is shared |
| Fewer heroics | Risk is designed out |
| Fewer surprises | Incentives and ownership are explicit |
| Continuity | System functions when original team leaves |
At year end, the organization has: shared understanding, explicit ownership, managed risk, institutional memory, and the ability to say "no" as confidently as "yes."
Quarterly View
Each quarter solves a human problem before it becomes a technical or financial one.
| Quarter |
Name |
Human Aim |
Gate |
Primary Outputs |
| Q1 |
Diagnostics |
Align people on reality before building anything expensive |
Problem & success definition locked; baseline approved |
Ontology, KPI targets, dataset inventory, baseline + error analysis |
| Q2 |
Architect |
Reduce ambiguity so teams stop arguing and start shipping |
Architecture review passed; security/compliance accepted |
System design, IaC plan, schema/versioning, baseline pipeline |
| Q3 |
Engineer |
Build with guardrails so operators don't carry risk |
Validation suite green; risk controls implemented |
Eval harness, red-team results, drift/bias checks, rollout plan |
| Q4 |
Enable |
Make the system survivable after handoff |
Production readiness met; monitoring live; owner assigned |
Runbooks, dashboards, change mgmt, ROI review |
Quarterly Roadmap
Gate Types
| Badge | Name | Meaning |
| HJG | Human Judgment Gate | Requires explicit human decision-making. Not automatable. |
| $ | Economic Gate | Requires ROI validation before proceeding. Kill criteria apply. |
| ⚠ | Irreversibility Flag | Decisions costly to unwind. Extra scrutiny required. |
| CT | Cost Telemetry Contract | Metrics with named owners, refresh cadence, and kill bindings. |
HJG Procedural Requirements
Human Judgment Gates require procedural enforcement, not just cultural compliance:
- Convener: Named person responsible for scheduling the gate (typically Product Owner or Tech Lead)
- Quorum: Minimum 2 reviewers with authority to approve or reject
- Evidence: Required artifacts must be submitted 48 hours before the gate
- Dissent: Dissenting views must be documented even if overruled
- Escalation: If gate is missed or delayed >5 business days, automatic escalation to Exec Sponsor
- Record: Decision, rationale, attendees, and dissent logged in Decision Memory Ledger
How to Use This Playbook
What "Done" Means
"Done" is not a model that runs. "Done" is a capability that can be measured, audited, rolled back, and re-learned by a new team without tribal memory.
What Breaks Teams
Most programs fail from missing evidence: unclear intent, no acceptance criteria, no telemetry contract, no rollback plan, and no operating owner. This playbook forces those decisions earlier.
Each month maps to a phase. Organizations may compress or extend phases based on complexity, but the sequence should not be reordered. Skipping phases creates debt that surfaces later—usually at the worst possible time.
Phase Evidence Packs
Each phase exit requires a formal Evidence Pack. This makes gatekeeping less subjective without bureaucratizing it.
| Phase | Evidence Pack ID | Required Artifacts | Reviewer |
| 01 Ontology | PH1-EVID-1 | Expert map, concept glossary, relationship diagram, contested concept log | Domain Lead + Product |
| 02 Problem Space | PH2-EVID-1 | Boundary stress tests, edge case matrix, scope validation results | Tech Lead + Product |
| 03 Discovery | PH3-EVID-1 | Stakeholder interview notes, data inventory, regulatory constraint map | Product + Compliance |
| 04 Alignment | PH4-EVID-1 | Architecture ROI pack, stakeholder sign-off matrix, risk acceptance docs | Exec Sponsor + Finance |
| 05 Integration | PH5-EVID-1 | IaC validation logs, schema version registry, security scan results | Platform Lead + Security |
| 06 Build | PH6-EVID-1 | Baseline model metrics, telemetry contract, reproducibility proof | ML Lead + SRE |
| 07 Validation | PH7-EVID-1 | Test suite results, bias audit, red team report, pen test findings | QA Lead + Security |
| 08 Pre-Production | PH8-EVID-1 | Load test results, canary metrics, rollback verification, kill drill results | SRE Lead + Ops |
| 09 Hypercare | PH9-EVID-1 | Launch checklist, escalation log, rapid iteration tracking | Product + Support Lead |
| 10 Production | PH10-EVID-1 | Deployment verification, autoscaling proof, rollback test results | SRE + Platform Lead |
| 11 Reliability | PH11-EVID-1 | Observability dashboard, on-call rotation, decay detection baseline | SRE Lead + ML Lead |
| 12 Continuous | PH12-EVID-1 | Automation inventory, knowledge transfer docs, next iteration brief | Tech Lead + Product |
Stop Authority Drills HJG
Stop authority is psychologically harder than rollback. Organizations must practice stopping, not just responding.
⚠ Mandatory Requirement
At least one simulated kill-decision exercise must be run before Phase 8. This forces the organization to practice stopping a project that has momentum, budget, and stakeholder investment.
| Drill Type | Timing | Participants | Success Criteria |
| Economic Kill Drill | Before Phase 4 ROI Gate | Finance, Product, Exec Sponsor | Team can articulate kill threshold and demonstrate willingness to invoke it |
| Technical Kill Drill | Before Phase 8 | ML Lead, SRE, Platform | Rollback executes in <15 min; all dependencies notified; audit trail complete |
| Compliance Kill Drill | Before Phase 9 | Legal, Compliance, Product | Stop authority invoked on simulated regulatory finding; communication chain verified |
| Adoption Kill Drill | Before Phase 10 | Product, UX, Support | Team can define minimum viable adoption shape and demonstrate kill criteria |
Drill Protocol
- Scenario briefing: Present a realistic kill condition (cost overrun, bias discovery, adoption failure)
- Decision simulation: Team must reach consensus on kill/continue within 30 minutes
- Execution proof: If kill, demonstrate the technical and communication steps
- Debrief: Document hesitation points, authority gaps, and process improvements
Anti-Patterns & Red Flags
Strong governance systems risk becoming performative. Watch for these signals that artifacts are being completed without genuine engagement.
| Anti-Pattern | Red Flag Signals | Root Cause | Intervention |
| Backfilled Model Card |
Model Card completed after deployment; sections copy-pasted from templates; no evidence of reviewer engagement |
Documentation treated as compliance checkbox, not design artifact |
Require Model Card draft at Phase 6; reviewer must sign with specific feedback |
| Mechanical Risk Register |
All risks rated "Medium"; mitigations are generic; no risks ever escalated or retired |
Risk assessment is ceremonial; no one expects it to drive decisions |
Require at least one risk escalation per quarter; track risk-to-decision linkage |
| Phantom RACI |
RACI exists but decisions still escalate informally; "Accountable" person doesn't know they're accountable |
Authority transfer is documented but not socialized |
RACI owner must verbally confirm role; escalation test in Phase 4 |
| Ceremonial HJG |
Human Judgment Gates passed in <5 minutes; no dissent recorded; same person approves everything |
Gates are scheduled but not staffed for genuine deliberation |
Require minimum 2 reviewers; document dissenting views even if overruled |
| Orphaned Telemetry |
Dashboards exist but no one checks them; alerts fire but aren't investigated |
Observability built for audit, not for operations |
Weekly telemetry review with named owner; alert-to-action audit |
| Compliance Theater |
Legal/Compliance consulted only for sign-off; concerns raised late are dismissed as "blocking" |
Compliance treated as gate, not design partner |
Compliance representative in Phase 3 discovery; veto power through Phase 7 |
| Tribal Knowledge Dependency |
Key decisions explained verbally; documentation says "ask Sarah"; Bus Factor = 1 |
Urgency prioritized over durability |
Knowledge transfer test: new team member must execute runbook solo |
Audit Question
For each artifact, ask: "If I removed this document, would anyone notice? Would any decision change?" If the answer is no, the artifact is performative.
Kill Criteria & Stop Authority
Projects fail expensively when nobody has the right—or the obligation—to stop them. Define explicit kill criteria early and assign named stop authority before incentives and sunk cost take over.
Kill criteria (examples)
- Cost per Unit of AI Work exceeds threshold for 3 consecutive cycles
- Unmitigated safety or compliance breach
- Performance regression beyond agreed tolerance
- Adoption remains below target despite corrective actions
Stop authority
- Named individual (not a committee)
- Clear escalation path and decision window
- Rollback power without political permission
- Evidence required for restart
Executive Control Surface
A CIO/CTO should monitor these 6 signals monthly. When thresholds are breached, intervention is required—not optional.
Monthly Monitoring Signals
| Signal |
Description |
Healthy |
Warning |
Critical |
| Unit Economics Health |
Cost per inference relative to value delivered |
<80% of value |
80-100% |
>100% (value-negative) |
| Model Performance Decay |
Accuracy/precision drift from baseline |
<5% decay |
5-15% |
>15% (trigger retraining) |
| Error Rate by Consequence |
Errors weighted by business impact |
<$10K/mo impact |
$10-50K |
>$50K (escalate) |
| Human Override Rate |
How often humans reject model outputs |
5-20% |
<5% or >30% |
<2% or >50% |
| Time-to-Rollback |
How quickly the system can be reverted |
<15 min |
15-60 min |
>60 min (unacceptable) |
| Compliance Drift |
Gap between current state and requirements |
Fully compliant |
Minor gaps |
Material gaps (halt) |
Intervention Triggers
These conditions require immediate executive action—not delegation.
1
Kill Trigger: ROI Collapse
If cost-per-inference exceeds value-per-inference for 2 consecutive months, initiate sunset review. Do not wait for quarter-end.
2
Escalation Trigger: Consequential Error Spike
If weighted error cost exceeds $50K in any month, convene incident review within 48 hours. Model may need to be pulled from production.
3
Governance Trigger: Compliance Gap
Any material compliance gap halts new feature deployment until resolved. Non-negotiable in regulated industries.
Decision Authority Matrix
| Decision |
Owner |
Consulted |
Informed |
| Model goes to production |
CTO / VP Eng |
Legal, Compliance, Product |
Board (if high-risk) |
| Model is sunset |
CTO + CFO jointly |
Product, Customer Success |
Affected customers |
| Emergency rollback |
On-call engineer |
None (act first) |
CTO within 1 hour |
| Compliance exception |
General Counsel |
CTO, CISO |
Board |
| Budget increase >20% |
CFO |
CTO, Product |
Board |
Economic Viability Framework
Cost is not a constraint—it is a governing force. Every AI system must justify its existence economically, continuously.
$1
Unit Economics Definition Gate
Before any model is built, define the economic unit. What is the cost of one inference? What is the value of one correct output?
Economic Gate
If value-per-inference cannot be estimated within 10x accuracy, the project is not ready for development. Return to Discovery.
E.1.1
Cost-per-Inference Model
Compute, storage, network, human review costs per prediction.
E.1.2
Value-per-Inference Model
Revenue generated, cost avoided, or risk mitigated per correct output.
E.1.3
Break-even Analysis
Volume required for positive ROI at current accuracy levels.
$2
Cost-of-Error Curves
Not all errors are equal. Map the cost of different error types and their frequency.
E.2.1
Error Taxonomy with Cost Weights
False positives, false negatives, edge cases—each with dollar impact.
E.2.2
Cost-of-Error vs Latency Trade-off Curves
Faster inference often means more errors. Quantify the trade-off.
E.2.3
Error Budget Allocation
Acceptable error rates by type, based on economic tolerance.
$3
Kill Thresholds HJG
Define the conditions under which the project is terminated—before you're emotionally invested.
⚠ Irreversibility Flag
Kill thresholds must be defined before Phase 4 (Alignment). Once development begins, sunk cost bias makes objective termination nearly impossible.
E.3.1
Kill Criteria Document
Specific, measurable conditions that trigger project termination.
E.3.2
Sunset Procedure
How to wind down gracefully: data retention, customer communication, team reallocation.
E.3.3
Pivot Criteria
Conditions under which the project should change direction rather than die.
$4
ROI Gates at Phase Boundaries
Economic viability is validated at Phase 4, 8, and 12. Not annually—at milestones.
E.4.1
Phase 4 ROI Gate: Design Complete
Projected ROI based on architecture decisions. Kill if negative at projected scale.
E.4.2
Phase 8 ROI Gate: Pre-Production
Validated ROI based on staging performance. Kill if <1.5x projected.
E.4.3
Phase 12 ROI Gate: Steady State
Actual ROI vs projected. Sunset if <1.0x after 3 months in production.
Economic Sovereignty Principle
A model that cannot pay for itself is a liability, not an asset. Economic viability is not a constraint to work around—it is the purpose the system must serve.
Cost Telemetry Contract
Economics are not conceptually sovereign—they are physically enforced. Every production system must satisfy this contract. No exceptions.
Mandatory Enforcement
Each metric below requires a named human owner (not "team"), a defined refresh cadence, a review forum, and a binding to a specific kill threshold. Systems without complete telemetry contracts do not ship.
Required Telemetry Metrics
| Metric |
Owner |
Refresh |
Reviewed By |
Kill Trigger |
| Cost per inference (fully loaded) |
Engineering Manager |
Daily |
CTO + CFO |
>1.0× value for 2 months |
| Error cost per month (weighted) |
Product Manager |
Weekly |
Executive Review |
>$50K/month |
| Human review cost per output |
Operations Lead |
Weekly |
Ops Review |
>30% of inference cost |
| Compute cost per 1K inferences |
Platform Engineer |
Real-time |
Infra Review |
>2× baseline for 1 week |
| Retraining cost per cycle |
ML Engineer |
Per event |
ML Review |
>1 month of value |
| Value delivered per inference |
Business Analyst |
Monthly |
Exec Review |
<0.8× projected for 2 months |
CT
Contract Artifact: CT-1
The Cost Telemetry Contract must be completed and signed off before Phase 8 (Pre-Production).
CT-1.1
Telemetry Implementation Checklist
Each metric instrumented with data pipeline and dashboard.
CT-1.2
Owner Assignment Document
Named individuals (not roles) with escalation paths.
CT-1.3
Alert Configuration Spec
Automated alerts for threshold breaches with escalation rules.
CT-1.4
Review Cadence Calendar
Standing meetings where each metric is reviewed with owners present.
Enforcement Mechanism
The CT-1 artifact is a gate artifact. Production deployment is blocked until all six metrics have verified telemetry, named owners, and configured alerts.
Implementation Templates
Production-ready templates for governance artifacts. Copy, customize, and deploy. These are starting points—adapt to your regulatory context.
Template Philosophy
Templates reduce cognitive load but create false confidence if used without adaptation. Each template includes "Customization Required" flags for organization-specific decisions.
T.1
RACI Matrix Template
Responsibility assignment for AI/ML lifecycle. The most common failure mode is "everyone is responsible" (meaning no one is).
| Activity / Decision |
ML Engineer |
Product Manager |
Data Engineer |
Security |
Legal/Compliance |
Executive Sponsor |
| Problem definition sign-off |
C |
R |
C |
I |
C |
A |
| Data availability assessment |
C |
I |
R |
C |
C |
I |
| Regulatory constraint mapping |
I |
C |
C |
C |
R |
A |
| Kill criteria definition |
C |
R |
I |
I |
C |
A |
| Architecture design |
R |
C |
C |
C |
I |
I |
| Security posture approval |
C |
I |
C |
R |
C |
A |
| Data pipeline implementation |
C |
I |
R |
C |
I |
I |
| Model training & selection |
R |
C |
C |
I |
I |
I |
| Bias/fairness evaluation |
R |
C |
I |
I |
C |
A |
| Security penetration testing |
C |
I |
I |
R |
I |
I |
| Production readiness sign-off |
C |
R |
C |
C |
C |
A |
| Rollback plan validation |
R |
C |
C |
C |
I |
I |
| Production deployment |
R |
C |
C |
C |
I |
I |
| Incident response (L1) |
R |
I |
C |
C |
I |
I |
| Incident escalation (L2+) |
C |
R |
C |
C |
C |
A |
| Model retraining decision |
R |
C |
C |
I |
I |
A |
| Kill/sunset decision |
C |
C |
I |
C |
C |
A |
R = Responsible (does the work)
A = Accountable (final decision authority)
C = Consulted (input required)
I = Informed (kept updated)
⚙ Customization Required
- Add organization-specific roles (e.g., AI Ethics Board, Model Risk Officer for financial services)
- Adjust "A" assignments based on your governance structure
- For regulated industries, Legal/Compliance may need "A" on more decisions
- Consider adding SRE/Platform team for infrastructure-heavy deployments
T.2
Telemetry Dashboard Configuration
Grafana/Datadog-compatible dashboard specification. These are the minimum viable metrics for production AI governance.
{
"dashboard": {
"title": "AI/ML Production Governance",
"tags": ["ai", "ml", "production", "governance"],
"panels": [
{
"title": "Economic Health",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
"targets": [{
"expr": "sum(ml_inference_cost_usd) / sum(ml_value_delivered_usd)",
"legendFormat": "Cost/Value Ratio"
}],
"thresholds": {
"steps": [
{"color": "#999", "value": null},
{"color": "#666", "value": 0.8},
{"color": "#000", "value": 1.0}
]
}
},
{
"title": "Model Performance Decay",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
"targets": [{
"expr": "1 - (ml_current_accuracy / ml_baseline_accuracy)",
"legendFormat": "Decay from Baseline"
}],
"alert": {
"name": "Model Decay Alert",
"conditions": [{
"evaluator": {"type": "gt", "params": [0.15]},
"operator": {"type": "and"},
"reducer": {"type": "avg"}
}],
"notifications": [{"uid": "ml-oncall-channel"}]
}
},
{
"title": "Human Override Rate",
"type": "gauge",
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
"targets": [{
"expr": "sum(ml_human_overrides) / sum(ml_total_predictions) * 100",
"legendFormat": "Override %"
}],
"thresholds": {
"steps": [
{"color": "#999", "value": null},
{"color": "#666", "value": 15},
{"color": "#000", "value": 30}
]
}
},
{
"title": "Error Cost by Category",
"type": "piechart",
"gridPos": {"h": 8, "w": 6, "x": 12, "y": 4},
"targets": [{
"expr": "sum by (error_type) (ml_error_cost_usd)",
"legendFormat": "{{error_type}}"
}]
},
{
"title": "Inference Latency P99",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 12},
"targets": [{
"expr": "histogram_quantile(0.99, ml_inference_latency_seconds_bucket)",
"legendFormat": "P99 Latency"
}],
"alert": {
"name": "Latency SLA Breach",
"conditions": [{
"evaluator": {"type": "gt", "params": [2.0]},
"operator": {"type": "and"},
"reducer": {"type": "avg"}
}]
}
},
{
"title": "Data Drift Score",
"type": "timeseries",
"gridPos": {"h": 8, "w": 6, "x": 12, "y": 12},
"targets": [{
"expr": "ml_feature_drift_score",
"legendFormat": "{{feature_name}}"
}],
"thresholds": {
"steps": [
{"color": "#999", "value": null},
{"color": "#666", "value": 0.1},
{"color": "#000", "value": 0.25}
]
}
},
{
"title": "Cost Telemetry Contract Status",
"type": "table",
"gridPos": {"h": 6, "w": 18, "x": 0, "y": 20},
"targets": [{
"expr": "ml_cost_metric_status",
"format": "table"
}],
"transformations": [{
"id": "organize",
"options": {
"indexByName": {},
"renameByName": {
"metric_name": "Metric",
"owner": "Owner",
"refresh_cadence": "Refresh",
"last_updated": "Last Updated",
"kill_trigger_status": "Kill Trigger Status"
}
}
}]
}
],
"refresh": "1m",
"time": {"from": "now-24h", "to": "now"}
}
}
T.2.1
Required Prometheus/OpenMetrics Exports
# HELP ml_inference_cost_usd Total inference cost in USD
# TYPE ml_inference_cost_usd counter
ml_inference_cost_usd{model="fraud_v2",env="prod"} 1234.56
# HELP ml_value_delivered_usd Estimated value delivered by predictions
# TYPE ml_value_delivered_usd counter
ml_value_delivered_usd{model="fraud_v2",env="prod"} 5678.90
# HELP ml_human_overrides Count of human override events
# TYPE ml_human_overrides counter
ml_human_overrides{model="fraud_v2",reason="low_confidence"} 42
# HELP ml_feature_drift_score PSI or KL divergence from baseline
# TYPE ml_feature_drift_score gauge
ml_feature_drift_score{feature="transaction_amount"} 0.08
T.3
Infrastructure as Code Snippets
Terraform modules for governed AI infrastructure. These enforce security and observability by default.
# ml-serving-infrastructure/main.tf
# Governed ML model serving with mandatory observability and rollback
terraform {
required_providers {
aws = { source = "hashicorp/aws", version = "~> 5.0" }
}
}
variable "model_name" {
type = string
description = "Name of the ML model (used for resource naming)"
}
variable "model_version" {
type = string
description = "Semantic version of the model"
}
variable "kill_threshold_cost_ratio" {
type = number
default = 1.0
description = "Cost/value ratio that triggers kill alert"
}
variable "rollback_model_version" {
type = string
description = "Previous stable version for automatic rollback"
}
# SageMaker Endpoint with mandatory monitoring
resource "aws_sagemaker_endpoint" "ml_endpoint" {
name = "${var.model_name}-${var.model_version}"
endpoint_config_name = aws_sagemaker_endpoint_configuration.ml_config.name
deployment_config {
blue_green_update_policy {
traffic_routing_configuration {
type = "CANARY"
canary_size {
type = "CAPACITY_PERCENT"
value = 10
}
wait_interval_in_seconds = 600
}
termination_wait_in_seconds = 300
maximum_execution_timeout_in_seconds = 3600
}
auto_rollback_configuration {
alarms = [
aws_cloudwatch_metric_alarm.model_error_rate.alarm_name,
aws_cloudwatch_metric_alarm.latency_breach.alarm_name,
aws_cloudwatch_metric_alarm.cost_ratio_breach.alarm_name
]
}
}
tags = {
ManagedBy = "terraform"
Model = var.model_name
Version = var.model_version
RollbackVersion = var.rollback_model_version
CostCenter = "ml-platform"
Governance = "ai-playbook-v7"
}
}
# Mandatory CloudWatch Alarms (cannot deploy without these)
resource "aws_cloudwatch_metric_alarm" "model_error_rate" {
alarm_name = "${var.model_name}-error-rate-breach"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "ModelError"
namespace = "AWS/SageMaker"
period = 300
statistic = "Average"
threshold = 0.05
alarm_description = "Model error rate exceeds 5% - triggers rollback"
dimensions = {
EndpointName = aws_sagemaker_endpoint.ml_endpoint.name
VariantName = "primary"
}
alarm_actions = [
aws_sns_topic.ml_alerts.arn,
# Auto-rollback is handled by deployment_config
]
}
resource "aws_cloudwatch_metric_alarm" "latency_breach" {
alarm_name = "${var.model_name}-latency-breach"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "ModelLatency"
namespace = "AWS/SageMaker"
period = 300
extended_statistic = "p99"
threshold = 2000 # 2 seconds
alarm_description = "P99 latency exceeds SLA - triggers rollback"
dimensions = {
EndpointName = aws_sagemaker_endpoint.ml_endpoint.name
}
alarm_actions = [aws_sns_topic.ml_alerts.arn]
}
resource "aws_cloudwatch_metric_alarm" "cost_ratio_breach" {
alarm_name = "${var.model_name}-cost-ratio-breach"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 24 # 24 hours of sustained breach
metric_name = "CostValueRatio"
namespace = "Custom/MLGovernance"
period = 3600
statistic = "Average"
threshold = var.kill_threshold_cost_ratio
alarm_description = "Cost/value ratio exceeds kill threshold"
alarm_actions = [
aws_sns_topic.ml_alerts.arn,
aws_sns_topic.executive_escalation.arn
]
}
# Governance enforcement: block deployment without audit trail
resource "aws_sagemaker_model" "ml_model" {
name = "${var.model_name}-${var.model_version}"
execution_role_arn = aws_iam_role.sagemaker_execution.arn
primary_container {
image = var.model_image_uri
model_data_url = var.model_artifact_s3_uri
environment = {
MODEL_VERSION = var.model_version
GOVERNANCE_PLAYBOOK_REF = "ai-playbook-v7"
DEPLOYMENT_TIMESTAMP = timestamp()
ROLLBACK_VERSION = var.rollback_model_version
}
}
tags = {
ApprovalTicket = var.approval_ticket_id # Required - enforced by policy
RiskAssessment = var.risk_assessment_id # Required - enforced by policy
ModelCard = var.model_card_url # Required - enforced by policy
}
}
# Output for audit trail
output "deployment_manifest" {
value = {
endpoint_name = aws_sagemaker_endpoint.ml_endpoint.name
model_version = var.model_version
rollback_version = var.rollback_model_version
kill_threshold = var.kill_threshold_cost_ratio
deployed_at = timestamp()
alarms_configured = [
aws_cloudwatch_metric_alarm.model_error_rate.alarm_name,
aws_cloudwatch_metric_alarm.latency_breach.alarm_name,
aws_cloudwatch_metric_alarm.cost_ratio_breach.alarm_name
]
}
description = "Deployment manifest for audit trail"
}
⚙ Customization Required
- Replace AWS SageMaker with your inference platform (GCP Vertex AI, Azure ML, self-hosted)
- Adjust thresholds based on your SLAs and risk tolerance
- Add VPC configuration for network isolation requirements
- Integrate with your CI/CD pipeline (GitHub Actions, GitLab CI, Jenkins)
- Add KMS encryption for regulated data (HIPAA, PCI-DSS)
T.4
Phase Exit Checklist Template
Standardized gate review checklist. No phase exit without explicit sign-off on all items.
Executive-Grade Observability
Operational AI fails quietly when only engineers can see the system. This layer makes trust, economics, and governance legible to executives so decisions are made on reality rather than narrative.
Trust Dashboard
- Drift / decay indicators
- Incident frequency and severity
- Override and escalation rates
- Operator confidence (measured, not assumed)
Economics Dashboard
- Cost per Unit of AI Work (UAW)
- Variance vs. forecast and budget guardrails
- Marginal cost per new capability
- ROI trendline (with confidence bounds)
Governance Dashboard
- Open risks with named owners
- Contract breaches and remediation status
- Model / prompt / policy version traceability
- Stop-authority exercises completed
Operator UX Principles
- Explain the “why” before the chart
- Surface the next action, not just metrics
- Design for incident time, not demo time
- Make rollback and safe-degradation obvious
Why Systems Fail
Technical correctness is necessary but not sufficient. These failure patterns survive model validation and destroy production systems.
1
Misaligned Incentives Override Accuracy
Users or operators have incentives that conflict with model objectives. The system produces correct outputs that get ignored or gamed.
2
Automation Shifts Users from Skepticism to Compliance
Over time, users stop questioning model outputs. When the model fails, no human catches it.
3
Unowned Outputs Create Silent Failure
No one is accountable for validating model decisions. Errors compound without detection.
4
Weak Rollback Paths Convert Errors into Crises
Systems that can't be quickly reversed turn fixable problems into reputational events.
5
Domain Expertise Erodes
Humans who could catch model errors lose their edge because they stop practicing judgment.
Key Insight
These failures are organizational and procedural, not technical. They cannot be fixed with better models—only with better governance.
The Human Failure Surface
Most production AI failures are human failures first: incentives, authority, skill asymmetry, and narrative decay. This section makes those failure modes explicit so they can be designed out.
Failure modes
- Incentive drift: KPIs reward usage, not outcomes.
- Authority ambiguity: no named stop authority.
- Skill asymmetry: operators cannot diagnose failure.
- Narrative decay: original intent is forgotten.
- Vendor gravity: defaults become architecture.
Countermeasures
- Phase Exit Contracts and named owners
- Override latency targets and rollback rehearsal
- Executive-grade observability (Trust/Econ/Gov)
- System Memory File with quarterly review
- Vendor constraints explicitly documented
System Continuity & Human Governance
Minimal addendum to prevent long-horizon failure: memory loss, meaning drift, and power resistance.
Artifacts
HJG
Senior-Only
Design intent: This addendum is deliberately small. It converts the remaining human and temporal risks into enforceable artifacts and gates—without expanding the core 12-month sequence.
0.1
Decision Memory Ledger (DML)
Documentation records outcomes. The Decision Memory Ledger preserves intent and assumptions so the system survives staff turnover and time.
Artifacts
- DML-1 Decision Memory Ledger (schema: Decision ID, Summary, Context, Alternatives, Rejections, Assumptions, Assumption Expiry, Owner)
- DML-2 Ledger Access Policy (read/write permissions, audit logging)
- DML-3 Ledger Query Requirement (mandatory consultation before scope, schema, objective, or boundary changes)
Gate
Hard Gate Required before Phase 4 / Phase 8 / Phase 11 changes that touch model objectives, retrieval scope, labeling, or decision boundaries.
0.2
Power Impact Assessment (PIA) Senior-Only HJG
Working systems redistribute authority. Resistance is usually rational: loss of discretion, shifted accountability, and threatened expertise. Make it visible early.
Artifacts
- PIA-1 Power Impact Assessment (who loses discretion, who gains authority, who becomes accountable, who can silently resist)
- PIA-2 Incentive Misalignment Register (misaligned KPIs, conflicting owners, perverse incentives)
- PIA-3 Adoption Risk Mitigation Plan (training, incentives, workflow design, escalation paths)
Gate
HJG Reviewed by Product + Exec Sponsor before Phase 3.
0.3
Declared System Role & Meaning Boundary
Humans use systems as stories. Declare what the system is allowed to mean—so “advisory” does not silently become “oracle.”
Artifacts
- DSR-1 Declared System Role Statement (Advisory / Assistive / Gatekeeping)
- DSR-2 Prohibited Uses & Boundary Conditions (domains, decisions, and contexts where use is disallowed)
- DSR-3 Human Confirmation Points (required approvals, override rules, escalation)
Gate
Hard Gate Required before Phase 3; UI language, training, and audit checks must align with DSR-1.
0.4
Long-Horizon Risk Register (LHR) Senior-Only
Some harm compounds invisibly and will not trigger short-term metrics or kill switches. Track it, review it annually, intervene with humans—not models.
Artifacts
- LHR-1 Long-Horizon Risk Register (skill atrophy, decision monoculture, over-dependence, vendor cognitive lock-in)
- LHR-2 Annual Review Record (evidence, outcomes, mitigations)
- LHR-3 Mitigation Action Plan (training, rotation, policy, workflow redesign)
Rule: Long-horizon risks do not trigger system termination. They trigger human intervention and governance action.
0.5
Planned Obsolescence & Doctrine Review Senior-Only
Every system needs a retirement plan, and every playbook needs a way to be revised without becoming dogma.
Artifacts
- PO-1 Planned Obsolescence Plan (expected lifespan, replacement conditions, knowledge transfer, archive & shutdown)
- DG-1 Doctrine Review Record (annual review, at least one external reviewer, logged exceptions & outcomes)
- DG-2 Exception Log (what rule was broken, why, outcome, preventive fix)
Gate
Hard Gate PO-1 required before Phase 8 (Launch/Production). DG-1 reviewed annually.
Glossary
Definitions for terms used throughout this playbook. Consistent terminology prevents miscommunication. If a term is used differently in your organization, document the mapping.
A
Agentic AI
AI systems that can autonomously take actions, use tools, or make multi-step decisions without human intervention at each step. Includes tool-using LLMs, autonomous agents, and multi-agent systems. See: Appendix AG
AI Act (EU)
European Union regulation establishing a legal framework for AI systems based on risk classification. High-risk AI systems must meet requirements for transparency, human oversight, accuracy, and robustness. Effective 2024-2026. Reference: Regulatory Matrix
Artifact
A documented deliverable produced during a phase of the playbook. Examples: Model Card, Risk Assessment Matrix, Runbook. Artifacts have named owners and version control.
AUC-ROC (Area Under the Receiver Operating Characteristic Curve)
A metric measuring a classification model's ability to distinguish between classes across all decision thresholds. Ranges from 0.5 (random) to 1.0 (perfect discrimination). Useful for comparing models but does not reflect real-world operating point.
B
Baseline
Reference performance metrics established before deployment or after initial production stabilization. Used to detect degradation and drift. Must be documented with measurement methodology.
Bias (Algorithmic)
Systematic errors in model outputs that disadvantage particular groups. Can arise from training data (historical bias), feature selection (proxy discrimination), or evaluation methodology. See: Phase 7.2, ISO/IEC TR 24027
Bus Factor
The minimum number of team members who would need to leave before a project becomes inoperable due to knowledge loss. A bus factor of 1 is a critical risk. This playbook requires bus factor ≥ 2 for production systems.
C
Canary Deployment
Deployment strategy where new model version receives a small percentage of traffic (typically 5-10%) while being monitored. Traffic increases gradually if metrics remain healthy. Enables early detection of issues without full exposure.
Concept Drift
Change in the relationship between input features and target variable over time. Unlike data drift, concept drift means the underlying patterns have changed, not just the input distribution. Example: Customer behavior changing during pandemic. See: Phase 11.4
Cost Telemetry Contract (CT)
A mandatory agreement specifying which economic metrics must be tracked, who owns each metric, refresh cadence, and kill thresholds. Systems cannot ship without a complete CT. See: Cost Telemetry section
CRISP-DM
Cross-Industry Standard Process for Data Mining. A methodology defining six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment. This playbook extends CRISP-DM with governance, operational, and economic controls.
D
Data Drift
Change in the statistical distribution of input features over time compared to training data. Measured using metrics like PSI (Population Stability Index) or KL Divergence. Does not necessarily indicate performance degradation but warrants investigation.
Datasheet (for Datasets)
Standardized documentation for datasets describing motivation, composition, collection process, preprocessing, uses, distribution, and maintenance. Based on Gebru et al. (2021). See: Model Cards & Datasheets section
Disparate Impact
A measure of fairness where a selection rate for a protected group is compared to the selection rate for a reference group. The "four-fifths rule" considers disparate impact if the ratio is below 0.8. Used in regulatory compliance.
E
Embedding
A dense vector representation of data (text, images, etc.) learned by a neural network. Embeddings capture semantic relationships and are used in similarity search, RAG systems, and transfer learning.
Error Budget
The acceptable amount of error (downtime, incorrect predictions, etc.) over a defined period, derived from SLO targets. When error budget is exhausted, new deployments should pause until reliability improves.
Explainability
The degree to which a model's predictions can be understood by humans. Includes feature importance, decision paths, and counterfactual explanations. Required for high-risk AI systems under EU AI Act. See: ISO/IEC 24028
F
Feature Store
A centralized repository for storing, managing, and serving ML features. Ensures consistency between training and inference, enables feature reuse, and provides lineage tracking.
Fine-tuning
Adapting a pre-trained model to a specific task or domain by training on task-specific data. Common with LLMs and transfer learning. Introduces risks around training data quality and catastrophic forgetting.
Foundation Model
Large models trained on broad data that can be adapted to many downstream tasks. Examples: GPT, BERT, CLIP. Introduce supply chain risks as organizations depend on external model providers.
G
GDPR (General Data Protection Regulation)
EU regulation on data protection and privacy. Relevant to AI: Article 22 (automated decision-making), Article 17 (right to erasure), and requirements for lawful basis and transparency. See: Regulatory Matrix
GMLP (Good Machine Learning Practice)
FDA guidance for developing medical device software using AI/ML. Emphasizes multi-disciplinary expertise, good software engineering practices, representative data, independence of training and test sets, and reference standards.
Governance OS
The operating system of controls, processes, and accountability structures that ensure AI systems remain safe, compliant, and valuable over time. This playbook is a Governance OS. See: Governance OS section
Ground Truth
The correct label or outcome used to evaluate model predictions. Quality of ground truth directly bounds model quality. Sources include human annotation, authoritative records, and observed outcomes.
H
Hallucination
When a generative AI model produces confident but factually incorrect or fabricated information. Particularly dangerous in high-stakes domains. Cannot be eliminated, only mitigated through verification and grounding. See: LLM Risks L5
HIPAA (Health Insurance Portability and Accountability Act)
US law establishing requirements for protecting health information (PHI). AI systems processing PHI must comply with access controls, audit logging, and data retention requirements. See: Regulatory Matrix
Human Judgment Gate (HJG)
A step in this playbook that requires explicit human decision-making and cannot be automated. Indicated by the HJG badge. Examples: kill criteria definition, risk acceptance, bias evaluation approval.
Hypercare
A period of intensified monitoring and support immediately following production deployment. Typically 2-4 weeks. Characterized by lower thresholds for alerts, faster response times, and elevated staffing. See: Phase 9
I
Inference
The process of applying a trained model to new data to produce predictions. Distinguished from training. Inference cost, latency, and reliability are key production concerns.
Irreversibility Flag
A marker in this playbook indicating decisions that are costly or impossible to unwind once made. Requires extra scrutiny and explicit approval. Examples: data schema choices, model architecture selection.
ISO/IEC 42001
International standard for AI Management Systems. Specifies requirements for establishing, implementing, maintaining, and continually improving an AI management system. Auditable certification available.
K
Kill Criteria
Pre-defined, measurable conditions under which a project or deployed system should be terminated. Must be established before significant investment. Requires named authority to execute. See: Economic Viability $3, Kill Criteria section
KL Divergence (Kullback-Leibler Divergence)
A measure of how one probability distribution differs from another. Used to detect data drift by comparing current input distribution to training distribution. Not symmetric: KL(P||Q) ≠ KL(Q||P).
L
Latency
Time between receiving an inference request and returning a prediction. Measured at various percentiles (P50, P95, P99). Critical for user experience and often traded off against accuracy or cost.
LLM (Large Language Model)
Neural network models trained on large text corpora to generate human-like text. Examples: GPT-4, Claude, Llama. Introduce unique risks including hallucination, prompt injection, and context window limitations. See: LLM-Specific Risks appendix
M
MLOps
Practices for deploying and maintaining ML models in production reliably and efficiently. Encompasses CI/CD for ML, monitoring, versioning, and automation. This playbook provides governance layer on top of MLOps.
Model Card
Standardized documentation for a trained model describing intended use, performance characteristics, limitations, and ethical considerations. Based on Mitchell et al. (2019). Required artifact before production. See: Model Cards section
Model Registry
A centralized repository for storing, versioning, and managing trained models. Enables model lineage, rollback, and audit. Essential infrastructure for governed ML.
N
NIST AI RMF (Risk Management Framework)
Framework from US National Institute of Standards and Technology for managing AI risks. Organized around Map, Measure, Manage, Govern functions. Voluntary but increasingly referenced in procurement and regulation. See: References
O
Ontology
A formal representation of concepts in a domain and the relationships between them. In this playbook, establishing ontology is Phase 1 — ensuring shared vocabulary before building. See: Phase 1
Override Rate
Percentage of model predictions that are overruled by human operators. High override rates may indicate low trust, poor model fit, or changing conditions. Tracked as executive-level signal. See: Executive Control Surface
P
Phase Exit Contract
A checklist of conditions that must be satisfied before proceeding to the next phase. Includes Truth, Economic, Risk, and Ownership contracts. Prevents premature advancement. See: Each phase section
Precision
The proportion of positive predictions that are correct: TP / (TP + FP). High precision means few false positives. Important when false positive cost is high.
Prompt Injection
An attack where adversarial text in input causes an LLM to ignore instructions or behave unexpectedly. Can occur directly (user input) or indirectly (retrieved documents). Requires input sanitization and privilege separation. See: LLM Risks L3
PSI (Population Stability Index)
A metric for measuring distribution shift between two datasets. PSI < 0.1 indicates no significant change; 0.1-0.25 indicates moderate shift; > 0.25 indicates major shift requiring investigation.
R
RACI Matrix
A responsibility assignment chart defining who is Responsible (does work), Accountable (final authority), Consulted (input required), and Informed (kept updated) for each activity. See: Template T.1
RAG (Retrieval-Augmented Generation)
An architecture where an LLM's responses are grounded by retrieving relevant documents from an external knowledge base. Reduces hallucination but introduces retrieval quality as a failure mode. See: Agentic AI section
Recall
The proportion of actual positives that are correctly identified: TP / (TP + FN). High recall means few false negatives. Important when missing positive cases is costly (e.g., fraud detection, medical diagnosis).
Red Team
A group that tests systems by simulating adversarial attacks. For AI, includes prompt injection, jailbreaking, bias elicitation, and edge case discovery. Required in Phase 7. See: Phase 7.3
Rollback
Reverting to a previous known-good version of a model or system. Must be testable, fast, and available without requiring the engineer who deployed the current version. A key incident response capability.
Runbook
A documented set of procedures for operating a system, including common tasks, troubleshooting steps, and incident response. Must be usable by someone who did not write it. See: Phase 10.3
S
SaMD (Software as a Medical Device)
Software intended to be used for medical purposes without being part of a hardware medical device. AI/ML in healthcare often qualifies. Subject to FDA regulation in US, MDR in EU. See: Regulatory Matrix
Shadow Deployment
Running a new model version in parallel with production, receiving real traffic but not affecting user-facing decisions. Enables comparison without risk. Precedes canary deployment. See: Phase 8
SLA (Service Level Agreement)
A commitment defining the expected level of service (uptime, latency, accuracy). Contractual between provider and consumer. Breaches may have financial or contractual consequences.
SLO (Service Level Objective)
An internal target for service quality, typically more stringent than SLA. Provides buffer before SLA breach. Used to guide engineering priorities and error budget allocation.
Stop Authority
A named individual with the power and obligation to halt a project or system when kill criteria are met. Must be able to act without political permission. See: Kill Criteria section
T
Technical Debt
The implied cost of rework caused by choosing quick solutions over better approaches. In ML, includes hardcoded thresholds, undocumented preprocessing, and missing tests. Accumulates interest. See: Phase 12.3
Telemetry
The automated collection and transmission of measurements from a system. For AI, includes inference metrics, resource usage, and business outcomes. Foundation for monitoring and governance.
Threshold
A decision boundary that converts model scores into actions (approve/reject/review). Selection involves trade-offs between precision and recall. Must be documented with rationale.
U
UAW (Unit of AI Work)
A standardized measure of AI system output for cost accounting. Defined specifically for each use case. Examples: one prediction, one document processed, one conversation turn. Basis for economic viability calculations. See: Economic Viability section
V
Validation
Testing a model's performance on held-out data to estimate real-world performance. Distinguished from verification (does the system meet specifications) and testing (does the code work). See: Phase 7
Version Control
Systematic tracking of changes to code, data, models, and configuration. Essential for reproducibility, rollback, and audit. All artifacts in this playbook must be version-controlled.
Standards Quick Reference
| Standard |
Full Name |
Domain |
Key Focus |
| ISO/IEC 42001 |
AI Management Systems |
All industries |
AI governance framework, certifiable |
| ISO/IEC 23894 |
AI Risk Management |
All industries |
Risk identification and treatment |
| NIST AI RMF |
AI Risk Management Framework |
All industries |
Map, Measure, Manage, Govern |
| EU AI Act |
Artificial Intelligence Act |
All industries (EU) |
Risk-based regulation, prohibited uses |
| FDA GMLP |
Good Machine Learning Practice |
Healthcare |
Medical device AI development |
| Basel AI Guidance |
Model Risk Management for AI |
Financial services |
Banking AI risk management |
| IEEE 2857 |
Privacy Engineering for AI/ML |
All industries |
Privacy-preserving AI design |
| SOC 2 Type II |
Service Organization Controls |
Technology services |
Security, availability, confidentiality |