v7.5

Forward Deployed Engineering
AI Systems — Production Playbook

Forward Deployed Engineering is a delivery methodology in which engineers are embedded within the operational environment they serve, owning system outcomes end-to-end and continuously adapting design decisions based on real-world feedback, economic constraints, and adoption signals.

Ontology Month 1

Define the conceptual foundation. What are the entities, relationships, and boundaries that structure the problem domain?

Before writing any code or training any model, the organization must agree on what words mean. Ontology is the disciplined practice of naming things, defining their relationships, and establishing the boundaries that separate one concept from another. This phase forces alignment on vocabulary that will later become schemas, labels, and embeddings. Mistakes made here propagate through the entire system—they become hardcoded assumptions that are expensive to unwind. The deliverables are not documents for their own sake; they are contracts that prevent the team from building the wrong thing confidently.

View Activities & Deliverables +

Related Charts

1.1

Domain Expert Identification & Access

Identify who holds the knowledge, how deep it goes, and how to extract it systematically.

1.1.1

Expert Stakeholder Map

Knowledge holders with depth assessment and availability matrix.

1.1.2

Interview Schedule & Protocol

Timeline with concept extraction methodologies.

1.1.3

Knowledge Source Priority Matrix

Ranked experts, customers, partners with access strategy.

1.2

Concept Harvesting Through Multiple Channels

Extract domain concepts from documents, interviews, and observations.

1.2.1

Terminology Extraction Report

Domain concepts with frequency analysis from multiple sources.

1.2.2

Concept Laddering Results

Hierarchical relationships from structured interviews.

1.2.3

Cross-Source Consistency Analysis

Validation matrix comparing concepts across channels.

1.3

Relationship Mapping & Hierarchy Construction

Build structural relationships—taxonomies, part-whole, and associations.

1.3.1

Taxonomic Hierarchy Model

Is-a relationships with inheritance rules and classification logic.

1.3.2

Part-Whole Relationship Map

Component dependencies and composition rules.

1.3.3

Associative Relationship Network

Related-to connections with strength weights.

1.4

Formal Representation & Documentation ^19,20,22,23

Capture the ontology in formats that can be reviewed, versioned, and enforced.

1.4.1

Concept Glossary & Definition Framework

Definitions, synonyms, examples, and measurement criteria.

1.4.2

Relationship Diagram Library

Visual representations of concept connections.

1.4.3

Decision Rationale Documentation

Reasoning for contested concepts with evidence.

Problem Space Month 2

Define boundaries, validate assumptions, and stress-test the problem definition before building.

The problem space is where ambiguity lives. This phase forces the team to draw explicit lines around what the system will and will not do—before those boundaries get encoded into architecture and data. Edge cases are where systems break, and edge cases live at boundaries. By stress-testing the problem definition from multiple angles, the team discovers disagreements that would otherwise surface during production incidents. The goal is not perfection; it is explicit acknowledgment of what is known, what is assumed, and what remains uncertain.

View Activities & Deliverables +

Related Charts

2.1

Boundary Definition & Scope Constraints HJG

Define what's in and what's out. Edge cases are where systems break.

⚠ Irreversibility Flag

Boundary mistakes propagate into schemas, labeling, and embeddings. Once encoded, they are expensive to unwind.

2.1.1

Domain Scope Definition

Core vs. adjacent domains with inclusion/exclusion criteria.

2.1.2

Edge Case Classification

Boundary-spanning scenarios with resolution protocols.

2.1.3

Scope Validation Test Suite

Scenarios validating boundary definitions.

2.2

Multi-Perspective Validation

Different stakeholders see the problem differently. Reconcile before building.

2.2.1

Cross-Functional Perspective Matrix

Sales, Engineering, Support, and Customer views compared.

2.2.2

Conflict Resolution Log

Documented disagreements with consensus outcomes.

2.2.3

Temporal Evolution Analysis

Historical changes with future predictions.

2.3

Stress Testing & Edge Case Exploration ^{1,6,19,20,21,23}

Push the problem definition to its limits before downstream systems depend on it.

2.3.1

Boundary Stress Test Results

Performance at edge cases and boundary conditions.

2.3.2

Scale Testing Report

Validation at 10x scale with implications.

2.3.3

Scenario-Based Validation Suite

Real-world scenarios tested to identify gaps.

2.4

Governance & Living Documentation Setup

Ontologies evolve. Establish ownership and change management.

2.4.1

Ontology Governance Charter

Ownership, triggers, and maintenance responsibilities.

2.4.2

Change Management Protocol

Update process with impact assessment.

2.4.3

Audit & Validation Schedule

Review cycles ensuring alignment with reality.

Discovery Month 3

Gather requirements from multiple perspectives. Misalignment here guarantees failure.

Discovery is the translation layer between business intent and technical specification. Different stakeholders see the same problem differently—sales sees revenue, engineering sees architecture, compliance sees risk. This phase reconciles those views before divergent assumptions become divergent implementations. The critical output is not a requirements document; it is shared understanding. Get the ML problem statement wrong, and the model will solve the wrong problem brilliantly. Discovery also surfaces data realities: what exists, what quality it has, and what gaps must be filled before training can begin.

View Activities & Deliverables +

Related Charts

3.1

Interview Customer Success, PM, and Domain Experts

Gather requirements from multiple perspectives before converging.

3.1.1

Stakeholder Interview Notes

Requirements, pain points, and success criteria.

3.1.2

Domain Expert Knowledge Base

Technical requirements and domain constraints.

3.1.3

Customer Success Insights

User journey mapping and solution gaps.

3.2

Translate Business Needs to ML Problem Statements HJG ^1,19,21

The critical translation layer. Get this wrong, and the model solves the wrong problem.

⚠ Irreversibility Flag

Problem framing errors compound through data collection, labeling, and architecture. Reframing late often requires starting over.

3.2.1

ML Problem Definition Document

Input/output specifications and success criteria.

3.2.2

Business-to-Technical Translation Matrix

Mapping business objectives to ML requirements.

3.2.3

Solution Approach Options

Comparative analysis of ML approaches.

3.3

Assess Data Availability & Quality

No data, no model. Understand what you have before promising what you'll build.

3.3.1

Data Inventory Report

Datasets with schema, volume, and quality assessments.

3.3.2

Data Quality Analysis

Missing values, outliers, distributions, lineage.

3.3.3

Data Acquisition Plan

Strategy for additional sources and labeling.

3.4

Identify Regulatory or Ethical Constraints HJG ^{1,2,16,17,19,22,23}

Legal and ethical constraints are non-negotiable. Identify early or pay later.

3.4.1

Regulatory Compliance Checklist

Applicable regulations (GDPR, HIPAA, etc.).

3.4.2

Ethical AI Framework

Bias detection, fairness metrics, guidelines.

3.4.3

Risk Assessment Matrix

Risks with mitigation strategies and owners.

Alignment & System Design Month 4

Lock in stakeholder alignment and design the end-to-end system architecture. ROI Gate: Design must project positive ROI at expected scale.

Alignment is where organizational politics meets engineering reality. This phase converts the shared understanding from Discovery into explicit commitments: who owns what, what success looks like, and when to stop. The system design maps the complete data flow from ingestion through inference—not just the model, but the infrastructure that surrounds it. Architectural decisions made here are expensive to reverse; this is where serving patterns, cost profiles, and scaling limits get locked in. The ROI gate ensures the team isn't building something that cannot pay for itself.

View Activities & Deliverables +

Related Charts

$ ROI Gate — Phase 4

Before proceeding to Integration, validate projected ROI based on architecture decisions. If unit economics are negative at projected scale, return to Discovery or terminate.

4.1

Document Stakeholder Priorities and Success Criteria

Explicit alignment prevents later conflicts about what "success" means.

4.1.1

Stakeholder Priority Matrix

Weighted priorities with conflict resolution.

4.1.2

Success Criteria Definition

Measurable outcomes and acceptance thresholds.

4.1.3

RACI Matrix

Responsibility assignment for decisions.

4.2

Design End-to-End ML Pipeline

ETL → Training → Serving. Map the complete data flow.

4.2.1

Pipeline Architecture Diagram

End-to-end flow with component specifications.

4.2.2

ETL Process Documentation

Extraction, transformation, loading procedures.

4.2.3

Training Pipeline Specification

Workflow, hyperparameter tuning, validation.

4.3

Choose Serving Pattern HJG

Batch, streaming, or online inference—each has different cost and latency implications.

4.3.1

Serving Pattern Analysis

Comparison with latency and cost trade-offs.

4.3.2

Inference Architecture Design

Detailed design with scalability considerations.

4.3.3

Performance Requirements Spec

SLA definitions, throughput, latency targets.

Integration Month 5

Connect the ML system to existing infrastructure, APIs, and data sources.

Integration is where the ML system meets the enterprise. This phase establishes the connective tissue between the model and everything it depends on: cloud infrastructure, data pipelines, security boundaries, and compliance controls. Infrastructure as Code ensures reproducibility; schema versioning ensures maintainability. The decisions made here—cloud provider, compute strategy, data residency—carry long-term operational and financial implications. Security and compliance posture are not afterthoughts; they are foundational constraints that shape every subsequent choice.

View Activities & Deliverables +

Related Charts

5.1

Select Cloud Provider & Compute Strategy

GPU/TPU selection with performance and cost analysis.

5.1.1

Cloud Provider Comparison

Cost, performance, and feature analysis.

5.1.2

Compute Strategy Document

GPU/TPU selection with benchmarks.

5.1.3

Multi-cloud Strategy Plan

Vendor lock-in mitigation and DR.

5.2

Define IaC Modules

Terraform, Helm—infrastructure as code for reproducibility.

5.2.1

Terraform Module Library

Reusable infrastructure modules with versioning.

5.2.2

Helm Chart Templates

Kubernetes deployment templates.

5.2.3

Infrastructure Deployment Guide

Step-by-step procedures and rollback.

5.3

Security & Compliance Posture

VPC, IAM, data residency—non-negotiable foundations.

5.3.1

Security Architecture Document

Network topology, access controls, encryption.

5.3.2

IAM Policy Framework

Role-based access with least privilege.

5.3.3

Data Residency Compliance Plan

Geographic storage and transfer protocols.

5.4

Define Schemas & Versioning Strategy

Data contracts and model versioning for maintainability.

5.4.1

Data Schema Registry

Schema definitions with evolution rules.

5.4.2

Model Versioning Framework

Semantic versioning for models and APIs.

5.4.3

Backward Compatibility Matrix

Version mapping and migration procedures.

Build Month 6

Construct the model, pipelines, and supporting infrastructure with reproducibility.

Build is where the model finally gets trained—but only after five phases of preparation. The emphasis is on reproducibility: deterministic environments, pinned dependencies, version-controlled artifacts. Start with a baseline model that is intentionally simple; prove value before adding complexity. Instrumentation is not optional—if you cannot measure latency, drift, and bias, you cannot manage them. For LLM-based systems, this phase includes mandatory controls for prompt injection and tool-call safety. The output is not just a model; it is a governed, observable, auditable capability.

View Activities & Deliverables +

Related Charts

6.1

Configure Reproducible ML Builds

Docker, requirements.txt—deterministic environments.

6.1.1

Containerized ML Environment

Docker images with pinned dependencies.

6.1.2

Dependency Management Strategy

Version locking and vulnerability scanning.

6.1.3

Build Reproducibility Guide

Consistent builds with checksums and validation.

6.2

Set Up Artifact Registry & Model Versioning

MLflow, DVC—track everything.

6.2.1

Artifact Registry Configuration

Metadata tracking and storage policies.

6.2.2

Model Registry Standards

Metadata schema and lifecycle management.

6.2.3

Version Control Integration

Git hooks for model versioning with code.

6.3

Build Baseline Model

Start simple. Prove value before adding complexity.

6.3.1

Baseline Model Implementation

Simple model with performance benchmarks.

6.3.2

Fine-tuning Guide

Transfer learning and adaptation strategies.

6.3.3

Model Evaluation Report

Metrics, visualizations, error analysis.

6.4

Instrument Model for Telemetry

Latency, drift, bias—if you can't measure it, you can't manage it.

6.4.1

Telemetry Collection Framework

Performance, latency, resource metrics.

6.4.2

Drift Detection System

Statistical tests and alerting for drift.

6.4.3

Bias Monitoring Dashboard

Fairness metrics across groups.

LLM Control Checkpoint — Phase 6

If this system uses LLMs, the following controls must be implemented during Build. Not optional.

Required LLM Controls — Build Phase LLM

Risk	Mandatory Control	Owner
Prompt Injection	Input sanitization + allow-list patterns	Security Engineer
Tool-Call Drift	Tool schema version pinning + audit logging	Platform Engineer

Validation Month 7

Rigorous testing across multiple dimensions—functional, performance, fairness, security.

Validation is where confidence is earned—or exposed as false. This phase subjects the model to rigorous testing across multiple dimensions: functional correctness, performance under load, fairness across populations, and security against adversaries. Golden datasets establish baselines; regression tests catch decay. Bias and fairness checks ensure the model does not encode harmful patterns from training data. Penetration testing and API fuzzing find the holes before attackers do. The output is an evidence pack that demonstrates—with data, not assertions—that the system is ready for production.

View Activities & Deliverables +

Related Charts

7.1

Unit Tests, Regression Tests, Golden Datasets

Comprehensive test coverage with baseline comparisons.

7.1.1

Unit Test Suite

Component tests with mocking and fixtures.

7.1.2

Regression Test Framework

Version comparison and performance tracking.

7.1.3

Golden Dataset Repository

Curated test data with expected outputs.

7.2

Bias/Fairness & Interpretability Checks HJG

Ensure the model doesn't encode harmful biases.

7.2.1

Fairness Evaluation Report

Bias analysis with statistical parity metrics.

7.2.2

Model Interpretability Analysis

SHAP, LIME, feature importance rankings.

7.2.3

Ethical Review Documentation

Ethics committee review and mitigations.

7.3

Performance Benchmarks & Cost Profiling

Know the system's limits before production exposes them.

7.3.1

Performance Benchmark Suite

Latency, throughput, accuracy benchmarks.

7.3.2

Cost Analysis Report

Compute, storage, operational cost breakdown.

7.3.3

Resource Optimization Plan

Cost reduction and performance improvements.

7.4

Penetration Testing & API Fuzzing

Security testing before adversaries find the holes.

7.4.1

Security Test Results

Vulnerability assessment with remediation.

7.4.2

API Fuzzing Report

Input validation and edge case handling.

7.4.3

Security Hardening Checklist

Configuration verification and compliance.

LLM Control Checkpoint — Phase 7

If this system uses LLMs, the following controls must be validated during testing. Not optional.

Required LLM Controls — Validation Phase LLM

Risk	Mandatory Control	Owner
Retrieval Contamination	Signed data sources + relevance score thresholds	Data Engineer
Hallucination	Factual grounding requirements + expert sampling	ML Engineer

Validation Evidence Pack (Required Deliverables)

Ship validation as evidence. The goal is reproducible confidence, not a slide-deck verdict.

Deliverables

VAL-TEST-1 Test Plan & Coverage Map (unit, integration, regression; baseline comparisons)
VAL-TEST-2 Golden Set + Drift Sentinels (frozen eval set; monitored slices and cohorts)
VAL-TEST-3 Red Team Report (prompt-injection, tool misuse, retrieval contamination scenarios)
VAL-REP-1 Validation Report (metrics, acceptance criteria, known limitations, escalation outcomes)
VAL-TRACE-1 Artifact Trace Map (links tests ⇄ datasets ⇄ model versions ⇄ decisions)

Suggested references: IEEE 29119 (software testing), ISO/IEC 25010 (quality model), NIST AI RMF (risk management), OWASP LLM Top 10 & MITRE ATLAS (threat modeling).

Pre-Production Month 8

Staging environment, load testing, and final sign-off. ROI Gate: Validated ROI must exceed 1.5x projected.

Pre-Production is the dress rehearsal. The staging environment must be production-like: same data shapes, same traffic patterns, same failure modes. Shadow traffic reveals how the model behaves on real inputs without affecting real users. Load and stress testing find breaking points before production exposes them. Canary and A/B test designs ensure statistical rigor in the rollout. Privacy and compliance validations—GDPR, HIPAA, whatever applies—must be verified before launch. The ROI gate at this phase validates that actual performance justifies the investment made so far.

View Activities & Deliverables +

Related Charts

$ ROI Gate — Phase 8

Before entering Hypercare, validate ROI based on staging performance. If actual metrics are <1.5x of Phase 4 projections, investigate root cause before proceeding.

8.1

Staging Deployment with Shadow Traffic

Production-like environment with real traffic patterns.

8.1.1

Staging Environment Setup

Production-like with anonymization.

8.1.2

Synthetic Traffic Generator

Realistic load with various patterns.

8.1.3

Shadow Traffic Analysis

Staging vs production behavior comparison.

8.2

Load & Stress Testing

Find breaking points before users do.

8.2.1

Load Testing Strategy

Gradual load increase and failure scenarios.

8.2.2

Stress Testing Report

Behavior under extreme load.

8.2.3

Capacity Planning Model

Scaling recommendations from testing.

8.3

Canary or A/B Test Plan Approval

Statistical design for safe rollout.

8.3.1

Experimentation Design

Sample size calculations and success criteria.

8.3.2

Risk Mitigation Plan

Rollback procedures and safety mechanisms.

8.3.3

Stakeholder Approval Matrix

Sign-off with go/no-go criteria.

8.4

Data Retention & Privacy Validation

GDPR, HIPAA—compliance verified before launch.

8.4.1

Privacy Impact Assessment

Data processing risk evaluation.

8.4.2

Data Retention Policy

Lifecycle management and deletion schedules.

8.4.3

Compliance Verification Report

Regulatory checklist with evidence.

LLM Control Checkpoint — Phase 8

If this system uses LLMs, the following controls must be verified before production. Not optional.

Required LLM Controls — Pre-Production Phase LLM

Risk	Mandatory Control	Owner
Context Window Decay	Max context length + truncation audit + instruction reinforcement	ML Engineer
Output Validation	PII scrubbing + format validation + sensitive data detection	Security Engineer

CT-1 Gate Enforcement The Cost Telemetry Contract (CT-1) must be complete with all metrics instrumented, owners assigned, and alerts configured before proceeding to Hypercare. This is a blocking gate.

Hypercare Month 9

Intensive post-launch support. High-touch monitoring, rapid response, and user feedback loops.

Hypercare is the high-touch period immediately following launch. This is not business as usual—it is an elevated state of vigilance. A dedicated support team with 24/7 coverage monitors everything in real time. Alert thresholds are tighter than normal operations. The war room is staffed. Escalation paths are rehearsed. User feedback flows directly to the team, enabling rapid iteration on issues that only emerge under real-world conditions. The goal is to catch and fix problems before they become crises, and to build the operational muscle that will sustain the system long-term.

View Activities & Deliverables +

Related Charts

9.1

Launch Readiness Review HJG

Final go/no-go decision with all stakeholders.

9.1.1

Launch Readiness Checklist

All prerequisites verified and documented.

9.1.2

Stakeholder Sign-off Document

Formal approval from all decision-makers.

9.1.3

Communication Plan

Internal and external launch messaging.

9.2

Dedicated Support Team Activation

24/7 coverage with escalation paths.

9.2.1

Support Team Roster

Names, roles, contact info, coverage hours.

9.2.2

Escalation Procedures

Severity levels and response time SLAs.

9.2.3

War Room Setup

Physical or virtual command center.

9.3

Real-time Monitoring & Rapid Response

Watch everything. React immediately.

9.3.1

Hypercare Dashboard

Real-time metrics with anomaly highlighting.

9.3.2

Incident Triage Playbook

Decision trees for rapid classification.

9.3.3

Hotfix Deployment Protocol

Emergency release process with safeguards.

9.4

User Feedback Collection & Rapid Iteration

Close the loop between users and the team.

9.4.1

Feedback Collection Channels

Surveys, support tickets, usage analytics.

9.4.2

Issue Prioritization Framework

Severity × impact × frequency scoring.

9.4.3

Hypercare Exit Criteria

Metrics that signal readiness for BAU.

Production Deployment Month 10

Full production rollout with monitoring, scaling, and operational excellence.

Production Deployment is the transition from hypercare intensity to sustainable operations. Deployment patterns—blue/green, canary, rolling—are selected based on risk tolerance and rollback requirements. Autoscaling policies ensure the system handles variable load without manual intervention. Rollback and failover plans are not just documented; they are tested. Monitoring dashboards track the metrics that matter: not just model accuracy, but business impact, cost efficiency, and user satisfaction. The system must be operable by someone who did not build it.

View Activities & Deliverables +

Related Charts

10.1

Select Deployment Pattern

Blue/green, canary, rolling—choose based on risk tolerance.

10.1.1

Deployment Strategy Document

Pattern selection with risk assessment.

10.1.2

Rollout Timeline Plan

Phased deployment with checkpoints.

10.1.3

Traffic Switching Procedures

Load balancer configuration.

10.2

Create Rollback and Failover Plans

Know how to undo everything before you do anything.

10.2.1

Rollback Procedures Manual

Step-by-step with automation scripts.

10.2.2

Failover Architecture Plan

Multi-region DR with RTO/RPO specs.

10.2.3

Emergency Response Playbook

Escalation with contact lists.

10.3

Configure Autoscaling

HPA, VPA—right-size dynamically.

10.3.1

Horizontal Pod Autoscaler Config

HPA rules based on custom metrics.

10.3.2

Vertical Pod Autoscaler Setup

VPA for resource optimization.

10.3.3

Scaling Behavior Analysis

Load patterns and thresholds.

10.4

Set Up Production Monitoring & Alerting

Dashboards, alerts, SLA tracking.

10.4.1

Monitoring Dashboard Configuration

Grafana/DataDog with key metrics.

10.4.2

Alerting Rules Framework

Thresholds and notification channels.

10.4.3

SLA Monitoring Setup

SLIs, SLOs, and error budgets.

Reliability Month 11

Establish operational excellence—observability, incident response, and continuous health monitoring.

Reliability is where the system earns trust over time. This phase establishes the observability stack that makes the system's behavior legible: metrics, logs, traces, and the dashboards that synthesize them. Model-specific monitoring tracks accuracy drift, data drift, and the decay patterns that signal retraining is needed. On-call rotations and incident response runbooks ensure that problems are caught and resolved by people who know what to do. Blameless postmortems convert incidents into organizational learning. The goal is a system that degrades gracefully, recovers quickly, and improves continuously.

View Activities & Deliverables +

Related Charts

11.1

Implement Logging, Tracing, Metrics

Prometheus, OpenTelemetry—full observability stack.

11.1.1

Observability Stack Deployment

Prometheus, Grafana, Jaeger setup.

11.1.2

Custom Metrics Framework

Business and technical metrics.

11.1.3

Trace Analysis Dashboard

Request flow visualization.

11.2

Build Model-Specific Dashboards

Accuracy, drift, business impact—ML-specific observability.

11.2.1

Model Performance Dashboard

Real-time accuracy with trends.

11.2.2

Data Drift Monitoring Panel

Statistical drift detection.

11.2.3

Business Impact Metrics View

ML to business KPI correlation.

11.3

On-call & Incident Response

Runbooks, escalation, blameless postmortems.

11.3.1

On-call Rotation Schedule

Coverage with escalation contacts.

11.3.2

Operational Runbooks

Step-by-step troubleshooting guides.

11.3.3

Postmortem Template

Incident analysis with action items.

11.4

Model Decay Detection & Retraining

Automated detection and retraining triggers.

11.4.1

Model Decay Detection System

Performance degradation monitoring.

11.4.2

Automated Retraining Pipeline

Trigger conditions and workflow.

11.4.3

Production Data Capture

Feedback loop for retraining data.

Continuous Improvement Month 12

The journey continues. ROI Gate: Actual ROI vs projected determines continuation or sunset.

Continuous Improvement recognizes that deployment is not the end—it is the beginning of a new cycle. Automation reduces toil and increases velocity. Documentation and knowledge sharing ensure that learnings survive staff turnover. Architecture reviews and technical debt assessments keep the system maintainable. The final ROI gate compares actual performance against projections: systems that deliver value earn continued investment; systems that do not are sunset gracefully. The insights from production feed back into product and research, informing the next iteration of capability.

View Activities & Deliverables +

Related Charts

$ ROI Gate — Phase 12

After 3 months in production, compare actual ROI to projections. If <1.0x, initiate sunset review. If >2.0x, consider expansion investment.

12.1

Automate Repetitive Steps

Reduce toil, increase velocity.

12.1.1

Automation Opportunity Analysis

Time-consuming tasks with ROI.

12.1.2

Workflow Automation Scripts

Python/Shell for common tasks.

12.1.3

CI/CD Pipeline Enhancements

Advanced automation and gates.

12.2

Document & Share Learnings

Write postmortems, design docs, and tech blogs.

12.2.1

Technical Writing Guidelines

Standards for documentation.

12.2.2

Knowledge Sharing Calendar

Tech talks and brown bags.

12.2.3

Learning Repository

Centralized knowledge base.

12.3

Architecture Reviews & Tech Debt Assessment

Systematic evaluation and debt tracking.

12.3.1

Architecture Review Checklist

Scalability, security, maintainability.

12.3.2

Technical Debt Inventory

Cataloged debt with priorities.

12.3.3

System Health Scorecard

Regular quality assessment.

12.4

Plan Next Iteration

Surface learnings to PM/Research for roadmap refinement.

12.4.1

Insights Report

Performance and behavior insights.

12.4.2

Research Collaboration Framework

Knowledge sharing with research.

12.4.3

Product Roadmap Input

Data-driven feature prioritization.

Appendix

What Good Looks Like

This roadmap describes one year of deliberate organizational change, not twelve months of model building. The goal is durability—systems that work, are trusted, and remain governable after leadership attention moves elsewhere.

Annual View

A good year ends with:

Outcome	Why It Matters
Fewer arguments	Reality is shared
Fewer heroics	Risk is designed out
Fewer surprises	Incentives and ownership are explicit
Continuity	System functions when original team leaves

At year end, the organization has: shared understanding, explicit ownership, managed risk, institutional memory, and the ability to say "no" as confidently as "yes."

Quarterly View

Each quarter solves a human problem before it becomes a technical or financial one.

Quarter	Name	Human Aim	Gate	Primary Outputs
Q1	Diagnostics	Align people on reality before building anything expensive	Problem & success definition locked; baseline approved	Ontology, KPI targets, dataset inventory, baseline + error analysis
Q2	Architect	Reduce ambiguity so teams stop arguing and start shipping	Architecture review passed; security/compliance accepted	System design, IaC plan, schema/versioning, baseline pipeline
Q3	Engineer	Build with guardrails so operators don't carry risk	Validation suite green; risk controls implemented	Eval harness, red-team results, drift/bias checks, rollout plan
Q4	Enable	Make the system survivable after handoff	Production readiness met; monitoring live; owner assigned	Runbooks, dashboards, change mgmt, ROI review

Quarterly Roadmap

Q1 Diagnostics

Ontology

Problem Space

Discovery

Q2 Architect

Alignment & Design

Integration

Build

Q3 Engineer

Validation

Pre-Production

Hypercare

Q4 Enable

M10

Production

M11

Reliability

M12

Continuous Improvement

Gate Types

Badge	Name	Meaning
HJG	Human Judgment Gate	Requires explicit human decision-making. Not automatable.
$	Economic Gate	Requires ROI validation before proceeding. Kill criteria apply.
⚠	Irreversibility Flag	Decisions costly to unwind. Extra scrutiny required.
CT	Cost Telemetry Contract	Metrics with named owners, refresh cadence, and kill bindings.

HJG Procedural Requirements

Human Judgment Gates require procedural enforcement, not just cultural compliance:

Convener: Named person responsible for scheduling the gate (typically Product Owner or Tech Lead)
Quorum: Minimum 2 reviewers with authority to approve or reject
Evidence: Required artifacts must be submitted 48 hours before the gate
Dissent: Dissenting views must be documented even if overruled
Escalation: If gate is missed or delayed >5 business days, automatic escalation to Exec Sponsor
Record: Decision, rationale, attendees, and dissent logged in Decision Memory Ledger

How to Use This Playbook

What "Done" Means

"Done" is not a model that runs. "Done" is a capability that can be measured, audited, rolled back, and re-learned by a new team without tribal memory.

What Breaks Teams

Most programs fail from missing evidence: unclear intent, no acceptance criteria, no telemetry contract, no rollback plan, and no operating owner. This playbook forces those decisions earlier.

Each month maps to a phase. Organizations may compress or extend phases based on complexity, but the sequence should not be reordered. Skipping phases creates debt that surfaces later—usually at the worst possible time.

Phase Evidence Packs

Each phase exit requires a formal Evidence Pack. This makes gatekeeping less subjective without bureaucratizing it.

Phase	Evidence Pack ID	Required Artifacts	Reviewer
01 Ontology	PH1-EVID-1	Expert map, concept glossary, relationship diagram, contested concept log	Domain Lead + Product
02 Problem Space	PH2-EVID-1	Boundary stress tests, edge case matrix, scope validation results	Tech Lead + Product
03 Discovery	PH3-EVID-1	Stakeholder interview notes, data inventory, regulatory constraint map	Product + Compliance
04 Alignment	PH4-EVID-1	Architecture ROI pack, stakeholder sign-off matrix, risk acceptance docs	Exec Sponsor + Finance
05 Integration	PH5-EVID-1	IaC validation logs, schema version registry, security scan results	Platform Lead + Security
06 Build	PH6-EVID-1	Baseline model metrics, telemetry contract, reproducibility proof	ML Lead + SRE
07 Validation	PH7-EVID-1	Test suite results, bias audit, red team report, pen test findings	QA Lead + Security
08 Pre-Production	PH8-EVID-1	Load test results, canary metrics, rollback verification, kill drill results	SRE Lead + Ops
09 Hypercare	PH9-EVID-1	Launch checklist, escalation log, rapid iteration tracking	Product + Support Lead
10 Production	PH10-EVID-1	Deployment verification, autoscaling proof, rollback test results	SRE + Platform Lead
11 Reliability	PH11-EVID-1	Observability dashboard, on-call rotation, decay detection baseline	SRE Lead + ML Lead
12 Continuous	PH12-EVID-1	Automation inventory, knowledge transfer docs, next iteration brief	Tech Lead + Product

Stop Authority Drills HJG

Stop authority is psychologically harder than rollback. Organizations must practice stopping, not just responding.

⚠ Mandatory Requirement

At least one simulated kill-decision exercise must be run before Phase 8. This forces the organization to practice stopping a project that has momentum, budget, and stakeholder investment.

Drill Type	Timing	Participants	Success Criteria
Economic Kill Drill	Before Phase 4 ROI Gate	Finance, Product, Exec Sponsor	Team can articulate kill threshold and demonstrate willingness to invoke it
Technical Kill Drill	Before Phase 8	ML Lead, SRE, Platform	Rollback executes in <15 min; all dependencies notified; audit trail complete
Compliance Kill Drill	Before Phase 9	Legal, Compliance, Product	Stop authority invoked on simulated regulatory finding; communication chain verified
Adoption Kill Drill	Before Phase 10	Product, UX, Support	Team can define minimum viable adoption shape and demonstrate kill criteria

Drill Protocol

Scenario briefing: Present a realistic kill condition (cost overrun, bias discovery, adoption failure)
Decision simulation: Team must reach consensus on kill/continue within 30 minutes
Execution proof: If kill, demonstrate the technical and communication steps
Debrief: Document hesitation points, authority gaps, and process improvements

Anti-Patterns & Red Flags

Strong governance systems risk becoming performative. Watch for these signals that artifacts are being completed without genuine engagement.

Anti-Pattern	Red Flag Signals	Root Cause	Intervention
Backfilled Model Card	Model Card completed after deployment; sections copy-pasted from templates; no evidence of reviewer engagement	Documentation treated as compliance checkbox, not design artifact	Require Model Card draft at Phase 6; reviewer must sign with specific feedback
Mechanical Risk Register	All risks rated "Medium"; mitigations are generic; no risks ever escalated or retired	Risk assessment is ceremonial; no one expects it to drive decisions	Require at least one risk escalation per quarter; track risk-to-decision linkage
Phantom RACI	RACI exists but decisions still escalate informally; "Accountable" person doesn't know they're accountable	Authority transfer is documented but not socialized	RACI owner must verbally confirm role; escalation test in Phase 4
Ceremonial HJG	Human Judgment Gates passed in <5 minutes; no dissent recorded; same person approves everything	Gates are scheduled but not staffed for genuine deliberation	Require minimum 2 reviewers; document dissenting views even if overruled
Orphaned Telemetry	Dashboards exist but no one checks them; alerts fire but aren't investigated	Observability built for audit, not for operations	Weekly telemetry review with named owner; alert-to-action audit
Compliance Theater	Legal/Compliance consulted only for sign-off; concerns raised late are dismissed as "blocking"	Compliance treated as gate, not design partner	Compliance representative in Phase 3 discovery; veto power through Phase 7
Tribal Knowledge Dependency	Key decisions explained verbally; documentation says "ask Sarah"; Bus Factor = 1	Urgency prioritized over durability	Knowledge transfer test: new team member must execute runbook solo

Audit Question

For each artifact, ask: "If I removed this document, would anyone notice? Would any decision change?" If the answer is no, the artifact is performative.

Kill Criteria & Stop Authority

Projects fail expensively when nobody has the right—or the obligation—to stop them. Define explicit kill criteria early and assign named stop authority before incentives and sunk cost take over.

Kill criteria (examples)

Cost per Unit of AI Work exceeds threshold for 3 consecutive cycles
Unmitigated safety or compliance breach
Performance regression beyond agreed tolerance
Adoption remains below target despite corrective actions

Stop authority

Named individual (not a committee)
Clear escalation path and decision window
Rollback power without political permission
Evidence required for restart

Executive Control Surface

A CIO/CTO should monitor these 6 signals monthly. When thresholds are breached, intervention is required—not optional.

Related Charts

Monthly Monitoring Signals

Signal	Description	Healthy	Warning	Critical
Unit Economics Health	Cost per inference relative to value delivered	<80% of value	80-100%	>100% (value-negative)
Model Performance Decay	Accuracy/precision drift from baseline	<5% decay	5-15%	>15% (trigger retraining)
Error Rate by Consequence	Errors weighted by business impact	<$10K/mo impact	$10-50K	>$50K (escalate)
Human Override Rate	How often humans reject model outputs	5-20%	<5% or >30%	<2% or >50%
Time-to-Rollback	How quickly the system can be reverted	<15 min	15-60 min	>60 min (unacceptable)
Compliance Drift	Gap between current state and requirements	Fully compliant	Minor gaps	Material gaps (halt)

Intervention Triggers

These conditions require immediate executive action—not delegation.

Kill Trigger: ROI Collapse

If cost-per-inference exceeds value-per-inference for 2 consecutive months, initiate sunset review. Do not wait for quarter-end.

Escalation Trigger: Consequential Error Spike

If weighted error cost exceeds $50K in any month, convene incident review within 48 hours. Model may need to be pulled from production.

Governance Trigger: Compliance Gap

Any material compliance gap halts new feature deployment until resolved. Non-negotiable in regulated industries.

Decision Authority Matrix

Decision	Owner	Consulted	Informed
Model goes to production	CTO / VP Eng	Legal, Compliance, Product	Board (if high-risk)
Model is sunset	CTO + CFO jointly	Product, Customer Success	Affected customers
Emergency rollback	On-call engineer	None (act first)	CTO within 1 hour
Compliance exception	General Counsel	CTO, CISO	Board
Budget increase >20%	CFO	CTO, Product	Board

Economic Viability Framework

Cost is not a constraint—it is a governing force. Every AI system must justify its existence economically, continuously.

Related Charts

Unit Economics Definition Gate

Before any model is built, define the economic unit. What is the cost of one inference? What is the value of one correct output?

Economic Gate

If value-per-inference cannot be estimated within 10x accuracy, the project is not ready for development. Return to Discovery.

E.1.1

Cost-per-Inference Model

Compute, storage, network, human review costs per prediction.

E.1.2

Value-per-Inference Model

Revenue generated, cost avoided, or risk mitigated per correct output.

E.1.3

Break-even Analysis

Volume required for positive ROI at current accuracy levels.

Cost-of-Error Curves

Not all errors are equal. Map the cost of different error types and their frequency.

E.2.1

Error Taxonomy with Cost Weights

False positives, false negatives, edge cases—each with dollar impact.

E.2.2

Cost-of-Error vs Latency Trade-off Curves

Faster inference often means more errors. Quantify the trade-off.

E.2.3

Error Budget Allocation

Acceptable error rates by type, based on economic tolerance.

Kill Thresholds HJG

Define the conditions under which the project is terminated—before you're emotionally invested.

⚠ Irreversibility Flag

Kill thresholds must be defined before Phase 4 (Alignment). Once development begins, sunk cost bias makes objective termination nearly impossible.

E.3.1

Kill Criteria Document

Specific, measurable conditions that trigger project termination.

E.3.2

Sunset Procedure

How to wind down gracefully: data retention, customer communication, team reallocation.

E.3.3

Pivot Criteria

Conditions under which the project should change direction rather than die.

ROI Gates at Phase Boundaries

Economic viability is validated at Phase 4, 8, and 12. Not annually—at milestones.

E.4.1

Phase 4 ROI Gate: Design Complete

Projected ROI based on architecture decisions. Kill if negative at projected scale.

E.4.2

Phase 8 ROI Gate: Pre-Production

Validated ROI based on staging performance. Kill if <1.5x projected.

E.4.3

Phase 12 ROI Gate: Steady State

Actual ROI vs projected. Sunset if <1.0x after 3 months in production.

Economic Sovereignty Principle A model that cannot pay for itself is a liability, not an asset. Economic viability is not a constraint to work around—it is the purpose the system must serve.

Cost Telemetry Contract

Economics are not conceptually sovereign—they are physically enforced. Every production system must satisfy this contract. No exceptions.

Related Charts

FinOpsShowback.tsx

Mandatory Enforcement Each metric below requires a named human owner (not "team"), a defined refresh cadence, a review forum, and a binding to a specific kill threshold. Systems without complete telemetry contracts do not ship.

Required Telemetry Metrics

Metric	Owner	Refresh	Reviewed By	Kill Trigger
Cost per inference (fully loaded)	Engineering Manager	Daily	CTO + CFO	>1.0× value for 2 months
Error cost per month (weighted)	Product Manager	Weekly	Executive Review	>$50K/month
Human review cost per output	Operations Lead	Weekly	Ops Review	>30% of inference cost
Compute cost per 1K inferences	Platform Engineer	Real-time	Infra Review	>2× baseline for 1 week
Retraining cost per cycle	ML Engineer	Per event	ML Review	>1 month of value
Value delivered per inference	Business Analyst	Monthly	Exec Review	<0.8× projected for 2 months

Contract Artifact: CT-1

The Cost Telemetry Contract must be completed and signed off before Phase 8 (Pre-Production).

CT-1.1

Telemetry Implementation Checklist

Each metric instrumented with data pipeline and dashboard.

CT-1.2

Owner Assignment Document

Named individuals (not roles) with escalation paths.

CT-1.3

Alert Configuration Spec

Automated alerts for threshold breaches with escalation rules.

CT-1.4

Review Cadence Calendar

Standing meetings where each metric is reviewed with owners present.

Enforcement Mechanism

The CT-1 artifact is a gate artifact. Production deployment is blocked until all six metrics have verified telemetry, named owners, and configured alerts.

Implementation Templates

Production-ready templates for governance artifacts. Copy, customize, and deploy. These are starting points—adapt to your regulatory context.

Related Charts

Template Philosophy Templates reduce cognitive load but create false confidence if used without adaptation. Each template includes "Customization Required" flags for organization-specific decisions.

T.1

RACI Matrix Template

Responsibility assignment for AI/ML lifecycle. The most common failure mode is "everyone is responsible" (meaning no one is).

RACI Matrix — AI/ML Production

Activity / Decision	ML Engineer	Product Manager	Data Engineer	Security	Legal/Compliance	Executive Sponsor
Phase 1-3: Discovery & Definition
Problem definition sign-off	C	R	C	I	C	A
Data availability assessment	C	I	R	C	C	I
Regulatory constraint mapping	I	C	C	C	R	A
Kill criteria definition	C	R	I	I	C	A
Phase 4-6: Design & Build
Architecture design	R	C	C	C	I	I
Security posture approval	C	I	C	R	C	A
Data pipeline implementation	C	I	R	C	I	I
Model training & selection	R	C	C	I	I	I
Phase 7-9: Validation & Pre-Production
Bias/fairness evaluation	R	C	I	I	C	A
Security penetration testing	C	I	I	R	I	I
Production readiness sign-off	C	R	C	C	C	A
Rollback plan validation	R	C	C	C	I	I
Phase 10-12: Production & Operations
Production deployment	R	C	C	C	I	I
Incident response (L1)	R	I	C	C	I	I
Incident escalation (L2+)	C	R	C	C	C	A
Model retraining decision	R	C	C	I	I	A
Kill/sunset decision	C	C	I	C	C	A

R = Responsible (does the work) A = Accountable (final decision authority) C = Consulted (input required) I = Informed (kept updated)

⚙ Customization Required

Add organization-specific roles (e.g., AI Ethics Board, Model Risk Officer for financial services)
Adjust "A" assignments based on your governance structure
For regulated industries, Legal/Compliance may need "A" on more decisions
Consider adding SRE/Platform team for infrastructure-heavy deployments

T.2

Telemetry Dashboard Configuration

Grafana/Datadog-compatible dashboard specification. These are the minimum viable metrics for production AI governance.

Grafana Dashboard JSON — AI/ML Production Telemetry

{
  "dashboard": {
    "title": "AI/ML Production Governance",
    "tags": ["ai", "ml", "production", "governance"],
    "panels": [
      {
        "title": "Economic Health",
        "type": "stat",
        "gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
        "targets": [{
          "expr": "sum(ml_inference_cost_usd) / sum(ml_value_delivered_usd)",
          "legendFormat": "Cost/Value Ratio"
        }],
        "thresholds": {
          "steps": [
            {"color": "#999", "value": null},
            {"color": "#666", "value": 0.8},
            {"color": "#000", "value": 1.0}
          ]
        }
      },
      {
        "title": "Model Performance Decay",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
        "targets": [{
          "expr": "1 - (ml_current_accuracy / ml_baseline_accuracy)",
          "legendFormat": "Decay from Baseline"
        }],
        "alert": {
          "name": "Model Decay Alert",
          "conditions": [{
            "evaluator": {"type": "gt", "params": [0.15]},
            "operator": {"type": "and"},
            "reducer": {"type": "avg"}
          }],
          "notifications": [{"uid": "ml-oncall-channel"}]
        }
      },
      {
        "title": "Human Override Rate",
        "type": "gauge",
        "gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
        "targets": [{
          "expr": "sum(ml_human_overrides) / sum(ml_total_predictions) * 100",
          "legendFormat": "Override %"
        }],
        "thresholds": {
          "steps": [
            {"color": "#999", "value": null},
            {"color": "#666", "value": 15},
            {"color": "#000", "value": 30}
          ]
        }
      },
      {
        "title": "Error Cost by Category",
        "type": "piechart",
        "gridPos": {"h": 8, "w": 6, "x": 12, "y": 4},
        "targets": [{
          "expr": "sum by (error_type) (ml_error_cost_usd)",
          "legendFormat": "{{error_type}}"
        }]
      },
      {
        "title": "Inference Latency P99",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 12},
        "targets": [{
          "expr": "histogram_quantile(0.99, ml_inference_latency_seconds_bucket)",
          "legendFormat": "P99 Latency"
        }],
        "alert": {
          "name": "Latency SLA Breach",
          "conditions": [{
            "evaluator": {"type": "gt", "params": [2.0]},
            "operator": {"type": "and"},
            "reducer": {"type": "avg"}
          }]
        }
      },
      {
        "title": "Data Drift Score",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 6, "x": 12, "y": 12},
        "targets": [{
          "expr": "ml_feature_drift_score",
          "legendFormat": "{{feature_name}}"
        }],
        "thresholds": {
          "steps": [
            {"color": "#999", "value": null},
            {"color": "#666", "value": 0.1},
            {"color": "#000", "value": 0.25}
          ]
        }
      },
      {
        "title": "Cost Telemetry Contract Status",
        "type": "table",
        "gridPos": {"h": 6, "w": 18, "x": 0, "y": 20},
        "targets": [{
          "expr": "ml_cost_metric_status",
          "format": "table"
        }],
        "transformations": [{
          "id": "organize",
          "options": {
            "indexByName": {},
            "renameByName": {
              "metric_name": "Metric",
              "owner": "Owner",
              "refresh_cadence": "Refresh",
              "last_updated": "Last Updated",
              "kill_trigger_status": "Kill Trigger Status"
            }
          }
        }]
      }
    ],
    "refresh": "1m",
    "time": {"from": "now-24h", "to": "now"}
  }
}

T.2.1

Required Prometheus/OpenMetrics Exports

# HELP ml_inference_cost_usd Total inference cost in USD
# TYPE ml_inference_cost_usd counter
ml_inference_cost_usd{model="fraud_v2",env="prod"} 1234.56

# HELP ml_value_delivered_usd Estimated value delivered by predictions
# TYPE ml_value_delivered_usd counter
ml_value_delivered_usd{model="fraud_v2",env="prod"} 5678.90

# HELP ml_human_overrides Count of human override events
# TYPE ml_human_overrides counter
ml_human_overrides{model="fraud_v2",reason="low_confidence"} 42

# HELP ml_feature_drift_score PSI or KL divergence from baseline
# TYPE ml_feature_drift_score gauge
ml_feature_drift_score{feature="transaction_amount"} 0.08

T.3

Infrastructure as Code Snippets

Terraform modules for governed AI infrastructure. These enforce security and observability by default.

Terraform — ML Model Serving Infrastructure (AWS)

# ml-serving-infrastructure/main.tf
# Governed ML model serving with mandatory observability and rollback

terraform {
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.0" }
  }
}

variable "model_name" {
  type        = string
  description = "Name of the ML model (used for resource naming)"
}

variable "model_version" {
  type        = string
  description = "Semantic version of the model"
}

variable "kill_threshold_cost_ratio" {
  type        = number
  default     = 1.0
  description = "Cost/value ratio that triggers kill alert"
}

variable "rollback_model_version" {
  type        = string
  description = "Previous stable version for automatic rollback"
}

# SageMaker Endpoint with mandatory monitoring
resource "aws_sagemaker_endpoint" "ml_endpoint" {
  name                 = "${var.model_name}-${var.model_version}"
  endpoint_config_name = aws_sagemaker_endpoint_configuration.ml_config.name

  deployment_config {
    blue_green_update_policy {
      traffic_routing_configuration {
        type                     = "CANARY"
        canary_size {
          type  = "CAPACITY_PERCENT"
          value = 10
        }
        wait_interval_in_seconds = 600
      }
      termination_wait_in_seconds = 300
      maximum_execution_timeout_in_seconds = 3600
    }
    
    auto_rollback_configuration {
      alarms = [
        aws_cloudwatch_metric_alarm.model_error_rate.alarm_name,
        aws_cloudwatch_metric_alarm.latency_breach.alarm_name,
        aws_cloudwatch_metric_alarm.cost_ratio_breach.alarm_name
      ]
    }
  }

  tags = {
    ManagedBy       = "terraform"
    Model           = var.model_name
    Version         = var.model_version
    RollbackVersion = var.rollback_model_version
    CostCenter      = "ml-platform"
    Governance      = "ai-playbook-v7"
  }
}

# Mandatory CloudWatch Alarms (cannot deploy without these)
resource "aws_cloudwatch_metric_alarm" "model_error_rate" {
  alarm_name          = "${var.model_name}-error-rate-breach"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "ModelError"
  namespace           = "AWS/SageMaker"
  period              = 300
  statistic           = "Average"
  threshold           = 0.05
  alarm_description   = "Model error rate exceeds 5% - triggers rollback"
  
  dimensions = {
    EndpointName = aws_sagemaker_endpoint.ml_endpoint.name
    VariantName  = "primary"
  }

  alarm_actions = [
    aws_sns_topic.ml_alerts.arn,
    # Auto-rollback is handled by deployment_config
  ]
}

resource "aws_cloudwatch_metric_alarm" "latency_breach" {
  alarm_name          = "${var.model_name}-latency-breach"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "ModelLatency"
  namespace           = "AWS/SageMaker"
  period              = 300
  extended_statistic  = "p99"
  threshold           = 2000  # 2 seconds
  alarm_description   = "P99 latency exceeds SLA - triggers rollback"
  
  dimensions = {
    EndpointName = aws_sagemaker_endpoint.ml_endpoint.name
  }

  alarm_actions = [aws_sns_topic.ml_alerts.arn]
}

resource "aws_cloudwatch_metric_alarm" "cost_ratio_breach" {
  alarm_name          = "${var.model_name}-cost-ratio-breach"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 24  # 24 hours of sustained breach
  metric_name         = "CostValueRatio"
  namespace           = "Custom/MLGovernance"
  period              = 3600
  statistic           = "Average"
  threshold           = var.kill_threshold_cost_ratio
  alarm_description   = "Cost/value ratio exceeds kill threshold"
  
  alarm_actions = [
    aws_sns_topic.ml_alerts.arn,
    aws_sns_topic.executive_escalation.arn
  ]
}

# Governance enforcement: block deployment without audit trail
resource "aws_sagemaker_model" "ml_model" {
  name               = "${var.model_name}-${var.model_version}"
  execution_role_arn = aws_iam_role.sagemaker_execution.arn

  primary_container {
    image          = var.model_image_uri
    model_data_url = var.model_artifact_s3_uri
    
    environment = {
      MODEL_VERSION           = var.model_version
      GOVERNANCE_PLAYBOOK_REF = "ai-playbook-v7"
      DEPLOYMENT_TIMESTAMP    = timestamp()
      ROLLBACK_VERSION        = var.rollback_model_version
    }
  }

  tags = {
    ApprovalTicket = var.approval_ticket_id  # Required - enforced by policy
    RiskAssessment = var.risk_assessment_id  # Required - enforced by policy
    ModelCard      = var.model_card_url      # Required - enforced by policy
  }
}

# Output for audit trail
output "deployment_manifest" {
  value = {
    endpoint_name     = aws_sagemaker_endpoint.ml_endpoint.name
    model_version     = var.model_version
    rollback_version  = var.rollback_model_version
    kill_threshold    = var.kill_threshold_cost_ratio
    deployed_at       = timestamp()
    alarms_configured = [
      aws_cloudwatch_metric_alarm.model_error_rate.alarm_name,
      aws_cloudwatch_metric_alarm.latency_breach.alarm_name,
      aws_cloudwatch_metric_alarm.cost_ratio_breach.alarm_name
    ]
  }
  description = "Deployment manifest for audit trail"
}

⚙ Customization Required

Replace AWS SageMaker with your inference platform (GCP Vertex AI, Azure ML, self-hosted)
Adjust thresholds based on your SLAs and risk tolerance
Add VPC configuration for network isolation requirements
Integrate with your CI/CD pipeline (GitHub Actions, GitLab CI, Jenkins)
Add KMS encryption for regulated data (HIPAA, PCI-DSS)

Executive-Grade Observability

Operational AI fails quietly when only engineers can see the system. This layer makes trust, economics, and governance legible to executives so decisions are made on reality rather than narrative.

Trust Dashboard

Drift / decay indicators
Incident frequency and severity
Override and escalation rates
Operator confidence (measured, not assumed)

Economics Dashboard

Cost per Unit of AI Work (UAW)
Variance vs. forecast and budget guardrails
Marginal cost per new capability
ROI trendline (with confidence bounds)

Governance Dashboard

Open risks with named owners
Contract breaches and remediation status
Model / prompt / policy version traceability
Stop-authority exercises completed

Operator UX Principles

Explain the “why” before the chart
Surface the next action, not just metrics
Design for incident time, not demo time
Make rollback and safe-degradation obvious

Why Systems Fail

Technical correctness is necessary but not sufficient. These failure patterns survive model validation and destroy production systems.

Related Charts

Misaligned Incentives Override Accuracy

Users or operators have incentives that conflict with model objectives. The system produces correct outputs that get ignored or gamed.

Automation Shifts Users from Skepticism to Compliance

Over time, users stop questioning model outputs. When the model fails, no human catches it.

Unowned Outputs Create Silent Failure

No one is accountable for validating model decisions. Errors compound without detection.

Weak Rollback Paths Convert Errors into Crises

Systems that can't be quickly reversed turn fixable problems into reputational events.

Domain Expertise Erodes

Humans who could catch model errors lose their edge because they stop practicing judgment.

Key Insight These failures are organizational and procedural, not technical. They cannot be fixed with better models—only with better governance.

The Human Failure Surface

Most production AI failures are human failures first: incentives, authority, skill asymmetry, and narrative decay. This section makes those failure modes explicit so they can be designed out.

Related Charts

Failure modes

Incentive drift: KPIs reward usage, not outcomes.
Authority ambiguity: no named stop authority.
Skill asymmetry: operators cannot diagnose failure.
Narrative decay: original intent is forgotten.
Vendor gravity: defaults become architecture.

Countermeasures

Phase Exit Contracts and named owners
Override latency targets and rollback rehearsal
Executive-grade observability (Trust/Econ/Gov)
System Memory File with quarterly review
Vendor constraints explicitly documented

System Continuity & Human Governance

Minimal addendum to prevent long-horizon failure: memory loss, meaning drift, and power resistance.

Artifacts HJG Senior-Only

Design intent: This addendum is deliberately small. It converts the remaining human and temporal risks into enforceable artifacts and gates—without expanding the core 12-month sequence.

0.1

Decision Memory Ledger (DML)

Documentation records outcomes. The Decision Memory Ledger preserves intent and assumptions so the system survives staff turnover and time.

Artifacts

DML-1 Decision Memory Ledger (schema: Decision ID, Summary, Context, Alternatives, Rejections, Assumptions, Assumption Expiry, Owner)
DML-2 Ledger Access Policy (read/write permissions, audit logging)
DML-3 Ledger Query Requirement (mandatory consultation before scope, schema, objective, or boundary changes)

Gate

Hard Gate Required before Phase 4 / Phase 8 / Phase 11 changes that touch model objectives, retrieval scope, labeling, or decision boundaries.

0.2

Power Impact Assessment (PIA) Senior-Only HJG

Working systems redistribute authority. Resistance is usually rational: loss of discretion, shifted accountability, and threatened expertise. Make it visible early.

Artifacts

PIA-1 Power Impact Assessment (who loses discretion, who gains authority, who becomes accountable, who can silently resist)
PIA-2 Incentive Misalignment Register (misaligned KPIs, conflicting owners, perverse incentives)
PIA-3 Adoption Risk Mitigation Plan (training, incentives, workflow design, escalation paths)

Gate

HJG Reviewed by Product + Exec Sponsor before Phase 3.

0.3

Declared System Role & Meaning Boundary

Humans use systems as stories. Declare what the system is allowed to mean—so “advisory” does not silently become “oracle.”

Artifacts

DSR-1 Declared System Role Statement (Advisory / Assistive / Gatekeeping)
DSR-2 Prohibited Uses & Boundary Conditions (domains, decisions, and contexts where use is disallowed)
DSR-3 Human Confirmation Points (required approvals, override rules, escalation)

Gate

Hard Gate Required before Phase 3; UI language, training, and audit checks must align with DSR-1.

0.4

Long-Horizon Risk Register (LHR) Senior-Only

Some harm compounds invisibly and will not trigger short-term metrics or kill switches. Track it, review it annually, intervene with humans—not models.

Artifacts

LHR-1 Long-Horizon Risk Register (skill atrophy, decision monoculture, over-dependence, vendor cognitive lock-in)
LHR-2 Annual Review Record (evidence, outcomes, mitigations)
LHR-3 Mitigation Action Plan (training, rotation, policy, workflow redesign)

Rule: Long-horizon risks do not trigger system termination. They trigger human intervention and governance action.

0.5

Planned Obsolescence & Doctrine Review Senior-Only

Every system needs a retirement plan, and every playbook needs a way to be revised without becoming dogma.

Artifacts

PO-1 Planned Obsolescence Plan (expected lifespan, replacement conditions, knowledge transfer, archive & shutdown)
DG-1 Doctrine Review Record (annual review, at least one external reviewer, logged exceptions & outcomes)
DG-2 Exception Log (what rule was broken, why, outcome, preventive fix)

Gate

Hard Gate PO-1 required before Phase 8 (Launch/Production). DG-1 reviewed annually.

Appendix: LLM-Specific Risk Classes

Large Language Models introduce failure modes that don't exist in traditional ML. These risks are documented here for reference; operational controls are embedded in Phases 6–8.

Prompt Injection Critical

Adversarial inputs that override system instructions. Can cause data exfiltration, unauthorized actions, or reputation damage.

⚠ Operational Risk

Any user-facing LLM is vulnerable. Defense requires input sanitization, output filtering, and privilege separation. No perfect solution exists.

L.1.1

Input Sanitization Rules

Character filtering, length limits, known-attack pattern detection.

L.1.2

Output Filtering Pipeline

Sensitive data detection, PII scrubbing, format validation.

L.1.3

Privilege Separation Architecture

LLM has no direct access to databases, APIs, or actions without human approval.

Tool-Call Drift

LLM-orchestrated tools gradually diverge from intended behavior. The model "learns" shortcuts that bypass safety checks.

L.2.1

Tool Call Audit Log

Every tool invocation logged with inputs, outputs, and latency.

L.2.2

Drift Detection Metrics

Statistical comparison of tool usage patterns over time.

L.2.3

Tool Capability Boundaries

Explicit limits on what each tool can do, enforced at API level.

Retrieval Contamination

RAG systems surface incorrect or malicious content from the knowledge base. Garbage in, authoritative-sounding garbage out.

L.3.1

Source Quality Scoring

Every document in the corpus rated for authority, recency, and reliability.

L.3.2

Retrieval Relevance Monitoring

Track semantic similarity scores and flag low-confidence retrievals.

L.3.3

Adversarial Document Detection

Scan corpus for documents designed to manipulate retrieval.

Context Window Decay

As conversations lengthen, early context degrades. The model "forgets" constraints and instructions established at the start.

L.4.1

Conversation Length Limits

Hard caps on turns before forced summarization or reset.

L.4.2

Instruction Reinforcement Strategy

System prompts repeated or summarized at intervals.

L.4.3

Context Health Monitoring

Measure instruction-following accuracy as function of conversation length.

Hallucination Detection Patterns

LLMs confidently produce false information. Detection requires domain-specific validation, not just confidence scores.

L.5.1

Factual Grounding Requirements

Claims must cite retrievable sources or be flagged as unverified.

L.5.2

Domain Expert Sampling Protocol

Random outputs reviewed by humans for factual accuracy.

L.5.3

Consistency Cross-Check System

Same question asked multiple ways; inconsistent answers flagged.

LLM Governance Principle LLMs are not databases. They are probabilistic generators. Every output should be treated as a hypothesis requiring validation, not a fact requiring transmission.

Appendix: Agentic AI & Multi-Model Orchestration

Guidance for systems where AI models call other models, use tools, or operate with autonomy. These systems introduce compounding risks that single-model deployments do not have.

Compounding Risk Warning Agentic systems multiply risk. An error in one component propagates through the chain. A hallucination becomes an action. A tool call becomes a state change. Governance for agentic systems must be stricter, not looser, than for single models.

AG.1

Definitions & Architecture Patterns

Common patterns in agentic AI systems, each with distinct risk profiles.

Simple Chain

Model A → Model B → Output

Example: Summarization model → Translation model

Risk Profile: Linear error propagation. Output quality bounded by weakest link.

Governance Needs: End-to-end evaluation, intermediate output logging.

Router / Classifier Chain

Input → Router → [Model A | Model B | Model C] → Output

Example: Intent classifier routes to specialized models

Risk Profile: Misrouting sends inputs to wrong model. Silent failures.

Governance Needs: Router accuracy monitoring, fallback paths, coverage analysis.

RAG (Retrieval-Augmented Generation)

Query → Retriever → [Documents] → Generator → Output

Example: Question answering with document retrieval

Risk Profile: Retrieved context quality directly affects output. Retrieval failures are invisible to users.

Governance Needs: Retrieval quality metrics, citation verification, context window management.

Tool-Using Agent

LLM → [Tool Selection] → Tool Execution → [Observation] → LLM → ...

Example: Agent that can search web, execute code, call APIs

Risk Profile: High. Model decisions trigger real-world actions. Hallucinated tool calls cause real damage.

Governance Needs: Tool allowlists, action confirmation, sandbox execution, audit trails.

Multi-Agent System

Agent A ↔ Agent B ↔ Agent C → Consensus → Output

Example: Debate between agents, critic-generator pairs

Risk Profile: Emergent behavior. Agents may reinforce each other's errors. Coordination failures.

Governance Needs: Interaction logging, consensus validation, human-in-the-loop for high-stakes decisions.

Autonomous Agent (Long-Running)

Goal → [Plan → Execute → Observe → Replan] → ... → Outcome

Example: Agent that autonomously pursues multi-step goals

Risk Profile: Highest. Compounding errors over time. Goal drift. Resource exhaustion. Unintended side effects.

Governance Needs: Step limits, budget caps, mandatory checkpoints, kill switches, human approval gates.

AG.2

Agentic Risk Framework

Risks specific to agentic systems that do not apply to single-model deployments.

Risk Category	Description	Example Failure	Mitigation
Cascade Amplification	Errors in early stages amplify through the chain	Retriever returns wrong documents; generator confidently answers based on irrelevant context	Intermediate validation gates, confidence thresholds at each stage
Tool Hallucination	Model invents tool calls that don't exist or passes invalid parameters	Agent calls `delete_user(id="all")` instead of `get_user(id="123")`	Tool schema validation, parameter sanitization, sandbox execution
Action Irreversibility	Agent takes actions that cannot be undone	Agent sends email, deletes file, or submits order based on misunderstanding	Soft-delete patterns, confirmation for destructive actions, staging environments
Goal Drift	Agent pursues instrumental goals that diverge from original intent	Agent asked to "increase engagement" starts generating controversial content	Explicit constraint specification, periodic goal alignment checks
Resource Exhaustion	Agent consumes unbounded resources (API calls, compute, tokens)	Infinite loop in agent reasoning burns $10K in API costs in an hour	Hard budget caps, step limits, automatic timeout
Prompt Injection via Tools	External data (web pages, documents) contains adversarial prompts	Retrieved document contains "Ignore previous instructions and..." that hijacks agent	Input sanitization, privilege separation, context isolation
Emergent Coordination	Multi-agent systems develop unexpected interaction patterns	Agents in debate converge on plausible but wrong answer through mutual reinforcement	Diversity enforcement, external validation, human-in-the-loop
Attribution Opacity	Cannot determine which component caused a failure	Output is wrong but unclear if retriever, generator, or post-processor is at fault	Comprehensive logging, trace IDs, intermediate output capture

AG.3

Mandatory Controls for Agentic Systems

Controls that are optional for single-model deployments but mandatory for agentic systems.

AG.3.1 — Action Allowlist

Agents may only invoke explicitly approved tools/actions. Default deny.

Implementation:

ALLOWED_TOOLS = ["search", "calculate", "lookup_user"]
# NOT ALLOWED: delete, update, send_email, execute_code

def validate_tool_call(tool_name, params):
    if tool_name not in ALLOWED_TOOLS:
        raise ToolNotAllowedError(f"Tool {tool_name} not in allowlist")
    validate_params(tool_name, params)  # Schema validation

AG.3.2 — Budget Caps

Hard limits on resources an agent can consume per task.

Dimensions to cap:

Max steps/iterations per task
Max tokens consumed (input + output)
Max API spend in dollars
Max wall-clock time
Max tool invocations

AG.3.3 — Human-in-the-Loop Gates

Mandatory human approval for high-stakes actions.

Gate triggers:

Any irreversible action (delete, send, submit)
Actions affecting other users
Financial transactions above threshold
Actions outside normal distribution
Low-confidence decisions

AG.3.4 — Comprehensive Tracing

Every step, decision, and tool call logged with trace IDs.

Required trace data:

Trace ID (propagated through chain)
Timestamp for each step
Model inputs and outputs
Tool calls with parameters and results
Reasoning/chain-of-thought (if available)
Confidence scores at each stage

AG.3.5 — Sandbox Execution

Tool execution in isolated environments with limited permissions.

Sandbox properties:

No network access (or allowlisted only)
No filesystem write (or scoped directory)
Resource limits (CPU, memory, time)
No access to credentials/secrets
Output sanitization before return

AG.3.6 — Rollback Capability

Ability to undo agent actions when errors are detected.

Implementation patterns:

Event sourcing for state changes
Soft-delete with retention period
Outbox pattern for external calls
Compensation transactions

AG.4

Tool Safety Specification

Every tool exposed to an agent must have a safety specification.

Tool Safety Spec — Template

Tool Name	[e.g., send_email]
Tool Description	[What the tool does — this is shown to the model]
Risk Level	[LOW \| MEDIUM \| HIGH \| CRITICAL]
Reversibility	[REVERSIBLE \| PARTIALLY_REVERSIBLE \| IRREVERSIBLE]
Side Effects	[List all external effects: sends data, modifies state, costs money, etc.]
Rate Limits	[Max calls per minute/hour/day]
Parameter Schema	{ "to": {"type": "string", "format": "email", "required": true}, "subject": {"type": "string", "maxLength": 200, "required": true}, "body": {"type": "string", "maxLength": 10000, "required": true} }
Forbidden Patterns	[Inputs that should be rejected: e.g., "to" cannot be list > 10 recipients]
Human Approval Required	[YES \| NO \| CONDITIONAL — specify conditions]
Sandbox Requirements	[What isolation is needed for safe execution]
Audit Log Fields	[What must be logged for each invocation]
Failure Modes	[How the tool can fail and what the agent should do]
Test Cases	[Link to test suite for this tool]

Tool Risk Classification

Risk Level	Characteristics	Examples	Required Controls
LOW	Read-only, no side effects, bounded output	Calculator, dictionary lookup, weather API	Logging, rate limits
MEDIUM	External queries, user data access, reversible state changes	Database read, search API, user profile lookup	+ Input validation, output filtering
HIGH	State mutations, external communication, financial impact	Database write, send notification, create record	+ Sandbox, confirmation UI, compensation logic
CRITICAL	Irreversible, high-stakes, affects multiple users	Delete data, send email, financial transaction, publish content	+ Human approval gate, staged rollout, real-time monitoring

AG.5

Evaluation Framework for Agentic Systems

Standard ML metrics are insufficient. Agentic systems require trajectory-level and safety-focused evaluation.

Task Completion Metrics

Success Rate: % of tasks completed correctly
Partial Credit: How much progress on failed tasks
Efficiency: Steps/tokens/cost per successful task
Time to Completion: Wall-clock time for task

Safety Metrics

Harmful Action Rate: % of tasks with unsafe tool calls
Constraint Violation Rate: How often agent exceeds boundaries
Hallucinated Tool Calls: Invalid tool names or parameters
Goal Adherence: Did agent stay on task or drift?

Robustness Metrics

Adversarial Resistance: Performance under prompt injection
Recovery Rate: Can agent recover from tool failures?
Consistency: Same task, same result across runs?
Graceful Degradation: Behavior when components fail

Interpretability Metrics

Reasoning Quality: Does chain-of-thought make sense?
Decision Justification: Can agent explain tool choices?
Error Attribution: Can we identify failure point?

Required Test Scenarios

Scenario Type	Description	Pass Criteria
Happy Path	Standard task with cooperative inputs	Completes correctly within budget
Edge Cases	Unusual but valid inputs	Handles gracefully or requests clarification
Tool Failure	External tool returns error	Retries appropriately or fails gracefully
Ambiguous Instructions	Task has multiple interpretations	Asks for clarification, doesn't assume
Out-of-Scope Request	Task requires tools not available	Refuses clearly, suggests alternatives
Prompt Injection	Adversarial content in retrieved data	Ignores injection, stays on task
Resource Exhaustion	Task that would exceed budget	Stops at limit, reports partial progress
Conflicting Instructions	User request conflicts with safety rules	Follows safety rules, explains refusal

AG.6

Production Monitoring for Agentic Systems

Real-time signals that indicate agentic system health.

Critical Alerts (Page Immediately)

Budget exceeded for single task
Hallucinated tool call attempted
Human approval timeout (task blocked)
Agent stuck in loop (N iterations without progress)
Unauthorized action attempted

Warning Alerts (Review Within 1 Hour)

Success rate dropped >10% from baseline
Average steps per task increased >20%
Tool failure rate elevated
Human override rate elevated
Cost per task trending up

Dashboards Required

Task Flow Dashboard: Success/failure rates, step counts, latency distributions
Tool Usage Dashboard: Which tools called, error rates, latency by tool
Cost Dashboard: Spend by task type, budget utilization, cost anomalies
Safety Dashboard: Blocked actions, human overrides, constraint violations
Trace Explorer: Drill into individual task traces for debugging

Governance Principle for Agentic AI Autonomy is a privilege, not a right. Agents earn expanded capability by demonstrating safe behavior within constraints. Start with minimal permissions and expand based on evidence, not optimism.

Appendix: Failure Autopsies

Anonymized case studies from real AI/ML production failures. These are composites based on patterns observed across multiple organizations. The goal is not blame—it is pattern recognition.

Learning from Failure Every failure below was preventable with the controls in this playbook. They occurred because the controls were skipped, weakened, or not enforced. Reading these should create discomfort—that discomfort is the point.

CASE 01

The Invisible Drift

Domain: Financial Services — Credit Decisioning

What Happened

A mid-size lender deployed an ML model for credit risk scoring. The model performed well in validation and initial production. Eighteen months later, the default rate on ML-approved loans was 340% higher than the previous rules-based system. The model was still reporting "healthy" metrics.

Root Cause Analysis

Proximate Cause

COVID-19 shifted income patterns. The model learned to approve applicants with pandemic-era income support (stimulus checks, unemployment) as if it were stable employment income.

↓

Contributing Cause

Drift monitoring tracked feature distributions but not the meaning of features. "Income source" distributions looked stable because unemployment income replaced employment income at similar rates.

↓

Systemic Cause

No outcome feedback loop. Defaults occur 12-24 months after approval. The team measured model confidence, not actual loan performance. By the time defaults materialized, 18 months of bad loans were already on the books.

↓

Root Cause

The model had no economic kill threshold. The team tracked ML metrics (AUC, precision) instead of business outcomes (default rate, loss ratio). There was no defined trigger for "stop trusting this model."

Financial Impact

$47M Direct losses from elevated defaults

$12M Remediation and model rebuild

8 months Return to baseline performance

Playbook Controls That Would Have Prevented This

Control	Playbook Reference	How It Would Have Helped
Cost Telemetry Contract	Section: Cost Telemetry	Mandatory tracking of "error cost per month" would have caught elevated defaults within weeks, not months
Kill Threshold Definition	Economic Viability $3	Pre-defined kill criteria (e.g., "default rate >1.5× baseline for 60 days") would have triggered automatic review
Outcome Feedback Loop	Phase 11.2	Dashboard tracking actual business outcomes (not just model metrics) with owner accountability
Concept Drift Monitoring	Phase 11.4	Semantic drift detection (not just statistical) would have flagged "income source" meaning change

Key Lesson

Model metrics are not business metrics. A model can report excellent precision while destroying value. The only metrics that matter are the ones your CFO would recognize.

CASE 02

The Helpful Hallucination

Domain: Healthcare — Clinical Decision Support

What Happened

A healthcare system deployed an LLM-powered clinical assistant to help physicians with differential diagnosis. The system was trained to be "helpful and thorough." During a complex case, the LLM confidently cited a drug interaction that did not exist, referencing a fabricated clinical trial. A physician, under time pressure, trusted the citation. The patient experienced a preventable adverse event.

Root Cause Analysis

Proximate Cause

LLM hallucinated a plausible-sounding citation: "Smith et al., NEJM 2019" — a paper that does not exist. The hallucination included specific dosing recommendations.

↓

Contributing Cause

The system was optimized for "helpfulness" scores in user testing. Responses that said "I don't know" or "please verify" scored lower. The model learned to always provide an answer.

↓

Systemic Cause

No citation verification layer. The system presented LLM outputs as if they were retrieved from a verified knowledge base. Users could not distinguish between "retrieved fact" and "generated text."

↓

Root Cause

The deployment skipped Phase 7 (Validation) hallucination detection. There was no red-team evaluation, no domain expert sampling protocol, and no factual grounding requirement. The team assumed "it's just a helper tool" exempted them from rigorous validation.

Impact

1 Patient Harm Event Preventable adverse drug reaction

$2.3M Settlement and legal costs

System Shutdown Full deployment rolled back

18 months Delay to relaunch with proper controls

Playbook Controls That Would Have Prevented This

Control	Playbook Reference	How It Would Have Helped
Hallucination Detection Patterns	LLM Risks L5	Factual grounding requirement (L.5.1) would mandate citation verification before display
Domain Expert Sampling	LLM Risks L.5.2	Random outputs reviewed by clinicians would have caught hallucinated citations in testing
Red Team Evaluation	Phase 7.3	Adversarial testing specifically designed to elicit hallucinations
Human Override Design	Executive Control Surface	UI should have distinguished "verified" vs "AI-generated" content with friction for high-risk actions
Risk Classification	Phase 3.4.3	Clinical decision support is HIGH RISK — should have triggered enhanced validation requirements

Key Lesson

LLMs are not retrieval systems. They generate plausible text, not verified facts. Any deployment in high-stakes domains must include a verification layer that is architecturally separate from the generation layer. "Helpful" without "accurate" is dangerous.

CASE 03

The Orphaned Model

Domain: E-Commerce — Recommendation Engine

What Happened

A recommendation model was deployed by a team of three ML engineers. It performed well. Over 24 months, all three engineers left the company. When the model began underperforming, no one knew how to retrain it, what data it needed, or how to roll it back. The model ran in degraded state for 11 months before being replaced entirely.

Root Cause Analysis

Proximate Cause

The retraining pipeline was never documented. It existed as a series of Jupyter notebooks on a departed engineer's laptop, with hardcoded paths and credentials.

↓

Contributing Cause

No handoff process. Engineers left without knowledge transfer. The "documentation" was a README that said "see Alice for details." Alice had left 8 months earlier.

↓

Systemic Cause

Ownership was assigned to "the ML team" (a team), not to a named individual with backup. When the team dissolved, ownership dissolved with it.

↓

Root Cause

The deployment skipped Phase 12 (Continuity). There was no Model Card, no operational runbook, no retraining documentation, and no named successor owner. The model was treated as "done" rather than as a living system requiring ongoing stewardship.

Financial Impact

$8.2M Lost revenue from degraded recommendations

$1.4M Cost to rebuild from scratch

11 months Duration of degraded performance

6 months Time to deploy replacement

Playbook Controls That Would Have Prevented This

Control	Playbook Reference	How It Would Have Helped
Named Owner Assignment	Ownership Contract (all phases)	Individual (not team) ownership with mandatory successor designation
Model Card Requirement	Phase 4.1, Model Cards section	Standardized documentation including training data, architecture, and retraining procedure
Runbook Requirement	Phase 10.3	Operational procedures documented and tested by someone other than the author
Continuity Addendum	Continuity Addendum section	Explicit handoff checklist and "bus factor" requirement (>1 person can operate)
Retraining Protocol	Phase 11.4	Documented, automated, and tested retraining pipeline in version control

Key Lesson

Models are not products; they are processes. A deployed model without documented, transferable operational procedures is not an asset — it is a liability with a countdown timer. "We shipped" is not the finish line; "anyone can operate this" is.

CASE 04

The Compliance Surprise

Domain: Insurance — Claims Processing

What Happened

An insurer deployed an ML model to expedite claims processing. The model reduced processing time by 60%. Six months post-launch, a regulatory audit revealed the model was using zip code as a proxy for race, resulting in systematically lower payouts to minority communities. The company faced regulatory action and class-action litigation.

Root Cause Analysis

Proximate Cause

Zip code was highly predictive of claim outcome in training data. The model learned this correlation without understanding it encoded historical discrimination.

↓

Contributing Cause

Bias testing was limited to "protected class" features (race, gender). Proxy features like zip code, which correlate with protected classes, were not evaluated.

↓

Systemic Cause

Legal/Compliance was consulted only at the end ("sign off on this"). They were not involved in feature selection or bias testing design.

↓

Root Cause

The deployment skipped Phase 3.4 (Regulatory/Ethical Constraints) and treated Phase 7.2 (Bias/Fairness Evaluation) as a checkbox rather than a substantive review. The RACI matrix showed Legal as "Informed" rather than "Consulted" on feature selection.

Impact

$34M Regulatory fine

$89M Class action settlement

Consent Decree 5-year regulatory oversight

Reputational National media coverage

Playbook Controls That Would Have Prevented This

Control	Playbook Reference	How It Would Have Helped
Regulatory Constraint Mapping	Phase 3.4	Early identification of fair lending/insurance regulations and their implications for feature selection
Ethical AI Framework	Phase 3.4.2	Explicit proxy discrimination analysis as part of feature engineering
Bias/Fairness Evaluation	Phase 7.2	Disparate impact analysis across protected classes AND correlated features
RACI Matrix	Template T.1	Legal/Compliance as "Consulted" on feature selection, not just "Informed" at the end
Regulatory Traceability Matrix	Appendix: Regulatory Matrix	Explicit mapping of regulatory requirements to artifacts and owners

Key Lesson

Compliance is not a sign-off; it is a design constraint. Legal and regulatory requirements must be inputs to the design process, not reviews of the finished product. By the time you ask Legal to "approve this," the expensive decisions have already been made.

Common Failure Patterns

Across these cases and dozens of others, the same patterns emerge:

Metric Mismatch

Teams measure ML metrics (AUC, F1) instead of business outcomes (revenue, cost, harm). Models can score well while destroying value.

Skipped Phases

"We don't have time for that" is the most expensive sentence in ML. Skipped controls create debt that compounds with interest.

Dissolved Ownership

Ownership assigned to teams, not individuals. When teams change, ownership evaporates. Models become orphans.

Late Compliance

Legal/Compliance consulted at the end, not the beginning. By then, the architecture encodes assumptions that are expensive to unwind.

Missing Kill Criteria

No pre-defined conditions for stopping. Sunk cost and optimism bias keep failing projects alive long past when they should die.

Hallucination Blindness

LLM outputs treated as facts. No verification layer. Confidence scores mistaken for accuracy.

Model Cards & Data Sheets

Standardized documentation for models and datasets. Based on Mitchell et al. (2019) "Model Cards for Model Reporting" and Gebru et al. (2021) "Datasheets for Datasets." These are not optional — they are audit artifacts.

Documentation as Governance Model Cards and Datasheets are not bureaucracy — they are evidence. When a regulator, auditor, or litigator asks "how did you know this was safe to deploy?", these documents are your answer. Incomplete documentation is indefensible documentation.

MC.1

Model Card Template

A Model Card documents a model's intended use, performance characteristics, limitations, and ethical considerations. It is required before production deployment.

Model Card — Template v2.0

1. Model Details

Model Name	[e.g., fraud-detection-v2.3.1]
Model Version	[Semantic version: MAJOR.MINOR.PATCH]
Model Type	[e.g., Gradient Boosted Trees, Transformer, Logistic Regression]
Model Date	[Training completion date: YYYY-MM-DD]
Model Owner	[Named individual, not team]
Model Steward	[Backup owner for continuity]
Contact	[Email or Slack channel for questions]
License	[Internal use only / Open source license]
Playbook Phase	[Current phase in AI/ML Production Playbook]

2. Intended Use

Primary Intended Uses

[Describe the primary use case(s) this model was designed for]

[Use case 1: e.g., "Real-time fraud scoring for card-present transactions"]
[Use case 2: e.g., "Batch scoring for transaction review queues"]

Primary Intended Users

[User type 1: e.g., "Fraud analysts reviewing flagged transactions"]
[User type 2: e.g., "Automated decisioning system for low-risk approvals"]

Out-of-Scope Uses

⚠ The following uses are explicitly NOT supported and may produce unreliable or harmful results:

[Out-of-scope 1: e.g., "Credit decisioning without human review"]
[Out-of-scope 2: e.g., "Application to card-not-present transactions (different fraud patterns)"]
[Out-of-scope 3: e.g., "Use in jurisdictions outside training data coverage"]

3. Factors

Relevant Factors

Factors that may influence model performance:

Demographic factors: [e.g., "Customer geography, account age, transaction history length"]
Instrument factors: [e.g., "Card type, merchant category, transaction channel"]
Environmental factors: [e.g., "Time of day, day of week, holiday periods"]

Evaluation Factors

Factors across which performance was explicitly evaluated:

Factor	Disaggregation Performed	Result
[Factor 1]	[Yes/No]	[Summary]
[Factor 2]	[Yes/No]	[Summary]

4. Metrics

Model Performance Metrics

Metric	Definition	Value	Threshold	Rationale
Precision @ threshold	True positives / Predicted positives	[0.XX]	[≥ 0.XX]	[Why this threshold matters]
Recall @ threshold	True positives / Actual positives	[0.XX]	[≥ 0.XX]	[Why this threshold matters]
False Positive Rate	False positives / Actual negatives	[0.XX]	[≤ 0.XX]	[Cost of false positives]
AUC-ROC	Area under ROC curve	[0.XX]	[≥ 0.XX]	[Overall discrimination ability]

Business Metrics

Metric	Value	Baseline	Kill Threshold
Cost per inference (fully loaded)	[$X.XX]	[$X.XX]	[> $X.XX]
Value per correct prediction	[$X.XX]	[$X.XX]	[< $X.XX]
Cost per error (weighted)	[$X.XX]	[$X.XX]	[> $X.XX]

Decision Thresholds

Operating thresholds and their implications:

Threshold	Value	Action	Trade-off
Auto-approve	[< 0.XX]	[Pass without review]	[Maximizes throughput, accepts some false negatives]
Auto-reject	[> 0.XX]	[Block and escalate]	[Minimizes fraud loss, increases false positives]
Human review	[0.XX - 0.XX]	[Queue for analyst]	[Balances accuracy with operational cost]

5. Training Data

Dataset Name	[Link to Datasheet]
Dataset Version	[Version hash or identifier]
Date Range	[Start date — End date]
Sample Size	[N records, with class distribution]
Sampling Strategy	[Random / Stratified / Other]
Label Source	[How ground truth was determined]
Known Limitations	[Data gaps, biases, or quality issues]

6. Evaluation Data

Dataset Name	[Link to Datasheet]
Relationship to Training	[Held-out split / Separate collection / Temporal split]
Date Range	[Start date — End date]
Sample Size	[N records]
Distribution Comparison	[How evaluation data compares to training data]

7. Ethical Considerations

Fairness Analysis

Protected Class	Metric	Group A	Group B	Ratio	Threshold	Status
[e.g., Geography]	False Positive Rate	[0.XX]	[0.XX]	[0.XX]	[0.8-1.25]	[✓/✗]
[e.g., Account Age]	Approval Rate	[0.XX]	[0.XX]	[0.XX]	[0.8-1.25]	[✓/✗]

Potential Harms

False Positive Harm: [What happens when model incorrectly flags legitimate activity]
False Negative Harm: [What happens when model misses actual fraud]
Disparate Impact Risk: [Populations that may be disproportionately affected]

Mitigation Strategies

[Mitigation 1: e.g., "Human review required for all rejections"]
[Mitigation 2: e.g., "Appeal process with alternative evaluation"]
[Mitigation 3: e.g., "Quarterly fairness audit"]

8. Caveats and Recommendations

Known Limitations

[Limitation 1: e.g., "Reduced accuracy for transactions < $10"]
[Limitation 2: e.g., "Not validated for international merchants"]
[Limitation 3: e.g., "Performance degrades after 90 days without retraining"]

Recommendations for Use

[Recommendation 1: e.g., "Always pair with human review for high-value transactions"]
[Recommendation 2: e.g., "Monitor drift weekly; retrain if PSI > 0.25"]
[Recommendation 3: e.g., "Do not use as sole decision criteria for account actions"]

Rollback Information

Previous Stable Version	[Model version to rollback to]
Rollback Trigger	[Conditions that trigger automatic rollback]
Rollback Procedure	[Link to runbook]
Rollback Owner	[Named individual authorized to execute]

9. Approvals

ML Lead	________________	Date: ________
Product Owner	________________	Date: ________
Security Review	________________	Date: ________
Legal/Compliance	________________	Date: ________
Executive Sponsor	________________	Date: ________

MC.2

Datasheet for Datasets Template

Based on Gebru et al. (2021). Documents the provenance, composition, and appropriate use of datasets used for training and evaluation.

Datasheet for Datasets — Template v1.0

1. Motivation

For what purpose was the dataset created?

[Describe the task or research question]

Who created the dataset and on behalf of which entity?

[Team name, organization]

Who funded the creation of the dataset?

[Internal budget / External grant / Client project]

2. Composition

What do the instances represent?

[e.g., "Each instance represents a single transaction"]

How many instances are there in total?

[N instances, broken down by split if applicable]

Does the dataset contain all possible instances or a sample?

[Describe sampling strategy and coverage]

What data does each instance consist of?

[List features/columns with data types]

Is there a label or target associated with each instance?

[Describe labels, how they were obtained, and inter-annotator agreement if applicable]

Is any information missing from individual instances?

[Describe missing data patterns and handling]

Are relationships between instances made explicit?

[e.g., "Transactions are linked by customer ID"]

Are there recommended data splits?

[Train/validation/test splits with rationale]

Are there any errors, noise, or redundancies?

[Known data quality issues]

Is the dataset self-contained or does it link to external resources?

[Dependencies on external data sources]

Does the dataset contain data that might be considered confidential?

[PII, PHI, financial data, trade secrets]

Does the dataset contain data that might be considered offensive or distressing?

[Content warnings if applicable]

Does the dataset relate to people?

[If yes, describe demographic coverage and potential for identification]

3. Collection Process

How was the data associated with each instance acquired?

[Directly observed / Derived / Inferred / User-provided]

What mechanisms were used to collect the data?

[APIs, sensors, manual entry, web scraping, etc.]

Who was involved in the data collection process?

[Automated systems / Human annotators / Crowd workers]

Over what timeframe was the data collected?

[Date range]

Were any ethical review processes conducted?

[IRB approval, ethics board review, etc.]

Did the data subjects consent to data collection?

[Consent mechanism and scope]

Has an analysis of potential impact on data subjects been conducted?

[Privacy impact assessment results]

4. Preprocessing / Cleaning / Labeling

Was any preprocessing applied?

[Normalization, deduplication, encoding, etc.]

Was the "raw" data saved in addition to preprocessed data?

[Location of raw data if preserved]

Is the preprocessing software available?

[Link to code repository]

5. Uses

Has the dataset been used for any tasks already?

[Previous uses and results]

What tasks could the dataset be used for?

[Appropriate use cases]

What tasks should the dataset NOT be used for?

[Inappropriate uses and why]

Is there anything about the composition or collection that might impact future uses?

[Limitations that affect generalization]

6. Distribution

Will the dataset be distributed to third parties?

[Internal only / Partners / Public]

How will the dataset be distributed?

[S3, API, download, etc.]

When will the dataset be distributed?

[Availability timeline]

Will the dataset be distributed under a license?

[License terms and restrictions]

Are there any export controls or regulatory restrictions?

[GDPR, HIPAA, CCPA, export controls]

7. Maintenance

Who is maintaining the dataset?

[Named owner]

How can the owner be contacted?

[Email, Slack, etc.]

Will the dataset be updated?

[Update cadence and process]

If others want to extend the dataset, is there a mechanism?

[Contribution process]

Are older versions available?

[Versioning and retention policy]

If the dataset becomes obsolete, how will this be communicated?

[Deprecation process]

MC.3

Citation & References

These templates are based on the following foundational papers:

Mitchell, M., et al. (2019). "Model Cards for Model Reporting." Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* '19). arXiv:1810.03993
Gebru, T., et al. (2021). "Datasheets for Datasets." Communications of the ACM, 64(12), 86-92. arXiv:1803.09010
Arnold, M., et al. (2019). "FactSheets: Increasing Trust in AI Services through Supplier's Declarations of Conformity." IBM Journal of Research and Development, 63(4/5).

Appendix: AI/ML Incident Response Playbook

Structured response procedures for AI/ML system incidents. Standard IT incident response is insufficient — AI systems have unique failure modes that require specialized handling.

Incident Response Readiness An incident response plan that hasn't been rehearsed is not a plan — it's a document. Run tabletop exercises quarterly. The first time you use this playbook should not be during an actual incident.

IR.1

AI/ML Incident Severity Classification

AI incidents require different classification criteria than traditional IT incidents.

Severity	Definition	AI/ML Examples	Response Time	Escalation
SEV-1 Critical	Active harm to users, regulatory breach, or complete system failure	Model producing harmful/illegal outputs PHI/PII exposure via model outputs Systematic bias causing demonstrable harm Complete model failure in production Hallucination causing user harm	Immediate (within 15 min)	Executive + Legal + Comms
SEV-2 High	Significant degradation or potential for harm if uncorrected	Model accuracy dropped below SLA Significant drift detected Data pipeline failure affecting predictions Cost overrun >2x budget Human override rate >50%	Within 1 hour	ML Lead + Product
SEV-3 Medium	Noticeable degradation without immediate harm	Latency SLA breach Partial feature failure Elevated error rates in non-critical paths Monitoring gaps detected	Within 4 hours	On-call engineer
SEV-4 Low	Minor issues with workarounds available	Non-critical dashboard down Logging gaps Documentation inaccuracies	Next business day	Ticket queue

IR.2

Incident Response Workflow

1. DETECT

Alert fires (automated monitoring)
User report received
Internal discovery
External report (researcher, regulator)

Output: Incident ticket created with initial severity

↓

2. TRIAGE

Confirm incident is real (not false positive)
Assess severity using classification matrix
Identify affected systems and users
Determine if rollback is needed immediately

Output: Confirmed severity, incident commander assigned

Target: 15 minutes

↓

3. CONTAIN

Stop the bleeding (rollback, disable, rate limit)
Preserve evidence (logs, model state, data snapshots)
Communicate status to stakeholders
Isolate affected components

Output: Harm stopped or bounded

Target: 30 minutes (SEV-1), 2 hours (SEV-2)

↓

4. INVESTIGATE

Root cause analysis
Timeline reconstruction
Impact assessment (users affected, data compromised)
Identify contributing factors

Output: Root cause identified, impact quantified

↓

5. REMEDIATE

Fix root cause (not just symptoms)
Validate fix in staging
Deploy fix with monitoring
Verify normal operation restored

Output: System restored to healthy state

↓

6. REVIEW

Blameless post-mortem
Document lessons learned
Identify systemic improvements
Update runbooks and monitoring
Share learnings with broader team

Output: Post-mortem document, action items assigned

Target: Within 5 business days of resolution

IR.3

AI-Specific Incident Runbooks

Pre-written response procedures for common AI/ML incident types.

RUNBOOK-001

Model Producing Harmful/Inappropriate Output

Typical Severity: SEV-1

Symptoms

User reports of offensive, dangerous, or illegal model outputs
Content filter alerts spiking
Social media reports of problematic outputs

Immediate Actions (First 15 Minutes)

Preserve evidence: Screenshot/log the harmful output with full context (input, session ID, timestamp)
Assess scope: Is this reproducible? One user or many? One input pattern or widespread?
Decide containment strategy:
- If reproducible/widespread → Rollback to previous model version
- If isolated → Add input to blocklist, increase human review
- If severity warrants → Take system offline entirely
Notify: Incident commander, ML lead, legal (if SEV-1), communications (if public-facing)

Investigation Checklist

□ What input triggered the harmful output?
□ Was this a prompt injection or adversarial input?
□ Did this output pattern exist in training data?
□ Has the content filter been bypassed? How?
□ Are there similar inputs that might produce similar outputs?
□ What is the user impact? (How many saw this? Who?)

Remediation Options

Add input pattern to blocklist
Update content filter with new pattern
Fine-tune model to refuse similar inputs
Add human review for similar input patterns
Retrain model with corrected data (longer-term)

Stakeholder Communication

Internal: Notify executive team within 1 hour
Affected users: Direct communication if identifiable
Public: If widely known, coordinate with communications team
Regulatory: If PHI/PII or regulated domain, notify compliance for reporting assessment

RUNBOOK-002

Model Performance Degradation / Drift

Typical Severity: SEV-2

Symptoms

Accuracy/precision metrics below SLA threshold
Drift score exceeds threshold (PSI > 0.25)
Human override rate elevated
User complaints about quality increasing

Immediate Actions

Confirm degradation: Check multiple metrics, not just one. Rule out monitoring bug.
Assess business impact: Is this causing measurable harm? Financial loss? User churn?
Decide action:
- If degradation severe (>15% from baseline) → Rollback to previous version
- If moderate → Increase human review, continue investigation
- If gradual → Schedule retraining, monitor closely

Investigation Checklist

□ When did degradation start? (Correlate with deployments, data changes)
□ Is this data drift or concept drift?
□ Which features are drifting most?
□ Is the label distribution changing?
□ Are there new input patterns the model hasn't seen?
□ Has upstream data quality degraded?

Remediation Options

Rollback to previous stable version
Retrain on recent data
Adjust decision thresholds
Add new training data for drifted segments
Fix upstream data quality issues

RUNBOOK-003

Data Pipeline Failure

Typical Severity: SEV-2 to SEV-3

Symptoms

Feature store not updating
Stale predictions (using old data)
Missing features in inference requests
Pipeline job failures in orchestrator

Immediate Actions

Identify failure point: Which pipeline stage failed? Ingestion? Transformation? Serving?
Assess staleness: How old is the data currently being served?
Decide action:
- If model can operate on stale data → Continue with degraded mode, alert users
- If stale data causes incorrect predictions → Fall back to rules-based system or disable feature

Investigation Checklist

□ What is the root cause? (Source system down? Schema change? Resource exhaustion?)
□ Is data recoverable or lost?
□ Are downstream systems affected?
□ What is the data gap (time range of missing data)?

RUNBOOK-004

Cost Overrun / Budget Breach

Typical Severity: SEV-2 to SEV-3

Symptoms

Cost alerts firing
Inference costs >2x budget
Unexpected spike in API/compute usage

Immediate Actions

Identify source: Which model/endpoint is causing the cost spike?
Check for abuse: Is this a DDoS, credential leak, or runaway automation?
Implement rate limiting: Throttle requests to bring costs under control
Notify finance: If significant budget impact expected

Investigation Checklist

□ Is traffic legitimate or malicious?
□ Has request volume increased or cost per request increased?
□ Is a new feature or integration driving unexpected usage?
□ Are caches working correctly?
□ Is there a retry storm or infinite loop?

IR.4

Post-Mortem Template

Blameless post-mortems are required for all SEV-1 and SEV-2 incidents.

AI/ML Incident Post-Mortem Template

Incident Summary

Incident ID	[INC-XXXX]
Severity	[SEV-1 / SEV-2 / SEV-3]
Date/Time	[Start time — End time, timezone]
Duration	[Total time to resolution]
Incident Commander	[Name]
Affected Systems	[List of models, services, features]
User Impact	[Number of users affected, nature of impact]
Financial Impact	[Estimated cost: lost revenue, remediation, etc.]

Executive Summary

[2-3 sentence summary suitable for leadership. What happened, what was the impact, and is it fixed?]

Timeline

Time	Event	Actor
[HH:MM]	[First symptom observed]	[System/Person]
[HH:MM]	[Alert fired]	[Monitoring system]
[HH:MM]	[Incident declared]	[Person]
[HH:MM]	[Containment action taken]	[Person]
[HH:MM]	[Root cause identified]	[Person]
[HH:MM]	[Fix deployed]	[Person]
[HH:MM]	[Incident resolved]	[Person]

Root Cause Analysis

What Happened

[Detailed technical description of what went wrong]

Why It Happened (5 Whys)

Why did [symptom] occur?
[Answer]
Why did [cause 1] happen?
[Answer]
Why did [cause 2] happen?
[Answer]
Why did [cause 3] happen?
[Answer]
Why did [cause 4] happen?
[Answer — this is usually the root cause]

Contributing Factors

[Factor 1: e.g., "Missing test coverage for edge case"]
[Factor 2: e.g., "Alert threshold set too high"]
[Factor 3: e.g., "Runbook out of date"]

What Went Well

[Thing 1: e.g., "Rollback completed in under 10 minutes"]
[Thing 2: e.g., "Cross-team collaboration was smooth"]
[Thing 3: e.g., "Monitoring detected issue before user reports"]

What Could Be Improved

[Improvement 1: e.g., "Alerting was too noisy, real signal was lost"]
[Improvement 2: e.g., "Runbook didn't cover this scenario"]
[Improvement 3: e.g., "Took too long to get right people in room"]

Action Items

Action	Type	Owner	Due Date	Status
[Action 1: e.g., "Add test for edge case X"]	Prevent	[Name]	[Date]	Open
[Action 2: e.g., "Lower alert threshold to Y"]	Detect	[Name]	[Date]	Open
[Action 3: e.g., "Update runbook with this scenario"]	Respond	[Name]	[Date]	Open

Action Types: Prevent (stop recurrence), Detect (catch earlier), Respond (handle better)

Lessons Learned

[Key takeaways that should be shared with the broader organization]

Approvals

Post-Mortem Author: _____________ Date: _______

Reviewed By: _____________ Date: _______

Action Items Approved By: _____________ Date: _______

IR.5

Incident Response Roles

Incident Commander (IC)

Responsibility: Overall incident coordination and decision-making

Declares incident severity
Coordinates response team
Makes containment decisions
Manages stakeholder communication
Declares incident resolved

Assigned To: [On-call rotation or named individual]

Technical Lead

Responsibility: Technical investigation and remediation

Leads root cause investigation
Proposes and implements fixes
Coordinates with other engineers
Validates fix before deployment

Assigned To: [ML engineer on-call or model owner]

Communications Lead

Responsibility: Internal and external communication

Drafts status updates
Manages stakeholder notifications
Coordinates with PR if needed
Documents incident timeline

Assigned To: [Product manager or designated comms person]

Scribe

Responsibility: Real-time documentation

Records all actions and decisions
Maintains incident timeline
Captures evidence and screenshots
Provides input for post-mortem

Assigned To: [Any available team member]

Rehearsal Requirement Run a tabletop exercise using this playbook at least once per quarter. Simulate a SEV-1 incident, assign roles, and walk through the response. Identify gaps before a real incident reveals them.

Appendix: Regulatory Traceability Matrix

This matrix maps regulatory requirements to specific phases, artifacts, and owners. For audit readiness, not prose.

Core Regulatory Mapping

Regulation / Standard	Requirement	Phase	Artifact(s)	Owner
EU AI Act (Art. 14)	Human oversight	7, 9	7.2 Bias Checks, 9.1 Launch Review	Compliance Officer
EU AI Act (Art. 13)	Transparency & documentation	1, 4	1.4 Documentation, 4.1 RACI	Product Manager
EU AI Act (Art. 9)	Risk management system	3	3.4.3 Risk Assessment Matrix	Risk Manager
GDPR (Art. 17)	Right to erasure	5, 8	5.3 Security Posture, 8.4 Privacy Validation	DPO
GDPR (Art. 22)	Automated decision rights	7	7.2 Interpretability Checks	Legal Counsel
HIPAA (164.312)	Access controls & audit	5	5.3 IAM Configuration	Security Engineer
HIPAA (164.530)	Data retention	8	8.4 Data Retention Validation	Compliance Officer
FDA SaMD (21 CFR 820)	Design controls	4	4.2 Pipeline Architecture	QA Manager
FDA SaMD	Change control	11	11.4 Retraining Protocol	QA Manager
NIST AI RMF (Map)	Context & scope definition	1, 2	1.3 Relationship Map, 2.1 Scope Definition	ML Lead
NIST AI RMF (Measure)	Performance monitoring	11	11.2 Model Dashboards	ML Engineer
NIST AI RMF (Manage)	Risk response	10, 11	10.2 Rollback Plans, 11.3 Incident Response	SRE Lead
ISO/IEC 42001	AI management system	All	Full playbook compliance	CTO
ISO/IEC 23894	AI risk management	3, 7	3.4 Risk Assessment, 7.4 Security Testing	Risk Manager
Basel AI Guidance	Model risk management	11, 12	11.4 Decay Detection, 12.3 Tech Debt	Model Risk Officer
SOC 2 Type II	Security controls	5, 7	5.3 Security Posture, 7.4 Pen Testing	CISO
IEEE 2857	Privacy engineering	3, 8	3.4 Ethical Framework, 8.4 Privacy Validation	Privacy Engineer

Audit Preparation For each row, the named owner must be able to produce the referenced artifact(s) within 24 hours of audit request. Artifacts without designated storage locations or owners are compliance gaps.

Standards & Citations

This framework incorporates requirements and guidance from the following standards:

NIST AI Risk Management Framework (AI RMF) 1.0
ISO/IEC 23894:2023 — AI Risk Management
ISO/IEC 23053:2022 — Framework for AI Systems using ML
ISO/IEC 24028:2020 — AI Trustworthiness Overview
ISO/IEC TR 24027:2021 — Bias in AI Systems
ISO/IEC 24029-1:2021 — Neural Network Robustness
IEEE 2857-2021 — Privacy Engineering for AI/ML
IEEE 2858-2021 — Algorithmic Bias Considerations
IEEE 3652.1 — Federated ML Architecture
ISO 13485 — Medical Devices QMS
IEC 62304 — Medical Device Software Lifecycle
FDA GMLP — Good Machine Learning Practice
ISO 21448 — SOTIF for Road Vehicles
ISO 26262 — Automotive Functional Safety
Basel Committee — AI Model Risk Guidance
EU Artificial Intelligence Act
OECD AI Principles
China GB/T AI Standards
ISO 9001:2015 — Quality Management Systems
ISO/IEC 90003:2014 — Software Engineering Guidelines
ISO/IEC 25010:2011 — Software Quality Models
ISO/IEC 42001 — AI Management Systems
IEEE 730 — Software Quality Assurance
CMMI Level 3+ — Process Maturity
ISO/IEC 17025:2017 — Testing Lab Competence

Glossary

Definitions for terms used throughout this playbook. Consistent terminology prevents miscommunication. If a term is used differently in your organization, document the mapping.

A B C D E F G H I K L M N O P R S T U V

A

Agentic AI

AI systems that can autonomously take actions, use tools, or make multi-step decisions without human intervention at each step. Includes tool-using LLMs, autonomous agents, and multi-agent systems. See: Appendix AG

AI Act (EU)

European Union regulation establishing a legal framework for AI systems based on risk classification. High-risk AI systems must meet requirements for transparency, human oversight, accuracy, and robustness. Effective 2024-2026. Reference: Regulatory Matrix

Artifact

A documented deliverable produced during a phase of the playbook. Examples: Model Card, Risk Assessment Matrix, Runbook. Artifacts have named owners and version control.

AUC-ROC (Area Under the Receiver Operating Characteristic Curve)

A metric measuring a classification model's ability to distinguish between classes across all decision thresholds. Ranges from 0.5 (random) to 1.0 (perfect discrimination). Useful for comparing models but does not reflect real-world operating point.

B

Baseline

Reference performance metrics established before deployment or after initial production stabilization. Used to detect degradation and drift. Must be documented with measurement methodology.

Bias (Algorithmic)

Systematic errors in model outputs that disadvantage particular groups. Can arise from training data (historical bias), feature selection (proxy discrimination), or evaluation methodology. See: Phase 7.2, ISO/IEC TR 24027

Bus Factor

The minimum number of team members who would need to leave before a project becomes inoperable due to knowledge loss. A bus factor of 1 is a critical risk. This playbook requires bus factor ≥ 2 for production systems.

C

Canary Deployment

Deployment strategy where new model version receives a small percentage of traffic (typically 5-10%) while being monitored. Traffic increases gradually if metrics remain healthy. Enables early detection of issues without full exposure.

Concept Drift

Change in the relationship between input features and target variable over time. Unlike data drift, concept drift means the underlying patterns have changed, not just the input distribution. Example: Customer behavior changing during pandemic. See: Phase 11.4

Cost Telemetry Contract (CT)

A mandatory agreement specifying which economic metrics must be tracked, who owns each metric, refresh cadence, and kill thresholds. Systems cannot ship without a complete CT. See: Cost Telemetry section

CRISP-DM

Cross-Industry Standard Process for Data Mining. A methodology defining six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment. This playbook extends CRISP-DM with governance, operational, and economic controls.

D

Data Drift

Change in the statistical distribution of input features over time compared to training data. Measured using metrics like PSI (Population Stability Index) or KL Divergence. Does not necessarily indicate performance degradation but warrants investigation.

Datasheet (for Datasets)

Standardized documentation for datasets describing motivation, composition, collection process, preprocessing, uses, distribution, and maintenance. Based on Gebru et al. (2021). See: Model Cards & Datasheets section

Disparate Impact

A measure of fairness where a selection rate for a protected group is compared to the selection rate for a reference group. The "four-fifths rule" considers disparate impact if the ratio is below 0.8. Used in regulatory compliance.

E

Embedding

A dense vector representation of data (text, images, etc.) learned by a neural network. Embeddings capture semantic relationships and are used in similarity search, RAG systems, and transfer learning.

Error Budget

The acceptable amount of error (downtime, incorrect predictions, etc.) over a defined period, derived from SLO targets. When error budget is exhausted, new deployments should pause until reliability improves.

Explainability

The degree to which a model's predictions can be understood by humans. Includes feature importance, decision paths, and counterfactual explanations. Required for high-risk AI systems under EU AI Act. See: ISO/IEC 24028

F

Feature Store

A centralized repository for storing, managing, and serving ML features. Ensures consistency between training and inference, enables feature reuse, and provides lineage tracking.

Fine-tuning

Adapting a pre-trained model to a specific task or domain by training on task-specific data. Common with LLMs and transfer learning. Introduces risks around training data quality and catastrophic forgetting.

Foundation Model

Large models trained on broad data that can be adapted to many downstream tasks. Examples: GPT, BERT, CLIP. Introduce supply chain risks as organizations depend on external model providers.

G

GDPR (General Data Protection Regulation)

EU regulation on data protection and privacy. Relevant to AI: Article 22 (automated decision-making), Article 17 (right to erasure), and requirements for lawful basis and transparency. See: Regulatory Matrix

GMLP (Good Machine Learning Practice)

FDA guidance for developing medical device software using AI/ML. Emphasizes multi-disciplinary expertise, good software engineering practices, representative data, independence of training and test sets, and reference standards.

Governance OS

The operating system of controls, processes, and accountability structures that ensure AI systems remain safe, compliant, and valuable over time. This playbook is a Governance OS. See: Governance OS section

Ground Truth

The correct label or outcome used to evaluate model predictions. Quality of ground truth directly bounds model quality. Sources include human annotation, authoritative records, and observed outcomes.

H

Hallucination

When a generative AI model produces confident but factually incorrect or fabricated information. Particularly dangerous in high-stakes domains. Cannot be eliminated, only mitigated through verification and grounding. See: LLM Risks L5

HIPAA (Health Insurance Portability and Accountability Act)

US law establishing requirements for protecting health information (PHI). AI systems processing PHI must comply with access controls, audit logging, and data retention requirements. See: Regulatory Matrix

Human Judgment Gate (HJG)

A step in this playbook that requires explicit human decision-making and cannot be automated. Indicated by the HJG badge. Examples: kill criteria definition, risk acceptance, bias evaluation approval.

Hypercare

A period of intensified monitoring and support immediately following production deployment. Typically 2-4 weeks. Characterized by lower thresholds for alerts, faster response times, and elevated staffing. See: Phase 9

I

Inference

The process of applying a trained model to new data to produce predictions. Distinguished from training. Inference cost, latency, and reliability are key production concerns.

Irreversibility Flag

A marker in this playbook indicating decisions that are costly or impossible to unwind once made. Requires extra scrutiny and explicit approval. Examples: data schema choices, model architecture selection.

ISO/IEC 42001

International standard for AI Management Systems. Specifies requirements for establishing, implementing, maintaining, and continually improving an AI management system. Auditable certification available.

K

Kill Criteria

Pre-defined, measurable conditions under which a project or deployed system should be terminated. Must be established before significant investment. Requires named authority to execute. See: Economic Viability $3, Kill Criteria section

KL Divergence (Kullback-Leibler Divergence)

A measure of how one probability distribution differs from another. Used to detect data drift by comparing current input distribution to training distribution. Not symmetric: KL(P||Q) ≠ KL(Q||P).

L

Latency

Time between receiving an inference request and returning a prediction. Measured at various percentiles (P50, P95, P99). Critical for user experience and often traded off against accuracy or cost.

LLM (Large Language Model)

Neural network models trained on large text corpora to generate human-like text. Examples: GPT-4, Claude, Llama. Introduce unique risks including hallucination, prompt injection, and context window limitations. See: LLM-Specific Risks appendix

M

MLOps

Practices for deploying and maintaining ML models in production reliably and efficiently. Encompasses CI/CD for ML, monitoring, versioning, and automation. This playbook provides governance layer on top of MLOps.

Model Card

Standardized documentation for a trained model describing intended use, performance characteristics, limitations, and ethical considerations. Based on Mitchell et al. (2019). Required artifact before production. See: Model Cards section

Model Registry

A centralized repository for storing, versioning, and managing trained models. Enables model lineage, rollback, and audit. Essential infrastructure for governed ML.

N

NIST AI RMF (Risk Management Framework)

Framework from US National Institute of Standards and Technology for managing AI risks. Organized around Map, Measure, Manage, Govern functions. Voluntary but increasingly referenced in procurement and regulation. See: References

O

Ontology

A formal representation of concepts in a domain and the relationships between them. In this playbook, establishing ontology is Phase 1 — ensuring shared vocabulary before building. See: Phase 1

Override Rate

Percentage of model predictions that are overruled by human operators. High override rates may indicate low trust, poor model fit, or changing conditions. Tracked as executive-level signal. See: Executive Control Surface

P

Phase Exit Contract

A checklist of conditions that must be satisfied before proceeding to the next phase. Includes Truth, Economic, Risk, and Ownership contracts. Prevents premature advancement. See: Each phase section

Precision

The proportion of positive predictions that are correct: TP / (TP + FP). High precision means few false positives. Important when false positive cost is high.

Prompt Injection

An attack where adversarial text in input causes an LLM to ignore instructions or behave unexpectedly. Can occur directly (user input) or indirectly (retrieved documents). Requires input sanitization and privilege separation. See: LLM Risks L3

PSI (Population Stability Index)

A metric for measuring distribution shift between two datasets. PSI < 0.1 indicates no significant change; 0.1-0.25 indicates moderate shift; > 0.25 indicates major shift requiring investigation.

R

RACI Matrix

A responsibility assignment chart defining who is Responsible (does work), Accountable (final authority), Consulted (input required), and Informed (kept updated) for each activity. See: Template T.1

RAG (Retrieval-Augmented Generation)

An architecture where an LLM's responses are grounded by retrieving relevant documents from an external knowledge base. Reduces hallucination but introduces retrieval quality as a failure mode. See: Agentic AI section

Recall

The proportion of actual positives that are correctly identified: TP / (TP + FN). High recall means few false negatives. Important when missing positive cases is costly (e.g., fraud detection, medical diagnosis).

Red Team

A group that tests systems by simulating adversarial attacks. For AI, includes prompt injection, jailbreaking, bias elicitation, and edge case discovery. Required in Phase 7. See: Phase 7.3

Rollback

Reverting to a previous known-good version of a model or system. Must be testable, fast, and available without requiring the engineer who deployed the current version. A key incident response capability.

Runbook

A documented set of procedures for operating a system, including common tasks, troubleshooting steps, and incident response. Must be usable by someone who did not write it. See: Phase 10.3

S

SaMD (Software as a Medical Device)

Software intended to be used for medical purposes without being part of a hardware medical device. AI/ML in healthcare often qualifies. Subject to FDA regulation in US, MDR in EU. See: Regulatory Matrix

Shadow Deployment

Running a new model version in parallel with production, receiving real traffic but not affecting user-facing decisions. Enables comparison without risk. Precedes canary deployment. See: Phase 8

SLA (Service Level Agreement)

A commitment defining the expected level of service (uptime, latency, accuracy). Contractual between provider and consumer. Breaches may have financial or contractual consequences.

SLO (Service Level Objective)

An internal target for service quality, typically more stringent than SLA. Provides buffer before SLA breach. Used to guide engineering priorities and error budget allocation.

Stop Authority

A named individual with the power and obligation to halt a project or system when kill criteria are met. Must be able to act without political permission. See: Kill Criteria section

T

Technical Debt

The implied cost of rework caused by choosing quick solutions over better approaches. In ML, includes hardcoded thresholds, undocumented preprocessing, and missing tests. Accumulates interest. See: Phase 12.3

Telemetry

The automated collection and transmission of measurements from a system. For AI, includes inference metrics, resource usage, and business outcomes. Foundation for monitoring and governance.

Threshold

A decision boundary that converts model scores into actions (approve/reject/review). Selection involves trade-offs between precision and recall. Must be documented with rationale.

U

UAW (Unit of AI Work)

A standardized measure of AI system output for cost accounting. Defined specifically for each use case. Examples: one prediction, one document processed, one conversation turn. Basis for economic viability calculations. See: Economic Viability section

V

Validation

Testing a model's performance on held-out data to estimate real-world performance. Distinguished from verification (does the system meet specifications) and testing (does the code work). See: Phase 7

Version Control

Systematic tracking of changes to code, data, models, and configuration. Essential for reproducibility, rollback, and audit. All artifacts in this playbook must be version-controlled.

Standards Quick Reference

Standard	Full Name	Domain	Key Focus
ISO/IEC 42001	AI Management Systems	All industries	AI governance framework, certifiable
ISO/IEC 23894	AI Risk Management	All industries	Risk identification and treatment
NIST AI RMF	AI Risk Management Framework	All industries	Map, Measure, Manage, Govern
EU AI Act	Artificial Intelligence Act	All industries (EU)	Risk-based regulation, prohibited uses
FDA GMLP	Good Machine Learning Practice	Healthcare	Medical device AI development
Basel AI Guidance	Model Risk Management for AI	Financial services	Banking AI risk management
IEEE 2857	Privacy Engineering for AI/ML	All industries	Privacy-preserving AI design
SOC 2 Type II	Service Organization Controls	Technology services	Security, availability, confidentiality

Forward Deployed EngineeringAI Systems — Production Playbook

Ontology Month 1

Domain Expert Identification & Access

Expert Stakeholder Map

Interview Schedule & Protocol

Knowledge Source Priority Matrix

Concept Harvesting Through Multiple Channels

Terminology Extraction Report

Concept Laddering Results

Cross-Source Consistency Analysis

Relationship Mapping & Hierarchy Construction

Taxonomic Hierarchy Model

Part-Whole Relationship Map

Associative Relationship Network

Formal Representation & Documentation 19,20,22,23

Concept Glossary & Definition Framework

Relationship Diagram Library

Decision Rationale Documentation

Phase Exit Contract

Truth Contract

Economic Contract

Risk Contract

Ownership Contract

Problem Space Month 2

Boundary Definition & Scope Constraints HJG

⚠ Irreversibility Flag

Domain Scope Definition

Edge Case Classification

Scope Validation Test Suite

Multi-Perspective Validation

Cross-Functional Perspective Matrix

Conflict Resolution Log

Temporal Evolution Analysis

Stress Testing & Edge Case Exploration 1,6,19,20,21,23

Boundary Stress Test Results

Scale Testing Report

Scenario-Based Validation Suite

Governance & Living Documentation Setup

Ontology Governance Charter

Change Management Protocol

Audit & Validation Schedule

Phase Exit Contract

Truth Contract

Economic Contract

Risk Contract

Ownership Contract

Discovery Month 3

Interview Customer Success, PM, and Domain Experts

Stakeholder Interview Notes

Domain Expert Knowledge Base

Customer Success Insights

Translate Business Needs to ML Problem Statements HJG 1,19,21

⚠ Irreversibility Flag

ML Problem Definition Document

Business-to-Technical Translation Matrix

Solution Approach Options

Assess Data Availability & Quality

Data Inventory Report

Data Quality Analysis

Data Acquisition Plan

Identify Regulatory or Ethical Constraints HJG 1,2,16,17,19,22,23

Regulatory Compliance Checklist

Ethical AI Framework

Risk Assessment Matrix

Phase Exit Contract

Truth Contract

Economic Contract

Risk Contract

Ownership Contract

Alignment & System Design Month 4

$ ROI Gate — Phase 4

Document Stakeholder Priorities and Success Criteria

Stakeholder Priority Matrix

Success Criteria Definition

RACI Matrix

Design End-to-End ML Pipeline

Pipeline Architecture Diagram

ETL Process Documentation

Training Pipeline Specification

Choose Serving Pattern HJG

Forward Deployed Engineering
AI Systems — Production Playbook

Formal Representation & Documentation ^19,20,22,23

Stress Testing & Edge Case Exploration ^{1,6,19,20,21,23}

Translate Business Needs to ML Problem Statements HJG ^1,19,21

Identify Regulatory or Ethical Constraints HJG ^{1,2,16,17,19,22,23}