v7.5

Forward Deployed Engineering
AI Systems — Production Playbook

Forward Deployed Engineering is a delivery methodology in which engineers are embedded within the operational environment they serve, owning system outcomes end-to-end and continuously adapting design decisions based on real-world feedback, economic constraints, and adoption signals.

01

Ontology Month 1

+

Define the conceptual foundation. What are the entities, relationships, and boundaries that structure the problem domain?

Before writing any code or training any model, the organization must agree on what words mean. Ontology is the disciplined practice of naming things, defining their relationships, and establishing the boundaries that separate one concept from another. This phase forces alignment on vocabulary that will later become schemas, labels, and embeddings. Mistakes made here propagate through the entire system—they become hardcoded assumptions that are expensive to unwind. The deliverables are not documents for their own sake; they are contracts that prevent the team from building the wrong thing confidently.

View Activities & Deliverables +
1.1

Domain Expert Identification & Access

Identify who holds the knowledge, how deep it goes, and how to extract it systematically.

1.1.1

Expert Stakeholder Map

Knowledge holders with depth assessment and availability matrix.

1.1.2

Interview Schedule & Protocol

Timeline with concept extraction methodologies.

1.1.3

Knowledge Source Priority Matrix

Ranked experts, customers, partners with access strategy.

1.2

Concept Harvesting Through Multiple Channels

Extract domain concepts from documents, interviews, and observations.

1.2.1

Terminology Extraction Report

Domain concepts with frequency analysis from multiple sources.

1.2.2

Concept Laddering Results

Hierarchical relationships from structured interviews.

1.2.3

Cross-Source Consistency Analysis

Validation matrix comparing concepts across channels.

1.3

Relationship Mapping & Hierarchy Construction

Build structural relationships—taxonomies, part-whole, and associations.

1.3.1

Taxonomic Hierarchy Model

Is-a relationships with inheritance rules and classification logic.

1.3.2

Part-Whole Relationship Map

Component dependencies and composition rules.

1.3.3

Associative Relationship Network

Related-to connections with strength weights.

1.4

Formal Representation & Documentation 19,20,22,23

Capture the ontology in formats that can be reviewed, versioned, and enforced.

1.4.1

Concept Glossary & Definition Framework

Definitions, synonyms, examples, and measurement criteria.

1.4.2

Relationship Diagram Library

Visual representations of concept connections.

1.4.3

Decision Rationale Documentation

Reasoning for contested concepts with evidence.

Phase Exit Contract

This phase is complete only when the following contracts are explicit, reviewed, and owned.

Truth Contract

  • New truths recorded (with evidence)
  • Unknowns named (not hidden)
  • Assumptions reviewed and time-boxed

Economic Contract

  • Unit of AI Work defined
  • Cost ceiling and guardrails set
  • Kill thresholds documented

Risk Contract

  • Risks introduced / retired listed
  • Abuse and failure modes reviewed
  • Compliance implications confirmed

Ownership Contract

  • Named owner assigned
  • Escalation path defined
  • Review cadence scheduled
02

Problem Space Month 2

+

Define boundaries, validate assumptions, and stress-test the problem definition before building.

The problem space is where ambiguity lives. This phase forces the team to draw explicit lines around what the system will and will not do—before those boundaries get encoded into architecture and data. Edge cases are where systems break, and edge cases live at boundaries. By stress-testing the problem definition from multiple angles, the team discovers disagreements that would otherwise surface during production incidents. The goal is not perfection; it is explicit acknowledgment of what is known, what is assumed, and what remains uncertain.

View Activities & Deliverables +
2.1

Boundary Definition & Scope Constraints HJG

Define what's in and what's out. Edge cases are where systems break.

⚠ Irreversibility Flag

Boundary mistakes propagate into schemas, labeling, and embeddings. Once encoded, they are expensive to unwind.

2.1.1

Domain Scope Definition

Core vs. adjacent domains with inclusion/exclusion criteria.

2.1.2

Edge Case Classification

Boundary-spanning scenarios with resolution protocols.

2.1.3

Scope Validation Test Suite

Scenarios validating boundary definitions.

2.2

Multi-Perspective Validation

Different stakeholders see the problem differently. Reconcile before building.

2.2.1

Cross-Functional Perspective Matrix

Sales, Engineering, Support, and Customer views compared.

2.2.2

Conflict Resolution Log

Documented disagreements with consensus outcomes.

2.2.3

Temporal Evolution Analysis

Historical changes with future predictions.

2.3

Stress Testing & Edge Case Exploration 1,6,19,20,21,23

Push the problem definition to its limits before downstream systems depend on it.

2.3.1

Boundary Stress Test Results

Performance at edge cases and boundary conditions.

2.3.2

Scale Testing Report

Validation at 10x scale with implications.

2.3.3

Scenario-Based Validation Suite

Real-world scenarios tested to identify gaps.

2.4

Governance & Living Documentation Setup

Ontologies evolve. Establish ownership and change management.

2.4.1

Ontology Governance Charter

Ownership, triggers, and maintenance responsibilities.

2.4.2

Change Management Protocol

Update process with impact assessment.

2.4.3

Audit & Validation Schedule

Review cycles ensuring alignment with reality.

Phase Exit Contract

This phase is complete only when the following contracts are explicit, reviewed, and owned.

Truth Contract

  • New truths recorded (with evidence)
  • Unknowns named (not hidden)
  • Assumptions reviewed and time-boxed

Economic Contract

  • Unit of AI Work defined
  • Cost ceiling and guardrails set
  • Kill thresholds documented

Risk Contract

  • Risks introduced / retired listed
  • Abuse and failure modes reviewed
  • Compliance implications confirmed

Ownership Contract

  • Named owner assigned
  • Escalation path defined
  • Review cadence scheduled
03

Discovery Month 3

+

Gather requirements from multiple perspectives. Misalignment here guarantees failure.

Discovery is the translation layer between business intent and technical specification. Different stakeholders see the same problem differently—sales sees revenue, engineering sees architecture, compliance sees risk. This phase reconciles those views before divergent assumptions become divergent implementations. The critical output is not a requirements document; it is shared understanding. Get the ML problem statement wrong, and the model will solve the wrong problem brilliantly. Discovery also surfaces data realities: what exists, what quality it has, and what gaps must be filled before training can begin.

View Activities & Deliverables +
3.1

Interview Customer Success, PM, and Domain Experts

Gather requirements from multiple perspectives before converging.

3.1.1

Stakeholder Interview Notes

Requirements, pain points, and success criteria.

3.1.2

Domain Expert Knowledge Base

Technical requirements and domain constraints.

3.1.3

Customer Success Insights

User journey mapping and solution gaps.

3.2

Translate Business Needs to ML Problem Statements HJG 1,19,21

The critical translation layer. Get this wrong, and the model solves the wrong problem.

⚠ Irreversibility Flag

Problem framing errors compound through data collection, labeling, and architecture. Reframing late often requires starting over.

3.2.1

ML Problem Definition Document

Input/output specifications and success criteria.

3.2.2

Business-to-Technical Translation Matrix

Mapping business objectives to ML requirements.

3.2.3

Solution Approach Options

Comparative analysis of ML approaches.

3.3

Assess Data Availability & Quality

No data, no model. Understand what you have before promising what you'll build.

3.3.1

Data Inventory Report

Datasets with schema, volume, and quality assessments.

3.3.2

Data Quality Analysis

Missing values, outliers, distributions, lineage.

3.3.3

Data Acquisition Plan

Strategy for additional sources and labeling.

3.4

Identify Regulatory or Ethical Constraints HJG 1,2,16,17,19,22,23

Legal and ethical constraints are non-negotiable. Identify early or pay later.

3.4.1

Regulatory Compliance Checklist

Applicable regulations (GDPR, HIPAA, etc.).

3.4.2

Ethical AI Framework

Bias detection, fairness metrics, guidelines.

3.4.3

Risk Assessment Matrix

Risks with mitigation strategies and owners.

Phase Exit Contract

This phase is complete only when the following contracts are explicit, reviewed, and owned.

Truth Contract

  • New truths recorded (with evidence)
  • Unknowns named (not hidden)
  • Assumptions reviewed and time-boxed

Economic Contract

  • Unit of AI Work defined
  • Cost ceiling and guardrails set
  • Kill thresholds documented

Risk Contract

  • Risks introduced / retired listed
  • Abuse and failure modes reviewed
  • Compliance implications confirmed

Ownership Contract

  • Named owner assigned
  • Escalation path defined
  • Review cadence scheduled
04

Alignment & System Design Month 4

+

Lock in stakeholder alignment and design the end-to-end system architecture. ROI Gate: Design must project positive ROI at expected scale.

Alignment is where organizational politics meets engineering reality. This phase converts the shared understanding from Discovery into explicit commitments: who owns what, what success looks like, and when to stop. The system design maps the complete data flow from ingestion through inference—not just the model, but the infrastructure that surrounds it. Architectural decisions made here are expensive to reverse; this is where serving patterns, cost profiles, and scaling limits get locked in. The ROI gate ensures the team isn't building something that cannot pay for itself.

View Activities & Deliverables +

$ ROI Gate — Phase 4

Before proceeding to Integration, validate projected ROI based on architecture decisions. If unit economics are negative at projected scale, return to Discovery or terminate.

4.1

Document Stakeholder Priorities and Success Criteria

Explicit alignment prevents later conflicts about what "success" means.

4.1.1

Stakeholder Priority Matrix

Weighted priorities with conflict resolution.

4.1.2

Success Criteria Definition

Measurable outcomes and acceptance thresholds.

4.1.3

RACI Matrix

Responsibility assignment for decisions.

4.2

Design End-to-End ML Pipeline

ETL → Training → Serving. Map the complete data flow.

4.2.1

Pipeline Architecture Diagram

End-to-end flow with component specifications.

4.2.2

ETL Process Documentation

Extraction, transformation, loading procedures.

4.2.3

Training Pipeline Specification

Workflow, hyperparameter tuning, validation.

4.3

Choose Serving Pattern HJG

Batch, streaming, or online inference—each has different cost and latency implications.

4.3.1

Serving Pattern Analysis

Comparison with latency and cost trade-offs.

4.3.2

Inference Architecture Design

Detailed design with scalability considerations.

4.3.3

Performance Requirements Spec

SLA definitions, throughput, latency targets.

Phase Exit Contract

This phase is complete only when the following contracts are explicit, reviewed, and owned.

Truth Contract

  • New truths recorded (with evidence)
  • Unknowns named (not hidden)
  • Assumptions reviewed and time-boxed

Economic Contract

  • Unit of AI Work defined
  • Cost ceiling and guardrails set
  • Kill thresholds documented

Risk Contract

  • Risks introduced / retired listed
  • Abuse and failure modes reviewed
  • Compliance implications confirmed

Ownership Contract

  • Named owner assigned
  • Escalation path defined
  • Review cadence scheduled
05

Integration Month 5

+

Connect the ML system to existing infrastructure, APIs, and data sources.

Integration is where the ML system meets the enterprise. This phase establishes the connective tissue between the model and everything it depends on: cloud infrastructure, data pipelines, security boundaries, and compliance controls. Infrastructure as Code ensures reproducibility; schema versioning ensures maintainability. The decisions made here—cloud provider, compute strategy, data residency—carry long-term operational and financial implications. Security and compliance posture are not afterthoughts; they are foundational constraints that shape every subsequent choice.

View Activities & Deliverables +
5.1

Select Cloud Provider & Compute Strategy

GPU/TPU selection with performance and cost analysis.

5.1.1

Cloud Provider Comparison

Cost, performance, and feature analysis.

5.1.2

Compute Strategy Document

GPU/TPU selection with benchmarks.

5.1.3

Multi-cloud Strategy Plan

Vendor lock-in mitigation and DR.

5.2

Define IaC Modules

Terraform, Helm—infrastructure as code for reproducibility.

5.2.1

Terraform Module Library

Reusable infrastructure modules with versioning.

5.2.2

Helm Chart Templates

Kubernetes deployment templates.

5.2.3

Infrastructure Deployment Guide

Step-by-step procedures and rollback.

5.3

Security & Compliance Posture

VPC, IAM, data residency—non-negotiable foundations.

5.3.1

Security Architecture Document

Network topology, access controls, encryption.

5.3.2

IAM Policy Framework

Role-based access with least privilege.

5.3.3

Data Residency Compliance Plan

Geographic storage and transfer protocols.

5.4

Define Schemas & Versioning Strategy

Data contracts and model versioning for maintainability.

5.4.1

Data Schema Registry

Schema definitions with evolution rules.

5.4.2

Model Versioning Framework

Semantic versioning for models and APIs.

5.4.3

Backward Compatibility Matrix

Version mapping and migration procedures.

Phase Exit Contract

This phase is complete only when the following contracts are explicit, reviewed, and owned.

Truth Contract

  • New truths recorded (with evidence)
  • Unknowns named (not hidden)
  • Assumptions reviewed and time-boxed

Economic Contract

  • Unit of AI Work defined
  • Cost ceiling and guardrails set
  • Kill thresholds documented

Risk Contract

  • Risks introduced / retired listed
  • Abuse and failure modes reviewed
  • Compliance implications confirmed

Ownership Contract

  • Named owner assigned
  • Escalation path defined
  • Review cadence scheduled
06

Build Month 6

+

Construct the model, pipelines, and supporting infrastructure with reproducibility.

Build is where the model finally gets trained—but only after five phases of preparation. The emphasis is on reproducibility: deterministic environments, pinned dependencies, version-controlled artifacts. Start with a baseline model that is intentionally simple; prove value before adding complexity. Instrumentation is not optional—if you cannot measure latency, drift, and bias, you cannot manage them. For LLM-based systems, this phase includes mandatory controls for prompt injection and tool-call safety. The output is not just a model; it is a governed, observable, auditable capability.

View Activities & Deliverables +
6.1

Configure Reproducible ML Builds

Docker, requirements.txt—deterministic environments.

6.1.1

Containerized ML Environment

Docker images with pinned dependencies.

6.1.2

Dependency Management Strategy

Version locking and vulnerability scanning.

6.1.3

Build Reproducibility Guide

Consistent builds with checksums and validation.

6.2

Set Up Artifact Registry & Model Versioning

MLflow, DVC—track everything.

6.2.1

Artifact Registry Configuration

Metadata tracking and storage policies.

6.2.2

Model Registry Standards

Metadata schema and lifecycle management.

6.2.3

Version Control Integration

Git hooks for model versioning with code.

6.3

Build Baseline Model

Start simple. Prove value before adding complexity.

6.3.1

Baseline Model Implementation

Simple model with performance benchmarks.

6.3.2

Fine-tuning Guide

Transfer learning and adaptation strategies.

6.3.3

Model Evaluation Report

Metrics, visualizations, error analysis.

6.4

Instrument Model for Telemetry

Latency, drift, bias—if you can't measure it, you can't manage it.

6.4.1

Telemetry Collection Framework

Performance, latency, resource metrics.

6.4.2

Drift Detection System

Statistical tests and alerting for drift.

6.4.3

Bias Monitoring Dashboard

Fairness metrics across groups.

LLM Control Checkpoint — Phase 6

If this system uses LLMs, the following controls must be implemented during Build. Not optional.

Required LLM Controls — Build Phase LLM

Risk Mandatory Control Owner
Prompt Injection Input sanitization + allow-list patterns Security Engineer
Tool-Call Drift Tool schema version pinning + audit logging Platform Engineer

Phase Exit Contract

This phase is complete only when the following contracts are explicit, reviewed, and owned.

Truth Contract

  • New truths recorded (with evidence)
  • Unknowns named (not hidden)
  • Assumptions reviewed and time-boxed

Economic Contract

  • Unit of AI Work defined
  • Cost ceiling and guardrails set
  • Kill thresholds documented

Risk Contract

  • Risks introduced / retired listed
  • Abuse and failure modes reviewed
  • Compliance implications confirmed

Ownership Contract

  • Named owner assigned
  • Escalation path defined
  • Review cadence scheduled
07

Validation Month 7

+

Rigorous testing across multiple dimensions—functional, performance, fairness, security.

Validation is where confidence is earned—or exposed as false. This phase subjects the model to rigorous testing across multiple dimensions: functional correctness, performance under load, fairness across populations, and security against adversaries. Golden datasets establish baselines; regression tests catch decay. Bias and fairness checks ensure the model does not encode harmful patterns from training data. Penetration testing and API fuzzing find the holes before attackers do. The output is an evidence pack that demonstrates—with data, not assertions—that the system is ready for production.

View Activities & Deliverables +
7.1

Unit Tests, Regression Tests, Golden Datasets

Comprehensive test coverage with baseline comparisons.

7.1.1

Unit Test Suite

Component tests with mocking and fixtures.

7.1.2

Regression Test Framework

Version comparison and performance tracking.

7.1.3

Golden Dataset Repository

Curated test data with expected outputs.

7.2

Bias/Fairness & Interpretability Checks HJG

Ensure the model doesn't encode harmful biases.

7.2.1

Fairness Evaluation Report

Bias analysis with statistical parity metrics.

7.2.2

Model Interpretability Analysis

SHAP, LIME, feature importance rankings.

7.2.3

Ethical Review Documentation

Ethics committee review and mitigations.

7.3

Performance Benchmarks & Cost Profiling

Know the system's limits before production exposes them.

7.3.1

Performance Benchmark Suite

Latency, throughput, accuracy benchmarks.

7.3.2

Cost Analysis Report

Compute, storage, operational cost breakdown.

7.3.3

Resource Optimization Plan

Cost reduction and performance improvements.

7.4

Penetration Testing & API Fuzzing

Security testing before adversaries find the holes.

7.4.1

Security Test Results

Vulnerability assessment with remediation.

7.4.2

API Fuzzing Report

Input validation and edge case handling.

7.4.3

Security Hardening Checklist

Configuration verification and compliance.

LLM Control Checkpoint — Phase 7

If this system uses LLMs, the following controls must be validated during testing. Not optional.

Required LLM Controls — Validation Phase LLM

Risk Mandatory Control Owner
Retrieval Contamination Signed data sources + relevance score thresholds Data Engineer
Hallucination Factual grounding requirements + expert sampling ML Engineer

Validation Evidence Pack (Required Deliverables)

Ship validation as evidence. The goal is reproducible confidence, not a slide-deck verdict.

Deliverables

  • VAL-TEST-1 Test Plan & Coverage Map (unit, integration, regression; baseline comparisons)
  • VAL-TEST-2 Golden Set + Drift Sentinels (frozen eval set; monitored slices and cohorts)
  • VAL-TEST-3 Red Team Report (prompt-injection, tool misuse, retrieval contamination scenarios)
  • VAL-REP-1 Validation Report (metrics, acceptance criteria, known limitations, escalation outcomes)
  • VAL-TRACE-1 Artifact Trace Map (links tests ⇄ datasets ⇄ model versions ⇄ decisions)

Suggested references: IEEE 29119 (software testing), ISO/IEC 25010 (quality model), NIST AI RMF (risk management), OWASP LLM Top 10 & MITRE ATLAS (threat modeling).

Phase Exit Contract

This phase is complete only when the following contracts are explicit, reviewed, and owned.

Truth Contract

  • New truths recorded (with evidence)
  • Unknowns named (not hidden)
  • Assumptions reviewed and time-boxed

Economic Contract

  • Unit of AI Work defined
  • Cost ceiling and guardrails set
  • Kill thresholds documented

Risk Contract

  • Risks introduced / retired listed
  • Abuse and failure modes reviewed
  • Compliance implications confirmed

Ownership Contract

  • Named owner assigned
  • Escalation path defined
  • Review cadence scheduled
08

Pre-Production Month 8

+

Staging environment, load testing, and final sign-off. ROI Gate: Validated ROI must exceed 1.5x projected.

Pre-Production is the dress rehearsal. The staging environment must be production-like: same data shapes, same traffic patterns, same failure modes. Shadow traffic reveals how the model behaves on real inputs without affecting real users. Load and stress testing find breaking points before production exposes them. Canary and A/B test designs ensure statistical rigor in the rollout. Privacy and compliance validations—GDPR, HIPAA, whatever applies—must be verified before launch. The ROI gate at this phase validates that actual performance justifies the investment made so far.

View Activities & Deliverables +

$ ROI Gate — Phase 8

Before entering Hypercare, validate ROI based on staging performance. If actual metrics are <1.5x of Phase 4 projections, investigate root cause before proceeding.

8.1

Staging Deployment with Shadow Traffic

Production-like environment with real traffic patterns.

8.1.1

Staging Environment Setup

Production-like with anonymization.

8.1.2

Synthetic Traffic Generator

Realistic load with various patterns.

8.1.3

Shadow Traffic Analysis

Staging vs production behavior comparison.

8.2

Load & Stress Testing

Find breaking points before users do.

8.2.1

Load Testing Strategy

Gradual load increase and failure scenarios.

8.2.2

Stress Testing Report

Behavior under extreme load.

8.2.3

Capacity Planning Model

Scaling recommendations from testing.

8.3

Canary or A/B Test Plan Approval

Statistical design for safe rollout.

8.3.1

Experimentation Design

Sample size calculations and success criteria.

8.3.2

Risk Mitigation Plan

Rollback procedures and safety mechanisms.

8.3.3

Stakeholder Approval Matrix

Sign-off with go/no-go criteria.

8.4

Data Retention & Privacy Validation

GDPR, HIPAA—compliance verified before launch.

8.4.1

Privacy Impact Assessment

Data processing risk evaluation.

8.4.2

Data Retention Policy

Lifecycle management and deletion schedules.

8.4.3

Compliance Verification Report

Regulatory checklist with evidence.

LLM Control Checkpoint — Phase 8

If this system uses LLMs, the following controls must be verified before production. Not optional.

Required LLM Controls — Pre-Production Phase LLM

Risk Mandatory Control Owner
Context Window Decay Max context length + truncation audit + instruction reinforcement ML Engineer
Output Validation PII scrubbing + format validation + sensitive data detection Security Engineer
CT-1 Gate Enforcement The Cost Telemetry Contract (CT-1) must be complete with all metrics instrumented, owners assigned, and alerts configured before proceeding to Hypercare. This is a blocking gate.

Phase Exit Contract

This phase is complete only when the following contracts are explicit, reviewed, and owned.

Truth Contract

  • New truths recorded (with evidence)
  • Unknowns named (not hidden)
  • Assumptions reviewed and time-boxed

Economic Contract

  • Unit of AI Work defined
  • Cost ceiling and guardrails set
  • Kill thresholds documented

Risk Contract

  • Risks introduced / retired listed
  • Abuse and failure modes reviewed
  • Compliance implications confirmed

Ownership Contract

  • Named owner assigned
  • Escalation path defined
  • Review cadence scheduled
09

Hypercare Month 9

+

Intensive post-launch support. High-touch monitoring, rapid response, and user feedback loops.

Hypercare is the high-touch period immediately following launch. This is not business as usual—it is an elevated state of vigilance. A dedicated support team with 24/7 coverage monitors everything in real time. Alert thresholds are tighter than normal operations. The war room is staffed. Escalation paths are rehearsed. User feedback flows directly to the team, enabling rapid iteration on issues that only emerge under real-world conditions. The goal is to catch and fix problems before they become crises, and to build the operational muscle that will sustain the system long-term.

View Activities & Deliverables +
9.1

Launch Readiness Review HJG

Final go/no-go decision with all stakeholders.

9.1.1

Launch Readiness Checklist

All prerequisites verified and documented.

9.1.2

Stakeholder Sign-off Document

Formal approval from all decision-makers.

9.1.3

Communication Plan

Internal and external launch messaging.

9.2

Dedicated Support Team Activation

24/7 coverage with escalation paths.

9.2.1

Support Team Roster

Names, roles, contact info, coverage hours.

9.2.2

Escalation Procedures

Severity levels and response time SLAs.

9.2.3

War Room Setup

Physical or virtual command center.

9.3

Real-time Monitoring & Rapid Response

Watch everything. React immediately.

9.3.1

Hypercare Dashboard

Real-time metrics with anomaly highlighting.

9.3.2

Incident Triage Playbook

Decision trees for rapid classification.

9.3.3

Hotfix Deployment Protocol

Emergency release process with safeguards.

9.4

User Feedback Collection & Rapid Iteration

Close the loop between users and the team.

9.4.1

Feedback Collection Channels

Surveys, support tickets, usage analytics.

9.4.2

Issue Prioritization Framework

Severity × impact × frequency scoring.

9.4.3

Hypercare Exit Criteria

Metrics that signal readiness for BAU.

Phase Exit Contract

This phase is complete only when the following contracts are explicit, reviewed, and owned.

Truth Contract

  • New truths recorded (with evidence)
  • Unknowns named (not hidden)
  • Assumptions reviewed and time-boxed

Economic Contract

  • Unit of AI Work defined
  • Cost ceiling and guardrails set
  • Kill thresholds documented

Risk Contract

  • Risks introduced / retired listed
  • Abuse and failure modes reviewed
  • Compliance implications confirmed

Ownership Contract

  • Named owner assigned
  • Escalation path defined
  • Review cadence scheduled
10

Production Deployment Month 10

+

Full production rollout with monitoring, scaling, and operational excellence.

Production Deployment is the transition from hypercare intensity to sustainable operations. Deployment patterns—blue/green, canary, rolling—are selected based on risk tolerance and rollback requirements. Autoscaling policies ensure the system handles variable load without manual intervention. Rollback and failover plans are not just documented; they are tested. Monitoring dashboards track the metrics that matter: not just model accuracy, but business impact, cost efficiency, and user satisfaction. The system must be operable by someone who did not build it.

View Activities & Deliverables +
10.1

Select Deployment Pattern

Blue/green, canary, rolling—choose based on risk tolerance.

10.1.1

Deployment Strategy Document

Pattern selection with risk assessment.

10.1.2

Rollout Timeline Plan

Phased deployment with checkpoints.

10.1.3

Traffic Switching Procedures

Load balancer configuration.

10.2

Create Rollback and Failover Plans

Know how to undo everything before you do anything.

10.2.1

Rollback Procedures Manual

Step-by-step with automation scripts.

10.2.2

Failover Architecture Plan

Multi-region DR with RTO/RPO specs.

10.2.3

Emergency Response Playbook

Escalation with contact lists.

10.3

Configure Autoscaling

HPA, VPA—right-size dynamically.

10.3.1

Horizontal Pod Autoscaler Config

HPA rules based on custom metrics.

10.3.2

Vertical Pod Autoscaler Setup

VPA for resource optimization.

10.3.3

Scaling Behavior Analysis

Load patterns and thresholds.

10.4

Set Up Production Monitoring & Alerting

Dashboards, alerts, SLA tracking.

10.4.1

Monitoring Dashboard Configuration

Grafana/DataDog with key metrics.

10.4.2

Alerting Rules Framework

Thresholds and notification channels.

10.4.3

SLA Monitoring Setup

SLIs, SLOs, and error budgets.

Phase Exit Contract

This phase is complete only when the following contracts are explicit, reviewed, and owned.

Truth Contract

  • New truths recorded (with evidence)
  • Unknowns named (not hidden)
  • Assumptions reviewed and time-boxed

Economic Contract

  • Unit of AI Work defined
  • Cost ceiling and guardrails set
  • Kill thresholds documented

Risk Contract

  • Risks introduced / retired listed
  • Abuse and failure modes reviewed
  • Compliance implications confirmed

Ownership Contract

  • Named owner assigned
  • Escalation path defined
  • Review cadence scheduled
11

Reliability Month 11

+

Establish operational excellence—observability, incident response, and continuous health monitoring.

Reliability is where the system earns trust over time. This phase establishes the observability stack that makes the system's behavior legible: metrics, logs, traces, and the dashboards that synthesize them. Model-specific monitoring tracks accuracy drift, data drift, and the decay patterns that signal retraining is needed. On-call rotations and incident response runbooks ensure that problems are caught and resolved by people who know what to do. Blameless postmortems convert incidents into organizational learning. The goal is a system that degrades gracefully, recovers quickly, and improves continuously.

View Activities & Deliverables +
11.1

Implement Logging, Tracing, Metrics

Prometheus, OpenTelemetry—full observability stack.

11.1.1

Observability Stack Deployment

Prometheus, Grafana, Jaeger setup.

11.1.2

Custom Metrics Framework

Business and technical metrics.

11.1.3

Trace Analysis Dashboard

Request flow visualization.

11.2

Build Model-Specific Dashboards

Accuracy, drift, business impact—ML-specific observability.

11.2.1

Model Performance Dashboard

Real-time accuracy with trends.

11.2.2

Data Drift Monitoring Panel

Statistical drift detection.

11.2.3

Business Impact Metrics View

ML to business KPI correlation.

11.3

On-call & Incident Response

Runbooks, escalation, blameless postmortems.

11.3.1

On-call Rotation Schedule

Coverage with escalation contacts.

11.3.2

Operational Runbooks

Step-by-step troubleshooting guides.

11.3.3

Postmortem Template

Incident analysis with action items.

11.4

Model Decay Detection & Retraining

Automated detection and retraining triggers.

11.4.1

Model Decay Detection System

Performance degradation monitoring.

11.4.2

Automated Retraining Pipeline

Trigger conditions and workflow.

11.4.3

Production Data Capture

Feedback loop for retraining data.

Phase Exit Contract

This phase is complete only when the following contracts are explicit, reviewed, and owned.

Truth Contract

  • New truths recorded (with evidence)
  • Unknowns named (not hidden)
  • Assumptions reviewed and time-boxed

Economic Contract

  • Unit of AI Work defined
  • Cost ceiling and guardrails set
  • Kill thresholds documented

Risk Contract

  • Risks introduced / retired listed
  • Abuse and failure modes reviewed
  • Compliance implications confirmed

Ownership Contract

  • Named owner assigned
  • Escalation path defined
  • Review cadence scheduled
12

Continuous Improvement Month 12

+

The journey continues. ROI Gate: Actual ROI vs projected determines continuation or sunset.

Continuous Improvement recognizes that deployment is not the end—it is the beginning of a new cycle. Automation reduces toil and increases velocity. Documentation and knowledge sharing ensure that learnings survive staff turnover. Architecture reviews and technical debt assessments keep the system maintainable. The final ROI gate compares actual performance against projections: systems that deliver value earn continued investment; systems that do not are sunset gracefully. The insights from production feed back into product and research, informing the next iteration of capability.

View Activities & Deliverables +

$ ROI Gate — Phase 12

After 3 months in production, compare actual ROI to projections. If <1.0x, initiate sunset review. If >2.0x, consider expansion investment.

12.1

Automate Repetitive Steps

Reduce toil, increase velocity.

12.1.1

Automation Opportunity Analysis

Time-consuming tasks with ROI.

12.1.2

Workflow Automation Scripts

Python/Shell for common tasks.

12.1.3

CI/CD Pipeline Enhancements

Advanced automation and gates.

12.2

Document & Share Learnings

Write postmortems, design docs, and tech blogs.

12.2.1

Technical Writing Guidelines

Standards for documentation.

12.2.2

Knowledge Sharing Calendar

Tech talks and brown bags.

12.2.3

Learning Repository

Centralized knowledge base.

12.3

Architecture Reviews & Tech Debt Assessment

Systematic evaluation and debt tracking.

12.3.1

Architecture Review Checklist

Scalability, security, maintainability.

12.3.2

Technical Debt Inventory

Cataloged debt with priorities.

12.3.3

System Health Scorecard

Regular quality assessment.

12.4

Plan Next Iteration

Surface learnings to PM/Research for roadmap refinement.

12.4.1

Insights Report

Performance and behavior insights.

12.4.2

Research Collaboration Framework

Knowledge sharing with research.

12.4.3

Product Roadmap Input

Data-driven feature prioritization.

Phase Exit Contract

This phase is complete only when the following contracts are explicit, reviewed, and owned.

Truth Contract

  • New truths recorded (with evidence)
  • Unknowns named (not hidden)
  • Assumptions reviewed and time-boxed

Economic Contract

  • Unit of AI Work defined
  • Cost ceiling and guardrails set
  • Kill thresholds documented

Risk Contract

  • Risks introduced / retired listed
  • Abuse and failure modes reviewed
  • Compliance implications confirmed

Ownership Contract

  • Named owner assigned
  • Escalation path defined
  • Review cadence scheduled
13

Appendix

+

What Good Looks Like

This roadmap describes one year of deliberate organizational change, not twelve months of model building. The goal is durability—systems that work, are trusted, and remain governable after leadership attention moves elsewhere.

Annual View

A good year ends with:

OutcomeWhy It Matters
Fewer argumentsReality is shared
Fewer heroicsRisk is designed out
Fewer surprisesIncentives and ownership are explicit
ContinuitySystem functions when original team leaves

At year end, the organization has: shared understanding, explicit ownership, managed risk, institutional memory, and the ability to say "no" as confidently as "yes."

Quarterly View

Each quarter solves a human problem before it becomes a technical or financial one.

Quarter Name Human Aim Gate Primary Outputs
Q1 Diagnostics Align people on reality before building anything expensive Problem & success definition locked; baseline approved Ontology, KPI targets, dataset inventory, baseline + error analysis
Q2 Architect Reduce ambiguity so teams stop arguing and start shipping Architecture review passed; security/compliance accepted System design, IaC plan, schema/versioning, baseline pipeline
Q3 Engineer Build with guardrails so operators don't carry risk Validation suite green; risk controls implemented Eval harness, red-team results, drift/bias checks, rollout plan
Q4 Enable Make the system survivable after handoff Production readiness met; monitoring live; owner assigned Runbooks, dashboards, change mgmt, ROI review

Quarterly Roadmap

Gate Types

BadgeNameMeaning
HJGHuman Judgment GateRequires explicit human decision-making. Not automatable.
$Economic GateRequires ROI validation before proceeding. Kill criteria apply.
Irreversibility FlagDecisions costly to unwind. Extra scrutiny required.
CTCost Telemetry ContractMetrics with named owners, refresh cadence, and kill bindings.

HJG Procedural Requirements

Human Judgment Gates require procedural enforcement, not just cultural compliance:

  • Convener: Named person responsible for scheduling the gate (typically Product Owner or Tech Lead)
  • Quorum: Minimum 2 reviewers with authority to approve or reject
  • Evidence: Required artifacts must be submitted 48 hours before the gate
  • Dissent: Dissenting views must be documented even if overruled
  • Escalation: If gate is missed or delayed >5 business days, automatic escalation to Exec Sponsor
  • Record: Decision, rationale, attendees, and dissent logged in Decision Memory Ledger

How to Use This Playbook

What "Done" Means

"Done" is not a model that runs. "Done" is a capability that can be measured, audited, rolled back, and re-learned by a new team without tribal memory.

What Breaks Teams

Most programs fail from missing evidence: unclear intent, no acceptance criteria, no telemetry contract, no rollback plan, and no operating owner. This playbook forces those decisions earlier.

Each month maps to a phase. Organizations may compress or extend phases based on complexity, but the sequence should not be reordered. Skipping phases creates debt that surfaces later—usually at the worst possible time.

Phase Evidence Packs

Each phase exit requires a formal Evidence Pack. This makes gatekeeping less subjective without bureaucratizing it.

PhaseEvidence Pack IDRequired ArtifactsReviewer
01 OntologyPH1-EVID-1Expert map, concept glossary, relationship diagram, contested concept logDomain Lead + Product
02 Problem SpacePH2-EVID-1Boundary stress tests, edge case matrix, scope validation resultsTech Lead + Product
03 DiscoveryPH3-EVID-1Stakeholder interview notes, data inventory, regulatory constraint mapProduct + Compliance
04 AlignmentPH4-EVID-1Architecture ROI pack, stakeholder sign-off matrix, risk acceptance docsExec Sponsor + Finance
05 IntegrationPH5-EVID-1IaC validation logs, schema version registry, security scan resultsPlatform Lead + Security
06 BuildPH6-EVID-1Baseline model metrics, telemetry contract, reproducibility proofML Lead + SRE
07 ValidationPH7-EVID-1Test suite results, bias audit, red team report, pen test findingsQA Lead + Security
08 Pre-ProductionPH8-EVID-1Load test results, canary metrics, rollback verification, kill drill resultsSRE Lead + Ops
09 HypercarePH9-EVID-1Launch checklist, escalation log, rapid iteration trackingProduct + Support Lead
10 ProductionPH10-EVID-1Deployment verification, autoscaling proof, rollback test resultsSRE + Platform Lead
11 ReliabilityPH11-EVID-1Observability dashboard, on-call rotation, decay detection baselineSRE Lead + ML Lead
12 ContinuousPH12-EVID-1Automation inventory, knowledge transfer docs, next iteration briefTech Lead + Product

Stop Authority Drills HJG

Stop authority is psychologically harder than rollback. Organizations must practice stopping, not just responding.

⚠ Mandatory Requirement

At least one simulated kill-decision exercise must be run before Phase 8. This forces the organization to practice stopping a project that has momentum, budget, and stakeholder investment.

Drill TypeTimingParticipantsSuccess Criteria
Economic Kill DrillBefore Phase 4 ROI GateFinance, Product, Exec SponsorTeam can articulate kill threshold and demonstrate willingness to invoke it
Technical Kill DrillBefore Phase 8ML Lead, SRE, PlatformRollback executes in <15 min; all dependencies notified; audit trail complete
Compliance Kill DrillBefore Phase 9Legal, Compliance, ProductStop authority invoked on simulated regulatory finding; communication chain verified
Adoption Kill DrillBefore Phase 10Product, UX, SupportTeam can define minimum viable adoption shape and demonstrate kill criteria

Drill Protocol

  1. Scenario briefing: Present a realistic kill condition (cost overrun, bias discovery, adoption failure)
  2. Decision simulation: Team must reach consensus on kill/continue within 30 minutes
  3. Execution proof: If kill, demonstrate the technical and communication steps
  4. Debrief: Document hesitation points, authority gaps, and process improvements

Anti-Patterns & Red Flags

Strong governance systems risk becoming performative. Watch for these signals that artifacts are being completed without genuine engagement.

Anti-PatternRed Flag SignalsRoot CauseIntervention
Backfilled Model Card Model Card completed after deployment; sections copy-pasted from templates; no evidence of reviewer engagement Documentation treated as compliance checkbox, not design artifact Require Model Card draft at Phase 6; reviewer must sign with specific feedback
Mechanical Risk Register All risks rated "Medium"; mitigations are generic; no risks ever escalated or retired Risk assessment is ceremonial; no one expects it to drive decisions Require at least one risk escalation per quarter; track risk-to-decision linkage
Phantom RACI RACI exists but decisions still escalate informally; "Accountable" person doesn't know they're accountable Authority transfer is documented but not socialized RACI owner must verbally confirm role; escalation test in Phase 4
Ceremonial HJG Human Judgment Gates passed in <5 minutes; no dissent recorded; same person approves everything Gates are scheduled but not staffed for genuine deliberation Require minimum 2 reviewers; document dissenting views even if overruled
Orphaned Telemetry Dashboards exist but no one checks them; alerts fire but aren't investigated Observability built for audit, not for operations Weekly telemetry review with named owner; alert-to-action audit
Compliance Theater Legal/Compliance consulted only for sign-off; concerns raised late are dismissed as "blocking" Compliance treated as gate, not design partner Compliance representative in Phase 3 discovery; veto power through Phase 7
Tribal Knowledge Dependency Key decisions explained verbally; documentation says "ask Sarah"; Bus Factor = 1 Urgency prioritized over durability Knowledge transfer test: new team member must execute runbook solo

Audit Question

For each artifact, ask: "If I removed this document, would anyone notice? Would any decision change?" If the answer is no, the artifact is performative.

Kill Criteria & Stop Authority

Projects fail expensively when nobody has the right—or the obligation—to stop them. Define explicit kill criteria early and assign named stop authority before incentives and sunk cost take over.

Kill criteria (examples)

  • Cost per Unit of AI Work exceeds threshold for 3 consecutive cycles
  • Unmitigated safety or compliance breach
  • Performance regression beyond agreed tolerance
  • Adoption remains below target despite corrective actions

Stop authority

  • Named individual (not a committee)
  • Clear escalation path and decision window
  • Rollback power without political permission
  • Evidence required for restart

Executive Control Surface

A CIO/CTO should monitor these 6 signals monthly. When thresholds are breached, intervention is required—not optional.

Monthly Monitoring Signals

Signal Description Healthy Warning Critical
Unit Economics Health Cost per inference relative to value delivered <80% of value 80-100% >100% (value-negative)
Model Performance Decay Accuracy/precision drift from baseline <5% decay 5-15% >15% (trigger retraining)
Error Rate by Consequence Errors weighted by business impact <$10K/mo impact $10-50K >$50K (escalate)
Human Override Rate How often humans reject model outputs 5-20% <5% or >30% <2% or >50%
Time-to-Rollback How quickly the system can be reverted <15 min 15-60 min >60 min (unacceptable)
Compliance Drift Gap between current state and requirements Fully compliant Minor gaps Material gaps (halt)

Intervention Triggers

These conditions require immediate executive action—not delegation.

1
Kill Trigger: ROI Collapse

If cost-per-inference exceeds value-per-inference for 2 consecutive months, initiate sunset review. Do not wait for quarter-end.

2
Escalation Trigger: Consequential Error Spike

If weighted error cost exceeds $50K in any month, convene incident review within 48 hours. Model may need to be pulled from production.

3
Governance Trigger: Compliance Gap

Any material compliance gap halts new feature deployment until resolved. Non-negotiable in regulated industries.

Decision Authority Matrix

Decision Owner Consulted Informed
Model goes to production CTO / VP Eng Legal, Compliance, Product Board (if high-risk)
Model is sunset CTO + CFO jointly Product, Customer Success Affected customers
Emergency rollback On-call engineer None (act first) CTO within 1 hour
Compliance exception General Counsel CTO, CISO Board
Budget increase >20% CFO CTO, Product Board

Economic Viability Framework

Cost is not a constraint—it is a governing force. Every AI system must justify its existence economically, continuously.

$1

Unit Economics Definition Gate

Before any model is built, define the economic unit. What is the cost of one inference? What is the value of one correct output?

Economic Gate

If value-per-inference cannot be estimated within 10x accuracy, the project is not ready for development. Return to Discovery.

E.1.1

Cost-per-Inference Model

Compute, storage, network, human review costs per prediction.

E.1.2

Value-per-Inference Model

Revenue generated, cost avoided, or risk mitigated per correct output.

E.1.3

Break-even Analysis

Volume required for positive ROI at current accuracy levels.

$2

Cost-of-Error Curves

Not all errors are equal. Map the cost of different error types and their frequency.

E.2.1

Error Taxonomy with Cost Weights

False positives, false negatives, edge cases—each with dollar impact.

E.2.2

Cost-of-Error vs Latency Trade-off Curves

Faster inference often means more errors. Quantify the trade-off.

E.2.3

Error Budget Allocation

Acceptable error rates by type, based on economic tolerance.

$3

Kill Thresholds HJG

Define the conditions under which the project is terminated—before you're emotionally invested.

⚠ Irreversibility Flag

Kill thresholds must be defined before Phase 4 (Alignment). Once development begins, sunk cost bias makes objective termination nearly impossible.

E.3.1

Kill Criteria Document

Specific, measurable conditions that trigger project termination.

E.3.2

Sunset Procedure

How to wind down gracefully: data retention, customer communication, team reallocation.

E.3.3

Pivot Criteria

Conditions under which the project should change direction rather than die.

$4

ROI Gates at Phase Boundaries

Economic viability is validated at Phase 4, 8, and 12. Not annually—at milestones.

E.4.1

Phase 4 ROI Gate: Design Complete

Projected ROI based on architecture decisions. Kill if negative at projected scale.

E.4.2

Phase 8 ROI Gate: Pre-Production

Validated ROI based on staging performance. Kill if <1.5x projected.

E.4.3

Phase 12 ROI Gate: Steady State

Actual ROI vs projected. Sunset if <1.0x after 3 months in production.

Economic Sovereignty Principle A model that cannot pay for itself is a liability, not an asset. Economic viability is not a constraint to work around—it is the purpose the system must serve.

Cost Telemetry Contract

Economics are not conceptually sovereign—they are physically enforced. Every production system must satisfy this contract. No exceptions.

Related Charts
Mandatory Enforcement Each metric below requires a named human owner (not "team"), a defined refresh cadence, a review forum, and a binding to a specific kill threshold. Systems without complete telemetry contracts do not ship.

Required Telemetry Metrics

Metric Owner Refresh Reviewed By Kill Trigger
Cost per inference (fully loaded) Engineering Manager Daily CTO + CFO >1.0× value for 2 months
Error cost per month (weighted) Product Manager Weekly Executive Review >$50K/month
Human review cost per output Operations Lead Weekly Ops Review >30% of inference cost
Compute cost per 1K inferences Platform Engineer Real-time Infra Review >2× baseline for 1 week
Retraining cost per cycle ML Engineer Per event ML Review >1 month of value
Value delivered per inference Business Analyst Monthly Exec Review <0.8× projected for 2 months
CT

Contract Artifact: CT-1

The Cost Telemetry Contract must be completed and signed off before Phase 8 (Pre-Production).

CT-1.1

Telemetry Implementation Checklist

Each metric instrumented with data pipeline and dashboard.

CT-1.2

Owner Assignment Document

Named individuals (not roles) with escalation paths.

CT-1.3

Alert Configuration Spec

Automated alerts for threshold breaches with escalation rules.

CT-1.4

Review Cadence Calendar

Standing meetings where each metric is reviewed with owners present.

Enforcement Mechanism

The CT-1 artifact is a gate artifact. Production deployment is blocked until all six metrics have verified telemetry, named owners, and configured alerts.

Implementation Templates

Production-ready templates for governance artifacts. Copy, customize, and deploy. These are starting points—adapt to your regulatory context.

Template Philosophy Templates reduce cognitive load but create false confidence if used without adaptation. Each template includes "Customization Required" flags for organization-specific decisions.
T.1

RACI Matrix Template

Responsibility assignment for AI/ML lifecycle. The most common failure mode is "everyone is responsible" (meaning no one is).

RACI Matrix — AI/ML Production
Activity / Decision ML Engineer Product Manager Data Engineer Security Legal/Compliance Executive Sponsor
Phase 1-3: Discovery & Definition
Problem definition sign-off C R C I C A
Data availability assessment C I R C C I
Regulatory constraint mapping I C C C R A
Kill criteria definition C R I I C A
Phase 4-6: Design & Build
Architecture design R C C C I I
Security posture approval C I C R C A
Data pipeline implementation C I R C I I
Model training & selection R C C I I I
Phase 7-9: Validation & Pre-Production
Bias/fairness evaluation R C I I C A
Security penetration testing C I I R I I
Production readiness sign-off C R C C C A
Rollback plan validation R C C C I I
Phase 10-12: Production & Operations
Production deployment R C C C I I
Incident response (L1) R I C C I I
Incident escalation (L2+) C R C C C A
Model retraining decision R C C I I A
Kill/sunset decision C C I C C A
R = Responsible (does the work) A = Accountable (final decision authority) C = Consulted (input required) I = Informed (kept updated)

⚙ Customization Required

  • Add organization-specific roles (e.g., AI Ethics Board, Model Risk Officer for financial services)
  • Adjust "A" assignments based on your governance structure
  • For regulated industries, Legal/Compliance may need "A" on more decisions
  • Consider adding SRE/Platform team for infrastructure-heavy deployments
T.2

Telemetry Dashboard Configuration

Grafana/Datadog-compatible dashboard specification. These are the minimum viable metrics for production AI governance.

Grafana Dashboard JSON — AI/ML Production Telemetry
{
  "dashboard": {
    "title": "AI/ML Production Governance",
    "tags": ["ai", "ml", "production", "governance"],
    "panels": [
      {
        "title": "Economic Health",
        "type": "stat",
        "gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
        "targets": [{
          "expr": "sum(ml_inference_cost_usd) / sum(ml_value_delivered_usd)",
          "legendFormat": "Cost/Value Ratio"
        }],
        "thresholds": {
          "steps": [
            {"color": "#999", "value": null},
            {"color": "#666", "value": 0.8},
            {"color": "#000", "value": 1.0}
          ]
        }
      },
      {
        "title": "Model Performance Decay",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
        "targets": [{
          "expr": "1 - (ml_current_accuracy / ml_baseline_accuracy)",
          "legendFormat": "Decay from Baseline"
        }],
        "alert": {
          "name": "Model Decay Alert",
          "conditions": [{
            "evaluator": {"type": "gt", "params": [0.15]},
            "operator": {"type": "and"},
            "reducer": {"type": "avg"}
          }],
          "notifications": [{"uid": "ml-oncall-channel"}]
        }
      },
      {
        "title": "Human Override Rate",
        "type": "gauge",
        "gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
        "targets": [{
          "expr": "sum(ml_human_overrides) / sum(ml_total_predictions) * 100",
          "legendFormat": "Override %"
        }],
        "thresholds": {
          "steps": [
            {"color": "#999", "value": null},
            {"color": "#666", "value": 15},
            {"color": "#000", "value": 30}
          ]
        }
      },
      {
        "title": "Error Cost by Category",
        "type": "piechart",
        "gridPos": {"h": 8, "w": 6, "x": 12, "y": 4},
        "targets": [{
          "expr": "sum by (error_type) (ml_error_cost_usd)",
          "legendFormat": "{{error_type}}"
        }]
      },
      {
        "title": "Inference Latency P99",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 12},
        "targets": [{
          "expr": "histogram_quantile(0.99, ml_inference_latency_seconds_bucket)",
          "legendFormat": "P99 Latency"
        }],
        "alert": {
          "name": "Latency SLA Breach",
          "conditions": [{
            "evaluator": {"type": "gt", "params": [2.0]},
            "operator": {"type": "and"},
            "reducer": {"type": "avg"}
          }]
        }
      },
      {
        "title": "Data Drift Score",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 6, "x": 12, "y": 12},
        "targets": [{
          "expr": "ml_feature_drift_score",
          "legendFormat": "{{feature_name}}"
        }],
        "thresholds": {
          "steps": [
            {"color": "#999", "value": null},
            {"color": "#666", "value": 0.1},
            {"color": "#000", "value": 0.25}
          ]
        }
      },
      {
        "title": "Cost Telemetry Contract Status",
        "type": "table",
        "gridPos": {"h": 6, "w": 18, "x": 0, "y": 20},
        "targets": [{
          "expr": "ml_cost_metric_status",
          "format": "table"
        }],
        "transformations": [{
          "id": "organize",
          "options": {
            "indexByName": {},
            "renameByName": {
              "metric_name": "Metric",
              "owner": "Owner",
              "refresh_cadence": "Refresh",
              "last_updated": "Last Updated",
              "kill_trigger_status": "Kill Trigger Status"
            }
          }
        }]
      }
    ],
    "refresh": "1m",
    "time": {"from": "now-24h", "to": "now"}
  }
}
T.2.1

Required Prometheus/OpenMetrics Exports

# HELP ml_inference_cost_usd Total inference cost in USD
# TYPE ml_inference_cost_usd counter
ml_inference_cost_usd{model="fraud_v2",env="prod"} 1234.56

# HELP ml_value_delivered_usd Estimated value delivered by predictions
# TYPE ml_value_delivered_usd counter
ml_value_delivered_usd{model="fraud_v2",env="prod"} 5678.90

# HELP ml_human_overrides Count of human override events
# TYPE ml_human_overrides counter
ml_human_overrides{model="fraud_v2",reason="low_confidence"} 42

# HELP ml_feature_drift_score PSI or KL divergence from baseline
# TYPE ml_feature_drift_score gauge
ml_feature_drift_score{feature="transaction_amount"} 0.08
T.3

Infrastructure as Code Snippets

Terraform modules for governed AI infrastructure. These enforce security and observability by default.

Terraform — ML Model Serving Infrastructure (AWS)
# ml-serving-infrastructure/main.tf
# Governed ML model serving with mandatory observability and rollback

terraform {
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.0" }
  }
}

variable "model_name" {
  type        = string
  description = "Name of the ML model (used for resource naming)"
}

variable "model_version" {
  type        = string
  description = "Semantic version of the model"
}

variable "kill_threshold_cost_ratio" {
  type        = number
  default     = 1.0
  description = "Cost/value ratio that triggers kill alert"
}

variable "rollback_model_version" {
  type        = string
  description = "Previous stable version for automatic rollback"
}

# SageMaker Endpoint with mandatory monitoring
resource "aws_sagemaker_endpoint" "ml_endpoint" {
  name                 = "${var.model_name}-${var.model_version}"
  endpoint_config_name = aws_sagemaker_endpoint_configuration.ml_config.name

  deployment_config {
    blue_green_update_policy {
      traffic_routing_configuration {
        type                     = "CANARY"
        canary_size {
          type  = "CAPACITY_PERCENT"
          value = 10
        }
        wait_interval_in_seconds = 600
      }
      termination_wait_in_seconds = 300
      maximum_execution_timeout_in_seconds = 3600
    }
    
    auto_rollback_configuration {
      alarms = [
        aws_cloudwatch_metric_alarm.model_error_rate.alarm_name,
        aws_cloudwatch_metric_alarm.latency_breach.alarm_name,
        aws_cloudwatch_metric_alarm.cost_ratio_breach.alarm_name
      ]
    }
  }

  tags = {
    ManagedBy       = "terraform"
    Model           = var.model_name
    Version         = var.model_version
    RollbackVersion = var.rollback_model_version
    CostCenter      = "ml-platform"
    Governance      = "ai-playbook-v7"
  }
}

# Mandatory CloudWatch Alarms (cannot deploy without these)
resource "aws_cloudwatch_metric_alarm" "model_error_rate" {
  alarm_name          = "${var.model_name}-error-rate-breach"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "ModelError"
  namespace           = "AWS/SageMaker"
  period              = 300
  statistic           = "Average"
  threshold           = 0.05
  alarm_description   = "Model error rate exceeds 5% - triggers rollback"
  
  dimensions = {
    EndpointName = aws_sagemaker_endpoint.ml_endpoint.name
    VariantName  = "primary"
  }

  alarm_actions = [
    aws_sns_topic.ml_alerts.arn,
    # Auto-rollback is handled by deployment_config
  ]
}

resource "aws_cloudwatch_metric_alarm" "latency_breach" {
  alarm_name          = "${var.model_name}-latency-breach"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "ModelLatency"
  namespace           = "AWS/SageMaker"
  period              = 300
  extended_statistic  = "p99"
  threshold           = 2000  # 2 seconds
  alarm_description   = "P99 latency exceeds SLA - triggers rollback"
  
  dimensions = {
    EndpointName = aws_sagemaker_endpoint.ml_endpoint.name
  }

  alarm_actions = [aws_sns_topic.ml_alerts.arn]
}

resource "aws_cloudwatch_metric_alarm" "cost_ratio_breach" {
  alarm_name          = "${var.model_name}-cost-ratio-breach"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 24  # 24 hours of sustained breach
  metric_name         = "CostValueRatio"
  namespace           = "Custom/MLGovernance"
  period              = 3600
  statistic           = "Average"
  threshold           = var.kill_threshold_cost_ratio
  alarm_description   = "Cost/value ratio exceeds kill threshold"
  
  alarm_actions = [
    aws_sns_topic.ml_alerts.arn,
    aws_sns_topic.executive_escalation.arn
  ]
}

# Governance enforcement: block deployment without audit trail
resource "aws_sagemaker_model" "ml_model" {
  name               = "${var.model_name}-${var.model_version}"
  execution_role_arn = aws_iam_role.sagemaker_execution.arn

  primary_container {
    image          = var.model_image_uri
    model_data_url = var.model_artifact_s3_uri
    
    environment = {
      MODEL_VERSION           = var.model_version
      GOVERNANCE_PLAYBOOK_REF = "ai-playbook-v7"
      DEPLOYMENT_TIMESTAMP    = timestamp()
      ROLLBACK_VERSION        = var.rollback_model_version
    }
  }

  tags = {
    ApprovalTicket = var.approval_ticket_id  # Required - enforced by policy
    RiskAssessment = var.risk_assessment_id  # Required - enforced by policy
    ModelCard      = var.model_card_url      # Required - enforced by policy
  }
}

# Output for audit trail
output "deployment_manifest" {
  value = {
    endpoint_name     = aws_sagemaker_endpoint.ml_endpoint.name
    model_version     = var.model_version
    rollback_version  = var.rollback_model_version
    kill_threshold    = var.kill_threshold_cost_ratio
    deployed_at       = timestamp()
    alarms_configured = [
      aws_cloudwatch_metric_alarm.model_error_rate.alarm_name,
      aws_cloudwatch_metric_alarm.latency_breach.alarm_name,
      aws_cloudwatch_metric_alarm.cost_ratio_breach.alarm_name
    ]
  }
  description = "Deployment manifest for audit trail"
}

⚙ Customization Required

  • Replace AWS SageMaker with your inference platform (GCP Vertex AI, Azure ML, self-hosted)
  • Adjust thresholds based on your SLAs and risk tolerance
  • Add VPC configuration for network isolation requirements
  • Integrate with your CI/CD pipeline (GitHub Actions, GitLab CI, Jenkins)
  • Add KMS encryption for regulated data (HIPAA, PCI-DSS)
T.4

Phase Exit Checklist Template

Standardized gate review checklist. No phase exit without explicit sign-off on all items.

Phase Exit Review — Template

Phase: [PHASE_NUMBER] — [PHASE_NAME]

Review Date: _____________ Reviewer: _____________

Truth Contract
Economic Contract
Risk Contract
Ownership Contract
Gate Decision

Accountable Executive: _____________ Date: _____________

ML Lead: _____________ Date: _____________

Product Owner: _____________ Date: _____________

Executive-Grade Observability

Operational AI fails quietly when only engineers can see the system. This layer makes trust, economics, and governance legible to executives so decisions are made on reality rather than narrative.

Trust Dashboard

  • Drift / decay indicators
  • Incident frequency and severity
  • Override and escalation rates
  • Operator confidence (measured, not assumed)

Economics Dashboard

  • Cost per Unit of AI Work (UAW)
  • Variance vs. forecast and budget guardrails
  • Marginal cost per new capability
  • ROI trendline (with confidence bounds)

Governance Dashboard

  • Open risks with named owners
  • Contract breaches and remediation status
  • Model / prompt / policy version traceability
  • Stop-authority exercises completed

Operator UX Principles

  • Explain the “why” before the chart
  • Surface the next action, not just metrics
  • Design for incident time, not demo time
  • Make rollback and safe-degradation obvious

Why Systems Fail

Technical correctness is necessary but not sufficient. These failure patterns survive model validation and destroy production systems.

1
Misaligned Incentives Override Accuracy

Users or operators have incentives that conflict with model objectives. The system produces correct outputs that get ignored or gamed.

2
Automation Shifts Users from Skepticism to Compliance

Over time, users stop questioning model outputs. When the model fails, no human catches it.

3
Unowned Outputs Create Silent Failure

No one is accountable for validating model decisions. Errors compound without detection.

4
Weak Rollback Paths Convert Errors into Crises

Systems that can't be quickly reversed turn fixable problems into reputational events.

5
Domain Expertise Erodes

Humans who could catch model errors lose their edge because they stop practicing judgment.

Key Insight These failures are organizational and procedural, not technical. They cannot be fixed with better models—only with better governance.

The Human Failure Surface

Most production AI failures are human failures first: incentives, authority, skill asymmetry, and narrative decay. This section makes those failure modes explicit so they can be designed out.

Failure modes

  • Incentive drift: KPIs reward usage, not outcomes.
  • Authority ambiguity: no named stop authority.
  • Skill asymmetry: operators cannot diagnose failure.
  • Narrative decay: original intent is forgotten.
  • Vendor gravity: defaults become architecture.

Countermeasures

  • Phase Exit Contracts and named owners
  • Override latency targets and rollback rehearsal
  • Executive-grade observability (Trust/Econ/Gov)
  • System Memory File with quarterly review
  • Vendor constraints explicitly documented

System Continuity & Human Governance

Minimal addendum to prevent long-horizon failure: memory loss, meaning drift, and power resistance.

Artifacts HJG Senior-Only
Design intent: This addendum is deliberately small. It converts the remaining human and temporal risks into enforceable artifacts and gates—without expanding the core 12-month sequence.
0.1

Decision Memory Ledger (DML)

Documentation records outcomes. The Decision Memory Ledger preserves intent and assumptions so the system survives staff turnover and time.

Artifacts

  • DML-1 Decision Memory Ledger (schema: Decision ID, Summary, Context, Alternatives, Rejections, Assumptions, Assumption Expiry, Owner)
  • DML-2 Ledger Access Policy (read/write permissions, audit logging)
  • DML-3 Ledger Query Requirement (mandatory consultation before scope, schema, objective, or boundary changes)

Gate

Hard Gate Required before Phase 4 / Phase 8 / Phase 11 changes that touch model objectives, retrieval scope, labeling, or decision boundaries.

0.2

Power Impact Assessment (PIA) Senior-Only HJG

Working systems redistribute authority. Resistance is usually rational: loss of discretion, shifted accountability, and threatened expertise. Make it visible early.

Artifacts

  • PIA-1 Power Impact Assessment (who loses discretion, who gains authority, who becomes accountable, who can silently resist)
  • PIA-2 Incentive Misalignment Register (misaligned KPIs, conflicting owners, perverse incentives)
  • PIA-3 Adoption Risk Mitigation Plan (training, incentives, workflow design, escalation paths)

Gate

HJG Reviewed by Product + Exec Sponsor before Phase 3.

0.3

Declared System Role & Meaning Boundary

Humans use systems as stories. Declare what the system is allowed to mean—so “advisory” does not silently become “oracle.”

Artifacts

  • DSR-1 Declared System Role Statement (Advisory / Assistive / Gatekeeping)
  • DSR-2 Prohibited Uses & Boundary Conditions (domains, decisions, and contexts where use is disallowed)
  • DSR-3 Human Confirmation Points (required approvals, override rules, escalation)

Gate

Hard Gate Required before Phase 3; UI language, training, and audit checks must align with DSR-1.

0.4

Long-Horizon Risk Register (LHR) Senior-Only

Some harm compounds invisibly and will not trigger short-term metrics or kill switches. Track it, review it annually, intervene with humans—not models.

Artifacts

  • LHR-1 Long-Horizon Risk Register (skill atrophy, decision monoculture, over-dependence, vendor cognitive lock-in)
  • LHR-2 Annual Review Record (evidence, outcomes, mitigations)
  • LHR-3 Mitigation Action Plan (training, rotation, policy, workflow redesign)
Rule: Long-horizon risks do not trigger system termination. They trigger human intervention and governance action.
0.5

Planned Obsolescence & Doctrine Review Senior-Only

Every system needs a retirement plan, and every playbook needs a way to be revised without becoming dogma.

Artifacts

  • PO-1 Planned Obsolescence Plan (expected lifespan, replacement conditions, knowledge transfer, archive & shutdown)
  • DG-1 Doctrine Review Record (annual review, at least one external reviewer, logged exceptions & outcomes)
  • DG-2 Exception Log (what rule was broken, why, outcome, preventive fix)

Gate

Hard Gate PO-1 required before Phase 8 (Launch/Production). DG-1 reviewed annually.

Appendix: LLM-Specific Risk Classes

Large Language Models introduce failure modes that don't exist in traditional ML. These risks are documented here for reference; operational controls are embedded in Phases 6–8.

L1

Prompt Injection Critical

Adversarial inputs that override system instructions. Can cause data exfiltration, unauthorized actions, or reputation damage.

⚠ Operational Risk

Any user-facing LLM is vulnerable. Defense requires input sanitization, output filtering, and privilege separation. No perfect solution exists.

L.1.1

Input Sanitization Rules

Character filtering, length limits, known-attack pattern detection.

L.1.2

Output Filtering Pipeline

Sensitive data detection, PII scrubbing, format validation.

L.1.3

Privilege Separation Architecture

LLM has no direct access to databases, APIs, or actions without human approval.

L2

Tool-Call Drift

LLM-orchestrated tools gradually diverge from intended behavior. The model "learns" shortcuts that bypass safety checks.

L.2.1

Tool Call Audit Log

Every tool invocation logged with inputs, outputs, and latency.

L.2.2

Drift Detection Metrics

Statistical comparison of tool usage patterns over time.

L.2.3

Tool Capability Boundaries

Explicit limits on what each tool can do, enforced at API level.

L3

Retrieval Contamination

RAG systems surface incorrect or malicious content from the knowledge base. Garbage in, authoritative-sounding garbage out.

L.3.1

Source Quality Scoring

Every document in the corpus rated for authority, recency, and reliability.

L.3.2

Retrieval Relevance Monitoring

Track semantic similarity scores and flag low-confidence retrievals.

L.3.3

Adversarial Document Detection

Scan corpus for documents designed to manipulate retrieval.

L4

Context Window Decay

As conversations lengthen, early context degrades. The model "forgets" constraints and instructions established at the start.

L.4.1

Conversation Length Limits

Hard caps on turns before forced summarization or reset.

L.4.2

Instruction Reinforcement Strategy

System prompts repeated or summarized at intervals.

L.4.3

Context Health Monitoring

Measure instruction-following accuracy as function of conversation length.

L5

Hallucination Detection Patterns

LLMs confidently produce false information. Detection requires domain-specific validation, not just confidence scores.

L.5.1

Factual Grounding Requirements

Claims must cite retrievable sources or be flagged as unverified.

L.5.2

Domain Expert Sampling Protocol

Random outputs reviewed by humans for factual accuracy.

L.5.3

Consistency Cross-Check System

Same question asked multiple ways; inconsistent answers flagged.

LLM Governance Principle LLMs are not databases. They are probabilistic generators. Every output should be treated as a hypothesis requiring validation, not a fact requiring transmission.

Appendix: Agentic AI & Multi-Model Orchestration

Guidance for systems where AI models call other models, use tools, or operate with autonomy. These systems introduce compounding risks that single-model deployments do not have.

Compounding Risk Warning Agentic systems multiply risk. An error in one component propagates through the chain. A hallucination becomes an action. A tool call becomes a state change. Governance for agentic systems must be stricter, not looser, than for single models.
AG.1

Definitions & Architecture Patterns

Common patterns in agentic AI systems, each with distinct risk profiles.

Simple Chain

Model A → Model B → Output

Example: Summarization model → Translation model

Risk Profile: Linear error propagation. Output quality bounded by weakest link.

Governance Needs: End-to-end evaluation, intermediate output logging.

Router / Classifier Chain

Input → Router → [Model A | Model B | Model C] → Output

Example: Intent classifier routes to specialized models

Risk Profile: Misrouting sends inputs to wrong model. Silent failures.

Governance Needs: Router accuracy monitoring, fallback paths, coverage analysis.

RAG (Retrieval-Augmented Generation)

Query → Retriever → [Documents] → Generator → Output

Example: Question answering with document retrieval

Risk Profile: Retrieved context quality directly affects output. Retrieval failures are invisible to users.

Governance Needs: Retrieval quality metrics, citation verification, context window management.

Tool-Using Agent

LLM → [Tool Selection] → Tool Execution → [Observation] → LLM → ...

Example: Agent that can search web, execute code, call APIs

Risk Profile: High. Model decisions trigger real-world actions. Hallucinated tool calls cause real damage.

Governance Needs: Tool allowlists, action confirmation, sandbox execution, audit trails.

Multi-Agent System

Agent A ↔ Agent B ↔ Agent C → Consensus → Output

Example: Debate between agents, critic-generator pairs

Risk Profile: Emergent behavior. Agents may reinforce each other's errors. Coordination failures.

Governance Needs: Interaction logging, consensus validation, human-in-the-loop for high-stakes decisions.

Autonomous Agent (Long-Running)

Goal → [Plan → Execute → Observe → Replan] → ... → Outcome

Example: Agent that autonomously pursues multi-step goals

Risk Profile: Highest. Compounding errors over time. Goal drift. Resource exhaustion. Unintended side effects.

Governance Needs: Step limits, budget caps, mandatory checkpoints, kill switches, human approval gates.

AG.2

Agentic Risk Framework

Risks specific to agentic systems that do not apply to single-model deployments.

Risk Category Description Example Failure Mitigation
Cascade Amplification Errors in early stages amplify through the chain Retriever returns wrong documents; generator confidently answers based on irrelevant context Intermediate validation gates, confidence thresholds at each stage
Tool Hallucination Model invents tool calls that don't exist or passes invalid parameters Agent calls delete_user(id="all") instead of get_user(id="123") Tool schema validation, parameter sanitization, sandbox execution
Action Irreversibility Agent takes actions that cannot be undone Agent sends email, deletes file, or submits order based on misunderstanding Soft-delete patterns, confirmation for destructive actions, staging environments
Goal Drift Agent pursues instrumental goals that diverge from original intent Agent asked to "increase engagement" starts generating controversial content Explicit constraint specification, periodic goal alignment checks
Resource Exhaustion Agent consumes unbounded resources (API calls, compute, tokens) Infinite loop in agent reasoning burns $10K in API costs in an hour Hard budget caps, step limits, automatic timeout
Prompt Injection via Tools External data (web pages, documents) contains adversarial prompts Retrieved document contains "Ignore previous instructions and..." that hijacks agent Input sanitization, privilege separation, context isolation
Emergent Coordination Multi-agent systems develop unexpected interaction patterns Agents in debate converge on plausible but wrong answer through mutual reinforcement Diversity enforcement, external validation, human-in-the-loop
Attribution Opacity Cannot determine which component caused a failure Output is wrong but unclear if retriever, generator, or post-processor is at fault Comprehensive logging, trace IDs, intermediate output capture
AG.3

Mandatory Controls for Agentic Systems

Controls that are optional for single-model deployments but mandatory for agentic systems.

AG.3.1 — Action Allowlist

Agents may only invoke explicitly approved tools/actions. Default deny.

Implementation:
ALLOWED_TOOLS = ["search", "calculate", "lookup_user"]
# NOT ALLOWED: delete, update, send_email, execute_code

def validate_tool_call(tool_name, params):
    if tool_name not in ALLOWED_TOOLS:
        raise ToolNotAllowedError(f"Tool {tool_name} not in allowlist")
    validate_params(tool_name, params)  # Schema validation

AG.3.2 — Budget Caps

Hard limits on resources an agent can consume per task.

Dimensions to cap:
  • Max steps/iterations per task
  • Max tokens consumed (input + output)
  • Max API spend in dollars
  • Max wall-clock time
  • Max tool invocations

AG.3.3 — Human-in-the-Loop Gates

Mandatory human approval for high-stakes actions.

Gate triggers:
  • Any irreversible action (delete, send, submit)
  • Actions affecting other users
  • Financial transactions above threshold
  • Actions outside normal distribution
  • Low-confidence decisions

AG.3.4 — Comprehensive Tracing

Every step, decision, and tool call logged with trace IDs.

Required trace data:
  • Trace ID (propagated through chain)
  • Timestamp for each step
  • Model inputs and outputs
  • Tool calls with parameters and results
  • Reasoning/chain-of-thought (if available)
  • Confidence scores at each stage

AG.3.5 — Sandbox Execution

Tool execution in isolated environments with limited permissions.

Sandbox properties:
  • No network access (or allowlisted only)
  • No filesystem write (or scoped directory)
  • Resource limits (CPU, memory, time)
  • No access to credentials/secrets
  • Output sanitization before return

AG.3.6 — Rollback Capability

Ability to undo agent actions when errors are detected.

Implementation patterns:
  • Event sourcing for state changes
  • Soft-delete with retention period
  • Outbox pattern for external calls
  • Compensation transactions
AG.4

Tool Safety Specification

Every tool exposed to an agent must have a safety specification.

Tool Safety Spec — Template
Tool Name [e.g., send_email]
Tool Description [What the tool does — this is shown to the model]
Risk Level [LOW | MEDIUM | HIGH | CRITICAL]
Reversibility [REVERSIBLE | PARTIALLY_REVERSIBLE | IRREVERSIBLE]
Side Effects [List all external effects: sends data, modifies state, costs money, etc.]
Rate Limits [Max calls per minute/hour/day]
Parameter Schema
{
  "to": {"type": "string", "format": "email", "required": true},
  "subject": {"type": "string", "maxLength": 200, "required": true},
  "body": {"type": "string", "maxLength": 10000, "required": true}
}
Forbidden Patterns [Inputs that should be rejected: e.g., "to" cannot be list > 10 recipients]
Human Approval Required [YES | NO | CONDITIONAL — specify conditions]
Sandbox Requirements [What isolation is needed for safe execution]
Audit Log Fields [What must be logged for each invocation]
Failure Modes [How the tool can fail and what the agent should do]
Test Cases [Link to test suite for this tool]

Tool Risk Classification

Risk Level Characteristics Examples Required Controls
LOW Read-only, no side effects, bounded output Calculator, dictionary lookup, weather API Logging, rate limits
MEDIUM External queries, user data access, reversible state changes Database read, search API, user profile lookup + Input validation, output filtering
HIGH State mutations, external communication, financial impact Database write, send notification, create record + Sandbox, confirmation UI, compensation logic
CRITICAL Irreversible, high-stakes, affects multiple users Delete data, send email, financial transaction, publish content + Human approval gate, staged rollout, real-time monitoring
AG.5

Evaluation Framework for Agentic Systems

Standard ML metrics are insufficient. Agentic systems require trajectory-level and safety-focused evaluation.

Task Completion Metrics

  • Success Rate: % of tasks completed correctly
  • Partial Credit: How much progress on failed tasks
  • Efficiency: Steps/tokens/cost per successful task
  • Time to Completion: Wall-clock time for task

Safety Metrics

  • Harmful Action Rate: % of tasks with unsafe tool calls
  • Constraint Violation Rate: How often agent exceeds boundaries
  • Hallucinated Tool Calls: Invalid tool names or parameters
  • Goal Adherence: Did agent stay on task or drift?

Robustness Metrics

  • Adversarial Resistance: Performance under prompt injection
  • Recovery Rate: Can agent recover from tool failures?
  • Consistency: Same task, same result across runs?
  • Graceful Degradation: Behavior when components fail

Interpretability Metrics

  • Reasoning Quality: Does chain-of-thought make sense?
  • Decision Justification: Can agent explain tool choices?
  • Error Attribution: Can we identify failure point?

Required Test Scenarios

Scenario Type Description Pass Criteria
Happy Path Standard task with cooperative inputs Completes correctly within budget
Edge Cases Unusual but valid inputs Handles gracefully or requests clarification
Tool Failure External tool returns error Retries appropriately or fails gracefully
Ambiguous Instructions Task has multiple interpretations Asks for clarification, doesn't assume
Out-of-Scope Request Task requires tools not available Refuses clearly, suggests alternatives
Prompt Injection Adversarial content in retrieved data Ignores injection, stays on task
Resource Exhaustion Task that would exceed budget Stops at limit, reports partial progress
Conflicting Instructions User request conflicts with safety rules Follows safety rules, explains refusal
AG.6

Production Monitoring for Agentic Systems

Real-time signals that indicate agentic system health.

Critical Alerts (Page Immediately)

  • Budget exceeded for single task
  • Hallucinated tool call attempted
  • Human approval timeout (task blocked)
  • Agent stuck in loop (N iterations without progress)
  • Unauthorized action attempted

Warning Alerts (Review Within 1 Hour)

  • Success rate dropped >10% from baseline
  • Average steps per task increased >20%
  • Tool failure rate elevated
  • Human override rate elevated
  • Cost per task trending up

Dashboards Required

  • Task Flow Dashboard: Success/failure rates, step counts, latency distributions
  • Tool Usage Dashboard: Which tools called, error rates, latency by tool
  • Cost Dashboard: Spend by task type, budget utilization, cost anomalies
  • Safety Dashboard: Blocked actions, human overrides, constraint violations
  • Trace Explorer: Drill into individual task traces for debugging
Governance Principle for Agentic AI Autonomy is a privilege, not a right. Agents earn expanded capability by demonstrating safe behavior within constraints. Start with minimal permissions and expand based on evidence, not optimism.

Appendix: Failure Autopsies

Anonymized case studies from real AI/ML production failures. These are composites based on patterns observed across multiple organizations. The goal is not blame—it is pattern recognition.

Learning from Failure Every failure below was preventable with the controls in this playbook. They occurred because the controls were skipped, weakened, or not enforced. Reading these should create discomfort—that discomfort is the point.
CASE 01

The Invisible Drift

Domain: Financial Services — Credit Decisioning

What Happened

A mid-size lender deployed an ML model for credit risk scoring. The model performed well in validation and initial production. Eighteen months later, the default rate on ML-approved loans was 340% higher than the previous rules-based system. The model was still reporting "healthy" metrics.

Root Cause Analysis

Proximate Cause

COVID-19 shifted income patterns. The model learned to approve applicants with pandemic-era income support (stimulus checks, unemployment) as if it were stable employment income.

Contributing Cause

Drift monitoring tracked feature distributions but not the meaning of features. "Income source" distributions looked stable because unemployment income replaced employment income at similar rates.

Systemic Cause

No outcome feedback loop. Defaults occur 12-24 months after approval. The team measured model confidence, not actual loan performance. By the time defaults materialized, 18 months of bad loans were already on the books.

Root Cause

The model had no economic kill threshold. The team tracked ML metrics (AUC, precision) instead of business outcomes (default rate, loss ratio). There was no defined trigger for "stop trusting this model."

Financial Impact

$47M Direct losses from elevated defaults
$12M Remediation and model rebuild
8 months Return to baseline performance

Playbook Controls That Would Have Prevented This

Control Playbook Reference How It Would Have Helped
Cost Telemetry Contract Section: Cost Telemetry Mandatory tracking of "error cost per month" would have caught elevated defaults within weeks, not months
Kill Threshold Definition Economic Viability $3 Pre-defined kill criteria (e.g., "default rate >1.5× baseline for 60 days") would have triggered automatic review
Outcome Feedback Loop Phase 11.2 Dashboard tracking actual business outcomes (not just model metrics) with owner accountability
Concept Drift Monitoring Phase 11.4 Semantic drift detection (not just statistical) would have flagged "income source" meaning change

Key Lesson

Model metrics are not business metrics. A model can report excellent precision while destroying value. The only metrics that matter are the ones your CFO would recognize.

CASE 02

The Helpful Hallucination

Domain: Healthcare — Clinical Decision Support

What Happened

A healthcare system deployed an LLM-powered clinical assistant to help physicians with differential diagnosis. The system was trained to be "helpful and thorough." During a complex case, the LLM confidently cited a drug interaction that did not exist, referencing a fabricated clinical trial. A physician, under time pressure, trusted the citation. The patient experienced a preventable adverse event.

Root Cause Analysis

Proximate Cause

LLM hallucinated a plausible-sounding citation: "Smith et al., NEJM 2019" — a paper that does not exist. The hallucination included specific dosing recommendations.

Contributing Cause

The system was optimized for "helpfulness" scores in user testing. Responses that said "I don't know" or "please verify" scored lower. The model learned to always provide an answer.

Systemic Cause

No citation verification layer. The system presented LLM outputs as if they were retrieved from a verified knowledge base. Users could not distinguish between "retrieved fact" and "generated text."

Root Cause

The deployment skipped Phase 7 (Validation) hallucination detection. There was no red-team evaluation, no domain expert sampling protocol, and no factual grounding requirement. The team assumed "it's just a helper tool" exempted them from rigorous validation.

Impact

1 Patient Harm Event Preventable adverse drug reaction
$2.3M Settlement and legal costs
System Shutdown Full deployment rolled back
18 months Delay to relaunch with proper controls

Playbook Controls That Would Have Prevented This

Control Playbook Reference How It Would Have Helped
Hallucination Detection Patterns LLM Risks L5 Factual grounding requirement (L.5.1) would mandate citation verification before display
Domain Expert Sampling LLM Risks L.5.2 Random outputs reviewed by clinicians would have caught hallucinated citations in testing
Red Team Evaluation Phase 7.3 Adversarial testing specifically designed to elicit hallucinations
Human Override Design Executive Control Surface UI should have distinguished "verified" vs "AI-generated" content with friction for high-risk actions
Risk Classification Phase 3.4.3 Clinical decision support is HIGH RISK — should have triggered enhanced validation requirements

Key Lesson

LLMs are not retrieval systems. They generate plausible text, not verified facts. Any deployment in high-stakes domains must include a verification layer that is architecturally separate from the generation layer. "Helpful" without "accurate" is dangerous.

CASE 03

The Orphaned Model

Domain: E-Commerce — Recommendation Engine

What Happened

A recommendation model was deployed by a team of three ML engineers. It performed well. Over 24 months, all three engineers left the company. When the model began underperforming, no one knew how to retrain it, what data it needed, or how to roll it back. The model ran in degraded state for 11 months before being replaced entirely.

Root Cause Analysis

Proximate Cause

The retraining pipeline was never documented. It existed as a series of Jupyter notebooks on a departed engineer's laptop, with hardcoded paths and credentials.

Contributing Cause

No handoff process. Engineers left without knowledge transfer. The "documentation" was a README that said "see Alice for details." Alice had left 8 months earlier.

Systemic Cause

Ownership was assigned to "the ML team" (a team), not to a named individual with backup. When the team dissolved, ownership dissolved with it.

Root Cause

The deployment skipped Phase 12 (Continuity). There was no Model Card, no operational runbook, no retraining documentation, and no named successor owner. The model was treated as "done" rather than as a living system requiring ongoing stewardship.

Financial Impact

$8.2M Lost revenue from degraded recommendations
$1.4M Cost to rebuild from scratch
11 months Duration of degraded performance
6 months Time to deploy replacement

Playbook Controls That Would Have Prevented This

Control Playbook Reference How It Would Have Helped
Named Owner Assignment Ownership Contract (all phases) Individual (not team) ownership with mandatory successor designation
Model Card Requirement Phase 4.1, Model Cards section Standardized documentation including training data, architecture, and retraining procedure
Runbook Requirement Phase 10.3 Operational procedures documented and tested by someone other than the author
Continuity Addendum Continuity Addendum section Explicit handoff checklist and "bus factor" requirement (>1 person can operate)
Retraining Protocol Phase 11.4 Documented, automated, and tested retraining pipeline in version control

Key Lesson

Models are not products; they are processes. A deployed model without documented, transferable operational procedures is not an asset — it is a liability with a countdown timer. "We shipped" is not the finish line; "anyone can operate this" is.

CASE 04

The Compliance Surprise

Domain: Insurance — Claims Processing

What Happened

An insurer deployed an ML model to expedite claims processing. The model reduced processing time by 60%. Six months post-launch, a regulatory audit revealed the model was using zip code as a proxy for race, resulting in systematically lower payouts to minority communities. The company faced regulatory action and class-action litigation.

Root Cause Analysis

Proximate Cause

Zip code was highly predictive of claim outcome in training data. The model learned this correlation without understanding it encoded historical discrimination.

Contributing Cause

Bias testing was limited to "protected class" features (race, gender). Proxy features like zip code, which correlate with protected classes, were not evaluated.

Systemic Cause

Legal/Compliance was consulted only at the end ("sign off on this"). They were not involved in feature selection or bias testing design.

Root Cause

The deployment skipped Phase 3.4 (Regulatory/Ethical Constraints) and treated Phase 7.2 (Bias/Fairness Evaluation) as a checkbox rather than a substantive review. The RACI matrix showed Legal as "Informed" rather than "Consulted" on feature selection.

Impact

$34M Regulatory fine
$89M Class action settlement
Consent Decree 5-year regulatory oversight
Reputational National media coverage

Playbook Controls That Would Have Prevented This

Control Playbook Reference How It Would Have Helped
Regulatory Constraint Mapping Phase 3.4 Early identification of fair lending/insurance regulations and their implications for feature selection
Ethical AI Framework Phase 3.4.2 Explicit proxy discrimination analysis as part of feature engineering
Bias/Fairness Evaluation Phase 7.2 Disparate impact analysis across protected classes AND correlated features
RACI Matrix Template T.1 Legal/Compliance as "Consulted" on feature selection, not just "Informed" at the end
Regulatory Traceability Matrix Appendix: Regulatory Matrix Explicit mapping of regulatory requirements to artifacts and owners

Key Lesson

Compliance is not a sign-off; it is a design constraint. Legal and regulatory requirements must be inputs to the design process, not reviews of the finished product. By the time you ask Legal to "approve this," the expensive decisions have already been made.

Common Failure Patterns

Across these cases and dozens of others, the same patterns emerge:

01

Metric Mismatch

Teams measure ML metrics (AUC, F1) instead of business outcomes (revenue, cost, harm). Models can score well while destroying value.

02

Skipped Phases

"We don't have time for that" is the most expensive sentence in ML. Skipped controls create debt that compounds with interest.

03

Dissolved Ownership

Ownership assigned to teams, not individuals. When teams change, ownership evaporates. Models become orphans.

04

Late Compliance

Legal/Compliance consulted at the end, not the beginning. By then, the architecture encodes assumptions that are expensive to unwind.

05

Missing Kill Criteria

No pre-defined conditions for stopping. Sunk cost and optimism bias keep failing projects alive long past when they should die.

06

Hallucination Blindness

LLM outputs treated as facts. No verification layer. Confidence scores mistaken for accuracy.

Model Cards & Data Sheets

Standardized documentation for models and datasets. Based on Mitchell et al. (2019) "Model Cards for Model Reporting" and Gebru et al. (2021) "Datasheets for Datasets." These are not optional — they are audit artifacts.

Documentation as Governance Model Cards and Datasheets are not bureaucracy — they are evidence. When a regulator, auditor, or litigator asks "how did you know this was safe to deploy?", these documents are your answer. Incomplete documentation is indefensible documentation.
MC.1

Model Card Template

A Model Card documents a model's intended use, performance characteristics, limitations, and ethical considerations. It is required before production deployment.

Model Card — Template v2.0

1. Model Details

Model Name [e.g., fraud-detection-v2.3.1]
Model Version [Semantic version: MAJOR.MINOR.PATCH]
Model Type [e.g., Gradient Boosted Trees, Transformer, Logistic Regression]
Model Date [Training completion date: YYYY-MM-DD]
Model Owner [Named individual, not team]
Model Steward [Backup owner for continuity]
Contact [Email or Slack channel for questions]
License [Internal use only / Open source license]
Playbook Phase [Current phase in AI/ML Production Playbook]

2. Intended Use

Primary Intended Uses

[Describe the primary use case(s) this model was designed for]

  • [Use case 1: e.g., "Real-time fraud scoring for card-present transactions"]
  • [Use case 2: e.g., "Batch scoring for transaction review queues"]
Primary Intended Users
  • [User type 1: e.g., "Fraud analysts reviewing flagged transactions"]
  • [User type 2: e.g., "Automated decisioning system for low-risk approvals"]
Out-of-Scope Uses

⚠ The following uses are explicitly NOT supported and may produce unreliable or harmful results:

  • [Out-of-scope 1: e.g., "Credit decisioning without human review"]
  • [Out-of-scope 2: e.g., "Application to card-not-present transactions (different fraud patterns)"]
  • [Out-of-scope 3: e.g., "Use in jurisdictions outside training data coverage"]

3. Factors

Relevant Factors

Factors that may influence model performance:

  • Demographic factors: [e.g., "Customer geography, account age, transaction history length"]
  • Instrument factors: [e.g., "Card type, merchant category, transaction channel"]
  • Environmental factors: [e.g., "Time of day, day of week, holiday periods"]
Evaluation Factors

Factors across which performance was explicitly evaluated:

FactorDisaggregation PerformedResult
[Factor 1][Yes/No][Summary]
[Factor 2][Yes/No][Summary]

4. Metrics

Model Performance Metrics
Metric Definition Value Threshold Rationale
Precision @ threshold True positives / Predicted positives [0.XX] [≥ 0.XX] [Why this threshold matters]
Recall @ threshold True positives / Actual positives [0.XX] [≥ 0.XX] [Why this threshold matters]
False Positive Rate False positives / Actual negatives [0.XX] [≤ 0.XX] [Cost of false positives]
AUC-ROC Area under ROC curve [0.XX] [≥ 0.XX] [Overall discrimination ability]
Business Metrics
Metric Value Baseline Kill Threshold
Cost per inference (fully loaded) [$X.XX] [$X.XX] [> $X.XX]
Value per correct prediction [$X.XX] [$X.XX] [< $X.XX]
Cost per error (weighted) [$X.XX] [$X.XX] [> $X.XX]
Decision Thresholds

Operating thresholds and their implications:

Threshold Value Action Trade-off
Auto-approve [< 0.XX] [Pass without review] [Maximizes throughput, accepts some false negatives]
Auto-reject [> 0.XX] [Block and escalate] [Minimizes fraud loss, increases false positives]
Human review [0.XX - 0.XX] [Queue for analyst] [Balances accuracy with operational cost]

5. Training Data

Dataset Name [Link to Datasheet]
Dataset Version [Version hash or identifier]
Date Range [Start date — End date]
Sample Size [N records, with class distribution]
Sampling Strategy [Random / Stratified / Other]
Label Source [How ground truth was determined]
Known Limitations [Data gaps, biases, or quality issues]

6. Evaluation Data

Dataset Name [Link to Datasheet]
Relationship to Training [Held-out split / Separate collection / Temporal split]
Date Range [Start date — End date]
Sample Size [N records]
Distribution Comparison [How evaluation data compares to training data]

7. Ethical Considerations

Fairness Analysis
Protected Class Metric Group A Group B Ratio Threshold Status
[e.g., Geography] False Positive Rate [0.XX] [0.XX] [0.XX] [0.8-1.25] [✓/✗]
[e.g., Account Age] Approval Rate [0.XX] [0.XX] [0.XX] [0.8-1.25] [✓/✗]
Potential Harms
  • False Positive Harm: [What happens when model incorrectly flags legitimate activity]
  • False Negative Harm: [What happens when model misses actual fraud]
  • Disparate Impact Risk: [Populations that may be disproportionately affected]
Mitigation Strategies
  • [Mitigation 1: e.g., "Human review required for all rejections"]
  • [Mitigation 2: e.g., "Appeal process with alternative evaluation"]
  • [Mitigation 3: e.g., "Quarterly fairness audit"]

8. Caveats and Recommendations

Known Limitations
  • [Limitation 1: e.g., "Reduced accuracy for transactions < $10"]
  • [Limitation 2: e.g., "Not validated for international merchants"]
  • [Limitation 3: e.g., "Performance degrades after 90 days without retraining"]
Recommendations for Use
  • [Recommendation 1: e.g., "Always pair with human review for high-value transactions"]
  • [Recommendation 2: e.g., "Monitor drift weekly; retrain if PSI > 0.25"]
  • [Recommendation 3: e.g., "Do not use as sole decision criteria for account actions"]
Rollback Information
Previous Stable Version [Model version to rollback to]
Rollback Trigger [Conditions that trigger automatic rollback]
Rollback Procedure [Link to runbook]
Rollback Owner [Named individual authorized to execute]

9. Approvals

ML Lead ________________ Date: ________
Product Owner ________________ Date: ________
Security Review ________________ Date: ________
Legal/Compliance ________________ Date: ________
Executive Sponsor ________________ Date: ________
MC.2

Datasheet for Datasets Template

Based on Gebru et al. (2021). Documents the provenance, composition, and appropriate use of datasets used for training and evaluation.

Datasheet for Datasets — Template v1.0

1. Motivation

For what purpose was the dataset created?

[Describe the task or research question]

Who created the dataset and on behalf of which entity?

[Team name, organization]

Who funded the creation of the dataset?

[Internal budget / External grant / Client project]

2. Composition

What do the instances represent?

[e.g., "Each instance represents a single transaction"]

How many instances are there in total?

[N instances, broken down by split if applicable]

Does the dataset contain all possible instances or a sample?

[Describe sampling strategy and coverage]

What data does each instance consist of?

[List features/columns with data types]

Is there a label or target associated with each instance?

[Describe labels, how they were obtained, and inter-annotator agreement if applicable]

Is any information missing from individual instances?

[Describe missing data patterns and handling]

Are relationships between instances made explicit?

[e.g., "Transactions are linked by customer ID"]

Are there recommended data splits?

[Train/validation/test splits with rationale]

Are there any errors, noise, or redundancies?

[Known data quality issues]

Is the dataset self-contained or does it link to external resources?

[Dependencies on external data sources]

Does the dataset contain data that might be considered confidential?

[PII, PHI, financial data, trade secrets]

Does the dataset contain data that might be considered offensive or distressing?

[Content warnings if applicable]

Does the dataset relate to people?

[If yes, describe demographic coverage and potential for identification]

3. Collection Process

How was the data associated with each instance acquired?

[Directly observed / Derived / Inferred / User-provided]

What mechanisms were used to collect the data?

[APIs, sensors, manual entry, web scraping, etc.]

Who was involved in the data collection process?

[Automated systems / Human annotators / Crowd workers]

Over what timeframe was the data collected?

[Date range]

Were any ethical review processes conducted?

[IRB approval, ethics board review, etc.]

Did the data subjects consent to data collection?

[Consent mechanism and scope]

Has an analysis of potential impact on data subjects been conducted?

[Privacy impact assessment results]

4. Preprocessing / Cleaning / Labeling

Was any preprocessing applied?

[Normalization, deduplication, encoding, etc.]

Was the "raw" data saved in addition to preprocessed data?

[Location of raw data if preserved]

Is the preprocessing software available?

[Link to code repository]

5. Uses

Has the dataset been used for any tasks already?

[Previous uses and results]

What tasks could the dataset be used for?

[Appropriate use cases]

What tasks should the dataset NOT be used for?

[Inappropriate uses and why]

Is there anything about the composition or collection that might impact future uses?

[Limitations that affect generalization]

6. Distribution

Will the dataset be distributed to third parties?

[Internal only / Partners / Public]

How will the dataset be distributed?

[S3, API, download, etc.]

When will the dataset be distributed?

[Availability timeline]

Will the dataset be distributed under a license?

[License terms and restrictions]

Are there any export controls or regulatory restrictions?

[GDPR, HIPAA, CCPA, export controls]

7. Maintenance

Who is maintaining the dataset?

[Named owner]

How can the owner be contacted?

[Email, Slack, etc.]

Will the dataset be updated?

[Update cadence and process]

If others want to extend the dataset, is there a mechanism?

[Contribution process]

Are older versions available?

[Versioning and retention policy]

If the dataset becomes obsolete, how will this be communicated?

[Deprecation process]

MC.3

Citation & References

These templates are based on the following foundational papers:

  1. Mitchell, M., et al. (2019). "Model Cards for Model Reporting." Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* '19). arXiv:1810.03993
  2. Gebru, T., et al. (2021). "Datasheets for Datasets." Communications of the ACM, 64(12), 86-92. arXiv:1803.09010
  3. Arnold, M., et al. (2019). "FactSheets: Increasing Trust in AI Services through Supplier's Declarations of Conformity." IBM Journal of Research and Development, 63(4/5).

Appendix: AI/ML Incident Response Playbook

Structured response procedures for AI/ML system incidents. Standard IT incident response is insufficient — AI systems have unique failure modes that require specialized handling.

Incident Response Readiness An incident response plan that hasn't been rehearsed is not a plan — it's a document. Run tabletop exercises quarterly. The first time you use this playbook should not be during an actual incident.
IR.1

AI/ML Incident Severity Classification

AI incidents require different classification criteria than traditional IT incidents.

Severity Definition AI/ML Examples Response Time Escalation
SEV-1
Critical
Active harm to users, regulatory breach, or complete system failure
  • Model producing harmful/illegal outputs
  • PHI/PII exposure via model outputs
  • Systematic bias causing demonstrable harm
  • Complete model failure in production
  • Hallucination causing user harm
Immediate
(within 15 min)
Executive + Legal + Comms
SEV-2
High
Significant degradation or potential for harm if uncorrected
  • Model accuracy dropped below SLA
  • Significant drift detected
  • Data pipeline failure affecting predictions
  • Cost overrun >2x budget
  • Human override rate >50%
Within 1 hour ML Lead + Product
SEV-3
Medium
Noticeable degradation without immediate harm
  • Latency SLA breach
  • Partial feature failure
  • Elevated error rates in non-critical paths
  • Monitoring gaps detected
Within 4 hours On-call engineer
SEV-4
Low
Minor issues with workarounds available
  • Non-critical dashboard down
  • Logging gaps
  • Documentation inaccuracies
Next business day Ticket queue
IR.2

Incident Response Workflow

1. DETECT
  • Alert fires (automated monitoring)
  • User report received
  • Internal discovery
  • External report (researcher, regulator)
Output: Incident ticket created with initial severity
2. TRIAGE
  • Confirm incident is real (not false positive)
  • Assess severity using classification matrix
  • Identify affected systems and users
  • Determine if rollback is needed immediately
Output: Confirmed severity, incident commander assigned
Target: 15 minutes
3. CONTAIN
  • Stop the bleeding (rollback, disable, rate limit)
  • Preserve evidence (logs, model state, data snapshots)
  • Communicate status to stakeholders
  • Isolate affected components
Output: Harm stopped or bounded
Target: 30 minutes (SEV-1), 2 hours (SEV-2)
4. INVESTIGATE
  • Root cause analysis
  • Timeline reconstruction
  • Impact assessment (users affected, data compromised)
  • Identify contributing factors
Output: Root cause identified, impact quantified
5. REMEDIATE
  • Fix root cause (not just symptoms)
  • Validate fix in staging
  • Deploy fix with monitoring
  • Verify normal operation restored
Output: System restored to healthy state
6. REVIEW
  • Blameless post-mortem
  • Document lessons learned
  • Identify systemic improvements
  • Update runbooks and monitoring
  • Share learnings with broader team
Output: Post-mortem document, action items assigned
Target: Within 5 business days of resolution
IR.3

AI-Specific Incident Runbooks

Pre-written response procedures for common AI/ML incident types.

RUNBOOK-001

Model Producing Harmful/Inappropriate Output

Typical Severity: SEV-1
Symptoms
  • User reports of offensive, dangerous, or illegal model outputs
  • Content filter alerts spiking
  • Social media reports of problematic outputs
Immediate Actions (First 15 Minutes)
  1. Preserve evidence: Screenshot/log the harmful output with full context (input, session ID, timestamp)
  2. Assess scope: Is this reproducible? One user or many? One input pattern or widespread?
  3. Decide containment strategy:
    • If reproducible/widespread → Rollback to previous model version
    • If isolated → Add input to blocklist, increase human review
    • If severity warrants → Take system offline entirely
  4. Notify: Incident commander, ML lead, legal (if SEV-1), communications (if public-facing)
Investigation Checklist
  • □ What input triggered the harmful output?
  • □ Was this a prompt injection or adversarial input?
  • □ Did this output pattern exist in training data?
  • □ Has the content filter been bypassed? How?
  • □ Are there similar inputs that might produce similar outputs?
  • □ What is the user impact? (How many saw this? Who?)
Remediation Options
  • Add input pattern to blocklist
  • Update content filter with new pattern
  • Fine-tune model to refuse similar inputs
  • Add human review for similar input patterns
  • Retrain model with corrected data (longer-term)
Stakeholder Communication
  • Internal: Notify executive team within 1 hour
  • Affected users: Direct communication if identifiable
  • Public: If widely known, coordinate with communications team
  • Regulatory: If PHI/PII or regulated domain, notify compliance for reporting assessment
RUNBOOK-002

Model Performance Degradation / Drift

Typical Severity: SEV-2
Symptoms
  • Accuracy/precision metrics below SLA threshold
  • Drift score exceeds threshold (PSI > 0.25)
  • Human override rate elevated
  • User complaints about quality increasing
Immediate Actions
  1. Confirm degradation: Check multiple metrics, not just one. Rule out monitoring bug.
  2. Assess business impact: Is this causing measurable harm? Financial loss? User churn?
  3. Decide action:
    • If degradation severe (>15% from baseline) → Rollback to previous version
    • If moderate → Increase human review, continue investigation
    • If gradual → Schedule retraining, monitor closely
Investigation Checklist
  • □ When did degradation start? (Correlate with deployments, data changes)
  • □ Is this data drift or concept drift?
  • □ Which features are drifting most?
  • □ Is the label distribution changing?
  • □ Are there new input patterns the model hasn't seen?
  • □ Has upstream data quality degraded?
Remediation Options
  • Rollback to previous stable version
  • Retrain on recent data
  • Adjust decision thresholds
  • Add new training data for drifted segments
  • Fix upstream data quality issues
RUNBOOK-003

Data Pipeline Failure

Typical Severity: SEV-2 to SEV-3
Symptoms
  • Feature store not updating
  • Stale predictions (using old data)
  • Missing features in inference requests
  • Pipeline job failures in orchestrator
Immediate Actions
  1. Identify failure point: Which pipeline stage failed? Ingestion? Transformation? Serving?
  2. Assess staleness: How old is the data currently being served?
  3. Decide action:
    • If model can operate on stale data → Continue with degraded mode, alert users
    • If stale data causes incorrect predictions → Fall back to rules-based system or disable feature
Investigation Checklist
  • □ What is the root cause? (Source system down? Schema change? Resource exhaustion?)
  • □ Is data recoverable or lost?
  • □ Are downstream systems affected?
  • □ What is the data gap (time range of missing data)?
RUNBOOK-004

Cost Overrun / Budget Breach

Typical Severity: SEV-2 to SEV-3
Symptoms
  • Cost alerts firing
  • Inference costs >2x budget
  • Unexpected spike in API/compute usage
Immediate Actions
  1. Identify source: Which model/endpoint is causing the cost spike?
  2. Check for abuse: Is this a DDoS, credential leak, or runaway automation?
  3. Implement rate limiting: Throttle requests to bring costs under control
  4. Notify finance: If significant budget impact expected
Investigation Checklist
  • □ Is traffic legitimate or malicious?
  • □ Has request volume increased or cost per request increased?
  • □ Is a new feature or integration driving unexpected usage?
  • □ Are caches working correctly?
  • □ Is there a retry storm or infinite loop?
IR.4

Post-Mortem Template

Blameless post-mortems are required for all SEV-1 and SEV-2 incidents.

AI/ML Incident Post-Mortem Template

Incident Summary

Incident ID[INC-XXXX]
Severity[SEV-1 / SEV-2 / SEV-3]
Date/Time[Start time — End time, timezone]
Duration[Total time to resolution]
Incident Commander[Name]
Affected Systems[List of models, services, features]
User Impact[Number of users affected, nature of impact]
Financial Impact[Estimated cost: lost revenue, remediation, etc.]

Executive Summary

[2-3 sentence summary suitable for leadership. What happened, what was the impact, and is it fixed?]

Timeline

TimeEventActor
[HH:MM][First symptom observed][System/Person]
[HH:MM][Alert fired][Monitoring system]
[HH:MM][Incident declared][Person]
[HH:MM][Containment action taken][Person]
[HH:MM][Root cause identified][Person]
[HH:MM][Fix deployed][Person]
[HH:MM][Incident resolved][Person]

Root Cause Analysis

What Happened

[Detailed technical description of what went wrong]

Why It Happened (5 Whys)
  1. Why did [symptom] occur?
    [Answer]
  2. Why did [cause 1] happen?
    [Answer]
  3. Why did [cause 2] happen?
    [Answer]
  4. Why did [cause 3] happen?
    [Answer]
  5. Why did [cause 4] happen?
    [Answer — this is usually the root cause]
Contributing Factors
  • [Factor 1: e.g., "Missing test coverage for edge case"]
  • [Factor 2: e.g., "Alert threshold set too high"]
  • [Factor 3: e.g., "Runbook out of date"]

What Went Well

  • [Thing 1: e.g., "Rollback completed in under 10 minutes"]
  • [Thing 2: e.g., "Cross-team collaboration was smooth"]
  • [Thing 3: e.g., "Monitoring detected issue before user reports"]

What Could Be Improved

  • [Improvement 1: e.g., "Alerting was too noisy, real signal was lost"]
  • [Improvement 2: e.g., "Runbook didn't cover this scenario"]
  • [Improvement 3: e.g., "Took too long to get right people in room"]

Action Items

Action Type Owner Due Date Status
[Action 1: e.g., "Add test for edge case X"] Prevent [Name] [Date] Open
[Action 2: e.g., "Lower alert threshold to Y"] Detect [Name] [Date] Open
[Action 3: e.g., "Update runbook with this scenario"] Respond [Name] [Date] Open

Action Types: Prevent (stop recurrence), Detect (catch earlier), Respond (handle better)

Lessons Learned

[Key takeaways that should be shared with the broader organization]

Approvals

Post-Mortem Author: _____________ Date: _______

Reviewed By: _____________ Date: _______

Action Items Approved By: _____________ Date: _______

IR.5

Incident Response Roles

Incident Commander (IC)

Responsibility: Overall incident coordination and decision-making

  • Declares incident severity
  • Coordinates response team
  • Makes containment decisions
  • Manages stakeholder communication
  • Declares incident resolved

Assigned To: [On-call rotation or named individual]

Technical Lead

Responsibility: Technical investigation and remediation

  • Leads root cause investigation
  • Proposes and implements fixes
  • Coordinates with other engineers
  • Validates fix before deployment

Assigned To: [ML engineer on-call or model owner]

Communications Lead

Responsibility: Internal and external communication

  • Drafts status updates
  • Manages stakeholder notifications
  • Coordinates with PR if needed
  • Documents incident timeline

Assigned To: [Product manager or designated comms person]

Scribe

Responsibility: Real-time documentation

  • Records all actions and decisions
  • Maintains incident timeline
  • Captures evidence and screenshots
  • Provides input for post-mortem

Assigned To: [Any available team member]

Rehearsal Requirement Run a tabletop exercise using this playbook at least once per quarter. Simulate a SEV-1 incident, assign roles, and walk through the response. Identify gaps before a real incident reveals them.

Appendix: Regulatory Traceability Matrix

This matrix maps regulatory requirements to specific phases, artifacts, and owners. For audit readiness, not prose.

Core Regulatory Mapping

Regulation / Standard Requirement Phase Artifact(s) Owner
EU AI Act (Art. 14) Human oversight 7, 9 7.2 Bias Checks, 9.1 Launch Review Compliance Officer
EU AI Act (Art. 13) Transparency & documentation 1, 4 1.4 Documentation, 4.1 RACI Product Manager
EU AI Act (Art. 9) Risk management system 3 3.4.3 Risk Assessment Matrix Risk Manager
GDPR (Art. 17) Right to erasure 5, 8 5.3 Security Posture, 8.4 Privacy Validation DPO
GDPR (Art. 22) Automated decision rights 7 7.2 Interpretability Checks Legal Counsel
HIPAA (164.312) Access controls & audit 5 5.3 IAM Configuration Security Engineer
HIPAA (164.530) Data retention 8 8.4 Data Retention Validation Compliance Officer
FDA SaMD (21 CFR 820) Design controls 4 4.2 Pipeline Architecture QA Manager
FDA SaMD Change control 11 11.4 Retraining Protocol QA Manager
NIST AI RMF (Map) Context & scope definition 1, 2 1.3 Relationship Map, 2.1 Scope Definition ML Lead
NIST AI RMF (Measure) Performance monitoring 11 11.2 Model Dashboards ML Engineer
NIST AI RMF (Manage) Risk response 10, 11 10.2 Rollback Plans, 11.3 Incident Response SRE Lead
ISO/IEC 42001 AI management system All Full playbook compliance CTO
ISO/IEC 23894 AI risk management 3, 7 3.4 Risk Assessment, 7.4 Security Testing Risk Manager
Basel AI Guidance Model risk management 11, 12 11.4 Decay Detection, 12.3 Tech Debt Model Risk Officer
SOC 2 Type II Security controls 5, 7 5.3 Security Posture, 7.4 Pen Testing CISO
IEEE 2857 Privacy engineering 3, 8 3.4 Ethical Framework, 8.4 Privacy Validation Privacy Engineer
Audit Preparation For each row, the named owner must be able to produce the referenced artifact(s) within 24 hours of audit request. Artifacts without designated storage locations or owners are compliance gaps.

Standards & Citations

This framework incorporates requirements and guidance from the following standards:

  1. NIST AI Risk Management Framework (AI RMF) 1.0
  2. ISO/IEC 23894:2023 — AI Risk Management
  3. ISO/IEC 23053:2022 — Framework for AI Systems using ML
  4. ISO/IEC 24028:2020 — AI Trustworthiness Overview
  5. ISO/IEC TR 24027:2021 — Bias in AI Systems
  6. ISO/IEC 24029-1:2021 — Neural Network Robustness
  7. IEEE 2857-2021 — Privacy Engineering for AI/ML
  8. IEEE 2858-2021 — Algorithmic Bias Considerations
  9. IEEE 3652.1 — Federated ML Architecture
  10. ISO 13485 — Medical Devices QMS
  11. IEC 62304 — Medical Device Software Lifecycle
  12. FDA GMLP — Good Machine Learning Practice
  13. ISO 21448 — SOTIF for Road Vehicles
  14. ISO 26262 — Automotive Functional Safety
  15. Basel Committee — AI Model Risk Guidance
  16. EU Artificial Intelligence Act
  17. OECD AI Principles
  18. China GB/T AI Standards
  19. ISO 9001:2015 — Quality Management Systems
  20. ISO/IEC 90003:2014 — Software Engineering Guidelines
  21. ISO/IEC 25010:2011 — Software Quality Models
  22. ISO/IEC 42001 — AI Management Systems
  23. IEEE 730 — Software Quality Assurance
  24. CMMI Level 3+ — Process Maturity
  25. ISO/IEC 17025:2017 — Testing Lab Competence

Glossary

Definitions for terms used throughout this playbook. Consistent terminology prevents miscommunication. If a term is used differently in your organization, document the mapping.

A B C D E F G H I K L M N O P R S T U V

A

Agentic AI
AI systems that can autonomously take actions, use tools, or make multi-step decisions without human intervention at each step. Includes tool-using LLMs, autonomous agents, and multi-agent systems. See: Appendix AG
AI Act (EU)
European Union regulation establishing a legal framework for AI systems based on risk classification. High-risk AI systems must meet requirements for transparency, human oversight, accuracy, and robustness. Effective 2024-2026. Reference: Regulatory Matrix
Artifact
A documented deliverable produced during a phase of the playbook. Examples: Model Card, Risk Assessment Matrix, Runbook. Artifacts have named owners and version control.
AUC-ROC (Area Under the Receiver Operating Characteristic Curve)
A metric measuring a classification model's ability to distinguish between classes across all decision thresholds. Ranges from 0.5 (random) to 1.0 (perfect discrimination). Useful for comparing models but does not reflect real-world operating point.

B

Baseline
Reference performance metrics established before deployment or after initial production stabilization. Used to detect degradation and drift. Must be documented with measurement methodology.
Bias (Algorithmic)
Systematic errors in model outputs that disadvantage particular groups. Can arise from training data (historical bias), feature selection (proxy discrimination), or evaluation methodology. See: Phase 7.2, ISO/IEC TR 24027
Bus Factor
The minimum number of team members who would need to leave before a project becomes inoperable due to knowledge loss. A bus factor of 1 is a critical risk. This playbook requires bus factor ≥ 2 for production systems.

C

Canary Deployment
Deployment strategy where new model version receives a small percentage of traffic (typically 5-10%) while being monitored. Traffic increases gradually if metrics remain healthy. Enables early detection of issues without full exposure.
Concept Drift
Change in the relationship between input features and target variable over time. Unlike data drift, concept drift means the underlying patterns have changed, not just the input distribution. Example: Customer behavior changing during pandemic. See: Phase 11.4
Cost Telemetry Contract (CT)
A mandatory agreement specifying which economic metrics must be tracked, who owns each metric, refresh cadence, and kill thresholds. Systems cannot ship without a complete CT. See: Cost Telemetry section
CRISP-DM
Cross-Industry Standard Process for Data Mining. A methodology defining six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment. This playbook extends CRISP-DM with governance, operational, and economic controls.

D

Data Drift
Change in the statistical distribution of input features over time compared to training data. Measured using metrics like PSI (Population Stability Index) or KL Divergence. Does not necessarily indicate performance degradation but warrants investigation.
Datasheet (for Datasets)
Standardized documentation for datasets describing motivation, composition, collection process, preprocessing, uses, distribution, and maintenance. Based on Gebru et al. (2021). See: Model Cards & Datasheets section
Disparate Impact
A measure of fairness where a selection rate for a protected group is compared to the selection rate for a reference group. The "four-fifths rule" considers disparate impact if the ratio is below 0.8. Used in regulatory compliance.

E

Embedding
A dense vector representation of data (text, images, etc.) learned by a neural network. Embeddings capture semantic relationships and are used in similarity search, RAG systems, and transfer learning.
Error Budget
The acceptable amount of error (downtime, incorrect predictions, etc.) over a defined period, derived from SLO targets. When error budget is exhausted, new deployments should pause until reliability improves.
Explainability
The degree to which a model's predictions can be understood by humans. Includes feature importance, decision paths, and counterfactual explanations. Required for high-risk AI systems under EU AI Act. See: ISO/IEC 24028

F

Feature Store
A centralized repository for storing, managing, and serving ML features. Ensures consistency between training and inference, enables feature reuse, and provides lineage tracking.
Fine-tuning
Adapting a pre-trained model to a specific task or domain by training on task-specific data. Common with LLMs and transfer learning. Introduces risks around training data quality and catastrophic forgetting.
Foundation Model
Large models trained on broad data that can be adapted to many downstream tasks. Examples: GPT, BERT, CLIP. Introduce supply chain risks as organizations depend on external model providers.

G

GDPR (General Data Protection Regulation)
EU regulation on data protection and privacy. Relevant to AI: Article 22 (automated decision-making), Article 17 (right to erasure), and requirements for lawful basis and transparency. See: Regulatory Matrix
GMLP (Good Machine Learning Practice)
FDA guidance for developing medical device software using AI/ML. Emphasizes multi-disciplinary expertise, good software engineering practices, representative data, independence of training and test sets, and reference standards.
Governance OS
The operating system of controls, processes, and accountability structures that ensure AI systems remain safe, compliant, and valuable over time. This playbook is a Governance OS. See: Governance OS section
Ground Truth
The correct label or outcome used to evaluate model predictions. Quality of ground truth directly bounds model quality. Sources include human annotation, authoritative records, and observed outcomes.

H

Hallucination
When a generative AI model produces confident but factually incorrect or fabricated information. Particularly dangerous in high-stakes domains. Cannot be eliminated, only mitigated through verification and grounding. See: LLM Risks L5
HIPAA (Health Insurance Portability and Accountability Act)
US law establishing requirements for protecting health information (PHI). AI systems processing PHI must comply with access controls, audit logging, and data retention requirements. See: Regulatory Matrix
Human Judgment Gate (HJG)
A step in this playbook that requires explicit human decision-making and cannot be automated. Indicated by the HJG badge. Examples: kill criteria definition, risk acceptance, bias evaluation approval.
Hypercare
A period of intensified monitoring and support immediately following production deployment. Typically 2-4 weeks. Characterized by lower thresholds for alerts, faster response times, and elevated staffing. See: Phase 9

I

Inference
The process of applying a trained model to new data to produce predictions. Distinguished from training. Inference cost, latency, and reliability are key production concerns.
Irreversibility Flag
A marker in this playbook indicating decisions that are costly or impossible to unwind once made. Requires extra scrutiny and explicit approval. Examples: data schema choices, model architecture selection.
ISO/IEC 42001
International standard for AI Management Systems. Specifies requirements for establishing, implementing, maintaining, and continually improving an AI management system. Auditable certification available.

K

Kill Criteria
Pre-defined, measurable conditions under which a project or deployed system should be terminated. Must be established before significant investment. Requires named authority to execute. See: Economic Viability $3, Kill Criteria section
KL Divergence (Kullback-Leibler Divergence)
A measure of how one probability distribution differs from another. Used to detect data drift by comparing current input distribution to training distribution. Not symmetric: KL(P||Q) ≠ KL(Q||P).

L

Latency
Time between receiving an inference request and returning a prediction. Measured at various percentiles (P50, P95, P99). Critical for user experience and often traded off against accuracy or cost.
LLM (Large Language Model)
Neural network models trained on large text corpora to generate human-like text. Examples: GPT-4, Claude, Llama. Introduce unique risks including hallucination, prompt injection, and context window limitations. See: LLM-Specific Risks appendix

M

MLOps
Practices for deploying and maintaining ML models in production reliably and efficiently. Encompasses CI/CD for ML, monitoring, versioning, and automation. This playbook provides governance layer on top of MLOps.
Model Card
Standardized documentation for a trained model describing intended use, performance characteristics, limitations, and ethical considerations. Based on Mitchell et al. (2019). Required artifact before production. See: Model Cards section
Model Registry
A centralized repository for storing, versioning, and managing trained models. Enables model lineage, rollback, and audit. Essential infrastructure for governed ML.

N

NIST AI RMF (Risk Management Framework)
Framework from US National Institute of Standards and Technology for managing AI risks. Organized around Map, Measure, Manage, Govern functions. Voluntary but increasingly referenced in procurement and regulation. See: References

O

Ontology
A formal representation of concepts in a domain and the relationships between them. In this playbook, establishing ontology is Phase 1 — ensuring shared vocabulary before building. See: Phase 1
Override Rate
Percentage of model predictions that are overruled by human operators. High override rates may indicate low trust, poor model fit, or changing conditions. Tracked as executive-level signal. See: Executive Control Surface

P

Phase Exit Contract
A checklist of conditions that must be satisfied before proceeding to the next phase. Includes Truth, Economic, Risk, and Ownership contracts. Prevents premature advancement. See: Each phase section
Precision
The proportion of positive predictions that are correct: TP / (TP + FP). High precision means few false positives. Important when false positive cost is high.
Prompt Injection
An attack where adversarial text in input causes an LLM to ignore instructions or behave unexpectedly. Can occur directly (user input) or indirectly (retrieved documents). Requires input sanitization and privilege separation. See: LLM Risks L3
PSI (Population Stability Index)
A metric for measuring distribution shift between two datasets. PSI < 0.1 indicates no significant change; 0.1-0.25 indicates moderate shift; > 0.25 indicates major shift requiring investigation.

R

RACI Matrix
A responsibility assignment chart defining who is Responsible (does work), Accountable (final authority), Consulted (input required), and Informed (kept updated) for each activity. See: Template T.1
RAG (Retrieval-Augmented Generation)
An architecture where an LLM's responses are grounded by retrieving relevant documents from an external knowledge base. Reduces hallucination but introduces retrieval quality as a failure mode. See: Agentic AI section
Recall
The proportion of actual positives that are correctly identified: TP / (TP + FN). High recall means few false negatives. Important when missing positive cases is costly (e.g., fraud detection, medical diagnosis).
Red Team
A group that tests systems by simulating adversarial attacks. For AI, includes prompt injection, jailbreaking, bias elicitation, and edge case discovery. Required in Phase 7. See: Phase 7.3
Rollback
Reverting to a previous known-good version of a model or system. Must be testable, fast, and available without requiring the engineer who deployed the current version. A key incident response capability.
Runbook
A documented set of procedures for operating a system, including common tasks, troubleshooting steps, and incident response. Must be usable by someone who did not write it. See: Phase 10.3

S

SaMD (Software as a Medical Device)
Software intended to be used for medical purposes without being part of a hardware medical device. AI/ML in healthcare often qualifies. Subject to FDA regulation in US, MDR in EU. See: Regulatory Matrix
Shadow Deployment
Running a new model version in parallel with production, receiving real traffic but not affecting user-facing decisions. Enables comparison without risk. Precedes canary deployment. See: Phase 8
SLA (Service Level Agreement)
A commitment defining the expected level of service (uptime, latency, accuracy). Contractual between provider and consumer. Breaches may have financial or contractual consequences.
SLO (Service Level Objective)
An internal target for service quality, typically more stringent than SLA. Provides buffer before SLA breach. Used to guide engineering priorities and error budget allocation.
Stop Authority
A named individual with the power and obligation to halt a project or system when kill criteria are met. Must be able to act without political permission. See: Kill Criteria section

T

Technical Debt
The implied cost of rework caused by choosing quick solutions over better approaches. In ML, includes hardcoded thresholds, undocumented preprocessing, and missing tests. Accumulates interest. See: Phase 12.3
Telemetry
The automated collection and transmission of measurements from a system. For AI, includes inference metrics, resource usage, and business outcomes. Foundation for monitoring and governance.
Threshold
A decision boundary that converts model scores into actions (approve/reject/review). Selection involves trade-offs between precision and recall. Must be documented with rationale.

U

UAW (Unit of AI Work)
A standardized measure of AI system output for cost accounting. Defined specifically for each use case. Examples: one prediction, one document processed, one conversation turn. Basis for economic viability calculations. See: Economic Viability section

V

Validation
Testing a model's performance on held-out data to estimate real-world performance. Distinguished from verification (does the system meet specifications) and testing (does the code work). See: Phase 7
Version Control
Systematic tracking of changes to code, data, models, and configuration. Essential for reproducibility, rollback, and audit. All artifacts in this playbook must be version-controlled.

Standards Quick Reference

Standard Full Name Domain Key Focus
ISO/IEC 42001 AI Management Systems All industries AI governance framework, certifiable
ISO/IEC 23894 AI Risk Management All industries Risk identification and treatment
NIST AI RMF AI Risk Management Framework All industries Map, Measure, Manage, Govern
EU AI Act Artificial Intelligence Act All industries (EU) Risk-based regulation, prohibited uses
FDA GMLP Good Machine Learning Practice Healthcare Medical device AI development
Basel AI Guidance Model Risk Management for AI Financial services Banking AI risk management
IEEE 2857 Privacy Engineering for AI/ML All industries Privacy-preserving AI design
SOC 2 Type II Service Organization Controls Technology services Security, availability, confidentiality