From Migration Factory to Migration Control Plane

By Keith TownsendPublished On: May 13, 2026

Executive Summary

You do not have a migration theory problem. You have a migration capacity and authority problem.

Most enterprises already know the broad mechanics of moving applications from private data centers to public cloud platforms. The harder problem is execution at scale. There are not enough developers to rewrite every application manually. There are not enough infrastructure and platform engineers with deep development knowledge to inspect every workload. Systems integrators face the same constraint when they inherit incomplete inventories, uneven test coverage, unavailable application SMEs, and inconsistent runbooks.

LLMs can help, but only if they are used inside the right operating model.

This is not an AI migration factory. It is a migration control plane that uses AI only where developer-like adaptation is required.

In this model, the playbook drives execution, deterministic code performs known transformations, validators define completion, humans handle authority boundaries and exceptions, and LLMs are called only for bounded tasks that require interpretation, code adaptation, or remediation.

The core loop is simple:

LLM proposes. Playbooks constrain. Deterministic code enforces. Landing zone validates. Traceability records authority.

This model is based on three foundations. The first is the Decision Authority Placement Model (DAPM): authority must be placed deliberately across the migration process. The second is practical migration experience: 28 years of data center, infrastructure, and cloud migration work show that migrations fail as much from operating model mismatch as from code or infrastructure defects. The third is Google’s TensorFlow-to-JAX migration pattern, which demonstrates that large-scale AI-assisted migration becomes credible when probabilistic agents operate inside deterministic planning, playbook, validation, and evidence boundaries.

As enterprises experiment with agent factories and code-migrating LLMs, this paper provides the control-plane and authority model required to let those agents safely touch production-adjacent migration work.

The practical question is not whether an enterprise can use LLMs in migration. The practical question is:

What authority can we safely give the system today?

If the organization does not have enough rigor to manage playbooks as governed artifacts, adding an LLM will not create that rigor. It will only automate the absence of it.

Who This Paper Is For

This paper is for enterprise cloud, platform, and migration leaders — and the systems integrators who support them — who already understand cloud migration but need a safer way to scale repeatable migration work with LLM assistance.

The primary reader is responsible for moving a large application estate without unlimited developers, perfect application knowledge, or uniform test coverage. That includes cloud platform leaders, enterprise architects, infrastructure leaders, migration factory owners, application modernization leads, and senior cloud transformation teams.

The secondary reader is the systems integrator or cloud professional services team helping clients move from migration labor to migration operating model transformation. For these teams, the opportunity is not to provide more migration hands. It is to help enterprises build the migration control plane that makes scarce hands scale.

This paper is also for governance, security, and risk leaders who need to understand where authority lives, what gets validated, and how failures are traced back into the system.

This paper is not for teams trying to learn basic cloud migration. It is for organizations asking a harder question: how do we safely automate repeatable migration patterns across a large estate without giving agents uncontrolled authority?

1. From AI Migration Factory to Migration Control Plane

The goal is not to remove humans from cloud migration. The goal is to reserve scarce human judgment for the decisions that actually require it.

A traditional migration model often assumes that experienced humans will inspect the application, understand the code, identify platform gaps, rewrite what is needed, validate behavior, interpret failures, and decide when the application is ready for the target environment. That model breaks when the application estate is large, the development knowledge is uneven, and the migration timeline is compressed.

The migration engineer often knows what needs to change operationally but may not have the development depth to safely inspect every code modification by hand. That is already true with human developers. The migration engineer identifies the task, delegates the implementation, and validates the result using runbooks, tests, deployment checks, and operational evidence.

An LLM-assisted process should follow the same concept. The migration engineer does not need to personally prove every line of code is correct by inspection. The engineer needs a controlled process that can verify whether the change satisfies the migration requirement.

The migration process should not be described as LLM-driven. It is playbook-driven.

The playbook is the controlling artifact. It defines the sequence of steps, deterministic actions, validation gates, evidence requirements, and escalation rules. The LLM is called only when the playbook reaches a task that requires interpretation, code adaptation, ambiguity handling, or developer-like remediation.

The LLM should be treated the way a senior migration engineer would treat a developer: a callable capability for specific work, not the owner of the migration. The engineer may ask for help refactoring a configuration loader, interpreting a failed test, adapting framework-specific startup behavior, or proposing a candidate patch. The engineer does not ask the developer to own the entire migration process.

The same boundary applies to the LLM. The playbook owns the process. The scheduler owns routing. Deterministic validators own completion. Humans own exceptions and authority changes. The LLM owns only the bounded task it was called to perform.

Agentic loops become useful when the exit condition is outside the agent.

The Control Plane

A cloud migration loop becomes trustworthy when the loop has machine-enforceable playbooks and deterministic exit conditions. The agent is not trusted because it is smart. The agent is useful because it can search the solution space. The system is trustworthy because external validators determine whether the proposed migration is acceptable.

The migration control plane includes:

Scheduler / intake inspection
Playbook catalog
Validator catalog
Landing-zone profile
Evidence store
Exception process
Human review boundaries
RCA feedback loop

These components decide what the loop is allowed to do, when the loop is allowed to stop, and how the system learns when the loop was wrong.

Authority is distributed across the system:

Scheduler: classify and route
Playbook: define the approved execution path
Deterministic code: execute known transformations
Validator: accept or reject completion
Landing zone: accept or reject deployment fit
Human reviewer: approve exceptions and unresolved ambiguity
RCA loop: improve the system after failure

A simple control-plane flow looks like this:

Application Intake
→ Scheduler
→ Playbook Catalog
→ Deterministic Actions
→ LLM-as-Needed Tasks
→ Validators
→ Evidence Store
→ Human Exceptions
→ RCA Feedback Loop
→ Updated Scheduler / Playbooks / Validators

This paper is not a primer on how to perform cloud migrations. It is a primer on how to automate cloud migrations at scale once the organization already has enough migration discipline to describe, govern, and validate repeatable work.

2. Readiness Model: What Authority Can the System Safely Hold?

Playbook-driven LLM-assisted migration assumes the organization can already describe how certain classes of workloads should move, what the target landing zone requires, and how success will be validated.

Readiness should not be treated as a binary yes/no. Most organizations will be ready for some parts of the model before they are ready for governed automation. The practical question is not “Can we use LLMs in migration?” The better question is: what authority can we safely give the system today?

Level 1: Not Ready for Automation

At this level, the organization lacks the operating foundation for automated migration. Common indicators include unreliable application inventory, unclear application ownership, incomplete or inconsistent CMDB data, no stable landing-zone standard, limited test coverage, few repeatable migration runbooks, informally known policy requirements, relationship-based exception handling, no consistent evidence retention, and no RCA feedback process.

LLMs can help with application documentation cleanup, interview summarization, migration assessment support, dependency discovery assistance, drafting candidate runbooks for human review, and creating first-pass application summaries. They should not be used for automated code transformation, scheduler-driven playbook assignment, autonomous remediation, or production-impacting migration decisions.

At this level, adding an LLM to execution will accelerate ambiguity. The priority is building migration operating discipline.

Level 2: Ready for Assisted Assessment

At this level, the organization has some migration discipline, but not enough governed playbook maturity to grant transformation authority. Application inventory exists but needs cleanup. Landing-zone standards exist but are not fully machine-enforceable. Some migration runbooks exist, and some application patterns repeat.

LLMs can help with pattern discovery, application assessment reports, dependency and configuration summaries, draft playbook generation, gap analysis against landing-zone standards, assessment-only scheduler mode, and human-supervised pilot support. Broad automated transformation, unsupervised playbook execution, automatic exception remediation, and production promotion based on LLM-assisted changes alone are not yet appropriate.

This is where many enterprises should begin. The system can gather evidence, classify likely patterns, and help humans mature repeatable playbooks. It should not yet own transformation authority.

Level 3: Ready for Constrained Automation

At this level, the organization has enough governed artifacts to automate narrow migration paths under defined authority boundaries. Application inventory is reliable enough for scheduler input. Landing-zone capabilities are defined and inspectable. Approved playbooks exist for specific workload classes. Deterministic detection and validation rules exist. Evidence capture is standard. Human review boundaries and exception workflows are explicit.

LLMs can perform scoped developer-like remediation inside approved playbooks, test failure interpretation, candidate patch generation, and configuration refactoring when deterministic code cannot safely complete the step. Human review remains required for exceptions, low-confidence cases, and production-impacting changes.

This is the first level where playbook-driven LLM-assisted migration becomes operationally useful beyond assessment.

Level 4: Ready for Governed Automation at Scale

At this level, the organization can run a migration factory with playbook-driven automation across known workload classes. The approved playbook catalog exists. The validator catalog exists with owners and versions. Scheduler decisions are traceable and replayable. The landing-zone profile is machine-readable and current. Evidence store is integrated into migration governance. Playbook lifecycle is enforced. Human approvals and overrides are traceable. The RCA feedback loop updates the control plane.

At this level, the organization is not trusting the LLM to migrate applications. It is trusting the migration control plane to decide when, where, and how the LLM may be used.

Adoption Path

The practical adoption path should be incremental:

Assess applications

→ identify repeatable patterns

→ draft playbooks

→ pilot deterministic validators

→ run supervised LLM-assisted remediation

→ approve narrow playbooks

→ expand automation authority only with evidence

→ feed failures back through RCA

The maturity question for each organization is:

What can we deterministically inspect?
What can we deterministically transform?
What can we deterministically validate?
Where do we need LLM-assisted developer-like work?
Where must human authority remain in the loop?

Automation amplifies the operating model you already have. It does not create one for you.

How Authority Flows Through the Control Plane

The authority model has four related layers.

Layer	Authority Question
Organization readiness	What is the maximum authority envelope the organization can safely grant the system?
Scheduler confidence	What authority should this specific workload receive inside that envelope?
Playbook lifecycle	Is this migration pattern mature enough to be executed, piloted, or only used as guidance?
Validators	What evidence is sufficient to declare this step or migration complete?

These layers prevent authority from silently accumulating in the LLM. A mature organization may be ready for governed automation, but a low-confidence workload still drops to assessment-only mode. A scheduler may identify a known pattern, but a draft playbook cannot drive automated execution. A playbook may execute successfully, but validators determine whether the work is done.

3. Scheduler / Intake Inspection

The first inspection at the scheduler layer is: does the prospect application conform to a known migration pattern?

This is the first routing decision. Before the agent attempts migration, the scheduler must determine whether the application matches a known pattern, partially matches a known pattern, or falls outside the current playbook library.

This inspection is not merely descriptive. It determines which playbooks are eligible, which validators must run, whether the migration can proceed automatically, and whether human review is required before any transformation begins.

Scheduler Responsibilities

The scheduler should inspect the application and classify it before assigning work to an agent loop. Initial scheduler responsibilities include:

Identifying application type and runtime pattern
Detecting framework, language, build system, and packaging model
Identifying deployment model, such as VM, container, Kubernetes, batch job, or serverless candidate
Detecting stateful dependencies
Detecting identity and secrets patterns
Detecting network assumptions
Detecting storage and persistence model
Detecting observability and logging patterns
Detecting external service integrations
Determining whether the app conforms to a known migration archetype
Selecting the appropriate playbook or playbook chain
Escalating unknown or ambiguous patterns for human classification

Scheduler Determinism

The scheduler should be deterministic wherever it is making an authority-bearing routing decision.

That does not mean every input must be perfect or every classification must be certain. It means the scheduler’s decision process must be repeatable, inspectable, and based on declared rules rather than hidden model judgment.

A scheduler may use an LLM to summarize an application, extract candidate facts, or explain findings. But the scheduler should not rely on the LLM as the final classifier for playbook assignment. The routing decision should be made by deterministic rules over inspectable evidence.

LLM assists inspection.
Deterministic scheduler classifies.
Human reviews ambiguity.

The scheduler’s job is not to understand the application like a human architect. Its job is to determine whether enough evidence exists to route the application into a known playbook, a constrained pilot path, assessment-only mode, or human review.

Scheduler Inputs

The scheduler should classify an application using concrete inspection artifacts:

Repository metadata
Dependency manifests
Build files and runtime version files
Dockerfiles or container manifests
Kubernetes manifests
Terraform or other IaC
CI/CD pipeline definitions
Application configuration files
Static analysis and secrets scanning results
Network dependency scan
Runtime inventory or CMDB data
Observability and logging configuration
Existing test coverage
Application owner questionnaire, if needed
Landing-zone capability inventory

Pattern Classification

The scheduler’s first useful output is a pattern classification.

Known Pattern. The application conforms cleanly to an existing migration pattern. For example, a stateless Java Spring Boot application running on a VM with an external database, standard HTTP ingress, and environment-based configuration may be a candidate for containerization and deployment into an approved runtime. The scheduler assigns the known migration playbook, runs deterministic preflight checks, and allows the migration loop to begin transformation within approved scope.

Known Pattern with Exceptions. The application mostly conforms to a known pattern but has specific exceptions. For example, a stateless web app may store secrets in local files, use local-only logging, or depend on a hardcoded internal hostname. The scheduler assigns the primary migration playbook, attaches exception-specific remediation playbooks, requires additional validation gates, and may require human review before deployment.

Unknown Pattern. The application does not conform to any known pattern in the migration library. Examples include custom runtimes, undocumented binary dependencies, shared filesystem assumptions, unclear state models, and undocumented integration paths. The scheduler does not begin automated transformation. It generates an inspection report, routes to a human architect or migration engineer, and may create a candidate pattern for future playbook development if the pattern recurs.

Prohibited Pattern. The application matches a known pattern that is not approved for automated migration. Examples include regulated workloads requiring manual security review, unsupported runtimes, legacy apps with direct hardware dependencies, applications with unclear data classification, and workloads requiring unavailable landing-zone capabilities. The scheduler stops the loop, produces an evidence report, and escalates to governance or architecture review.

Degraded-Confidence Scheduler Mode

The scheduler should not be written as if classification always resolves cleanly. In practice, this is one of the first places the model will break. Application evidence will be incomplete. Repository structure will be inconsistent. CMDB data may be stale. Owners may misunderstand their own dependencies. Static analysis may detect symptoms without understanding architectural intent.

A deterministic scheduler still needs a degraded-confidence operating mode. Degraded confidence does not mean the scheduler failed. It means the scheduler produced enough evidence to continue the assessment, but not enough evidence to grant full automation authority.

The scheduler should support several confidence states:

High confidence: Known pattern, full required evidence present, no disqualifying rules triggered.
Medium confidence: Known pattern with exceptions, evidence mostly complete, warnings require remediation playbooks or human confirmation.
Low confidence: Possible pattern match, but required evidence is missing or contradictory.
No confidence: Unknown pattern or insufficient evidence to classify.
Prohibited: Known disqualifying rule triggered.

The important point is that confidence changes the authority granted to the downstream process.

High Confidence / Automated Candidate. The application is eligible for approved playbook execution. The migration loop may run within approved scope. Human review is not required until an exception, failed validator, or production release gate appears.

Medium Confidence / Constrained Automation. The application is eligible for supervised or constrained playbook execution. The migration loop may run detection, remediation, and test-environment transformation only. Human review is required before production-impacting changes or deployment promotion.

Low Confidence / Assessment-Only Mode. The system should not transform yet. Agents or deterministic tools may inspect, summarize, and gather additional evidence. Human review is required before any code or infrastructure changes.

No Confidence / Human Classification Required. The automated migration path stops. No transformation authority is granted. Human review is required to classify the pattern or create a candidate draft playbook.

Prohibited / Governance Stop. The process stops. No automation authority is granted. Human review is required only for exception or policy review.

A deterministic scheduler does not have to pretend it knows. It has to be explicit about what it knows, what it does not know, and what authority that uncertainty permits.

Scheduler Traceability and Inspectability

Every scheduler decision must generate a trace record. At minimum, that record should include the scheduler run ID, application ID, migration request ID, scheduler version, rule set version, playbook catalog version, landing-zone profile version, input artifacts inspected, evidence extracted, rules evaluated, classification result, confidence state, playbooks considered, playbooks selected or rejected, automation authority granted, additional evidence requested, escalation reason, human reviewer if applicable, and timestamp.

The trace record should make it possible to replay or inspect the decision later. The scheduler is not just plumbing. It determines what kind of automation is allowed to touch the application.

4. Playbooks as Governed Automation Artifacts

There are multiple playbooks in a cloud migration. They should not be treated as one giant migration prompt. Each playbook should encode a specific pattern, transformation, or validation concern.

Candidate playbook categories include:

Application pattern playbooks: stateless web application migration, stateful application migration, batch job migration, API service migration, event-driven application migration, legacy monolith modernization, containerization candidates, serverless candidates, and Kubernetes candidates.
Platform conformance playbooks: identity and access management, secrets management, logging and observability, network ingress and egress, DNS and certificates, storage and persistence, backup and recovery, tagging and cost allocation, encryption and key management, deployment pipeline integration, runtime configuration, and policy-as-code conformance.
Remediation playbooks: hardcoded secrets remediation, local file dependency remediation, hardcoded hostname remediation, local logging remediation, database connection modernization, runtime version upgrades, dependency replacement, configuration externalization, and IAM or service account migration.
Validation playbooks: unit test validation, integration test validation, behavioral equivalence validation, infrastructure-as-code validation, security policy validation, runtime smoke testing, observability validation, performance baseline validation, and rollback validation.

A playbook is stronger than a markdown file when it is machine-enforceable through deterministic code. A markdown instruction can describe the desired state. A governed playbook defines detection logic, required transformation, validation logic, exit condition, evidence produced, and escalation rules.

Example structure:

Playbook ID: secrets-management-gcp-001

Policy Source:
All cloud workloads must use approved managed secrets storage.

Detect:

– .env files committed to repo
– Hardcoded credential-like strings
– Kubernetes Secret manifests in app repo
– Local config files containing secret-like values

Transform:

– Move secrets to approved secrets manager
– Update app bootstrap code
– Update IAM/service account permissions
– Update deployment config

Validate:

– No secret-like values remain in repo
– Secret manager resource exists in IaC
– Workload identity can access required secret
– App starts successfully without local secret file
– Policy scanner passes

Exit Condition:

All validation checks pass.

Evidence:

– Rule ID
– Files inspected
– Findings
– Changes applied
– Test results
– Policy result
– Reviewer or exception owner
– Timestamp

5. Playbook Execution: Deterministic First, LLM as Needed

Each playbook step should first ask: can this action be performed deterministically?

If the answer is yes, the playbook should use deterministic code rather than the LLM. Deterministic code is faster, cheaper, more repeatable, easier to test, and easier to audit.

Examples include detecting committed .env files, identifying secret-like values using scanners, creating approved secret manager resources, updating environment variable references in deployment manifests, rewriting known configuration keys, updating Terraform modules, adding required tags or labels, validating IAM bindings, checking whether logs reach the approved sink, and running unit tests, smoke tests, policy checks, and deployment checks.

LLM-as-Needed Execution

The LLM should be called when the playbook reaches a task that cannot be safely handled by deterministic code alone. Examples include interpreting unfamiliar application logic, refactoring custom configuration loading code, adapting framework-specific code to a new runtime contract, explaining why a test failed after migration, proposing a code change when deterministic remediation cannot infer the right transformation, mapping undocumented behavior to a known migration pattern, and generating candidate patches for human or deterministic review.

The LLM call should be scoped to a specific task, bounded by the playbook, and followed by deterministic validation.

Step: Refactor configuration bootstrap
Deterministic pre-check: local config file dependency detected
LLM task: propose code patch to read non-secret config from environment variables and secrets from approved secret accessor
Constraints: do not change business logic; do not introduce new dependency outside approved list; preserve existing config key names where possible
Validation: unit tests pass; app starts without local config file; secret retrieval integration test passes
Exit condition: deterministic validation passes

The LLM may perform developer-like tasks, but the acceptance model remains the same: the change must pass the defined validation gates.

Validation Without Code Expertise

The migration engineer may not be able to inspect the code deeply enough to know whether the implementation is elegant, idiomatic, or maintainable. But they can validate whether the migration requirement has been satisfied.

Requirement: Application no longer reads secrets from local files.

Validation:

– no local secret file reads detected
– secret manager access path exists
– service account has required permission
– app starts successfully without local secret file
– integration test retrieves secret at runtime
– policy scanner passes

The migration engineer is not validating the code by taste. They are validating the operational outcome.

Review Boundary

Some changes still require developer or architect review. Examples include business logic changes, data model changes, concurrency changes, authentication or authorization logic changes, error-handling behavior changes, performance-sensitive code paths, code changes that pass tests but alter user-visible behavior, and repeated LLM remediation attempts that do not converge.

The playbook should distinguish between changes that can be accepted through deterministic validation and changes that require expert review.

Playbook Execution Trace

Because the playbook is the driver, the execution trace should record each step as a playbook action, not as a free-form agent conversation. Minimum playbook execution fields include playbook execution ID, application ID, playbook ID and version, step ID, step type, execution mode, inputs used, artifacts changed, deterministic tool or script version, LLM model and prompt if used, human reviewer if used, validation result, evidence produced, retry count, exit condition status, and timestamp.

This makes the migration inspectable as a controlled process rather than a transcript of agent behavior.

6. Case Study Walkthrough: Stateless Java Application with Secrets and Logging Exceptions

Scenario

A migration factory is moving a VM-hosted Java Spring Boot application from a private data center to an approved public cloud landing zone.

The application has:

External database dependency
HTTP ingress
Local configuration files
Local file-based logging
Credentials stored in application.properties

The target landing-zone pattern requires:

Containerized deployment to an approved runtime
Managed identity
Secrets stored in the approved secret manager
Structured logs emitted to the approved logging path
Required tags and ownership metadata
Deployment through an approved CI/CD pipeline
Policy-as-code checks before promotion

Walkthrough

Step	Control Plane Action	Result
1. Scheduler inspection	Scheduler inspects repo metadata, pom.xml, application config, deployment scripts, CMDB/runtime inventory, static analysis, secrets scans, and landing-zone profile.	App is classified as Known Pattern with Exceptions.
2. Playbook assignment	Scheduler assigns stateless-java-webapp-cloudrun-001 plus secrets-remediation-gcp-001 and local-logging-remediation-gcp-001.	Automation authority is constrained to test scope; human review is required before staging or production promotion.
3. Deterministic transformation	Secrets playbook detects credential-like values, classifies secret versus non-secret config, creates secret manager resources, updates IaC and service account permissions, and removes committed secret-like values. Logging playbook updates deployment configuration for structured stdout logging and validates log delivery.	Known transformations are performed without calling the LLM.
4. LLM-as-needed remediation	Playbook detects custom configuration bootstrap code that cannot be safely updated with deterministic replacement rules.	LLM is called only for a scoped candidate patch.
5. Validation	Validators run old-world behavior tests and new-world landing-zone conformance tests.	Build, unit, API, database, secrets, logging, metadata, policy, image, and deployment checks pass.
6. Human review boundary	Medium-confidence classification requires human review before promotion.	Human approves the evidence and authority boundary, not every line of code by inspection.
7. RCA after missed failure	Later, the app fails under a secret manager outage because the refactored startup sequence blocks indefinitely.	RCA classifies the issue as a missing validator and playbook assumption gap.

The LLM task in this walkthrough is deliberately narrow:

Task: Refactor configuration bootstrap
Context: The app currently reads secrets and environment configuration from local application.properties.
Constraint: Do not change business logic.
Constraint: Preserve existing config key names where possible.
Constraint: Use approved secret accessor and managed identity.
Constraint: Do not introduce unapproved dependencies.
Output: Candidate patch only.
Validation: deterministic tests and policy gates decide acceptance.

The validators check both old-world behavior and new-world landing-zone conformance.

Old tests prove the app still works: the build succeeds, unit tests pass, existing API smoke tests pass, and database integration tests pass.

New tests prove the app belongs in the landing zone: no secret-like values remain in the repo, secret manager resources exist, runtime identity can retrieve secrets, the app starts without a local secret file, logs appear in the approved logging sink, required metadata and tags exist, policy-as-code checks pass, the container image builds, and the test deployment succeeds.

Old tests prove the app still works. New tests prove the app belongs here.

Why the RCA Matters

The missed secret-manager outage failure is not a side case. It is the reason the migration control plane needs closed-loop learning. The app passed the validators that existed at the time. The RCA process determines whether the miss was workload-specific, organization-specific, or general enough to update the reusable migration system. Section 8 expands that feedback loop into an operating model.

7. What Breaks: Failure Modes and Validator Gaps

The model is strong, but only if playbooks are treated as governed artifacts, not clever prompts.

Bad Pattern Classification. The scheduler may classify an application as a known pattern when it only superficially resembles one. The application may look stateless while depending on local disk persistence, shared filesystem locks, or implicit network trust. Mitigation requires confidence thresholds, human review for low-confidence matches, and negative detection rules.

Playbook Drift. A playbook may reflect yesterday’s landing-zone pattern, not today’s approved platform standard. The loop then enforces outdated policy deterministically. Mitigation requires versioned playbooks, reviewed changes, execution records, and fail-closed deprecated playbooks.

Over-Automation of Exceptions. The system may treat recurring exceptions as normal and automatically remediate them without architectural review. The migration factory hides real design problems by repeatedly patching symptoms. Mitigation requires exception frequency tracking, playbook review triggers, and restrictions on auto-remediation for certain exception classes.

Deterministic but Wrong. A deterministic check may pass while enforcing the wrong requirement. The organization gains false confidence because the pipeline is green. Mitigation requires rules that trace to approved policy, rule owners, adversarial and edge-case test examples, and evidence that shows why each rule exists.

Validator Coverage Gap. A more subtle failure occurs when the LLM proposes a change that passes every deterministic validator but is architecturally wrong in a way the validators were not designed to catch. This is not an LLM-only problem. Human developers can also make changes that pass tests while violating architectural intent. The problem is that the validator set becomes the effective definition of correctness. If the validators are incomplete, the loop can confidently accept the wrong outcome.

A green pipeline proves the workload satisfied the validators. It does not prove the validators captured every architectural concern.

Landing-Zone Assumption Gap. The playbook may assume the target platform has capabilities that are not actually available or enabled. The agent generates a valid target design that cannot be deployed in the actual enterprise landing zone. Mitigation requires scheduler inspection of landing-zone capabilities before playbook assignment, declared platform capability requirements in playbooks, and hard stops for missing capabilities.

8. Validator Governance and Closed-Loop Learning

The secret-manager outage example in Section 6 is the practical reason validator governance matters. The system did not fail because it used an LLM. It failed because the validator set did not capture a real architectural risk.

Because validators define the exit condition, they are authority-bearing control plane components. The system needs a process for governing the validator set itself.

If the validator defines done, the validator is part of the authority model.

The validator governance process should answer:

Who owns each validator?
What policy, runbook, or architectural requirement does it enforce?
What failure mode is it intended to catch?
What failure modes is it not intended to catch?
How was the validator tested?
What known gaps exist?
What evidence does the validator produce?
What happens when the validator passes but downstream review finds an architectural problem?

RCA Feedback Loop

Validator gaps are inevitable. This is true for human-written software, human-led migrations, and LLM-assisted migrations. A workload can pass every known validator and still fail because the validators did not capture the relevant failure mode.

The answer is not to pretend validation can become perfect. The answer is to make failures traceable, perform root-cause analysis, and feed the learning back into the scheduler, playbooks, validators, landing-zone profiles, and organizational knowledge base.

The RCA process should begin whenever a migrated workload fails after passing the expected validation gates. The purpose of RCA is not only to fix the workload. The purpose is to improve the migration control plane.

Every post-validation failure should create an RCA trace that links the failure back to the migration execution history: application ID, migration request ID, failure event, failure description, impact, detection source, playbook version, scheduler run ID, validator set and versions used, LLM calls involved if any, human approvals or overrides involved if any, evidence available at migration time, evidence missing at migration time, root cause category, corrective action, preventive action, feedback destination, owner, due date, and closure evidence.

RCA Root Cause Categories

RCA should classify failures into categories that lead to different fixes.

Missing validator: the failure mode was not covered by any validator.
Weak validator: a validator existed but did not test the condition deeply enough.
Bad scheduler classification: the application was routed to the wrong playbook or given too much automation authority.
Playbook assumption gap: the playbook assumed a migration pattern that was not true for this workload.
Landing-zone profile gap: the target environment did not provide the capability or operating behavior the playbook assumed.
Organization-specific control gap: the migration passed general cloud validation but violated a local standard, operational convention, or compliance requirement.
Human decision error: a human approved an exception, override, or classification that later proved wrong.
LLM remediation error: the LLM produced a change that passed available validators but introduced a hidden defect.

General vs. Organization-Specific vs. Workload-Specific Learning

Not every RCA should update the global playbook. Some findings are generally reusable. Others are specific to the workload, application family, business unit, or organization.

General learning applies across many migrations and should improve the reusable control plane: validator catalog, playbook catalog, scheduler rules, landing-zone profile, pilot test suite, and migration engineering runbooks.

Organization-specific learning reflects local standards, platform design, compliance rules, or operating practices. It should update the organization policy map, landing-zone profile, local validators, local evidence requirements, and local exception workflows.

Workload-specific learning applies to one application or tightly related application family. It should update the application migration record, application-specific runbook, CMDB/application portfolio metadata, future scheduler evidence for related workloads, and human review notes.

The feedback process should not pollute global playbooks with one-off exceptions, but it should also avoid burying reusable failure modes inside application-specific notes.

Closed-Loop Improvement

The migration system should not simply record RCAs. It should require closure before similar migrations continue at the same automation level.

Minor workload-specific issue: update application runbook.
Recurring pattern issue: update playbook and validator set.
High-severity control failure: suspend playbook automation and return to pilot.
Policy violation: stop affected migration class until governance review completes.

Passing validation is not the end of governance. It is the beginning of evidence-based learning when reality disagrees.

The question is not whether failures will happen. They will. The question is whether the system can tell what failed, why it failed, and which part of the migration control plane must improve.

9. Human-in-the-Loop as Accountability, Not Correctness

Human review does not eliminate error. A human can misclassify an application, approve a bad exception, misunderstand the landing-zone requirement, or rubber-stamp an agent’s recommendation. The purpose of human involvement is not to make the process infallible. The purpose is to place accountability, judgment, and exception authority where the organization can inspect it.

Human review provides accountability. It does not automatically provide correctness.

Human decisions need traceability for the same reason LLM decisions need traceability. The system must show what evidence was available, what decision was made, who made it, what authority they had, and what downstream automation that decision enabled.

An LLM can produce a decision-like output without accountable ownership. It can summarize evidence, infer patterns, and recommend a playbook, but it does not carry organizational authority. Its output must be treated as evidence or recommendation unless bounded by deterministic rules.

A human can carry organizational authority, but that does not mean the human is correct. The human can approve the wrong thing. The difference is that a human decision can be assigned to a role, reviewed against policy, challenged later, and used to improve the process.

Human review should be triggered when the system reaches an authority boundary: unknown patterns, low-confidence classifications, disqualifying rule override requests, production-impacting exceptions, missing landing-zone capabilities, policy conflicts, repeated failed loop attempts, and migration paths with material cost, risk, or architecture implications.

The human should not merely approve the agent’s output. The human should approve a specific decision with evidence.

Weak review:

Looks good. Proceed.

Strong review:

Approved as Known Pattern with Exceptions.
Reviewed scheduler evidence showing Java Spring Boot app, external database, no local persistent state, and two warnings for local secrets and file logging.
Approved remediation playbooks secrets-remediation-gcp-001 and local-logging-remediation-gcp-001.
No production deployment allowed until validation gates pass.

Human-in-the-loop is not a correctness guarantee. It is an accountability boundary.

10. Playbook Lifecycle: From Human Practice to Governed Automation

This section is not intended to teach the basics of migration planning. It assumes the organization already has migration practitioners, landing-zone standards, and enough operating discipline to recognize repeatable patterns.

The purpose of playbook maturation is to convert known migration practice into governed automation. If the organization cannot already describe how a class of workloads should move, it is not ready to automate that class of migration.

A playbook should begin as a human-governed migration pattern, not as an AI-generated artifact.

Observed migration pattern

→ human architect classification

→ target-state decision

→ policy and landing-zone mapping

→ deterministic detection rules

→ deterministic validation rules

→ pilot migrations

→ evidence review

→ approved reusable playbook

A playbook should move through controlled lifecycle stages:

Draft
→ Pilot
→ Approved
→ Deprecated
→ Retired

Each stage requires explicit promotion criteria. A playbook should not move from draft to pilot or pilot to approved because the team feels confident. It should move because it has produced enough evidence to justify the next level of authority.

Draft. A draft playbook is a human-authored description of a recurring migration pattern and the desired target-state pattern. It may be assisted by AI, but it is not yet allowed to control an automated migration loop.

Pilot. A pilot playbook is allowed to run against selected applications under human supervision. It can guide a migration loop, but it cannot make unsupervised production-impacting decisions. The pilot should include a clean match, a partial match with expected remediation, a false-positive risk, and a disqualification or escalation case.

Approved. An approved playbook is eligible for scheduler assignment and migration loops within its approved scope. It is still not allowed to expand its own scope, approve its own exceptions, ignore failed validators, change policy source mappings without review, or update landing-zone assumptions without recertification.

Deprecated and Retired. Deprecated playbooks remain traceable for old migrations but are not assigned to new ones. Retired playbooks are no longer valid, though historical execution evidence remains available for audit.

Concrete Lifecycle Example: Local Secrets Migration Playbook

Migration engineers repeatedly find applications reading credentials from .env files, application.properties, or local configuration files on VMs. The target landing-zone standard requires all secrets to be stored in the approved cloud secrets manager and accessed through managed workload identity.

In draft, the playbook defines eligible apps, disqualified apps, target state, detection logic, and validation rules. It is advisory only.

In pilot, the team selects four applications: one clean Spring Boot app using application.properties, one Node.js app with a .env file and straightforward environment variable mapping, one app with a suspicious local config pattern that might be a false positive, and one app with a custom credential broker that should be rejected.

The pilot shows that the deterministic scanner correctly identifies common secret patterns and rejects the custom credential broker case. It also reveals a gap: some non-secret environment values were being classified as secrets. The classification logic is updated and the pilot is rerun.

After evidence review, the playbook is approved for applications within a narrow scope: common framework-based local secret patterns, test and staging transformation, and production promotion through normal release controls. Custom credential brokers remain excluded, and regulated workloads require separate governance review.

Six months later, the landing-zone team introduces a new secrets access library and deprecates the old accessor pattern. The previous playbook remains traceable for past migrations but is no longer assigned to new applications. After all remaining applications have moved to the new pattern, the old playbook is retired. Historical execution evidence remains available for audit.

11. Traceability and Evidence Chain

Once playbooks are machine-enforceable, they become part of the audit surface. Each playbook must be traceable from business requirement to enforcement result.

Business requirement

→ migration policy

→ playbook rule

→ deterministic enforcement code

→ test result

→ deployment decision

→ exception record, if any

The goal is to prevent the agent from hiding policy decisions inside code changes.

Every playbook execution should produce an evidence chain. Minimum evidence fields include application ID, migration request ID, pattern classification, playbook ID and version, policy source, detection result, transformation applied, validation result, artifacts changed, deterministic checks executed, human reviewer if required, exception owner if applicable, timestamp, and final disposition.

The agent should not be allowed to hide policy decisions inside code changes.

12. The SI Role: More Work, Higher-Value Work

For a systems integrator, this model does not reduce the work. It changes the work.

The traditional migration story often treats scale as a staffing problem: assign more people, run more assessment waves, execute more runbooks, and push more applications through the factory. That model still runs into the same constraints the client faces. The SI does not magically inherit perfect application knowledge. The SI often faces the same blockers: incomplete inventories, inconsistent runbooks, uneven test coverage, unclear ownership, landing-zone drift, unavailable application SMEs, and limited developer capacity.

The migration control plane does not eliminate those blockers. It exposes them, structures them, and creates a path to reduce them over time.

The SI opportunity is therefore not smaller. It is larger and more strategic. The work moves upstream from migration labor into migration operating model transformation.

Instead of only moving applications,
the SI helps build the system that makes application movement repeatable, inspectable, governable, and improvable.

The SI does not win by pretending the mess is gone. The SI wins by turning the mess into a control plane.

SI Work Across the Maturity Curve

The readiness model creates a natural SI engagement path. This section uses the maturity curve from the SI delivery lens: what work the SI performs at each stage. Section 2 uses the same curve from the enterprise buyer lens: what authority the organization can safely grant the system at each stage.

The overlap is intentional. The SI view describes the engagement model. The readiness view describes the operating authority model.

Level 1: Build Migration Discipline. For clients not ready for automation, the SI helps create the operating foundation: application inventory cleanup, ownership mapping, dependency discovery, landing-zone readiness assessment, current-state documentation, migration wave planning, runbook normalization, test coverage assessment, and policy and exception process discovery.

Level 2: Assisted Assessment and Playbook Discovery. For clients ready for assisted assessment, the SI helps discover and document repeatable patterns: application pattern discovery, candidate playbook drafting, assessment-only scheduler design, landing-zone gap analysis, human review workflow design, policy-to-control mapping, initial evidence model design, and pilot workload selection.

Level 3: Constrained Automation. For clients ready for narrow automation, the SI builds and operates the first governed playbook loops: deterministic detector development, validator catalog development, playbook implementation, CI/CD and policy-as-code integration, test-environment migration loops, LLM task constraint design, evidence capture implementation, human approval workflow integration, and pilot RCA and playbook refinement.

Level 4: Governed Automation at Scale. For mature clients, the SI can help operate the migration control plane across the application estate: migration control plane operations, playbook catalog lifecycle management, validator catalog maintenance, scheduler rule governance, landing-zone profile updates, exception workflow management, RCA facilitation and closure, organization-specific validator development, workload-specific knowledge capture, reporting and evidence management, and continuous improvement of automation authority.

Why This Is Higher-Value Work

The lower-value migration work is the repetitive labor that can eventually be encoded, validated, and repeated: finding common secrets patterns, updating known configuration references, rewriting boilerplate deployment manifests, applying standard tags, checking known policy requirements, and producing repetitive migration documentation.

The higher-value work is defining and operating the system that makes those tasks safe to automate: deciding which patterns are known enough to automate, building playbooks that encode approved target states, creating deterministic validators, defining human review boundaries, managing exceptions, interpreting failures, feeding RCA back into the control plane, and separating global learning from organization-specific and workload-specific learning.

The work does not disappear. It moves upstream into the design and operation of the migration control plane.

SI Business Implication

This model can be uncomfortable for SIs that depend primarily on migration labor volume. But it is attractive for SIs that want to sell higher-value transformation.

The SI can package the work as a progression:

Migration readiness assessment

→ application pattern discovery

→ playbook factory design

→ validator catalog development

→ constrained automation pilot

→ migration control plane operations

→ closed-loop optimization

The result is not fewer SI services. It is a different mix of services: more architecture, platform engineering, governance, automation, testing, evidence management, and operational transformation.

The SI opportunity is not to provide more migration hands. It is to help the enterprise build the migration control plane that makes scarce hands scale.

13. Open Questions

The following questions remain open and are intended to guide future work, implementation planning, and organizational adoption decisions.

What is the minimum application inspection data required before assigning a playbook?
Should the scheduler classify by application architecture, deployment target, operational risk, or all three?
What qualifies as enough confidence to allow the loop to begin?
Which failures should retry automatically versus escalate immediately?
How should new unknown patterns be promoted into reusable playbooks?
Who owns playbook approval: platform team, security, architecture, app owner, or shared governance board?
How much of the playbook should be represented as deterministic code versus human-readable documentation?
How often should approved playbooks be recertified against the current landing zone? For example, many organizations would align this with landing-zone release cycles or perform at least quarterly review.
What is the threshold for converting recurring exceptions into new remediation playbooks?
What validator gaps require suspension of an approved playbook versus a minor version update? For example, gaps that can cause data integrity loss, security policy violations, or production availability risk should usually trigger suspension until reviewed.
How should organization-specific learning be separated from global playbook improvements?
Where should workload-specific knowledge live so future scheduler runs can use it without polluting global playbooks? For example, application portfolio metadata, CMDB records, or application-specific migration runbooks may be better destinations than the global playbook catalog.

Keith Townsend

Keith Townsend is a seasoned technology leader and Founder of The Advisor Bench, specializing in IT infrastructure, cloud technologies, and AI. With expertise spanning cloud, virtualization, networking, and storage, Keith has been a trusted partner in transforming IT operations across industries, including pharmaceuticals, manufacturing, government, software, and financial services.
Keith’s career highlights include leading global initiatives to consolidate multiple data centers, unify disparate IT operations, and modernize mission-critical platforms for “three-letter” federal agencies. His ability to align complex technology solutions with business objectives has made him a sought-after advisor for organizations navigating digital transformation.
A recognized voice in the industry, Keith combines his deep infrastructure knowledge with AI expertise to help enterprises integrate machine learning and AI-driven solutions into their IT strategies. His leadership has extended to designing scalable architectures that support advanced analytics and automation, empowering businesses to unlock new efficiencies and capabilities.
Whether guiding data center modernization, deploying AI solutions, or advising on cloud strategies, Keith brings a unique blend of technical depth and strategic insight to every project.

AI & Machine Learning