May 26, 2026
· 12 min readInfrastructure as Code Solved Repeatability. It Didn't Solve Coordination.
Terraform fixed the snowflake-server problem. But once you have dozens of stacks, multiple teams, and dependencies between environments, a new class of failure shows up: drift, ordering, and compliance gaps. This is a deep dive into why provisioning isn't orchestration, and the control-plane pattern the best teams build on top of their IaC.

TL;DR
- Infrastructure as Code (IaC) solved repeatability. Tools like Terraform replaced hand-configured "snowflake servers" with versioned, reviewable definitions.
- Provisioning is not orchestration. Terraform reconciles one desired state against one actual state. It does not sequence deployments across stacks, pass outputs between them, or model who-deploys-what-in-what-order.
- At scale, three failure modes appear: drift (manual changes that never get committed), ordering failures (downstream breaks because a dependency wasn't modeled), and compliance gaps (approvals living in Slack threads, not the system).
- The fix is a control plane — an orchestration layer on top of your IaC that makes dependencies explicit, moves governance into the workflow, and runs continuous drift detection.
- The category is real and competitive: HCP Terraform, Spacelift, env0, Scalr, and Atlantis all attack this problem from different angles.
Why infrastructure management breaks at scale
When you deploy a web app to the cloud, a lot happens behind the scenes. Your code runs on a server that needs an OS, specific packages, the right ports open, the right environment variables. If you have a database, it needs to be created and configured. Same for networking, storage, and security rules. All of that is your infrastructure.
In the early days, setup was simple. You SSH into a server, run some commands, install what you need, configure what needs configuring. One server, one engineer, done. Completely manageable.
Now let the app grow. You add a second server for load balancing, a third for background jobs, a separate database server, a caching layer — and a staging environment that's supposed to look exactly like production. Multiply that across a team of ten engineers each making changes, and the questions start:
- How does anyone know what's installed where?
- What happens when two engineers configure the same server differently?
- What happens when the person who set everything up leaves?
This is the snowflake server problem — a term Martin Fowler popularized back in 2012. Like snowflakes, no two servers end up exactly alike — each configured by hand, by different people, at different times, with nobody holding a complete picture of what's actually running. As Fowler put it, the real fragility shows up when you need to change them: upgrades cause unpredictable knock-on effects. Source: Martin Fowler — SnowflakeServer.
How we got here: the evolution of IaC
The industry didn't solve this in one jump. It was a sequence of tools, each fixing the previous one's weakness.
Chef and Puppet came first as AWS got popular. You could script provisioning and configuration — but it was fragile. Scripts assumed things were already set up a certain way, manual changes pushed state out of sync, and there was no easy way to preview what a run would change.
CloudFormation was AWS's native answer: declare what you want in JSON or YAML — a VPC, a load balancer, an RDS instance — and AWS provisions it. A real step forward, because now infrastructure had a definition you could version and repeat. The downside: it was tightly coupled to AWS, the syntax was verbose, and anything complex got hard to manage fast.
Ansible solved a related but different problem. It isn't really about provisioning cloud resources — it's about configuring what's already running: installing software, managing users, pushing config files across a fleet. It's procedural (write the explicit steps you want to happen) rather than declarative (describe the end state and let the engine work out the steps). Source: Red Hat — Ansible vs. Terraform. Many teams paired the two: CloudFormation to create the infrastructure, Ansible to configure it.
Terraform changed the picture. It took CloudFormation's declarative approach and made it cloud-agnostic — the same workflow whether you're on AWS, GCP, Azure, or all three. Because the language is declarative (you describe the desired end state in HCL), Terraform itself figures out the dependency order and what to create, modify, or destroy to get there. Source: Red Hat — Ansible vs. Terraform. And it introduced a clean split between planning and applying: run terraform plan to see exactly what's going to change, then terraform apply only when you're ready. That review step alone removed a whole category of anxiety from infrastructure changes.
Here's what that looks like in practice — an S3 bucket:
# main.tf
resource "aws_s3_bucket" "app_assets" {
bucket = "thakurcoder-${var.environment}-assets"
tags = {
Environment = var.environment
ManagedBy = "terraform"
}
}terraform plan # shows exactly what will be created — touches nothing
terraform apply # provisions the bucketNo console clicking, no manual steps, no hoping someone remembered the right tag. Anyone on the team can reproduce the same bucket in a different environment by changing one variable. And if something changes unexpectedly, Git history tells you when and who.
💡 Note on OpenTofu: In August 2023, HashiCorp relicensed Terraform from the Mozilla Public License (MPL 2.0) to the Business Source License (BSL). The community forked the last MPL version, and it became OpenTofu under the Linux Foundation — keeping CLI and HCL compatibility so most Terraform code runs unchanged. Source: OpenTofu Manifesto, Platform Engineering. HashiCorp itself was acquired by IBM in a deal worth roughly $6.4 billion, completed in late 2024 — so both Terraform and Ansible now sit under the same corporate umbrella. Source: env0 — Ansible vs Terraform 2026.
For teams in the early stages, this is usually enough. The problem shows up later.
What Terraform does — and what it doesn't
Terraform is a provisioning tool, and an exceptionally good one. Its job is to take a desired state, compare it to the actual state, and reconcile the difference. It does that job well.
What it does not do is orchestrate — and orchestration is a completely different class of problem.
Picture a typical layered setup:
In real organizations these are three separate workspaces — different lifecycles, different owners, different teams deploying at different cadences.
Now suppose the networking team needs to change the CIDR block (the address plan for the network — how many IPs it has and how they're divided). Simple in isolation. But the change ripples outward:
- The Kubernetes cluster was configured assuming a specific IP range.
- The application services have firewall rules tied to those addresses.
So who runs what, in what order? If the network change succeeds but the Kubernetes update fails halfway, what's the recovery path? Who approves what before any of this starts?
None of these questions are answered by Terraform. They get answered by your team — through Slack messages, a Confluence doc that's six months out of date, and whoever happens to remember the order things need to go in.
That informal coordination works when the team is small. As the org scales, it becomes a liability.
The three failure modes at scale
1. Drift. Your actual infrastructure quietly diverges from your code because someone made a manual change during an incident at 2 a.m. and never committed it by Monday. What Terraform thinks is true and what's actually running drift apart. Over time those differences accumulate.
2. Ordering failures. A deployment succeeds in isolation but breaks something downstream because the dependency was never modeled explicitly. You find out when the alert fires — not when the plan runs.
3. Compliance gaps. Your approval process lives in people's heads or in Slack threads, not in the system. When an auditor asks who approved this change and when, you're scrolling back through message history.
This is the point where infrastructure management started to evolve. The realization: the hard part was never writing infrastructure as code. The hard part is managing changes safely, consistently, and with visibility at scale.
The orchestration layer: a control plane for infrastructure
The pattern you see consistently at companies operating at scale: they build an orchestration layer on top of their IaC tools — a control plane that treats infrastructure deployments the way a good CI/CD system treats application deployments.
This layer does three specific things Terraform itself doesn't.
Making dependencies explicit
Your networking stack, Kubernetes cluster, and application layer stop being separate workspaces that happen to be related. They become connected nodes in a graph. When the network changes, the system knows which stacks depend on it, passes the outputs automatically, and triggers downstream runs in the right order.
Conceptually, you go from manually copying values:
# networking stack outputs the subnet IDs
output "private_subnet_ids" {
value = aws_subnet.private[*].id
}…to declaring the relationship once and letting the platform wire it up:
# Illustrative dependency declaration (syntax varies by platform)
stack: application-services
depends_on:
- stack: networking
outputs:
private_subnet_ids: app_subnet_ids # auto-passed downstreamThe tribal knowledge — "run networking first, grab the subnet IDs, then deploy the app" — gets encoded in the platform instead of living in someone's head.
Moving governance into the workflow
Think about how code review works in a good engineering org: you catch problems before a PR merges, not after it deploys. Infrastructure governance should work the same way.
Instead of a single policy check at the end of a run that either blocks everything or gets rubber-stamped, you want policy evaluation at multiple points — when a PR opens, when a plan is generated, when an approval is requested. Most platforms use OPA (Open Policy Agent) and its Rego language for this. OPA is a general-purpose policy engine originally created at Styra; it's a graduated CNCF project (graduated January 2021) and decouples policy decisions from enforcement by evaluating structured input against declarative rules. Source: Open Policy Agent docs, CNCF — OPA project.
A guardrail looks like this:
# Illustrative: require senior approval for production changes
package infra.guardrails
deny contains msg if {
input.stack.labels[_] == "env:production"
input.run.changes_count > 0
not input.run.approved
msg := "Production changes require senior engineer approval"
}You can also write policies that block a run if resources aren't tagged correctly, or if a change would open a security group to the public internet. These checks happen before apply, not after something breaks.
Continuous drift detection
Not a weekly cron job that generates a report nobody reads — actual detection that can trigger remediation. If someone makes a manual change during an incident, the system notices, and once things stabilize it can bring infrastructure back to the desired state. Source: env0 — IaC tools 2026.
The orchestration platform landscape
This is a real, competitive category — not a single product. The tools differ in scope, IaC support, and how opinionated their workflows are.
| Capability | HCP Terraform | Spacelift | env0 | Scalr | Atlantis |
|---|---|---|---|---|---|
| Primary scope | Terraform-centric | Multi-IaC orchestration | Collaborative IaC + FinOps | Terraform/OpenTofu at scale | PR automation |
| IaC tools | Terraform, OpenTofu | TF, OpenTofu, Terragrunt, CloudFormation, Pulumi, Ansible, K8s | TF, OpenTofu, others | Terraform, OpenTofu | Terraform |
| Stack dependencies | Limited | ✅ Chain + pass outputs | ✅ | ✅ | ❌ |
| Policy as code | Sentinel + OPA | ✅ OPA, multi-point | ✅ OPA | ✅ OPA pre/post-plan | Via custom hooks |
| Drift detection | ✅ (higher tiers) | ✅ Continuous + remediation | ✅ Scheduled + auto-remediate | ✅ | Partial (via endpoints) |
| Self-hosted / air-gapped | Enterprise | ✅ | ✅ | ✅ | ✅ (self-host only) |
| Cost focus | Cost estimation | General | Strong FinOps | Cost control | None |
Sources: Spacelift vs. Atlantis / Terraform Cloud, env0 vs. Spacelift, Scalr overview, Spacelift vs env0 vs StackGen.
A few patterns from the landscape:
- Atlantis is the lightest option — a self-hosted PR-automation product that runs
plan/applyfrom pull requests. Great if you already manage state, RBAC, secrets, and compliance elsewhere. Source: Spacelift. - HCP Terraform is tightly integrated with the HashiCorp ecosystem but narrower in scope.
- Spacelift leans into multi-tool orchestration, stack dependencies, and granular OPA policies across many enforcement points, with flexible (including air-gapped) deployment. Source: env0 — IaC tools 2026.
- env0 emphasizes cost governance and FinOps-style continuous monitoring.
⚠️ Don't pick on features alone. Most of these tools provision the same resources. The real question is fit: how each aligns with your team topology, governance requirements, and tolerance for vendor lock-in. Source: Platform Engineering.
When you need an orchestration layer (and when you don't)
You probably don't need one yet if:
- You have one or two stacks with a single owner.
- Plain
terraform plan/applyplus a Git workflow covers your reviews. - Your "dependency graph" fits in one engineer's head and they're not leaving.
You almost certainly need one when:
- You have multiple workspaces with different owners changing at different cadences.
- Cross-stack dependencies mean a change in one place silently breaks another.
- Your approval and compliance story lives in Slack and tribal memory.
- You're running a mixed toolchain (Terraform + Ansible + Kubernetes) and want one control plane over all of it.
Production checklist
- Model dependencies explicitly. Encode the deploy order as a graph in your platform, not in a runbook. Pass outputs automatically between stacks.
- Shift policy left. Evaluate OPA/Rego policies on PR open, after plan, and before apply — not just at the end.
- Turn on continuous drift detection. Pair it with auto-remediation for non-critical stacks; require human review for critical ones.
- Separate stacks by lifecycle and ownership, not by convenience. Networking, cluster, and app layers deploy at different cadences.
- Make approvals auditable. Every production change should have a recorded approver and timestamp inside the system.
- Keep IaC engine choice open. With the BSL change, evaluate OpenTofu as a fallback; CLI/HCL compatibility makes migration low-risk.
- Tag everything, enforce it with policy. Block untagged resources at plan time so cost allocation and ownership stay accurate.
Conclusion
I've watched the same arc play out on every infrastructure I've worked on. In the beginning, getting Terraform in place feels like the win — and it is. The snowflake servers go away, changes become reviewable, and you stop deploying by tribal memory. Repeatability solved.
But repeatability was only ever the first problem. The harder one shows up quietly, months later, as coordination: between teams, between environments, between services that depend on each other. That's the gap the orchestration layer fills, and it's why "control plane for infrastructure" has become its own product category rather than a feature bolted onto a provisioning tool.
If you're a small team, don't over-engineer it — plan, apply, and a clean Git history will take you a long way. But the moment you feel yourself answering "who runs what, in what order?" in a Slack thread, that's the signal. Start by modeling your dependencies explicitly and moving one approval policy into code. The rest of the control plane follows from there.
FAQ
Is Terraform an orchestration tool?
No. Terraform is a provisioning tool — it reconciles a desired state with the actual state for a single configuration. It does not coordinate the order of deployments across multiple stacks, pass outputs between them, or model cross-team dependencies. That's what an orchestration layer adds on top.
What's the difference between provisioning and orchestration?
Provisioning creates and updates resources to match a definition. Orchestration coordinates many provisioning runs — sequencing them, passing data between them, enforcing approvals, and detecting drift across the whole estate. One stack vs. the graph of all your stacks.
What is configuration drift?
Drift is when your live infrastructure quietly diverges from what's declared in code — usually because someone made a manual change during an incident and never committed it. Over time these differences accumulate, and what your IaC thinks is true stops matching reality.
Do I need an orchestration platform for a small team?
Usually not. For early-stage teams with one or two stacks, plain Terraform plus a Git workflow is enough. The orchestration layer earns its keep once you have multiple workspaces with different owners, cross-stack dependencies, and an approval process that lives in people's heads.
What is OPA used for in IaC?
Open Policy Agent (OPA) lets you write policy-as-code in Rego. Orchestration platforms evaluate these policies at multiple points in a run — on PR open, after a plan, before apply — so guardrails like 'production changes need approval' or 'no public security groups' are enforced before changes land, not after something breaks.
Why did teams move from Terraform to OpenTofu?
In August 2023 HashiCorp relicensed Terraform from MPL 2.0 to the Business Source License (BSL). The community forked the last MPL version as OpenTofu, now governed by the Linux Foundation. It keeps CLI and HCL compatibility, so most Terraform code runs unchanged.