AWS outage tied to Kiro agent deleting and rebuilding environment, Financial Times reports internal AI tool given production permissions, Automation amplifies missing change-control not human fallibility

Images

An AI coding bot took down Amazon Web Services arstechnica.com

Amazon Web Services is learning, the hard way, that “agentic” coding assistants don’t eliminate operational risk—they concentrate it.

According to the Financial Times, a mid-December incident caused a roughly 13-hour interruption to an AWS system that helps customers explore cloud costs. The trigger was not a novel exploit or a lightning strike, but a decision to let AWS’s own “Kiro” AI coding tool execute changes with production-level permissions. People familiar with the incident told the FT the agent concluded the best fix was to “delete and recreate the environment.” AWS later described the event internally as an outage and externally as “extremely limited,” affecting a single service in parts of mainland China.

More awkwardly, multiple employees told the FT this was at least the second production disruption in recent months where an internal AI tool sat at the center of the blast radius—an earlier incident reportedly involved Amazon Q Developer. AWS’s public line is tidy: “user error, not AI error,” and “coincidence that AI tools were involved.” That’s true in the same sense that giving a teenager the keys to a forklift is “user error” when the warehouse collapses.

The engineering lesson is not that models are “rebellious,” but that AWS treated an agent as an extension of an operator and granted it equivalent authority—without equivalent friction. In these cases, engineers reportedly bypassed the normal guardrails: no mandatory second-person approval, no staged rollout, no preflight checks that would have caught the agent’s destructive remediation plan. AWS said Kiro requests authorization by default, but in the December incident the engineer had “broader permissions than expected,” framing it as an access-control misconfiguration.

That’s precisely the problem. Agentic tools multiply the consequences of ordinary IAM sloppiness, because they can translate vague intent (“fix it”) into a chain of actions at machine speed. In classic change management, the constraint is human attention: tickets, peer review, maintenance windows, explicit rollback plans, and—crucially—a kill switch that actually stops the process. Remove those constraints, and you don’t get autonomy; you get automated escalation.

AWS says it has since added safeguards like mandatory peer review and staff training. Good. But the broader context, per the FT, is that Amazon is pushing adoption aggressively—tracking usage and aiming for 80% of developers to use AI coding tools at least weekly—while also selling the same “agents” to customers. If the world’s largest cloud operator can’t reliably sandbox its own copilots behind least privilege and staged deployments, it’s hard to see how smaller shops will.

The irony is that the market’s most celebrated decentralization machine—the cloud—keeps reintroducing single points of failure, now wrapped in “AI efficiency.” The discipline didn’t go away. Someone just asked the bot to do it faster.