It started with a few "Can't connect" messages in Slack. Then the dashboards turned red. If you were on call during the AWS outage June 12 2025, you probably remember that sinking feeling in your stomach as the status page remained green while your entire stack was screaming.
Cloud reliability is a bit of a lie we all tell ourselves. We talk about "five nines" and "multi-region failover" as if they are magic spells that prevent downtime. But they aren't. They’re just mitigations. On that Tuesday morning, those mitigations were tested in a way that exposed some pretty uncomfortable truths about how much of the modern internet relies on a few specific lines of code in Northern Virginia.
What really happened during the AWS outage June 12 2025
The problem wasn't a meteor hitting a data center. It wasn't even a massive power failure. Instead, it was a subtle, creeping issue within the Amazon Elastic Block Store (EBS) control plane in the US-EAST-1 region. Specifically, a localized API throttling event snowballed into a full-blown "deadlock" scenario.
When the control plane started lagging, it didn't just stop. It started rejecting requests to attach or detach volumes. This meant that any auto-scaling group trying to replace a "sick" instance couldn't get a disk assigned to it. Basically, the system was trying to heal itself but couldn't grab the tools it needed to do the job.
👉 See also: Who invented the computer mouse and why it almost didn't happen
Most people think "outage" means the servers are off. Not this time. The servers were humming along just fine. The problem was that the management of those servers—the brains of the operation—was stuck in a loop. If your app was already running and didn't need to scale, you were fine. But if you had a deployment scheduled or a sudden spike in traffic? You were toast.
The ripple effect across the stack
We saw huge names like Netflix, Disney+, and even some internal Amazon logistics tools start to stutter. It’s kinda wild when you think about it. You’ve got these billion-dollar infrastructures, and they’re all vulnerable to the same bottleneck.
- API Timeouts: Third-party integrations that relied on AWS Lambda functions in US-EAST-1 began timing out, which triggered a "retry storm."
- Database Latency: RDS instances in the affected availability zones struggled with I/O credits, leading to "ghost" connections that hung for minutes.
- The Status Page Myth: For the first 45 minutes, the AWS Service Health Dashboard showed "Operating Normally." This is the part that drives DevOps teams crazy.
Honesty is important here: AWS is incredibly reliable. But US-EAST-1 is the oldest, most crowded, and arguably most complex region in their entire global footprint. It’s where most new features launch, and it's where the most legacy "technical debt" lives in the underlying physical hardware.
Why "Multi-AZ" wasn't enough this time
We're always told to spread our workloads across multiple Availability Zones (AZs). It’s the first thing you learn in the Solutions Architect exam. But during the AWS outage June 12 2025, the failure was at the Regional control plane level.
If the brain of the region is having a seizure, it doesn't matter if you have servers in three different buildings. They all talk to the same brain.
I talked to a few lead engineers who were scrambling that day. One of them told me that their "failover" actually made things worse. Their system detected a failure in AZ-A and tried to move everything to AZ-B. But since the EBS control plane was the bottleneck, the requests to create new volumes in AZ-B just added more load to the already failing API. It was a self-inflicted Distributed Denial of Service (DDoS) attack.
The cost of complexity
Modern cloud architecture is incredibly complex. You’re not just renting a computer; you’re renting a massive, interconnected web of microservices. When you use an EC2 instance, you’re also using VPC networking, IAM for permissions, EBS for storage, and CloudWatch for monitoring.
If any one of those "hidden" services has a hiccup, the whole thing can come crashing down.
Lessons for the next big one
It’s going to happen again. That’s not being pessimistic; it’s just how systems work. Entropy is real. So, what do we actually do with the info from the AWS outage June 12 2025?
First, stop treating US-EAST-1 as your default home. It’s the "default" for a reason—it’s the oldest. If you can, move your critical production workloads to US-WEST-2 (Oregon) or even one of the newer regions like Ohio. They tend to be a bit more stable during these weird API-level events.
Second, you've got to test your "static stability." This is a fancy way of saying: "Will my app keep running if the AWS API is down?" If your app needs to talk to the AWS API just to function—to fetch a secret from Secrets Manager or to mount a disk—you are vulnerable.
Hardening your infrastructure
- Cache your configuration: Don't call the AWS API every time a process starts. Cache those values locally or in a distributed cache like Redis that lives outside the main control plane.
- Graceful Degradation: If a non-essential service (like your logging or analytics) goes down, your main app should keep working. Use circuit breakers.
- Cross-Region Failover: This is expensive and hard. But for a business losing $100k an hour, it's cheaper than sitting in the dark.
The June 12 event wasn't a failure of the cloud concept. It was a reminder that the cloud is just "someone else's computer," and sometimes, that person has a really bad day.
Actionable Next Steps
Instead of just worrying about the next red icon on a status page, take these concrete steps this week:
- Audit your dependencies: Map out every time your application calls an AWS service during its startup or runtime. If that service disappeared for four hours, what would happen?
- Run a "Game Day": Manually simulate a control plane failure in a staging environment. Block all outbound calls to the AWS API and see if your app can still handle traffic.
- Review your RPO and RTO: Make sure your "Recovery Point Objective" and "Recovery Time Objective" are actually realistic. If you tell your boss you can be back up in five minutes but it takes twenty minutes just to spin up a new database, you’re lying to yourself.
- Decouple from US-EAST-1: If you don't need to be there, start planning a migration to a more modern, less congested region. It won't be a fun weekend project, but you'll sleep better the next time Slack starts blowing up.
Cloud architecture is about managing risk, not eliminating it. The people who survived the June 12 outage with the least amount of stress weren't the ones with the most expensive support plans; they were the ones who designed their systems to be "cynical" about the environment they lived in.