Postmortem: Reactive Operations

What Happened

We built systems that page humans to investigate failures manually. Then complexity grew 10x while investigation stayed manual.

That's it. That's the incident.

Detection got faster. Dashboards got prettier. Investigation? Still a human opening six tools at 3 AM, correlating data by hand, relying on memory to figure out what broke.

Twenty years. The dashboards changed. The process didn't.

What 3 AM Actually Looks Like

This isn't a horror story. It's a pretty average incident - clear symptoms, one root cause, resolved in under an hour. Most teams would call this a good night.

02:47 AM - Alert fires. "High latency on checkout-service." You wake up.

02:52 AM - Open your metrics dashboard. Latency spike confirmed. /api/checkout/complete. P99 went from 200ms to 4.2 seconds.

02:58 AM - Is it the service or downstream? Traces show 3.8 seconds in a call to payment-gateway. But is the gateway slow, or is this service timing out waiting for it?

03:06 AM - Switch to payment-gateway's dashboard. Latency looks normal. Logs show nothing. Maybe it's not the gateway.

03:14 AM - Back to checkout-service. There was a deploy at 02:30. Pull up the diff. 47 files changed. You scan for database queries... there's a new one. Did someone add an N+1?

// Before: single query
const orders = await db.query('SELECT * FROM orders WHERE user_id = ?', [userId]);

// After: N+1 - queries inside a loop
const orders = await db.query('SELECT * FROM orders WHERE user_id = ?', [userId]);
for (const order of orders) {
  order.items = await db.query('SELECT * FROM order_items WHERE order_id = ?', [order.id]);
}

03:23 AM - Connect to the database. SHOW PROCESSLIST. 200+ connections from checkout-service. Normally there are 30. Found it.

03:31 AM - Rollback. Wait for pods to cycle. Latency recovers.

03:38 AM - Write up the incident. Go back to sleep. Wake up exhausted.

Total investigation time: 44 minutes. Detection took 30 seconds. A human still had to query six tools, hold context in their head, and make the correlation manually.

This is normal. It shouldn't be.

Root Cause

The industry automated detection but stopped there.

Detection - knowing something is wrong - is solved. Anomaly recognition, threshold alerting, pattern matching. Done.

Investigation - understanding why something is wrong - is still manual. A human must decide where to look, query across tools, correlate events, trace dependencies, recall past incidents from memory, and determine the fix.

Every step requires judgment, memory, and time.

The automation boundary stopped at the alert. It hasn't moved in twenty years.

The Cost

Resolution time is unchanged. You can detect a failure in 30 seconds. But if root cause analysis takes 44 minutes, your MTTR doesn't move.

Engineering capacity is bleeding out. Industry benchmarks target toil below 50%. Most teams report 30-50%. Some hit 90%. Meanwhile, engineer tenure averages 2-3 years. You're spending half your best engineers' time on reactive work. Then they leave. Then you start over.

Knowledge is a single point of failure. How systems work isn't documented. It's remembered. By individuals. Who leave. Every departure takes operational context with them. Every new hire burns months rebuilding it.

Burnout is structural, not personal. On-call isn't disruptive - it's a design flaw we decided to live with.

Why It's Still Broken

Tool fragmentation. The incident above required 6 tools. Each knows part of the story. None knows the whole thing. The correlation happens in a human's head at 3 AM.

Tribal knowledge. "Ask Sarah, she knows how that service works." Sarah leaves in 8 months. Now nobody knows.

Misapplied AI. The industry poured money into AI-powered query interfaces. Better search. Smarter autocomplete. But the query was never the bottleneck. Knowing what to query was.

Cultural normalization. "On-call is just part of the job." We built systems that require heroics and called it engineering culture.

The Point

Every layer of infrastructure eventually got automated. Cloud automated hardware. Containers automated OS config. CI/CD automated deployment.

One layer is still manual: the layer where a human opens six tools at 3 AM, correlates data by hand, and relies on memory to find root cause.

That's not a permanent condition. It's just where the automation boundary sits right now.

Investigation will follow the same path. We're not waiting.

All posts