When a routine server update breaks traffic through the load balancer

A cross-domain failure pattern: the servers are healthy, the application is fine, but traffic still fails because a platform update changed health-check behavior.

When a routine server update breaks traffic through the load balancer

A platform update changed web server behavior in a subtle but important way. The load balancer health checks still used an older probe pattern, so healthy systems started being marked down even though the application itself had not failed.

From the application side, things looked normal. From the server side, the update looked routine. From the load balancer side, backends looked unhealthy.

No single team saw the full problem.

What happened

The web tier had been updated. One side effect was a change in how requests without a host header were handled.

That did not matter for normal user traffic. It mattered for the load balancer probes, which still relied on a legacy request pattern.

The result was simple and expensive:

Why this kind of issue is easy to miss

This is a classic cross-domain failure.

Each team sees a technically correct fragment:

The issue is not “in” one component. It emerges from the interaction between them.

What fixed it

The resolution was straightforward once the real dependency was understood:

  1. stop the wider rollout
  2. update the health-check request to match the new behavior
  3. validate traffic recovery
  4. resume deployment in the right order

The difficult part was not the fix. It was understanding where the real fault line was.

Why this matters

Many infrastructure failures are not caused by obviously broken systems.

They happen because:

This is exactly why mature environments still suffer “surprising” outages.

What to take away

If your environment depends on load balancers, proxies, health checks and layered ownership, do not assume that a routine patch is operationally routine.

Review:

Where I help

This is the kind of issue I help clients find before it becomes an outage: failures that sit between Linux, cloud, networking, patching and security work.

If that sounds familiar, start with a short discovery call.

Book a consultation

Need help turning infrastructure risk into a practical plan?

I help teams prioritize remediation, harden platforms, and reduce risk without adding operational chaos.

Book a discovery call