When a routine server update breaks traffic through the load balancer
A platform update changed web server behavior in a subtle but important way. The load balancer health checks still used an older probe pattern, so healthy systems started being marked down even though the application itself had not failed.
From the application side, things looked normal. From the server side, the update looked routine. From the load balancer side, backends looked unhealthy.
No single team saw the full problem.
What happened
The web tier had been updated. One side effect was a change in how requests without a host header were handled.
That did not matter for normal user traffic. It mattered for the load balancer probes, which still relied on a legacy request pattern.
The result was simple and expensive:
- the servers were up
- the application was reachable directly
- the load balancer marked healthy nodes as down
- traffic failed anyway
Why this kind of issue is easy to miss
This is a classic cross-domain failure.
Each team sees a technically correct fragment:
- the server team sees a valid update
- the load balancer team sees failed probes
- the application team sees broken service
- monitoring may show partial health instead of customer impact
The issue is not “in” one component. It emerges from the interaction between them.
What fixed it
The resolution was straightforward once the real dependency was understood:
- stop the wider rollout
- update the health-check request to match the new behavior
- validate traffic recovery
- resume deployment in the right order
The difficult part was not the fix. It was understanding where the real fault line was.
Why this matters
Many infrastructure failures are not caused by obviously broken systems.
They happen because:
- assumptions survive longer than architectures do
- one team changes behavior another team depends on
- health checks validate the wrong thing
- monitoring reflects component status, not business reality
This is exactly why mature environments still suffer “surprising” outages.
What to take away
If your environment depends on load balancers, proxies, health checks and layered ownership, do not assume that a routine patch is operationally routine.
Review:
- what your health checks actually validate
- whether they still match current server behavior
- whether monitoring reflects customer impact
- which cross-team assumptions are undocumented but critical
Where I help
This is the kind of issue I help clients find before it becomes an outage: failures that sit between Linux, cloud, networking, patching and security work.
If that sounds familiar, start with a short discovery call.