Everything looked broken — but nothing was
In complex environments, some of the most disruptive failures happen between systems.
Not within a component.
Not due to a clear error.
But at the boundary where two systems interpret behavior differently.
---
The situation
In one environment, multiple applications suddenly appeared to be down.
Load balancers marked backend servers as unavailable.
Traffic was no longer routed.
From the outside, services were effectively offline.
---
What made this confusing
At the same time:
- servers were reachable
- applications were running
- direct access to the servers worked
- no recent changes were reported by application teams
From each team’s perspective, everything seemed fine.
---
The missing connection
The issue appeared after a routine update of the web server software.
A subtle change had been introduced:
> the web server now required a valid Host header in HTTP requests
At the same time, the load balancer was performing health checks using a minimal request:
GET /
Without a Host header.
---
What actually happened
Because the request did not include a Host header:
- the web server rejected or did not properly handle the request
- the load balancer interpreted the response as a failure
- backend servers were marked as down
- traffic was no longer routed
The application itself was still working.
But the system that decided whether it was reachable concluded that it was not.
---
Why this was difficult to diagnose
Each team saw only part of the system:
- load balancer team saw servers marked as down
- application teams saw working applications
- platform teams saw no obvious failures
No single team owned the interaction between:
load balancer ↔ web server ↔ application behavior
The problem existed in that interaction.
---
The fix
The resolution required two coordinated changes:
- Update the load balancer health check to include a valid Host header
- Ensure that the web server configuration aligned with expected request patterns
In addition:
- updates were temporarily paused to prevent further disruption
- changes were rolled out in a controlled way
---
The result
Once the health check behavior matched the web server expectations:
- backend servers were correctly marked as healthy
- traffic routing resumed
- applications became reachable again
The underlying systems had not been broken.
They had simply disagreed.
---
The lesson
Many infrastructure failures are not caused by:
- broken systems
- missing resources
- obvious misconfigurations
They are caused by:
- mismatched assumptions
- protocol-level differences
- lack of shared understanding across teams
These issues are often hardest to diagnose because they exist between domains.
---
Closing thought
If different parts of your system are maintained by different teams, the most critical failures may occur at the boundaries.
Understanding how systems interact is often more important than understanding each system in isolation.
This is where many of the highest-impact issues hide.