Monitoring said everything was fine
In many environments, monitoring is treated as the primary source of truth.
If all checks are green, the system is assumed to be healthy.
In practice, this assumption can be dangerously wrong.
---
The situation
In one environment, a digital content platform experienced a complete loss of revenue for an entire day.
No sales were recorded.
At the same time, all monitoring systems reported that everything was functioning normally.
- servers were up
- web services were running
- databases were operational
- system resources were healthy
From an infrastructure perspective, everything appeared to be in order.
From a business perspective, the system was effectively down.
---
What was actually happening
When visiting the site, all pages rendered as blank.
The issue was caused by a coding error in a shared PHP include file.
Because that file was included across multiple parts of the application, the result was that all output became empty.
Technically:
- Apache was running
- PHP was executing
- databases were responding
But the application produced no usable output.
---
Why monitoring failed
The monitoring setup was comprehensive — but it focused on system health, not business outcomes.
It checked:
- server availability
- service uptime
- database connectivity
- resource usage
All of these checks were correct.
What it did not check was:
> Does the system actually produce usable output?
This is a common gap.
Monitoring systems often validate that components are running, but not that the system is delivering value.
---
The impact
The impact extended beyond lost sales.
Affiliate partners continued sending traffic and expected compensation based on normal conversion rates.
This created both financial loss and reputational risk.
---
The fix
The solution required changes in two areas.
1. Reduce the blast radius
The problematic code was refactored into a separate function.
This ensured that a failure in that component would not take down the entire application.
In the worst case, a non-critical feature would fail, but core functionality would remain available.
---
2. Monitor business outcomes
A new monitoring approach was introduced.
Instead of only checking system health, the system began tracking:
- conversions per product
- conversions per mobile provider
- conversions per country
- time-based conversion patterns
These values were compared against expected ranges.
If current conversions deviated significantly from normal behavior, an alert was triggered.
---
The result
This approach proved highly effective.
In some cases, the system detected issues in external systems before those systems identified them internally.
The original failure scenario — where the entire site produced empty output without triggering alerts — did not occur again.
---
The lesson
Infrastructure monitoring should answer two questions:
- Are systems running?
- Is the business functioning?
If the second question is not covered, critical failures can remain invisible.
The most expensive outages are often not caused by systems going down, but by systems continuing to run without delivering value.
---
Closing thought
If your monitoring focuses primarily on system health, it may be missing the signals that matter most.
A structured infrastructure assessment can help identify these gaps and define monitoring approaches that reflect real operational and business risk.