When Resilience Becomes a Lie
If your system survives everything…
it might be because it never forces you to fix anything.
That is not resilience. That is absorption.
1. The Comfortable Story
Most organisations celebrate resilience. They point at uptime, incident recovery, and teams that keep delivery moving under pressure.
It reads like strength. It often hides something else.
A system that continues to operate despite recurring issues does not automatically qualify as robust. It may simply tolerate dysfunction.
2. The Critical Distinction
Failure will occur: incidents, broken dependencies, degraded quality, delayed delivery.
The defining factor sits in the response.
- Healthy systems expose failure, quantify it, and iterate on it. They accept that failure exists, keep it under control, and progressively reduce it through learning cycles.
- Dysfunctional systems absorb failure, compensate locally, and continue unchanged. They keep it hidden, so nothing is improved.
Progress in a healthy system is measurable. It does not aim at zero failure, but at controlled reduction.
This is where SLO-based alerting becomes critical: breaches signal where learning must happen, not just where recovery succeeded.
Same events. Opposite trajectories.
3. How Absorption Happens (Behaviours, Not Theory)
Failure rarely disappears by itself.
People make it disappear.
- Hero culture: a few individuals close gaps at any cost, often outside working hours.
- Silent overtime: effort increases while metrics remain unchanged.
- Workarounds over fixes: temporary patches become permanent operating modes.
- Managerial avoidance: difficult structural decisions get postponed in favour of short-term continuity.
- Narrative smoothing: incidents get reframed as minor or isolated to protect the perception of control.
Each behaviour feels pragmatic in isolation. Together, they create a system that absorbs pressure instead of learning from it.
4. Why It Persists
Absorption produces a dangerous outcome: continuity without consequence.
- Delivery continues, so urgency disappears.
- Metrics remain acceptable, so escalation never triggers.
- Customers receive something, so impact looks contained.
This is a structural watermelon effect: green on the outside, red on the inside.
The organisation confuses absence of collapse with presence of control.
5. A Concrete Pattern (Platform as Example)
Consider a platform team under pressure:
- Upstream dependencies remain unstable.
- Integration paths lack clear ownership.
- Incidents recur around the same interfaces.
What happens in many organisations:
- Platform engineers build defensive layers and ad-hoc fixes.
- Product teams route around constraints instead of addressing them.
- Incidents close quickly, root causes linger.
Throughput holds.
The system “works”.
This is precisely the trap: success gets measured through throughput, a vanity KPI focused on output rather than outcome. It builds on an incomplete view of the system. Throughput alone says nothing about steadiness, recurrence, or systemic health.
A stable system requires multiple variables: recurrence rate, recovery quality, error budgets, and the cost of compensation. Without them, green dashboards hide red reality.
If throughput is your only signal, you are not measuring performance. You are measuring activity.
Under the surface:
- Coupling increases.
- Recovery time improves superficially, recurrence rate stays flat.
- Cognitive load rises across teams.
The platform turns into a buffer, not a leverage engine.
6. The Cost That Does Not Show Up
Absorbed failure does not vanish.
It converts into hidden cost:
- Complexity debt: every workaround adds another path to maintain.
- Signal loss: repeated issues no longer trigger attention.
- Decision drift: leadership decisions rely on incomplete or sanitised data.
- Talent erosion: strong engineers either burn out or disengage.
None of these appear clearly on dashboards.
All of them accumulate.
7. Engineering Reality: Feedback or Noise
Engineering discipline does not aim to remove failure.
It ensures that failure produces information.
- Measurable signals (SLIs) reflect real behaviour.
- Commitments (SLOs) create thresholds that force trade-offs.
- Breaches trigger action, not explanation.
When failures get absorbed, the feedback loop breaks. Without feedback, the system does not learn. And movement continues. Then direction disappears.
8. Pressure Systems Analogy
A well-designed pressure system relies on feedback loops to stay within safe bounds.
- A simple valve releases pressure and provides a clear signal that limits are being reached.
- A regulator maintains a target range by continuously adjusting flow based on measured pressure.
- More sophisticated systems use closed-loop control (feedback + correction) to stabilise behaviour under changing conditions.
Operators understand limits because the system makes them visible.
A poorly designed system removes or weakens these feedback loops. Pressure still builds. It gets dissipated through leaks, friction, or manual intervention, but without clear signals. It prevents immediate rupture while removing any indication that limits have been exceeded.
The first design protects the system by forcing awareness and correction.
The second delays failure and magnifies its impact.
Many organisations operate in this second mode, relying on people as the leak instead of engineering proper feedback.
9. What Changes the Trajectory
Restoring a learning system requires constraints, not speeches:
- Make failure visible: instrument recurrence, not only resolution.
- Make failure costly: tie SLO breaches to delivery decisions.
- Remove absorption layers: limit ad-hoc workarounds without root-cause tickets.
- Enforce ownership: every interface has a clear owner and contract.
- Shorten loops: analyse within days, not quarters.
- Time-box remediation: do not let fixes drag indefinitely; set clear deadlines and escalate when they slip.
The goal does not target perfection. It targets forced learning. If nothing forces remediation, nothing will be fixed.
10. Close
A system that absorbs failure without learning from it does not qualify as resilient.
It drifts. And drift accumulates quietly until correction becomes unavoidable and expensive.
The question does not ask whether your system survives failure. It asks whether failure still has the power to change your system.
#SystemsThinking #EngineeringLeadership #PlatformEngineering #SLO #TechLeadership
Member discussion