The Controls That Fail When You Need Them Most

Feb 3

I was prompted to write this article after a LinkedIn post that came up recently in my feed. Picture this: your incident response plan exists. Your backup system runs every night. Your business continuity documentation sits in a shared drive, reviewed and approved. On paper, you're covered. Then, something happens. And you discover that your most important controls are the ones you've never actually used.

The Paradox of Rarely-Used Controls

There's a counterintuitive truth in security: the controls you exercise daily stay sharp, while the controls you rarely use quietly rot. Your team logs into the VPN every morning, so that process works. Your deployment pipeline runs dozens of times a week, so everyone knows the drill. But your disaster recovery procedure? Your incident response playbook? Those get written once, filed away, and forgotten until the moment they matter most.

This isn't negligence. It's human nature. When something works and you don't need to think about it, you stop thinking about it. The problem is that "working" and "ready to work when needed" aren't the same thing.

I recently worked with a client who had a documented incident response process. Detailed, approved, sitting in the right folder. When we reviewed their production incidents from the past year, we found something uncomfortable: not a single one had been documented according to that process. The incidents were fixed, the systems were restored, but the process was ignored.

When I asked why, the answer was honest: "We just forgot. We don't see that many incidents."

That's the trap. The controls you rarely use are exactly the ones most likely to drift, and exactly the ones you'll need most when things go wrong.

GitLab's $300 Million Lesson

In January 2017, GitLab.com went down. An engineer accidentally deleted the primary database while attempting to fix a replication issue. Unfortunate, but recoverable, right? That's what backups are for.

Except GitLab had five separate backup and replication mechanisms in place. None of them worked.

The automated pg_dump backups hadn't run in weeks because the backup system was configured for PostgreSQL 9.2 while GitLab was running PostgreSQL 9.6. The backup failure notifications were being sent via email, but those emails were silently rejected because of a DMARC misconfiguration. The S3 backup bucket was empty. The Azure disk snapshots weren't enabled because the team assumed the other backup methods were sufficient.

The result: 18 hours of downtime and permanent loss of 6 hours of production data. GitLab lost projects, issues, comments, and user accounts. The company live-streamed its recovery efforts on YouTube, watched by thousands. In their postmortem, they wrote what should be framed on the wall of every engineering team:

"Out of 5 backup/replication techniques deployed, none are working reliably or set up in the first place."

Why did this happen? The same reason my client's incident response process was ignored: nobody was regularly testing these systems. The backups ran in the background. The team assumed they worked. There was no ownership, no regular validation, no muscle memory.

GitLab's backup systems weren't broken by a single catastrophic event. They drifted into failure, one small change at a time, while nobody was watching.

The Controls Most Likely to Drift

At scale-ups, certain controls are almost guaranteed to drift because they're rarely exercised. Here are some examples:

Backup restoration. Your backups run every night. When was the last time you actually restored from one? Not verified that files exist, actually restored a system and confirmed it works? Statistics suggest that 60% of backups are incomplete and 50% of restore attempts fail. The backup isn't the control. The restoration is.
Incident response. You have a documented process for security incidents, but if incidents are rare, that process never becomes muscle memory. When a real incident happens, people revert to instinct: fix the problem first, figure out the paperwork later. The SANS Institute found that 77% of organisations don't have a formal incident response plan that's consistently applied.
Business continuity. Your BCP exists. It was probably written to satisfy an audit requirement or close an enterprise deal. Research indicates that 41% of companies have never tested their disaster recovery systems. When was your last tabletop exercise?
Access revocation. You have an offboarding checklist. But when someone leaves suddenly, or moves to a different team, or takes extended leave, does every system get updated? Or do orphaned accounts accumulate until someone notices during an audit?
Vendor security reviews. You assessed your critical vendors when you onboarded them. That was three years ago. Their security posture has changed. So has yours. So has the threat landscape.

These aren't exotic edge cases. They're the standard controls that every compliance framework requires. And they're failing at company after company, not because they were never implemented, but because they were never maintained.

Why This Happens

Large enterprises have entire teams dedicated to testing controls, running tabletop exercises, and validating business continuity plans. They have the resources to build resilience as a discipline.

Early-stage startups don't need sophisticated controls because the blast radius is small. Everyone knows everyone, systems are simple, and you can rebuild from scratch if you have to.

Scale-ups are caught in between. You're big enough that a control failure could be catastrophic, but still operating with startup-era assumptions about agility and informality. You have security tools but not security processes. You have documentation but not muscle memory.

The pattern I see repeatedly: companies pass their SOC 2 audit by demonstrating that controls exist, then never actually use those controls until an auditor or an incident forces them to. The gap between "we have this control" and "this control would work if we needed it tomorrow" is where compliance breaks.

The Fix: Tabletop Exercises for Rarely-Used Controls

The solution isn't more documentation. It's practice.

A tabletop exercise is a structured walkthrough of a hypothetical scenario. You gather the relevant people, present a situation, and talk through how you'd respond. No systems get touched. No alarms go off. You're just building muscle memory and finding gaps before they matter.

The key insight is to focus your tabletop exercises specifically on rarely-used controls. Don't simulate a DDoS attack if your team deals with performance issues weekly. Simulate a data breach that requires your incident response plan. Simulate a database failure that requires backup restoration. Simulate a key employee becoming unavailable during a crisis.

Here's how to run an effective tabletop for a scale-up:

Keep it focused. Pick one scenario. A ransomware attack. A production database deletion. A critical vendor going offline. Don't try to test everything at once.

Include the right people. This isn't just for engineers. Bring in whoever would actually be involved: your CTO, your head of customer success, someone from legal if you have one. The goal is to test coordination, not just technical response.
Make it realistic. Use actual system names, real team members' roles, genuine third-party dependencies. The more concrete the scenario, the more useful the exercise.
Introduce complications. Halfway through, reveal that the person who usually handles this is on vacation. Or that the backup you thought you had is corrupted. Or that a customer is demanding answers. Real incidents never follow the happy path.
Document everything. Not to create more paperwork, but to capture the gaps you discover. Who didn't know who to call? Which process was unclear? What access did someone need but not have?
Follow up. The exercise is worthless if you don't act on what you learned. Assign owners to fix the gaps. Schedule the next exercise.
You don't need a full day or an external consultant. A two-hour session, run quarterly, focused on your highest-risk, rarely used controls, will do more for your actual security posture than another policy document.

The Real Test

Here's a question for every CTO reading this: if your primary database were deleted right now, could you restore it? Not "do you have backups," could you actually restore it? Do you know how long it would take? Does more than one person know how to do it? Have you tested it in the last six months?

If you're not certain, that's the control that needs a tabletop exercise next week.

Controls don't fail loudly. They drift quietly until someone measures them. The space between how a control was designed and how it actually operates is where compliance breaks, and where incidents become disasters.

The controls you use daily will keep working. The ones you don't need active investment to stay functional. Test the things you hope you'll never need, because hope isn't a strategy.

Paolo Carner is the founder of BARE, a cybersecurity consultancy helping scale-ups build security capabilities that actually work. He's spent 15+ years watching controls drift and helping companies catch them before auditors do.

Paolo Carner