What is it?
In recent years there has been a great focus in the security industry on increasing resilience. But what does that actually mean, why is it important, and how do you actually do it?
Resilience (noun)
The capacity to recover quickly from difficulties; toughness
The ability of a substance or object to spring back into shape; elasticity
So, it is about recovering quickly, springing back into shape. In the cyber world this means returning to normal operations from an incident.
Any organisation with a DR solution should be able to return to normal operation. How long that takes and how much that will cost is another story.
I have specifically chosen the word “should” in this statement. I will expand on the idea DR does not automatically equal good resilience later.
So why is speed important? Simply put – for private industries time means money. For public entities .. it will upset your minister and affect some random productivity number[1].
This is why “quickly” is important. Any organisation that relies on technology as part of its business will lose money if there is an outage. Money is burnt while staff are unable to perform their duties because they are waiting for a system to be restored. Money is burnt because your IT staff are busy running around rebuilding servers and restoring backups. Etc.
Some years ago, I was with an organisation testing their DR plans. A disaster was declared – An industrial accident had resulted in a section of the city being shut down and a portion of people in the area hospitalised. Of course, the explosion and following gas leak was next to their datacentre, which was in an industrial estate.
Everyone convenes at the remote war room. At this stage the Head of IT walks in followed by 8 or 9 of his staff. I didn’t recognise many of them, so I asked who they were. He replied they were the server team based at the datacentre. Feeling smug I walked over and point to half of them at random and asked them to leave. The Head of IT quickly countered that they were part of the team and were required for the DR. I pointed out that as result of the test scenario those people were either dead, in hospital or unable to get to the war room…
The takeaway from the test scenario was that people cannot be in the critical path for recovery. In that organisation, a significant amount of knowledge was inside people’s heads and not in the DR plans.
The million-dollar question is – how do we build resilience?
Good resilience starts with the right planning.
- Begin by having an accurate Business Continuity Plan (BCP)
- Then make sure the DR can meet that BCP (think MAO, RTO, RPO)
- Next make sure you have a BCP for your DR.
- Lastly, Test it!
BCP, DR and Resilience are interconnected.
Some questions to ask yourself:
- Is DR you have in place what the business requires?
- Have you document all the steps? What are the dependencies?
- What is the priority for all the organisations applications? – What is the recovery order? Do they change at different times of the year eg EOF?
- Are the expectations realistic? Remember that 4 nines up time (99.99%) only allows you one hour outage a year.
- Have you tested everything? Recovery tests should come as close as to possible to being realistic. Proving you can restore a single application in an hour is fine – but remember that you may need to recover other items first – AD Server? Routers? Firewalls? What will it take to recover if you lost your main datacentre/cloud zone?
- Do you test frequently and accurately? Annual testing is the standard. Testing should be as real as possible, not play acting.
- Lastly, have you made sure that individual staff members are not part of the critical path?
Third party providers?
With increasing utilisation of cloud technologies and SaaS products we are handing over control to other people. You should make sure that their resilience/HA capabilities a solid. Remember that offloading your DR and replacing them with SLA’s doesn’t make your resilience requirements disappear.
You now need to think about what you will do if that provider/service has a problem. What if they are not available? Can you still get to your data?
This happened recently when an international payroll software company was hacked. They shut all services down for several days while assessing the situation. This means reliant businesses could not pay employees. Could not even extract data to undertake manual payments.
Obviously, this shouldn’t happen – after all this is what you are paying them to take care off. Do you need to keep a copy of the data in some other place? Is it in format that you can port to a competitor or use internally yourself?
References:
https://en.oxforddictionaries.com/definition/resilience
[1] This is a little tongue in cheek as any outage, even for government has cost implications