Why actually think and take precautions for robust corporate governance and supposedly expensive business continuity management?
Paper is patient – and the ISO 22301 standard lies warm and dry in the cupboard. We take you through a case study to show you the process of an emergency using two differently positioned companies.
Our case study takes you to two medium-sized companies operating in the same industry: House of cards-Money Tomb GmbH and robusta-Willow Tree KG. As “hidden champions,” the companies produce the highest quality products and services in a niche area. Both companies have an extensive product portfolio, production at several locations, a business field of medical services and a large web presence with B2B and B2C contact. Key customers of the company come from the aviation industry, the automotive industry and other diverse sectors.
Our protagonist is Bert van Jenssen, IT manager in the companies. The system architecture separates into production IT and office IT. Parts of the components as well as the office IT are outsourced to an external service provider. External programming service providers are also intensively involved at lower levels.
In the company robusta-willow tree one has made intensive preliminary planning for an emergency preparedness and an emergency and crisis management and has practiced this regularly. An audit according to ISO 22301 has also been carried out and passed. Annual costs are estimated at around €200,000 – this includes purchases for IT security, extensive service level agreements with service providers, as well as personnel and training costs. A business impact analysis has shown that the risk of malware infestation is high, which is why an intensive backup culture and expensive analysis software are maintained. The company itself has an emergency structure consisting of the CEO, CFO, CIO, COO, Head of Legal and Head of Safety & Security. Bert van Jenssen, as IT manager, is integrated into a predefined IT emergency response team (CSIRT) that can be activated by the hotline. Both the hotline staff and the members of the IT emergency team are trained in emergency management, and “standard operating procedures” for IT emergencies have been drawn up and implemented.
The company attaches great importance to long-term business success – this is reflected in particular in employee satisfaction and workload. Short-term peaks in capacity utilization and workload can be cushioned.
In the company house of cards-money tomb no emergency precautions are implemented – the costs are too high, in addition the probability of an emergency is estimated rather small. Short- and medium-term company successes take a high priority. For this motivation, some services, such as a hotline, have also been outsourced at low cost. IT has also been affected by rationalization and is working at 112% capacity with the employees still remaining. No resources are available for creating documentation or even emergency plans. The last emergency plan dates back to 2004.
Both company managements meet on the sidelines of a trade fair on Friday, December 01. While the managing directors are exchanging ideas over a beer in the evening, there is an incident involving malware at both companies at the same time. First, a company-owned database, which processes customer data behind the web portal, is affected. The malware then spreads to the control computers in production.
Nothing happens at the company House of Cards-Money Tomb – due to the lack of up-to-date analysis tools, there is no way to proactively detect the infestation.
In the company robusta-willow tree there is an automatic ticket of an analysis tool to the hotline. The employee on duty tries to fix the error and detects abnormalities that may indicate malware. Further analyses are initiated.
Friday, December 01, around 22:45: Employees at both companies notice that the quality of the manufactured products shows significant deviations. The central IT product control system is identified as the cause. Production has to be stopped.
At the company House of Cards-Money Tomb, a senior employee desperately tries to reach the hotline, but according to the recorded message, it will not be staffed again until 8:00 a.m. on Saturday. Wildly one telephones through the company, in order to be able to acquire at least some IT coworkers from the closing time. Bert van Jenssen is also reached at 23:21 and heads for the data center.
At the robusta-willow tree company, the employee on duty presses the “Activate IT emergency team” button at 22:46. The team is informed via text message. Likewise, the management is informed about the activation of the emergency team. Six employees, including Bert van Jenssen, are on site at around 23:20 and start their work.
Bert van Jenssen tries to organize his three employees and identify tasks around 23:30. The contingency plan turns out to be completely out of date. But where to start? Isolate? Start an analysis?
Bert van Jenssen takes charge of the IT emergency response team (CSIRT) at 23:35. His employees take on predefined and practiced roles, including planning, documentation, communication. Meanwhile, the results of the analysis tool are ready. The malware is novel, but behaves similarly to known malware, for which an immediate action plan is ready. This outlines a sequence – isolate first, contain spread, secure evidence, import controlled backups. The communications officer informs the company’s management and staff.
Around midnight: Bert van Jenssen decides independently to take the web portal and the central IT production control system offline. A ticket is opened at the external network service provider. According to the SLA, 4 hours are allowed for processing.
Bert van Jenssen acts on the basis of the decision criteria defined in the contingency plan and has the external network service provider take the web portal and the central IT production control system offline. The SLA for the Prio 1 ticket is a maximum of 30 minutes.
Around 02:00: The analysis of the affected system components yields no results. Bert van Jenssen realizes that the quality and timeliness of the analysis software leaves something to be desired. Cursing and frustrated, he decides to import backups on the supposedly affected computer infrastructure.
Around 02:00: After the affected network is disconnected from the network at 00:45, the immediate action plan is followed. Incoming customer queries are answered and collected using a prepared web page. In the background, the logs are backed up and preparations are made to import the backup from 28.11 that has been tested and declared clean. From exercises it is known that this will take about 6-8 hours with a contaminated network and computers separated from it. The night shift is therefore sent off in consultation with the senior staff member and the start of the morning shift is set for 08:00. Bert van Jenssen orders food for his IT employees from the pizza parlor and has a trainee from production supply his employees with drinks.
Around 04:00: Bert van Jenssen realizes that the backup is dated 11/01. At the same time, the calls are piling up that customer inquiries are going nowhere. Within the IT staff there are differences about procedure, leadership and competencies. There is no structure. The night shift employees stand at the idle machines with full pay. When the head of the night shift asks how much longer this will take, Bert van Jenssen can only shrug his shoulders.
Around 6:00 a.m.: The import of the backups takes longer than expected. To make up for this, the IT emergency team is now reinforced with two rested employees. The CEO is proactively informed about last night’s milestones. The latter has the customers informed via the key account managers and corporate communications.
06:21: The managing director of House of Cards money tomb is thrown out of bed by a best-customer with just-in-time logistics, who discreetly asks dissatisfied whether the House of Cards money tomb company have lost one´s marbles. The managing director uniformly falls from all clouds – “with our production at a standstill?!?”
06:27 The phone rings at Bert van Jenssen’s…
08:00 Production is still at a standstill. However, the early shift is able to bring certain maintenance work forward and carry it out.
08:01: The ticket at the external hotline can be opened. The external service provider promises to take care of it within the next 6-8 hours. An IT employee leaves the plant premises to get something to eat and cannot be found for over an hour.
09:18 The web portal can go back online. With the data from the backup, as well as the database from the emergency website, the data loss can be limited to 47 hours. The trainee from production has organized sandwiches.
09:20 Bert van Jenssen falls asleep due to fatigue. He has really achieved little in the last nine hours. Production and the web portal are still down and a recovery time still cannot be reliably predicted.
10:45 Parts of the production (approx. 30%) can be restarted.
11:03 The first backups are imported – the data loss of almost 4 weeks cannot be repaired.
11:58 A large part of the production (approx. 80 %) is running again.
12:10 Parts of the backup are found to be corrupted or incompatibilities of software versions exist. These have to be patched manually. Earliest start: Sunday morning shift.
13:06 Bert van Jenssen declares the IT emergency over after more than 14 hours. The two colleagues who only joined the team at 06:00 take over the follow-up work. They can check the documentation to see which measures or “quick and dirty workarounds” need to be reversed.
13:37 Bert van Jenssen and his staff are working hard to at least get the web portal back online. A fit employee has hacked together a temporary solution in the background so that customer inquiries no longer go nowhere, but can be tracked in an improvised database.
13:50 Bert van Jenssen is resting at home. In the meantime, his employees review the logs and determine that the malware was introduced via an external programming company. At the same time, forensic evidence is secured and documented.
14:18 Bert van Jenssen Partial success: the web portal and database could be partially restored. The data of the last two weeks, concerning the high turnover Christmas business, is irretrievably lost.
16:05 The CEO personally arrives at the company’s plant and takes stock of the situation. Bert van Jenssen has to personally justify the emergency and how it was handled, such as the decision to take the web portal offline. There is no documentation to support him in this. The CEO decides that the night shift of production is cancelled.
19:58 An IT employee can be activated from vacation.
20:00 While importing the backups with the network connected, it is noticed that recontamination with the malware is taking place through an undisconnected and undocumented server operated as shadow IT by the development department. The server responsible for the recontamination is an unpatched server that was connected to the production network by the development department for convenience and on which every user has administrator rights for software reasons. No documentation or physical location for this server can be found, leaving only the brute force blocking of the network port. The entire process of importing the backups now has to be repeated.
Sunday, 03. December 07:00 The backups have been imported, the web portal is working again. The start of production can be expected around 08:00.
Sunday, 03. December 12:00 The emergency can be considered to have been lifted – there was no communication on this matter. Throughout Sunday, there was a great deal of uncertainty among the workforce as to whether it would be possible to use this or that software again.
Sunday, 03. December 13:05 Production is running normally again in large parts. Only in the customized production for a special customer there are still difficulties with the data supply. Bert van Jenssen suspects that this is due to the forcibly disconnected server of the development department – it still could not be physically located.
Monday 11:00 The evaluation of the emergency has been completed. Costs of the emergency are approx. 2 million € due to loss of production, additional personnel costs etc. pp. Basically, however, the company management is satisfied with the emergency management: the customers reacted angrily, but felt well informed. The downtime and business impact could be reduced to a minimum through appropriate measures – the immediate action plans proved to be well suited. Slotting in maintenance has proved to be really good – this has made good use of staff time and the time is available elsewhere. There is consideration of whether the damage can be charged to the service provider that caused it. During the re-evaluation of the SLAs, it was also discovered that the service provider reacted too late in some cases and that financial compensation can also be claimed there.
The company management expresses its gratitude to the whole staff for their good cooperation and especially emphasizes the professional work of the IT emergency team – which is especially evident in comparison with the company House of Cards-Money Tomb.
Tuesday, 05. December 13:05 The Chief Finance Officer has calculated the presumed damage of the emergency. This could amount to around 10 – 15 million euros. In addition, there is the image loss and also customer loss due to a significant loss of data in the high-turnover end-of-year business. The incalculable downtime has led to considerable labor costs, as the workforce had to stand next to the idle machines on Friday the night before and during the day on Saturday. The cause of the malware infestation could not be determined by copying over backups, as evidence was destroyed. During the re-evaluation of the SLA with the external service providers, no compensation could be achieved even on a goodwill basis – the contract only provides for 6 x 8 hour availability and corresponding response times.
Within the company, the reputation of IT has also suffered greatly – corresponding consequences will certainly follow.
Both company managements meet again on the fringes of a lecture on IT security at the beginning of January. Over a beer in the evening, the two CEOs exchange views on the topic of business continuity management and their lessons learned:
Business Impact Analysis
Service Level Agreements based on a BIA
Up-to-date emergency preparedness and emergency manual
Well-trained IT emergency response team with roles assigned and supplies secured
Coordinated immediate action plans
Business continuity plans and disaster recovery plans
Preservation of evidence and documentation
An article written by Jens von den Berken, published on 19 December 2018
Translated by Charlotte Ley