Disaster Recovery Lessons Learned from the Toronto Ice Storm, December 2013

by Jeff Wiener on January 2, 2014

Disaster planning is something that has appeared on Digitcom’s “To-Do” list for many years now, and invariably every year we dust off the previous years’ Disaster Recovery (DR) plan, make some minor modifications, and resume business as usual, always hoping for the best. Then, without fail, at least once or twice a year I kick myself for not spending more time thinking about effective disaster recovery and business continuity. December 23rd, 2013 was one of those days.

I watched the Toronto weather with interest the Friday prior, recognizing of course that a winter storm was on its way, assuming that this one, like most others, would simply pass.

My family and I had left for a vacation in Ottawa the week before, so on Saturday morning when I read the news about the storm and the early havoc it had created in the city I had trouble believing it, given that the weather in Ottawa, while cold and slightly snowy, was a comparative paradise compared to what had hit Toronto. That day I tried logging into our office systems and it appeared that our battery back-up had kicked in, so I knew we had about 2 hours total before our systems would shutdown.

On Monday morning I woke early enough to find that the power hadn’t returned, and I was able to reach a few staff members early enough to tell them to work from home.  Although we made it through the business day, we did a few things right, and most things wrong.  In fact I put together a scorecard of how we did, and while I don’t think we faired well, I now fortunately have a more comprehensive understanding of what we need to better prepare for next time.

Lesson #1: Have an effective internal communication plan (We did not do this very well.)

All DR plans need to encompass how employees will communicate, where they will go and how they will keep doing their jobs. For some businesses, issues such as supply chain logistics are most crucial and are the focus on the plan. For others, information technology may play a more pivotal role, and the business continuity/disaster recovery plan may have more of a focus on systems recovery. For example, the plan at one global manufacturing company would restore critical mainframes with vital data at a backup site within four to six days of a disruptive event, obtain a back-up phone solution with phones, recover the LAN, and set-up a temporary call center at a nearby training facility.

In the context of the DR plan it’s important to define the disaster. A fire that destroys the office or warehouse is very different than a power outage that lasts one, maybe two days.  In any event, it’s important that all employees know where they will work, how they will access their data, and how they will communicate with one another. It’s also important to determine offsite crisis meeting places and crisis communication.

Lesson #2: Have an effective external communication plan (We did not do this very well.)

The premise regarding communication during a crisis is still the same: it is important for companies to be proactive and transparent with their communications. And likewise, customers’ and stakeholders’ expectations remain the same: they want to know that companies are taking ownership and accountability and that there is a resolution plan to get the services stabilized and restored. Manage the messaging in a timely manner.   Advise customers and suppliers and keep them engaged and aware. Toronto Hydro did a great job of communicating with their customers during the ice storm using their Twitter feed:   https://twitter.com/TorontoHydro

Lesson #3: The cloud is great for business continuity planning. (We did some of this well.)

Digitcom moved from hosting our corporate mail on an Exchange server to Google’s corporate GMAIL product. Having our email in the cloud and not dependent on a server sitting inside our office certainly facilitated communication since our email was up throughout the entire day. I did speak with a few customers that day who advised they were unable to communicate via email because their email servers were down because of lack of power. If you are not going to host your email in the cloud, then putting your mail server in a data center is a good alternative. But even then, you don’t want to be dependent on having everything in one location because it’s possible the DC could also have a problem.

Infrastructure diversification, not just of email but of all systems, is one key to enhancing business resiliency, and even then, its important that the infrastructure is capable of being managed as a single entity. In Digitcom’s case, although we did have our mail in the cloud, our primary CRM software was down, and unfortunately we did not have a back-up in our DC.  That will get corrected in the next month.

Lesson #4: The cloud is great for voice redundancy (We did this REALLY well.)

Although we run a traditional Avaya IP Office phone system in our office using a PRI circuit, we do have appropriate back-up plans in place that worked quite well. Our PRI fails to our SIP trunks using our SureConnect product, and SureConnect fails to our hosted PBX solution. Some of our staff have hosted PBX extensions running in their homes, and we were able to quickly overflow our main line to a number of hosted PBX extensions. Amazingly I don’t think we missed a single call that day. I heard from a few clients that also use our service that were also down on Monday and Tuesday, and they also had their DR plan fail to their home hosted PBX extensions. So if your phone system fails, PRI, SIP trunks because of power, phone system, line failure … then you can, in an instant, have your main line ring elsewhere. The hosted extension can be a softphone, physical IP phone, or even a cell phone. One client called and asked that we fail their SIP trunk to another office, and we were able to do that inside of 5 minutes! If you are interested in this service please get in touch with someone in sales at Digitcom, 416-783-7890 and press “2” for sales, or visit Digitcom’s web site.

Lesson #5: Test your DR plan

I’m not necessarily advocating for a “pull the plug” scenario, but it’s important to test the DR plan and ensure you have an integrated process in place between all business units, IT, and internal and external communication. You can run your business off the back-up site for a short period of time and make sure the processes function appropriately. It’s better to pinpoint flaws before a disaster than after.

Despite the power outage we were able to answer all calls during the day, handle tech support issues, and communicate with our field techs as normal. Many of our internal processes failed however, and we are now in the process of determining what did, and more importantly, what didn’t work.  Luckily our power was restored by the end of day Monday so it was a short outage, but, long enough for us to test our processes and take corrective action in preparation for the next outage.

Comments on this entry are closed.

Previous post:

Next post: