Major Incident Report

Major Incident Report

This page details the findings of our investigation into major service outages (MSOs). Thankfully MSOs occur extremely rarely and when they do occur we go through a detailed investigation and planning session to ensure that they cannot happen again.

If you would like this report e-mailed to you please contact us at feedback@tagadab.com

 

Major Incident Report - 18th August, 2012

Tagadab suffered a major issue in one of its datacentres starting at around 16:00 on 18/08/2012. It was not fully resolved until 02:30 on 19/08/2012 although most services were back up by 22:00 on the 18th.

The issue was caused by a complete failure of the cooling systems in the datacentre.

The datacentre has multiple redundant cooling systems: four chillers in total, more than twice the amount actually required. In these circumstances is it very difficult to see how the entire cooling system could fail.

The answer, which we learned in a meeting with the datacentre operating company during the course of our investigation, is surprisingly simple: although there are four chillers they all operate on a single water circuit. A burst pipe caused the water to drain out rendering all four chillers useless.

Needless to say this is a very serious error which has caused considerable harm to us and our customers. The datacentre company have agreed the following measures to prevent a re-occurrence.

 - mobile chiller units are installed in the datacentre as of this afternoon. If the main chillers fail we now have a fully independent backup cooling solution

 - the entire cooling system at the datacentre will be replaced with a fully redundant system which Tagadab will be able to verify beforehand. The mobile chillers will remain in place until this work is completed. We will have a deadline for this work by the end of this week.

This is by some margin the most serious technical issue in Tagadab's history and we have absolutely no intention of allowing this to happen again.

Please accept our sincere apologies for the outage.