Having been in the eye of the
storm of the BA IT systems failure last weekend, and only getting away on
holiday, 2 days after we should have, I think there’s lots of things to learn.
I think what most struck me about
the outage was the sheer size of it.
Upon arriving at Heathrow terminal 5 on Saturday morning with the
extended family all excited about a week’s holiday in Greece, we were met with
huge queues outside T5, and at that stage it looked like a baggage or check in
problem. But then over the course of the next hour it quickly became clear how
severe the outage was. Not only were
check-in systems not working, but the departures information boards had been
stuck since 9.30 am. Even when we got to
the gate, which turned out to be the wrong one, there were planes on stand
waiting to push back, more aircraft waiting for a gate, and flight crew equally
confused. When we did finally get on board
an aircraft the pilot informed us that the flight planning systems weren’t
working so he couldn’t create a flight plan, and therefore was unable to work
out the correct amount of fuel to put on board, and without that he was
unwilling to push back off the stand.
Even when we got the news (first via the BBC) that all flights were
cancelled, the pilot told us even the system to cancel flights wasn’t
working. This meant that it getting
busses to take us back to the terminal took a long time, followed by the
ignominy of having to go back through passport control having not left the
airport let alone the country.
From an IT perspective there’s a
few interesting aspects. Firstly BA have
claimed this to be a power related incident.
This is an interesting cause. As
far as I’m aware there were no other companies impacted by this outage, which
strongly suggests that this is was not in a shared (co-located) data centre, as
otherwise we’d have seen other outages. This
also implies that BA aren’t running in the cloud, as we saw no cloud outages
over the weekend. Secondly assuming this
was a dedicated BA data centre then there’s been a major failure of
resiliency. I would normally expect of
any decent quality data centre that there would be a battery backup to provide
power in the immediate follow-up to a power failure. As soon as there’s been a power failure
detected diesel generators should kick in to provide longer term power. Normally batteries would sit in-line with the
external power to smooth the supply and provide instant protection if the external
power fails. At this level of
criticality it would be normal to have 2 diverse and ideally separate power
suppliers. The diesel generators are
some of the most loved engines in the world, they are often encased in
permanently warmed enclosures to keep them at the correct operating
temperature. Quite often the diesel they
consume is pre-warmed as well. This
often also is stored in 2 different locations to ensure that if one gets
contaminated there’s a secondary supply that can still be used. These engines
are often over a million pounds each, and in some sites I’ve seen then have n+n
redundancy (if 4 generators are needed there are 8 on site) to deal with 100%
failure. Clearly as a customer you pay
more to have this level of redundancy but as we’ve seen over the weekend you
never want an incident like this.
In addition to having all this
redundancy built into a data centre its vital that all these components are
regularly tested. It’s normal for data
centres to test battery back-up and run up the generators at least once a month
to ensure all the hardware and processes work as they should in an emergency.
Once you’re inside the data
centre, all the racks (where servers are housed) are typically dual powered
from different backup batteries, and power supplies, and then each server is
dual powered to further protect against individual failures. In total there are 6 layers of redundancy in
between power coming into the data centre and the actual server (redundant
Power suppliers, redundant Battery back-up,
redundant power generators, dual power to the rack, dual power supplies
to the server, redundant power supplies in the server itself).
As you can see in theory it’s
pretty difficult to have a serious power failure. While it’s possible to have a serious failure
in parts of a power supply system, it would be highly unusual for this to be
service impacting.
However as we saw in the outage
at the weekend something catastrophic must have happened to produce such a
widespread outage, and one that seems to have affected BA globally.
Even outside of pure power
redundancy most large corporations will have redundancy built into individual
systems, be that within the same data centre or in a secondary site (ideally
both). For the more sophisticated sites,
these are often what’s known as active-active, i.e. the service is running in
both sites at the same time, so if there’s a failure in one server or site the
service keeps running but with degraded capacity (the application may appear
slower to users), however it is still available.
Most companies will spend at
least 7 figures sums annually running with this level of redundancy and will
test it regularly (most regulators insist this is at least every two
years). It would appear that for this level
of outage and number of systems that failed, either there wasn’t the
appropriate level of redundancy or it hasn’t been tested regularly enough.
It’s worth pointing out that all
the points mentioned above are expensive, painful to test, and do little to add
to the bottom line of the company, but it is just this sort of ‘insurance’ that
you never want to rely on, but having thorough and well tested plans makes all
the difference when this sort of event happens.
There’s been lots of reports in the UK press,
and comments from unions saying this event is reflective of BA outsourcing its
IT services to a third party. I’m not
sure if outsourcing had any impact on the outage, but the mere fact that if BA
do outsource their IT it’s an indication that they do not perceive IT to be a
core function for BA, as they’ve asked someone else to do it on their behalf.
You may have read many IT
articles about Uber being the biggest taxi company and owning no taxis, and
airBnB being the biggest hotel chain, but owns no hotels. It’s clear that both these examples are
technology companies not traditional taxi or hotel vendors and therefore with
such a reliance on technology they would be expected to have very highly
resilient systems that are regularly tested.
BA however doesn’t fit that
model, their biggest expenses wouldn’t be IT, they probably spend significantly
more on aircraft, fuel, staff etc.
However when I thought about it, their main systemic risk probably is
IT. If any one model of aircraft was
grounded for any reason they use a range of planes in their fleet so this would
be impactful, but not catastrophic.
Similarly if one of the unions that some of their staff belong to goes
on strike (as we’ve seen in the past) is annoying but not critical. The same could probably be said for their
food or fuel vendors, who probably vary around the world, and so if any one of
them fail, they can most likely work around an individual failure.
Not so with IT, it appears that
one power failure in one data had the ability to completely cripple one of the
biggest airlines in the world. I cannot
believe that BA would have actively known this risk and chosen to run with it.
It the ever increasing digital
world we live in every company is slowly turning into a technology
company. Maybe not front facing, but
even in a traditional industry such as aviation where aircraft hardware will
always be key, this weekend proved you can have all the planes in the world,
but if the tech isn’t there to support it, you’ve got no business.
thanks for the article, i think you hit the nail on the head ' main systemic risk probably is IT.' yes, yes it is, for so many companies, it is taken for granted. And outsourcing is the problem when it is done to pass the buck, it takes serious resources to manage contracted work, that is something we use to know, but some seem to imagine tech is different, it isn’t.
ReplyDelete