5 things about cloud downtime

Downtime on the cloud can be more daunting than on-premise. You have no control. And, unfortunately, most of the time you have no recourse to being awarded damages. The following article gives you tips on the importance of planning for downtime, the importance of signing the rightly negotiated contracts. It also describes some notorious instances of downtime.


5 Things They Never Told You About Downtime

By Ohad Flinkr

(link to original post at the end of this article. Posted here through WordPress share )

Cloud providers are generously publishing aggressive Service Level Agreements (SLA), along with promises of elasticity, scalability and success stories from fortune 500 companies. Unfortunately, it’s very easy to fall into a false sense of security when it comes to service availability in the cloud.  Based on our recent Disaster Recovery survey, IT professionals define downtime as the duration of time in which the service to your customers is unavailable or interrupted—even if your cloud infrastructure is up and running. Therefore, your cloud provider’s downtime is only a part of the whole downtime picture.

Five Things Most IT Managers Miss When Planning for Unplanned Downtime

We all know that the cloud allows organizations to focus on the core business without worrying about managing IT infrastructure. However, as cloud integrates multiple services running on geo-redundant data-centers with complex architecture, managing downtime can be challenging.

The promise of the cloud is so alluring that it’s sometimes easy to forget we’re only at the beginning of a decade-long adoption curve. Here are five things most IT managers never consider about cloud downtime:

  1. SLAs are not always met

It seemed like just any other Sunday on August 26, 2013. At Instragram, Facebook’s billion dollar (and hyper-popular) mobile picture-sharing app, the offices were all but deserted. Suddenly, at 1:00 pm, a laid-back Sunday afternoon turned into an IT nightmare.  For 59 whole minutes, Instagram was down.  Hosted on Amazon Web Services (AWS), the outage was triggered by a failure at the provider’s US East data-center in northern Virginia, affecting other companies as well, including Vine, Airbnb and Flipboard.

Cloud uptime is typically stated in the SLA with the cloud service provider.  AWS for example, guarantees 99.95% availability, which equals to 21.56 downtime minutes per month. But do SLAs really mean that your app is going to be up 99.5% of the time?

Amazon’s 99.95% SLA Does NOT Correspond to 99.95% Availability of Your Service

Not necessarily. For example, Instagram was down for 59 minutes, and on Christmas 2012, AWS had an outage of 20 hours—tough luck for those late holiday shoppers.  Therefore, the SLA provided by your cloud service provider is not exactly set in stone and if your service is down for more than what is stated, all you can do is apply for a miserly service credit of up to 30%, while counting the mounting damage to your business.

2. Downtime increases with multiple services

While an SLA of 99.95% is aggressive, it is per service or region. While it may seem that relying on more services actually reduces your risk, in reality it actually increases the likelihood for downtime or low quality of service. In fact, your cloud downtime actually increases exponentially with every additional service that you use.

For example:

  • One service at 99.95% SLA: 99.95%
  • Three services at 99.95% SLA: 99.85%

Just like that, your downtime increased threefold—from 0.05% to 0.15%.  Therefore, when managing downtime, it is important to account for all services that may interrupt or degrade your service. If you require a 3-nines, or 99.9% uptime, even though Amazon promises 99.95% uptime, it did not do all of the work for you.

  1. Downtime of third party services

Many apps and websites apply more than one cloud-based service ranging from hosting, code libraries, checkout and transaction services, analytics and many other services.  As the Web is becoming more integrated, outages may affect Web apps indirectly.

Whether a Web app is hosted on premise or in the public cloud, it typically uses other APIs, services and components that are hosted in the cloud.  Therefore, cloud outages may affect Web apps that are not directly using that service and limit functionality or degrade user experience.

Third Party Service Disruptions Must Be Factored into Unplanned

Your SLA with your cloud service provider does not protect you against third party downtime.  For example, for ecommerce, if your third party shopping cart service is down, but your website is available, it would traditionally be considered as uptime.  But is it?  You may suffer significant losses as well as frustrated users calling your customer service center.

Furthermore, third party monitoring like Pingdom or New Relic may not provide the full usability you need in order to make sure that your service is fully functional.  Therefore, when managing your expected downtime, take all third-party services into account as well.

  1. Cloud is up, service is down

One small thing that may have slipped your mind: your cloud service provider guarantees uptime for their service, not for yours.  This means that even if the cloud is up, your app may be down.

Human Error Is One of the Most Common Root Causes for Service

SLA only refers to the physical availability of your virtual machine (VM).  Your VM may be running, but you may not have access to the application.  This would typically not count towards your uptime SLA, even though your application was not available.  There can be plenty of reasons why apps may not be available and some of them may entirely be the fault of the development and deployment teams.

5.Bad quality of service

While your cloud service provider promises availability, many boilerplate SLAs do not guarantee the quality of service (QoS).  While your Quality of Service may be degraded unexpectedly, your application is still available and therefore this would not count towards your SLA agreement.  QoS issues are also hard to monitor and may not be reported by the cloud service provider, as the service is actually available, although the user experience is degraded.

Degraded Quality of Service Effectively Means

What is your true downtime?

Managing downtime in the cloud is far more complicated than just reviewing your cloud provider’s SLA. In fact, if you count downtime as unavailable or interrupted service for your users, there are other factors to consider besides your cloud service.  The five considerations outlined here should help you manage and mitigate your downtime and optimize your business availability.


Print Friendly
Bookmark and Share

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>