Supporting SLA’s on the Cloud

What does it take to make a Cloud Computing infrastructure enterprise ready ? Well, as always, this probably depends on the use case, but support for real-time scaling and SLA support must figure highly.

Software that purports to scale the applications on the cloud is not new, have a look at our prior blog post on this topic, and you will see some of the usual suspects such as RightScale, and Scalr. A new offering in this space is by Tibco with its Tibco Silver offering. Tibco Silver is trying to solve the problem of not whether cloud services can scale but whether the applications themselves can scale with them. This problem is addressed by Silver through ‘self aware elasticity’. Hmmm….sounds good but what exactly does that mean ? It means the system can automatically provision new cloud capacity (be that storage or compute) dependent upon fluctuations in application usage.

According to Tibco, unlike services in a service-oriented architecture cloud services are not aware of the SLA’s to which they are required to adhere and Tibco Silver is aimed at providing this missing functionality. Tibco claim that “Self-aware elasticity” is something no other vendor has developed. I would dispute this. GigaSpaces XAP with it’s ability to deploy to the cloud as well as on-premise using the same technology has very fine grained application level SLA control that when breached allows the application to react accordingly, whether this be to increase the number of threads, provision new instances or to distribute workloads in a different way. GigaSpaces Service Grid technology enables support for this real-times elasticity.  The GigaSpaces Service Grid originated from Sun’s RIO Project. (interestingly it seems GigaSpaces are doing some work on enabling their cloud tools to deploy to and manage VMWARE images on private clouds as they do with AMI’s on Amazon’s public cloud) 

Without a doubt the ability to react in real-time to application level SLA’s rather than just breaches of an SLA at an infrastructure level is something that will find a welcome home in both private and public clouds.

Why, right now, Amazon is the only game in town ?

Amazon is currently the big bear of Cloud Computing Platforms. It’s web services division has proved disruptive and consistently shown innovation and breadth of services within its platform. It is growing at a rapid rate. Forty per cent of Amazon’s cross revenues are from its 3rd party merchants. Amazon Web Services is an extension of this. The core Amazon site uses its own web services to build the Amazon pages on the fly, dynamically. This results in approximately 2-300 Amazon Web Service calls. In short, it eats its own dog food.

Why are Amazon good at this ?

1. They have a deep level of technical expertise that has come from running one of the largest global online consumer marketplaces.

2. This has lead to a culture of Scale and Operational excellence.

3. They have an appetite for low margin, high volume business, and more importantly the understand it fully.

Lets look at the competition. Microsoft certainly can satisfy  the first point from the list above, but will probably have to buy the second, and certainly have not in their history demonstrated that they have the third.  For this reason we cannot expect Azure to be an instant Amazon competitor. What about RackSpace ? Well they can satisfy 1,and to a lesser extent 2, but again it is not clear that they have currently fully assimilated point 3. IBM have both 1 and 2 but again fall down point 3.  Currently Amazon are unique in the combination of what they provide, how they provide it, and how they price and make money for it.

The core ethos of the Amazon CTO, Werner Vogels, is that “everything breaks all the time“, and it is with this approach that they build their infrastructure. Amazon currently have 3 worldwide data centers. One on the east coast, one on the west coast, and one in Ireland. The intent is to have at least another in AsiaPac.  Each data centre is on a different flood plain, different power grid, and has different bandwidth provider to ensure redundancy. If S3 is used to store data then 6 copies of the data are stored. In short, the infrastructure is built to be resilient.

This does not mean there will not be outages. We know that this has occurred not just for Amazon but for other prominent online companies as well. Amazon’s SLA guarantees 99.95%  uptime for EC2 and 99.9% for S3. What does this mean in terms of downtime ? Well this is approximately 4 hours and 23 minutes per year. Not good enough ? Well reduced downtime costs and I know many, many enterprise organisations who could only dream of having downtime as low as this. Chasing 5 9’s availability is in many ways chasing the dream.  Achieving it is often more costly than the cost of outages it is meant to protect. Amazon already provides a services health dashboard for all it’s services, something Google also seems set to do. It is set to provide additional monitoring services later in the year (along with auto-scaling and load balancing services) that make the core services even better.

Amazon has proved that as soon as you take away the friction of hardware you breed innovation.The Animoto use case is a good example of this, as is their case study on the Washington Post.  There are more Amazon case studies here.

Right now, for my money, Amazon is on its own for what it is providing. Sure other companies provide hosting, and storage, and for many users they will be good enough, but for the sheer innovation and breadth of integrated services coupled with the low cost utility compute model, Amazon stands alone.

Could Amazon really pull the plug ?

One of the interesting things about the Amazon success story is that the EC2  virtual server technology is often assumed  to originally have been an overspill from Amazon’s own network and that Amazon sold the extra capacity that it had available and that was not being used at peak times. I guess we’ll never know the real answer to this, but an interesting post from Benjamin Black on of the original guys to work with Chris Pinkham on what was to become Amazon EC2 seems to dispel this as an urban myth.

Why is this important ? Because I still meet people who seem to believe this, and that Amazon could “take back” capacity if they needed it and therefore leave people using EC2 (i.e. them ) high and dry. So could this ever happen ? Well, not for this reason, but I think a better question to ask is “Could Amazon pull the plug ?”

“Of course not ” I hear you say, Amazon have just announced an SLA. Well, that is true, since October 23rd 2008 Amazon have had an SLA that guarantees they will make every reasonable effort to provide a 99.9% monthly uptime. If they breach this then there are a series of financial credits which may not make up for the money you lose through trade if your site is down. To be fair though, if any SLA is breached you have the same problem wherever the site / service / application is hosted (and remember don’t provide an SLA, preferring to build trust in their service instead)

One of the things that you also sign up to when you use one of Amazon’s services is the click through license agreement. Delving into this provides more details of the answer to our core question, “could Amazon really pull the plug”.  In section 3.3.2 of this agreement Amazon state the following:

3.3.2. Paid Services (other than Amazon FPS and Amazon DevPay). We may suspend your right and license to use any or all Paid Services (and any associated Amazon Properties) other than Amazon FPS and Amazon DevPay, or terminate this Agreement in its entirety (and, accordingly, cease providing all Services to you), for any reason or for no reason, at our discretion at any time by providing you sixty (60) days’ advance notice in accordance with the notice provisions set forth in Section 15 below.

In essence what this says is that if Amazon want to pull the plug, other than if you make a fundamental breach of contract (which are laid out in section 3.4) then they will give you 60 days notice. Great, so you get 60 days, right ? Well, not quite. Another section in the terms of service, “Modifications to this Agreement” allow Amazon to modify the terms of the whole agreement and once posted the new terms will be applicable 15 terms of after posting. Of course this change could include the section that says Amazon has to give 60 days notice of termination. OK, so now we get to it, so Amazon have to give you 15 days before they pull the plug ? Well, not quite, if they redefine their acceptable usage policy, and the new usage policy prohibits your service or application then you in effect get no notice before you get the plug pulled.

Extreme ? Of course, but the reality is that a service like Amazon (and SalesForce) is built on trust and if people don’t trust the service they won’t use it, Amazon and SalesForce both know this and work hard on creating services that have very little downtime and that are flexible and easy to use. This is why their usage ramp is going through the roof.

Alexa Traffic History