The recent Cloud Views conference presentations can be viewed below:
The recent Cloud Views conference presentations can be viewed below:
Some of the key things to think about when putting your application on the cloud are discussed below. Cloud computing is relatively new, and best practice is still being established. However we can learn from earlier technologies and concepts such as utility compute, SaaS, outsourcing and even internal enterprise centre management, as well as from experience with vendors such as Amazon and FlexiScale.
Licensing: If you are using the cloud for spikes or overspill make sure that the products you want to use in the cloud can be used in this way. Certain products restrict their licenses to be used from a cloud perspective. This is especially true of commercial Grid, HPC or DataGrid vendors.
Data transfer costs: When using a provider like Amazon with a detailed cost model, make sure that any data transfers are internal to the provider network rather than external. In the case of Amazon, internal traffic is free but you will be charged for any traffic over the external IP addresses.
Latency: If you have low latency requirements then the Cloud may not be the best environment to achieve this. If you are trying to run an ERP or some such system in the cloud then the latency may be good enough but if you are trying to run a binary or FX Exchange then of course the latency requirements are very different and more stringent. It is essential to make sure you understand the performance requirements of your application and have a clear understanding of what is deemed business critical.
One Grid vendor who has focused on attacking low latency in the cloud is GigaSpaces and so if you require cloud low latency then these are one of the companies you should evaluate. Also for processing distributed data loads there is the map reduce pattern and Hadoop. These type of architectures eliminating the boundaries created by scale-out database based approaches.
State: Check whether your cloud infrastructure providers have persistence. When an application is brought down and then back up all local changes will be wiped and you start with a blank slate. This obviously has ramifications with instances that need to store user or application state. To combat this on their platform Amazon rdelivered EC2 persistent storage in which data can remain linked to a specific computing instance. You should ensure you understand the state limitations of any Cloud Computing platform that you work with.
Data Regulations: If you are storing data in the cloud you may be breaching data laws depending where your data is stored i.e. which country or continent. To combat this Amazon S3 now supports location constraints, which allow you to specify where in the world to store data for a bucket and provides a new API to retrieve the location constraint for an existing bucket. However if you are using another cloud provider you should check where your data is stored.
Dependencies: Be aware of dependencies of service providers. If service ‘y’ is dependant on ‘x’ then if you subscribe to service ‘y’ and service ‘x’ goes down you lose your service. Always check any dependencies when you are using a cloud service.
Standardisation: A major issue with current cloud computing platforms is that there is no standardisation of the APIs and platform technologies that underpin the services provided. Although this represents a lack of maturity you need to consider how locked in you are when considering a Cloud platform or migrating between cloud computing platforms will be very difficult if not impossible. This may not be an issue if your supplier is IBM and always likely to be IBM, but it will be an issue if you are just dipping your toe in the water and discover that other platforms are better suited to your needs.
Security: Lack of security or apparent lack of security is one of the perceived major drawbacks of working with Cloud platform and Cloud technology. When moving sensitive data about or storing it in public cloud it should be encrypted. And it is important to consider a secure ID mechanism for authentication and authorisation for services. As with normal enterprise infrastructures only open the ports needed and consider installing a host based intrusion detection systems such as OSSEC. The advantage of working with an enterprise Cloud provider, such as IBM or Sun is that many of these security optimisations are already taken care of. See our prior blog entry for securing n-tier and distributed applications on the cloud.
Compliance: Regulatory controls mean that certain applications may not be able to deployed in the Cloud. For example the US Patriot Act could have very serious consequences for non-US firms considering U.S. hosted cloud providers. Be aware that often cloud computing platforms are made up of components from a variety of vendors who may themselves provide computing in a variety of legal jurisdictions. Be very aware of the dependencies and ensure you factor this into any operational risk management assessment. See also our prior blog entry on this topic
Quality of service: You will need to ensure that the behaviour and effectiveness of the cloud application that you implement can be measured and tracked both to meet existing or new Service Level agreements. We have discussed previously some of the tools that come with this option built in (GigaSpaces) and other tools that provide functionality that enable you to use this with your Cloud Architecture (RightScale, Scalr etc). Achieving Quality of Service will encompass scaling, reliability, service fluidity, monitoring, management and system performance.
System hardening: Like all enterprise application infrastructures you need to harden the system so that it is secure, robust, and achieves the necessary functional requirements that you need. See our prior blog entry on system hardening for Amazon EC2.
Compliance and regulatory concerns are often voiced when it comes to Cloud Computing, and often many of the interesting types of applications organisations would like to deploy to the cloud are often those governed by some form of regulatory standard. Lets look in more details at one of these.
PCI DSS is a set of comprehensive requirements for enhancing payment account data security and was developed by the founding payment brands of the PCI Security Standards Council, including American Express, Discover Financial Services, JCB International, MasterCard Worldwide and Visa Inc. Inc. International, to help facilitate the broad adoption of consistent data security measures on a global basis.
The PCI DSS is a multifaceted security standard that includes requirements for security management, policies, procedures, network architecture, software design and other critical protective measures. This comprehensive standard is intended to help organizations proactively protect customer account data.
So, is it possible to create a PCI DSS compliant application that can be deployed to EC2 ?
In order for an application or system to become PCI DSS compliant requires an end to end system design (or a review if pre-existing) and implementation. In the case of AWS customer’s attaining PCI compliance (certification), they would have to ensure they met all of the prescribed requirements through the use of encryption etc. very much like other customers have done with HIPAA applications. The AWS design allows for customers with varying security and compliance requirements to build to those standards in a customized way.
There are different levels of PCI compliance and the secondary level is quite a straight forward configuration, but requires additional things such as 3rd party external scanning (annually). You can find an example here of the PCI Scan report that is done on a quarterly basis for the Amazon platform. This isn’t meant to be a replacement for the annual scan requirement. Customers undergoing PCI certification should have a dedicated scan that includes their complete solution, therefore certifying the entire capability, not just the Amazon infrastructure.
The principles and accompanying requirements, around which the specific elements of the DSS are organized are:
Build and Maintain a Secure Network
Requirement 1: Install and maintain a firewall configuration to protect cardholder data
Requirement 2: Do not use vendor-supplied defaults for system passwords and other security parameters Protect Cardholder Data
Requirement 3: Protect stored cardholder data
Requirement 4: Encrypt transmission of cardholder data across open, public networks Maintain a Vulnerability Management Program
Requirement 5: Use and regularly update anti-virus software
Requirement 6: Develop and maintain secure systems and applications Implement Strong Access Control Measures
Requirement 7: Restrict access to cardholder data by business need-to-know
Requirement 8: Assign a unique ID to each person with computer access
Requirement 9: Restrict physical access to cardholder data Regularly Monitor and Test Networks
Requirement 10: Track and monitor all access to network resources and cardholder data
Requirement 11: Regularly test security systems and processes Maintain an Information Security Policy
Requirement 12: Maintain a policy that addresses information security
Many of these requirements can’t be met strictly by a datacenter provider, but in Amazon’s case, they will be able to provide an SAS70 Type 2 Audit Statement in July that will provide much of the infrastructure information needed to meet PCI DSS certification. The Control Objectives that the Amazon Audit will address are:
Control Objective 1: Security Organization: Management sets a clear information security policy. The policy is communicated throughout the organization to users
Control Objective 2: Amazon Employee Lifecycle: Controls provide reasonable assurance that procedures have been established so that Amazon employee accounts are added, modified and deleted in a timely manner and reviewed on a periodic basis to reduce the risk of unauthorized / inappropriate access
Control Objective 3: Logical Security: Controls provide reasonable assurance that unauthorized internal and external access to data is appropriately restricted
Control Objective 4: Access to Customer Data: Controls provide reasonable assurance that access to customer data is managed by the customer and appropriately segregated from other customers
Control Objective 5: Secure Data Handling: Controls provide reasonable assurance that data handling between customer point of initiation to Amazon storage location is secured and mapped accurately
Control Objective 6: Physical Security: Controls provide reasonable assurance that physical access to Amazon’s operations building and the data centers is restricted to authorized personnel
Control Objective 7: Environmental Safeguards: Controls provide reasonable assurance that procedures exist to minimize the effect of a malfunction or physical disaster to the computer and data center facilities
Control Objective 8: Change Management: Controls provide reasonable assurance that changes (including emergency / non-routine and configuration) to existing IT resources are logged, authorized, tested, approved and documented.
Control Objective 9: Data Integrity, Availability and Redundancy: Controls provide reasonable assurance that data integrity is maintained through all phases including transmission, storage and processing and the Data Lifecycle is managed by customers
Control Objective 10: Incident Handling: Controls provide reasonable assurance that system problems are properly recorded, analyzed, and resolved in a timely manner.
Many thanks to Carl from Amazon for his help with this information.
Update: Since this post was published Amazon updated their PCI DSS FAQ. You can find that here.
There has been an interesting discussion occurring on The Cloud Computing forum hosted on Google Groups (and if you are all interested in Cloud I recommend you join this as it really does have some excellent discussions). What has been interesting about it from my viewpoint is that there is a general consensus that the average CPU utilisation in organisational data centre’s runs between 10 and 15%. Some snippets of the discussion below:
Initial Statement on the Group discussion
The Wallstreet Journal article “Internet Industry is on a Cloud” does not do Cloud computing any justice at all.
First: Value proposition of Cloud computing is crystal clear. Averaged over 24 hours, and 7 days a week , 52 weeks in a year most servers have a CPU utilization of 1% or less. The same is also true of network bandwidth. The storage capacity on harddisks that can be accessed only from a specific servers is also underutilized. For example, harddisk capacity of hard disks attached to a database server, is used only when certain queries that require intermediate results to be stored to the harddisk. At all other times the hard disk capacity is not used at all.
First response to the statement above on the group
Utilization of *** 1 % or less *** ???
Who fed them this? I have seen actual collected data from 1000s of customers showing server utilization, and it’s consistently 10-15%. (Except mainframes.) (But including big proprietary UNIX systems.)
A poll over at LinkedIn is asking the question “What areas of Cloud Computing most concerns your or your organisation”. The current state of play for the poll is as below:
I am surprised that security is coming is so low and performance is perceived as the number one concern, It will be interesting to monitor how this poll changes as more votes are cast. You can choose to vote here.
Simone Brunozzi, technology evangelist for AWS in Europe, posted some further success stories / use cases for Amazon Web Services in Europe and Asia on the Amazon blog – I’ve reposted the article below as it always makes interesting reading to see how companies are embracing cloud computing, and particularly what the details are of the use case.
Industria , Iceland
Industria adopted the Amazon Web Services for theirZignalCloud service, as well as for the Zignal digital entertainment delivery platform. Zignal Cloud lowers the total cost of ownership for service providers and provides predictability of costs, reduces technology risks and decreases time to market.
In their blog, they state:
“ An intended consequence of this approach is that we can do it all with no upfront cost for our customers, because we’re effectively using a true cost-sharing model that offers us almost a 100% economy of scale. ”
Of course, when you use Amazon Web Services, you’re charged only for what you use, with no upfront investment . You can read more details on AWS’s offerings on our product page .
If you’re interested in ZignalCloud, you can contact Industria in Iceland, Ireland, Bulgaria, UK, Sweden or China.
When they started imageloop.com’s transition to Amazon Web Services, they needed to convert all their old pictures, generating new thumbnails and output formats.
Normally that would have taken months, but since they had virtually unlimited access to cpu power with EC2, they just launched sixty c1.xlarge instances that fed off conversion jobs from SQS and were done in a day and a half.
Then, about a week later,when they were going live, they scheduled one night of downtime maintenance, and converted the images that had accumulated during the week, about 110,000 pictures , using ten EC2 instances for two hours.
Overall, imageloop.com is very pleased with the level of flexibility that Amazon provides.
From the words of Antonio : “the delivery speed of the Slideshows is way better than before and we liked the flexibility and ease with which we were able to build up the platform. Congratulations to a great product!”
And this is Stefan Riehl , imageloop.com’s CEO: “When we began evaluating alternatives to traditional hosting vendors, it became apparent that AWS’s offering is the most mature in the market.”
SnappyFingers is a Question and Answer search engine. SnappyFingers crawls and indexes Frequently Asked Questions on the Internet, and provides search results in a easy to view Question/Answer format.
Chirayu Patel was kind enough to share with us some details on how they use Amazon Web Services (AWS) along with some rationale behind their choices.
The three main motivations behind their choices are (in their own words):
– we are extremely reluctant to learn or do anything outside of SnappyFingers domain. We would rather outsource.
– We are very cost conscious.
– We do write buggy code, but we do not want our systems to die because it.
During the design of SnappyFingers, they considered multiple options, but at the end they picked Amazon Web Services.
Preliminary cost analysis showed that the basic cost of the AWS alternatives would be lower in the long run. Also, there was an added advantage of not being tied to a single vendor. However, once they added the cost of managing the systems, the financial advantage of using AWS became evident.
This, coupled with the fact that they didn’t want to be distracted with operational burdens unrelated to their core business, meant that AWS became the obvious choice for scaling CPU/storage resources.
SnappyFingers is comprised of two systems – a Website, and Information Retrieval System (IRS). The Website corresponds to the system that serves user requests, and the IRS is the system that does all the behind the scenes work to gather Q&A.
SnappyFingers is mostly coded in Python, Java languages, and uses multiple third party packages: notably being the Django framework , multiprocessing package in Python, and Apache Lucene , a high-performance, full-featured text search engine library written entirely in Java.
The Website runs on at least three EC2 nodes , and uses the following components.
1. nginx – An extremely fast web server, used to serve static/cached content. It is also used to reverse proxy traffic to multiple Apache servers.
2. Apache server with mod_python to execute the Python code along with the Django framework.
3. Searchers to perform the actual searches on the Q&A index.
4. Spell checkers .
5. PostgreSQL , for system management: recording bugs, registering new services, and such.
Caching is built into the system using a combination of memcached and file system caching. Static content is served using Amazon CloudFront . Amazon Mechanical Turk is used to test the relevancy of search results.
The Information Retrieval System (IRS) is responsible for creating Q&A indexes that will eventually be used by the searcher. It uses multiple services to do the job:
1. Crawlers to crawl the internet.
2. Parsers to extract Questions and Answers from each page, detect spam, and eliminate duplicate content.
3. Scorers to score the Q&A’s based on a number of factors. The scoring algorithms are the most dynamic pieces of code, and are under continuous evolution.
4. Indexers to index the Q&A.
These services interact with multiple storage devices – Amazon S3, Amazon SimpleDB and Postgresql. Not all data is stored in all locations. Based on the data size, and retrieval requirements, we store the data in different locations. All data access is done through a Python based custom ORM (Object Relational Mapping) to simplify programming.
Another aspect of these services is that they can be run in any node . At times they have used a certain amount of EC2 servers, while at others they have reduced their infrastructure depending on the load and their monthly AWS budget.
At present IRS has consumed roughly 500 GBytes for a data set of 11 millionQ&A.
Intra-service communication uses the concept of pipelines, each with its own set of pipes. Each pipe ( Amazon SQS Queue) is owned by a service, which is responsible for processing messages within it. Once processing is complete messages are sent to the next pipe in the pipeline.
This architecture has not only allowed SnappyFingers to maintain the modular nature of the system, but also to develop and deploy services in isolation with the rest of the system.
The Error Handling strategy is simple: on an error, a service will log the error and store the corresponding message in Amazon SimpleDB , and continue processing the next message. The service stops only when the error rate exceeds configured thresholds.
Once the errors have been corrected, the corresponding messages are pushed back to Amazon SQS for completion of processing.
CPU utilization and scaling
All the IRS services are designed to keep the CPU occupancy 100% (or to a configured value), using Python’s multiprocessing package to spawn/kill processes to maintain CPU occupancy.
The services are independent of the node on which they are running, and if there is a huge backlog of messages in Amazon SQS, more EC2 nodes can be spawned to handle the extra load.
We see these questions time and time again – “How do I design for Peak load” and “How do I scale out on the cloud?”. First lets figure out how to give some definition for Peak load:
We will make a stab at defining peak load as: “A percentage of activity on a day/week/month/year that comes within a window of a few hours and is deemed as extreme and occurs because of either seasonality or because of unpredictable spikes.”
H = peak hits per second
h = # hits received over a one month period
a = % of activity that comes during peak time
t = peak time in hours
H = h * a / (days * t * minutes * seconds)
H = h * a / (108,000 * t)
Determine the peak Virtual Users: Peak hits/second + page view times
U = peak virtual users
H = peak hits per second
p = average number of hits / page
v = average time a user views a page
U = (H / p) * v
h = 150,000,000 hits per month
a = 10% of traffic occurs during peak time
t = peak time is 2 hours
p = a page consists of 6 hits
v = the average view time is 30 seconds
H = (h X a) / (108,000 * t)
H = (150,000,000 * .1) / (108,000 X 2)
H = 48
U = (H / p) * v
U = (48 / 6) * 30
U = 8 * 30
U = 240
Desired Metric – 48 Hits / Sec or 240 Virtual Users
In the example Thomas Consulting present above Peak load is 15,000 hits in two hours whereas the normal average of hits for two hours is 411 [(((h*12)/365)/24)*2]. This is over a 70% increase and a huge difference, and this example is not even extreme. Online web consumer companies can do 70% of their yearly business in December alone.
Depending on the what else occurs during the transactionality of the hits, then this could be difference between having 1 EC2 instance and having 10, or a cost difference between $6912 to $82,944 over the course of a year (based on a large Amazon EC2 instance). And of course building for what you think is peak can still lead to problems. A famous quote by Scott Gulbransen from Intuit is:
“Every year, we take the busiest minute of the busiest hour of the busiest day and build capacity on that, We built our systems to (handle that load) and we went above and beyond that.” Despite this the systems still could not handle the load.
What we really want to be able to do is to have our site build for our average load, excluding peak, and have scale on demand built into the Architecture. As EC2 is the most mature cloud platform we will look at tools that can achieve this on EC2:
GigaSpaces XAP: From version 6.6 of the GigaSpaces XAP Platform Cloud tooling is built in. GigaSpaces is a next generation virtualised middleware platform that hosts logic, data, and messaging in-memory, and has less moving parts so that scaling out can be achieved linearly, unlike traditional middlware platforms. GigaSpaces us underpinned by a service grid which enables application level Service Level Agreement’s to be set and which are monitored and acted on in real-time. This means if load is increased then GigaSpaces can scale threads or the number of virtualised middlware instances to ensure that the SLA is met, which in our example would be the ability to act process the number of requests. GigaSpaces also partner with RightScale. GigaSpaces lets you try their Cloud offering for free before following the traditional utility compute pricing model.
Scalr:Scalr is a series of Amazon Machine Images (AMI), for basic website needs i.e. an app server, a load balancer, and a database server. The AMIs are pre-built with a management suite that monitors the load and operating status of the various servers on the cloud. Scalr purports to increase / decrease capacity as demand fluctuates, as well as detecting and rebuilding improperly functioning instances. Scalr has open source and commercial versions and is a relatively new infrastructure service / application. We liked the the ‘Synchronize to All’ features of Scalr. This auto-bundles an AMI and then re-deploys it on a new instance. It does this without interrupting the core running of your site. This saves time going through the EC2 image/AMI creation process. To find out more about Scalr you should check out the Scalr Google Groups forum.
RightScale: RightScale has an automated Cloud Management platform. RightScale services include auto-scaling of servers according to usage load, and pre-built installation templates for common software stacks. RightScale support Amazon EC2, Eucalyptus, FlexiScale, and GoGrid. They are quoted as saying that RackSpace support will happen also at some point. RightScale has a great case study oveview on their blog about Animoto and also explains how they have launched, configured and managed over 200,oo0 instances to date. RightScale are VC backed and in December 2008 did a $13 million series B funding round. RightScale have free and commercial offerings.
FreedomOSS: Freedom OSS has created custom templates, called jPaaS (JBoss Platform as a Service), for scaling resources such as JBoss Application Server, JBoss Messaging, JBoss Rules, jBPM, Hibernate and JBoss Seam. jPaaS monitors the instances for load and scales them as necessary. jPaaS takes care of updating the vhosts file and other relevant configuration files to ensure that all instances of Apache respond to this hostname. The newly deployed app that runs either on Tomcat or JBoss becomes part of the new app server image.
A new book co-authored by Jim Liddle, one of the contributors to this blog has been released entitled “TheSavvyGuideTo HPC, Grid, DataGrid, Virtualisation and Cloud Computing. The aim of TheSavvyGuideTo book range is to get people up to speed as quickly as possible on the subject matter of the books.
Although the book covers a plethora of technologies it is a good read for anyone who wants a good concise overview of the space be they a manager, consultant or developer.
Without a doubt the credit crunch is affecting everyone big and small. Firstly jobs are being shed, even from previously impervious organisations. Microsoft announced that it was cutting 5,000 jobs worldwide, Intel announced it was cutting between 5-6,000 of its 83,000 employees, and SAP announced it was to reduce it’s 51,500 headcount by 3,000 before the end of 2009. Add to this Citrix cutting 4-500 jobs, Lenovo cutting 2,500 employees…well you get the picture. Times are tough and companies are doing whatever they need to so they can maintain profit margins.
Now if we look at the organisations that these companies supply we find a different problem. Banks are reluctant to lend and capital markets have dried up. How do companies find the capital expenditure to fund new initiatives or project that are the lifeblood of all companies who sell hardware or software in the IT space ? Well one way is for the same organisations who supply the software / kit to also provide the credit. To that end HP, SAP and even Microsoft have announced initiatives in which they will be offering 0% finance on new purchases.
Some of the companies bucking the trend are Amazon and VMWare. VMWare increased Q4 year-on-year profits by 25% to $515 million and it’s profits were up by 34% to $102 million. Amazon’s net profit rose 9 percent, to $225 million, in the 4th quarter, up from $207 million, in the same quarter a year earlier. It is diificult to figure out how much of this was related to Amazon’s web services business as they do not break out these figures but the web services products feature strongly in the Amazon investor relations release leading one to speculate that they continue to perform strongly.
I see two ways in which the current credit crunch can help Cloud Computing move much closer to fully crossing the chasm. Firstly because of the economics of the utility compute model (for more details see this article) Cloud Computing allows organisation to continue to drive innovation and project without the upfront capital costs (CAPEX) associated with such projects in the past, enabling them to utilise an OPEX model. Even if this only relates to using the Cloud for the testing cycle of a project this can be a huge win for cloud computing. I know of one telecoms company whose testing needs for their projects, in terms of software, hardware, data centre space etc are vast.
Secondly, by the very nature of companies utilising more Cloud Computing Services this will foster more companies to build products, have cloud strategies, and to compete with other established cloud vendors.
All in all, 2009 could truly be the year that Cloud crossed over.
Jean-Lou Dupont has created a Mindmap of all the Cloud Computing players. It’s public and you can clone it or export it as an image. It’s a useful reference of who is doing what in the space.