From the Blogosphere
Lessons Learned from the Amazon Web Services Outage
The only surprising thing about this AWS outage was that anyone was surprised by it
By: Sven Hammar
Oct. 26, 2012 11:00 AM
On Monday, Amazon Web Services — the leading provider of cloud services — suffered an outage, and as a result, a long list of well-known and popular websites went dark. According to Amazon’s Service Health Dashboard, the outage started out as degraded performance of a small number of Elastic Bloc Store (EBS) storage units in the US-EAST-1 Region, then evolved to include problems with the Relational Database Service and Elastic Beanstalk as well.
The only surprising thing about this AWS outage was that anyone was surprised by it. It wasn’t the first time AWS had a major outage or problems with this data center. If you remember, back in June a line of powerful thunderstorms knocked the power out at a major Amazon hosting center. The backup generator failed, then the software failed, and, well, you know the drill. A corollary of Murphy’s Law is that if multiple things can go wrong, they will all go wrong at once.
In both of these instances (and in all Amazon Web Services outages, in fact) some customers were knocked “off the air” while others continued running without a hiccup. You would think that eventually companies will learn to anticipate the inevitable AWS outages and take active steps to prepare for them. There are best practices and solutions on how to reduce vulnerability to an outage, but they’re rarely implemented. That’s because people don’t think that anything could happen to Amazon — obviously, things happen.
Instances like this are a learning opportunity if we take the time to think about why they happened and what could have been done to prevent them. Here are six lessons that I think we can learn from the Amazon Web Services outages.
Lesson 1 — Clouds are made of components that can fail. When people think of the cloud, they think that there is some amorphous and untouchable blog up in the sky. And while that’s a nice bit of marketing, it is not a useful model for operational planning. Be mindful of your cloud provider’s architecture and how it is built to manage failure of a component or a zone blackout. Then anticipate that failures can happen at any point in the cloud infrastructure.
Lesson 2 — The stress of failure will trigger a cascade of other failures. After reading a description of the outage, you get the sense that it was just one thing after another. What started as a small issue affecting one Northern Virginia data center quickly spread, causing a chain reaction and outage that disrupted much of the Internet for several hours. Remember Murphy and his law?
Lesson 3 – -Spikes matter. When a cloud fails, hundreds of customers are impacted. As they try to recover, they will be stressing the cloud provider’s infrastructure with a peak load that is guaranteed to cause even more problems. If you get these transition spikes, they get worse and worse. Every time you reboot, it takes longer and longer. If you have ten servers doing that, that’s bad. If you spike a thousand servers, that’s really bad. Something that would have taken five minutes to fix will now take five hours when you get into that transition type of syndrome.
Lesson 4 — Cloud providers provide the tools to manage failure, but it is up to you to put your own failover plans in place. AWS, for example, is broken into zones. If a component in the Virginia zone goes down and the whole matrix is dead, then (in theory) you should be able to move all your data to another zone. That other zone might be hosted, unaffected, in Ireland and then you are up and running again. This is one of the big differences between the cloud and more traditional approaches to IT. It is up to the application (and by extension, the application’s designer) to manage its interaction with the cloud environment, up to and including failover. Most cloud providers offer tools and frameworks to support failover, but you are responsible for implementing that best practice into your system operation and into the applications.
Lesson 5 — You need to put your failover plans through a full-blown load test. It’s not enough to have a strategy in place for failover. You have to test it under real-world conditions. Even the best laid failover plans, once implemented and designed, might have hiccups when a real outage occurs. A full-blown cloud load test can help you see how long the failover process will take to kick in and what other dependencies might need to be sorted out. Obviously this isn’t easy. If it was, Reddit, Foursquare, Airbnb and others wouldn’t have been impacted by the AWS outage.
Lesson 6 — Conduct fire drills. While a load test will confirm that your failover plan works as you expect, it will also give your team some real experience in executing the plan. Remember the fire drills you used to do in school? Fire drills help train students, teachers, and others to know exactly what they’re supposed to do and where they’re supposed to go in the event of an emergency. All the bugs in the process are worked out during the fire drill, and the more everybody does the drills, the more comfortable there are with what they need to do. And if a real emergency happens, everybody knows how to leave the building calmly. You want to do the same thing with your failover plan, and load testing can help you get there. Fire drills save lives and load tests save cloud apps.
Is your failure worth more than $28?
Amazon offers reimbursement to its customers based on the amount of downtime the customer experiences. The last time our Amazon Web Services went down, we got a $28 reimbursement. So my final lesson learned (I guess this makes for seven lessons) is this: The cost of downtime for your organization — in lost revenue, poor customer experience, etc. — is far, far greater than just what you are paying your cloud provider. $28 is not going to save your day. You have to make sure that you have a failover solution that’s ready and working. Don’t wait for Amazon to solve this problem for you, because it’s only a $28 problem for it.
The biggest lesson learned from these AWS outages is that you need to configure properly and you need to train your people. These types of events will always happen, and when they do, you need to be trained ahead of time. Load testing itself is a good way to validate and train. That way when a real emergency occurs, your team can react in a calm, collected manner to a situation they’ve experienced dozens of times before.
Best Recent Articles on Cloud Computing & Big Data Topics
As we enter a new year, it is time to look back over the past year and resolve to improve upon it. In 2014, we will see more service providers resolve to add more personalization in enterprise technology. Below are seven predictions about what will drive this trend toward personalization.
IT organizations face a growing demand for faster innovation and new applications to support emerging opportunities in social, mobile, growth markets, Big Data analytics, mergers and acquisitions, strategic partnerships, and more. This is great news because it shows that IT continues to be a key stakeholder in delivering business service innovation. However, it also means that IT must deliver new innovation despite flat budgets, while maintaining existing services that grow more complex every day.
Cloud computing is transforming the way businesses think about and leverage technology. As a result, the general understanding of cloud computing has come a long way in a short time. However, there are still many misconceptions about what cloud computing is and what it can do for businesses that adopt this game-changing computing model. In this exclusive Q&A with Cloud Expo Conference Chair Jeremy Geelan, Rex Wang, Vice President of Product Marketing at Oracle, discusses and dispels some of the common myths about cloud computing that still exist today.
Despite the economy, cloud computing is doing well. Gartner estimates the cloud market will double by 2016 to $206 billion. The time for dabbling in the cloud is over! The 14th International Cloud Expo, co-located with 5th International Big Data Expo and 3rd International SDN Expo, to be held June 10-12, 2014, at the Javits Center in New York City, N.Y. announces that its Call for Papers is now open. Topics include all aspects of providing or using massively scalable IT-related capabilities as a service using Internet technologies (see suggested topics below). Cloud computing helps IT cut infrastructure costs while adding new features and services to grow core businesses. Clouds can help grow margins as costs are cut back but service offerings are expanded. Help plant your flag in the fast-expanding business opportunity that is The Cloud, Big Data and Software-Defined Networking: submit your speaking proposal today!
What do you get when you combine Big Data technologies….like Pig and Hive? A flying pig? No, you get a “Logical Data Warehouse.” In 2012, Infochimps (now CSC) leveraged its early use of stream processing, NoSQLs, and Hadoop to create a design pattern which combined real-time, ad-hoc, and batch analytics. This concept of combining the best-in-breed Big Data technologies will continue to advance across the industry until the entire legacy (and proprietary) data infrastructure stack will be replaced with a new (and open) one.
While unprecedented technological advances have been made in healthcare in areas such as genomics, digital imaging and Health Information Systems, access to this information has been not been easy for both the healthcare provider and the patient themselves. Regulatory compliance and controls, information lock-in in proprietary Electronic Health Record systems and security concerns have made it difficult to share data across health care providers.
Cloud Expo, Inc. has announced today that Vanessa Alvarez has been named conference chair of Cloud Expo® 2014. 14th International Cloud Expo will take place on June 10-12, 2014, at the Javits Center in New York City, New York, and 15th International Cloud Expo® will take place on November 4-6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
12th International Cloud Expo, held on June 10–13, 2013 at the Javits Center in New York City, featured four content-packed days with a rich array of sessions about the business and technical value of cloud computing led by exceptional speakers from every sector of the cloud computing ecosystem. The Cloud Expo series is the fastest-growing Enterprise IT event in the past 10 years, devoted to every aspect of delivering massively scalable enterprise IT as a service.
Ulitzer.com announced "the World's 30 most influential Cloud bloggers," who collectively generated more than 24 million Ulitzer page views. Ulitzer's annual "most influential Cloud bloggers" list was announced at Cloud Expo, which drew more delegates than all other Cloud-related events put together worldwide. "The world's 50 most influential Cloud bloggers 2010" list will be announced at the Cloud Expo 2010 East, which will take place April 19-21, 2010, at the Jacob Javitz Convention Center, in New York City, with more than 5,000 expected to attend.
It's a simple fact that the better sales reps understand their prospects' intentions, preferences and pain points during calls, the more business they'll close. Each day, as your prospects interact with websites and social media platforms, their behavioral data profile is expanding. It's now possible to gain unprecedented insight into prospects' content preferences, product needs and budget. We hear a lot about how valuable Big Data is to sales and marketing teams. But data itself is only valuable when it's part of a bigger story, made visible in the right context.
Cloud Expo, Inc. has announced today that Larry Carvalho has been named Tech Chair of Cloud Expo® 2014. 14th International Cloud Expo will take place on June 10-12, 2014, at the Javits Center in New York City, New York, and 15th International Cloud Expo® will take place on November 4-6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
Everyone talks about a cloud-first or mobile-first strategy. It's the trend du jour, and for good reason as these innovative technologies have revolutionized an industry and made savvy companies a lot of money. But consider for a minute what's emerging with the Age of Context and the Internet of Things. Devices, interfaces, everyday objects are becoming endowed with computing smarts. This is creating an unprecedented focus on the Application Programming Interface (API) as developers seek to connect these devices and interfaces to create new supporting services and hybrids. I call this trend the move toward an API-first business model and strategy.
We live in a world that requires us to compete on our differential use of time and information, yet only a fraction of information workers today have access to the analytical capabilities they need to make better decisions. Now, with the advent of a new generation of embedded business intelligence (BI) platforms, cloud developers are disrupting the world of analytics. They are using these new BI platforms to inject more intelligence into the applications business people use every day. As a result, data-driven decision-making is finally on track to become the rule, not the exception.
Digital Transformation Blogs