Monday, November 26, 2012

Lots of Netflix talks at AWS Re:Invent



[Update: here's video's of these talks http://techblog.netflix.com/2012/12/videos-of-netflix-talks-at-aws-reinvent.html along with slides]

There is a Netflix booth in the expo center, we will be talking about our open source tools from http://netflix.github.com and collecting resumes from anyone interested in joining us.

Date/Time
Presenter
Topic
Wed 8:30-10:00
Reed Hastings
Keynote with Andy Jassy
Wed 1:00-1:45
Coburn Watson
Optimizing Costs with AWS
Wed 2:05-2:55
Kevin McEntee
Netflix’s Transcoding Transformation
Wed 3:25-4:15
Neil Hunt / Yury I.
Netflix: Embracing the Cloud
Wed 4:30-5:20
Adrian Cockcroft
High Availability Architecture at Netflix
Thu 10:30-11:20
Jeremy Edberg
Rainmakers – Operating Clouds
Thu 11:35-12:25
Kurt Brown
Data Science with Elastic Map Reduce (EMR)
Thu 11:35-12:25
Jason Chan
Security Panel: Learn from CISOs working with AWS
Thu 3:00-3:50
Adrian Cockcroft
Compute & Networking Masters Customer Panel
Thu 3:00-3:50
Ruslan M./Gregg U.
Optimizing Your Cassandra Database on AWS
Thu 4:05-4:55
Ariel Tseitlin
Intro to Chaos Monkey and the Simian Army

Friday, November 16, 2012

Cloud Outage Reports

The detailed summaries of outages from cloud vendors are comprehensive and the response to each highlights many lessons in how to build robust distributed systems. For outages that significantly affected Netflix, the Netflix techblog report gives insight into how to effectively build reliable services on top of AWS. I've included some Google and Azure outages here because they illustrate different failure modes that should be taken into account. Recent AWS and Azure outage reports have far more detail than Google outage reports.

I plan to collect reports here over time, and welcome links to other write-ups of outages and how to survive them. My naming convention is {vendor} {primary scope} {cause}. The scope may be global, a specific region, or a zone in the region. In some cases there are secondary impacts with a wider scope but shorter duration such as regional control planes becoming unavailable for a short time during a zone outage.

This post was written while researching my AWS Re:Invent talk.
Slides: http://www.slideshare.net/AmazonWebServices/arc203-netflixha
Video: http://www.youtube.com/watch?v=dekV3Oq7pH8


November 18th, 2014 - Azure Global Storage Outage

Microsoft Reports


January 10th, 2014 - Dropbox Global Outage

Dropbox Report


April 20th, 2013 - Google Global API Outage

Google Report


February 22nd, 2013 - Azure Global Outage Cert Expiry

Azure Report


December 24th, 2012 - AWS US-East Partial Regional ELB State Overwritten

AWS Service Event Report

http://aws.amazon.com/message/680587/

Netflix Techblog Report

http://techblog.netflix.com/2012/12/a-closer-look-at-christmas-eve-outage.html


October 26th, 2012 - Google AppEngine Network Router Overload

Google Outage Report


October 22, 2012 - AWS US-East Zone EBS Data Collector Bug

AWS Outage Report

Netflix Techblog Report


June 29th 2012 - AWS US-East Zone Power Outage During Storm 

AWS Outage Report

Netflix Techblog Report


June 13th, 2012 - AWS US-East SimpleDB Region Outage

AWS Outage Report


February 29th, 2012 - Microsoft Azure Global Leap-Year Outage

Azure Outage Report


August 17th, 2011 - AWS EU-West Zone Power Outage

AWS Outage Report


April 2011 - AWS US-East Zone EBS Outage

AWS Outage Report

Netflix Techblog Report

http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html


February 24th, 2010 - Google App Engine Power Outage

Google Forum Report


July 20th, 2008 - AWS Global S3 Gossip Protocol Corruption

AWS Outage Report