Sunday, April 07, 2013

Tutorials and Training on Cloud Architecture and NetflixOSS

There are two places that I'm giving in-depth training on cloud architecture and NetflixOSS this year.

On May 22nd-23rd at Gluecon 2013 in Broomfield Colorado I'm giving one of the opening keynote talks that introduces the concepts of a Cloud Native Architecture, then spending all afternoon on a tutorial to go into the details of how to get there and how to use NetflixOSS tools as an on-ramp to accelerate the process.

For people in Europe, I'm teaching at a summer school on "Software for the Cloud and Big Data" September 8th-14th in Italy. It's organized by ETH Zurich. I'll be giving six one-hour talks over a week, and there are seven other speakers.

LASER 2013 banner

When I get back from LASER, it will be time to start evaluating nominations for the NetflixOSS Cloud Prize, and these are both opportunities to figure out how to build a strong entry.

Tuesday, March 26, 2013

Comment on How Netflix Is Ruining Cloud Computing


I wrote a long comment response to how-netflix-is-ruining-cloud-computing on Information Week, but they don't seem in a hurry to post it. Luckily I saved a copy so here it is:



There should be a http://techblog.netflix.com post in the next day or so that will give more context to the Cloud Prize and clarify most of the points above. However I will address some of the specific issues here.

Cloud 1.0 vs. 2.0?
I would argue that the way most people are doing cloud today is to forklift part of their existing architecture into a cloud and run a hybrid setup. That's what I would call Cloud 1.0. What Netflix has done is show how to build much more agile green field native cloud applications, which might justify being called Cloud 2.0. The specific IaaS provider used underneath, and whether you do this with public or private clouds is irrelevant to the architectural constructs we've explained.

Outages
The outages that have been mentioned were regional, they didn't apply to Netflix operations in Europe for example. Our current work is to build tooling for multi-regional support on AWS (East cosat/West coast), including the DNS management that was mentioned. This removes the failure mode with the least effort and disruption to our existing operations.

Portability
Other cloud vendors have a feature set and scale comparable to AWS in 2008-2009. We're still waiting for them to catch up. There are many promises but nothing usable for Netflix itself. However there is demand to use NetflixOSS for other smaller and simpler applications, in both public and private clouds, and Eucalyptus have demonstrated Asgard, Edda and Chaos Monkey running, and will ship soon in Eucalyptus 3.3. There are signs of interest from people to add the missing features to OpenStack, CloudStack and Google Compute so that NetflixOSS can also run on them.

Edda
You've completely missed the point of Edda. It does three important things. 1) if you run at large scale your automation will overload the cloud API endpoint, Edda buffers this information and provides a query capability for efficient lookups. 2) Edda stores a history of your config, it's a CMDB that can be used to query for what changed. 3) Edda cross integrates multiple data sources, the cloud API, our own service registry Eureka, Appdynamics call flow information and can be extended to include other data sources.

AMInator
If you want to spin up 500 identical instances, having them each run Chef or Puppet after they boot creates a failure mode dependency on the Chef/Puppet service, wastes startup time, and if anything can go wrong with the install you end up with an inconsistent set of instances. By using AMInator to run Chef once at build time, there is less to go wrong at run time. It also makes red/black pushes and roll-backs trivial and reliable.

Cloud Prize
The prize includes a portability category. It's a broad category and might be won by someone who adds new language support to NetflixOSS (Erlang, Go, Ruby?) or someone who makes parts of NetflixOSS run on a broader range of IaaS options. The reality is that AWS is actually dominating cloud deployments today, so contributions that run on AWS will have the greatest utility by the largest number of people. The alternatives to AWS are being hyped by everyone else, and are showing some promise, but have some way to go.

We hope that NetflixOSS provides a useful driver for higher baseline functionality that more IaaS APIs can converge on, and move from 2008-era EC2 functionality to 2010-era EC2 functionality across more vendors. Meanwhile Netflix itself will be enjoying the benefits of 2013 AWS functionality like RedShift.

Wednesday, January 16, 2013

The IT swamp draining manual for anyone who is neck deep in alligators

I've spent the last year or so reviewing Gene Kim's new book - the Phoenix Project and encouraging him to get it finished. It came out this week, is the top business book on Amazon as I write this, and I got a nice back-cover quote shown below with Gene's actual finger in the photo.


Many years ago someone gave me a copy of The Goal, which is the inspiration for The Phoenix Project. In both cases the book is a novel about a company that is dysfunctional and on the verge of going out of business. The lead character is dropped into the job of figuring out how to dig their way out of the problem, and in the case of The Phoenix Project, the company fumbles its way from legacy enterprise where IT isn't regarded as central to their success, into the modern world of agile development practices and DevOps deployments where IT becomes a competitive advantage.

Don't just read it, give copies of it to your friends in management. It should be on every CxO's bookshelf.

Tuesday, January 01, 2013

Looking back at 2012, with pointers to 2013

A collection of things that seem to have pivoted in 2012.

Mobile Bandwidth Greater than Fixed Bandwidth

I've been talking about LTE and the growth in mobile since 2008, but I started 2012 with a Verizon iPhone 4 which maxed out at 2Mbit/s over 3G and at home in the mountains I would get less than 1Mbit/s. I ended 2012 with a Verizon iPhone 5 which is about ten times faster at home, I regularly see 8-9Mbits/s, and the best speed I have seen anywhere so far was in downtown Los Gatos at over 50Mbit/s. My home fixed wire Internet is a 3Mbit/s DSL that has neighborhood congestion at peak times. I now find it works better to have WiFi turned off on my iPhone at home.

This is one of those pivotal changes, similar to the change from having predominantly fixed wire telephone service at home, to having many people use mobile phones exclusively. It costs more, but if you already have a high bandwidth connection to your phone with a high data cap because you use it a lot, why pay to also have a low bandwidth connection to your house? Bandwidth caps and data usage plans will slow the switchover, but the writing is on the wall.

Cutting The Cable/Satellite TV Feed

In 2013 we finally turned off our TiVo and shut down our DirecTV account. We weren't using it enough to make it worth while. For some of the sports events (Laurel follows the Stanford Cardinals), we go to a sports bar to watch, which is more fun anyway. Everything else that we have time to watch, we can watch online, and we get all our news updates from Twitter, RSS feeds and Facebook posts. By the time it's on TV or in a newspaper, it's already old news.

The TV has an AppleTV connected to it, which gets almost all the usage. We watch a few things on laptops, and sometimes I connect a laptop to the TV. I also stream music from my iPhone to the AppleTV because I can't get Pandora or Spotify on it. Come on Apple, where's the AppleTV App Store? Maybe that's a 2013 thing.

The Netflix Open Source Cloud Platform Got Traction

We started the year with a handful of disconnected projects, and ended it with a large chunk of the platform on Github, and some high profile users. Most people are still picking it up piecemeal but in 2013 we plan to get the whole thing put together as an installable bundle. This is the Alan Kay approach, "The best way to predict the future is to invent it".  Netflix has been out in front of the industry in terms of cloud adoption, inventing the future. Next we make it easier for others to join us in that future, and have some ideas for how to drive adoption to new heights.

Netflix Cloud Architecture Presentations

I was going to list all the talks I gave, but there are too many, so go see the slides I posted at http://www.slideshare.net/adrianco. Highlights were QConSF, QConLondon, Gluecon in Colorado, GOTO in Aarhus and of course AWS Re:Invent in Las Vegas. The impact of these talks grew through the year, reaching a peak at Re:Invent, where we had lots of speakers and attention to the way the Netflix cloud and open source story was bringing value to the company and reaching out into the technical community. A big thanks to everyone who came to my talks, and all the other Netflix speakers who have been out there broadening the story. It's almost impossible to write an article or do a presentation about cloud without mentioning Netflix. In 2013 there will be even more talks, I focus on local and US based events that are strongly developer oriented like QConGluecon, and GOTO. We will definitely be back at AWS Re:Invent next November.

The Concept of Anti-Fragility Took Off

Nassim Taleb's latest book crystallizes the way I tend to approach things and gives it a name. The Netflix cloud architecture is anti-fragile, we run "Chaos Monkey's" continuously to try and break it, and that makes it stronger. The Netflix culture is anti-fragile, it's decentralized with as little process and rules as possible and a lot of local autonomy. Netflix management is not afraid of change or of being first to do something and tends to navigate disruptive transitions well. From the outside this can look chaotic or confusing, but it works, and recovers well from missteps, which are always going to happen. If you're not failing occasionally you aren't trying hard enough, and you are missing opportunities. Getting stronger through failure is the basis of anti-fragility. Avoiding failure at all costs (as many people try to do) makes you brittle and vulnerable to unexpected Black Swan events that will have a much bigger impact.

Cloud, Open Source, SaaS and the End of Enterprise Computing

Taleb makes the point that big companies become increasingly fragile as they lose agility and the ability to move with the markets, and we are seeing that play out in the Enterprise Computing space. There is still money to be made from the late adopter customers, but the trend is clearly towards development using exclusively open source tools, with applications and infrastructure delivered as a service. There is zero revenue for traditional Enterprise Computing vendors in this model. The current interest in building out private cloud infrastructure is real and will continue to support traditional vendors into 2013, but it's a short term investment. At best you end up with a much better automated datacenter, but it isn't elastic and it has far fewer features than AWS, so it's going to be marginalized over time. At worst, you discover just how hard it is to run a reliable private cloud based on immature software, with incompatible upgrade paths, and it turns out to be much more expensive to run.

The Enterprise Computing vendors haven't been able to build a public cloud  that competes with AWS on scale, price or features, and AWS is now focused on building everything their customers need to take the next generation of application investment out of the datacenter, so the high margin revenue is going to gradually go away for the traditional vendors.

The most interesting development in 2012 was the re-launch of Google as a public cloud infrastructure vendor, and the mini-price-war between AWS and Google over instance and storage costs makes it clear where the real action is. During 2013 we will see if Google manages to invest heavily and execute well enough to build up a big user base.  In mobile, as I predicted years ago, we are now in an iPhone vs. Android battle that is wiping out everyone else. I personally think in 2014 we will likely see a similar effect as the scale, features and price point of AWS and Google clouds make everyone else irrelevant. The only question in my mind is whether AWS runs away with this on their own, or Google manages to get some traction as the alternative.

Note to sales reps (who won't listen), I'm not interested in anything to do with datacenters, private cloud, or other public clouds in 2013. I'm only interested in SaaS apps, things that run on AWS, and interesting open source projects.

Solar Powered Electrics Cars Are For Real Now

We drive our Nissan Leaf all the time, it's fun to drive and the first car we pick for most trips, adding up to almost 1000 miles a month. The marginal cost of running the car is near zero. New tires and a cabin air filter at 15K miles is all the maintenance it needs. We have an excess of solar power generation that added up to $500 in unused electricity over the year. At 10c/KWh and 3.5KWh/mile that's plenty for us run a second electric car before we start paying for the power, and there are a lot more choices coming in 2013. There are many charging stations around the Bay Area, lots of other people running Leafs, and the Tesla Model S got car of the year awards. It takes a test drive to realize what fun it is to have instant torque and no gear shifts. This is a case of the future being unevenly distributed. If you don't live in California, it's a bit further out, but it's coming.

A friend recently got a quote for Solar Power installation which was about half what we paid two years ago, and we got a good deal then. Prices have dropped fast and are much lower than most people think. If you don't already have solar panels on your roof, you should get them. If you don't use enough electricity to justify solar panels, get an electric car as well, and save at the gas pump.

Global Warming Arrived in the USA in 2012

The well funded Merchants of Doubt (read the book) managed to confuse and suppress public discussion of global warming in the USA for the last few years, but the effects just became too obvious this year and it broke through, creating the scenarios that James Hansen warned of in Storms of My Grandchildren. The arctic ice cap melt continues to accelerate, seas are warming and rising, drought and record heat hit most of the USA, and everything wrapped up with Hurricane Sandy, pushing the topic onto the front page. The dice are all loaded now, and 2013 is already rolling those dice as the drought continues and the Mississippi river is empty. I've been saying for the last few years that if you own property at sea level, you should find someone who doesn't believe in global warming to sell it to, because it's going to become increasingly uninsurable and end up as worthless as the houses along the New Jersey shoreline that were swept away.

The Republican party is still in denial, a combination of funding from big oil companies and an inability to accept or admit that their demonized Al Gore could have been right all along. In 2013 it will be interesting to see how they deal with losing the election, and perhaps there will be a split into a group of Republicans that see the path to re-election in 2014 as needing to accept reality by voting for some Global Warming related legislation, versus the hard core that are trying to pray their way out. The current battle is over stopping the Keystone XL pipeline that would move the dirtiest kind of tar oil from Alberta Canada to Texas. It may be symbolic, but if KXL is stopped, the tide will have turned. Carbon needs to be left in the ground. For 2013, I'm going to try and re-balance my 401K retirement accounts to divest from oil companies. Many students are now pressuring their colleges to divest from oil as well.

Twitter and Snapchat

Personally, 2012 was an excellent year for me, I've made lots of new friends and learned a lot by being active on twitter, ending the year with about 6500 followers. I joked on twitter that I posted my new years resolutions for 2013 to Snapchat, but you missed them. If you don't know what Snapchat is for, ask a teenager. You'll probably hear a lot more about it in 2013, then, when their parents figure it out and join too, the teens will be onto the next thing....


Monday, November 26, 2012

Lots of Netflix talks at AWS Re:Invent



[Update: here's video's of these talks http://techblog.netflix.com/2012/12/videos-of-netflix-talks-at-aws-reinvent.html along with slides]

There is a Netflix booth in the expo center, we will be talking about our open source tools from http://netflix.github.com and collecting resumes from anyone interested in joining us.

Date/Time
Presenter
Topic
Wed 8:30-10:00
Reed Hastings
Keynote with Andy Jassy
Wed 1:00-1:45
Coburn Watson
Optimizing Costs with AWS
Wed 2:05-2:55
Kevin McEntee
Netflix’s Transcoding Transformation
Wed 3:25-4:15
Neil Hunt / Yury I.
Netflix: Embracing the Cloud
Wed 4:30-5:20
Adrian Cockcroft
High Availability Architecture at Netflix
Thu 10:30-11:20
Jeremy Edberg
Rainmakers – Operating Clouds
Thu 11:35-12:25
Kurt Brown
Data Science with Elastic Map Reduce (EMR)
Thu 11:35-12:25
Jason Chan
Security Panel: Learn from CISOs working with AWS
Thu 3:00-3:50
Adrian Cockcroft
Compute & Networking Masters Customer Panel
Thu 3:00-3:50
Ruslan M./Gregg U.
Optimizing Your Cassandra Database on AWS
Thu 4:05-4:55
Ariel Tseitlin
Intro to Chaos Monkey and the Simian Army

Friday, November 16, 2012

Cloud Outage Reports

The detailed summaries of outages from cloud vendors are comprehensive and the response to each highlights many lessons in how to build robust distributed systems. For outages that significantly affected Netflix, the Netflix techblog report gives insight into how to effectively build reliable services on top of AWS. I've included some Google and Azure outages here because they illustrate different failure modes that should be taken into account. In general AWS and Azure outage reports have far more detail than Google outage reports.

I plan to collect reports here over time, and welcome links to other write-ups of outages and how to survive them. My naming convention is {vendor} {primary scope} {cause}. The scope may be global, a specific region, or a zone in the region. In some cases there are secondary impacts with a wider scope but shorter duration such as regional control planes becoming unavailable for a short time during a zone outage.

This post was written while researching my AWS Re:Invent talk.
Slides: http://www.slideshare.net/AmazonWebServices/arc203-netflixha
Video: http://www.youtube.com/watch?v=dekV3Oq7pH8

April 20th, 2013 - Google Global API Outage

Google Report


February 22nd, 2013 - Azure Global Outage Cert Expiry

Azure Report


December 24th, 2012 - AWS US-East Partial Regional ELB State Overwritten

AWS Service Event Report

http://aws.amazon.com/message/680587/

Netflix Techblog Report

http://techblog.netflix.com/2012/12/a-closer-look-at-christmas-eve-outage.html

October 26th, 2012 - Google AppEngine Network Router Overload

Google Outage Report



October 22, 2012 - AWS US-East Zone EBS Data Collector Bug

AWS Outage Report

Netflix Techblog Report



June 29th 2012 - AWS US-East Zone Power Outage During Storm 

AWS Outage Report

Netflix Techblog Report



June 13th, 2012 - AWS US-East SimpleDB Region Outage

AWS Outage Report



February 29th, 2012 - Microsoft Azure Global Leap-Year Outage

Azure Outage Report



August 17th, 2011 - AWS EU-West Zone Power Outage

AWS Outage Report



April 2011 - AWS US-East Zone EBS Outage

AWS Outage Report

Netflix Techblog Report

http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html


July 20th, 2008 - AWS Global S3 Gossip Protocol Corruption

AWS Outage Report

http://status.aws.amazon.com/s3-20080720.html

Friday, October 26, 2012

What's a Distinguished Engineer?

A recent post by John Allspaw on what it means to be a senior engineer reminded me of something I put together years ago while I was a Distinguished Engineer at Sun. One question from senior engineers looking at their career path was what did it take to become a Distinguished Engineer?

Although Sun is no more, across the industry, there are engineers who are "distinguished" and the title is used in a few places. At Sun, there were between 50 and 100 people in the role, who were mostly director level individual contributors, although there were also Sun Fellows who were VP level, and some were also line managers.

I boiled it down into a few questions.

First I made a list of the names of all the Sun Distinguished Engineers and Fellows, and the first question was "how many of these names do you recognize, and know what they did". The intent is to get a baseline level of understanding of what might be expected. The list included people who invented software languages and frameworks that lots of people use, microprocessor architects, and fundamental researchers in security and networking. There were also CTOs of companies that Sun had acquired, and a few like me who mostly got in through writing books that everyone else had read.

The next question is "how many of these people know who you are?". If you think you did do something special, we would expect that the existing Distinguished Engineers would have heard of it. Since at Sun the way to become a DE involved having the existing DE and Fellows vote for you, this was critical.

The final question was "how many DE and Fellows are hanging around your cube on a regular basis waiting to talk to you?". This shows that you are the go-to person for something that matters.

Translating this into a broader context, more current questions for being distinguished might be "Do the top conferences invite you to speak?", "How many of the other invited speakers and conference organizers do you know?" and "how many know you?". The other dimension of what you did to deserve it is nowadays a mixture of open source projects that lots of people use, or key ideas shared through books or blogs.

Here's the original slide from 2002, how many of these names do you know, what did they do then and where are they now?