Sunday, April 15, 2012

Put this on your Netflix queue: Release of The Simian Army

About a year ago, in April of 2011, a whole lot of internet services were failing because a whole lot of internet services run on Amazon’s EC2, and EC2 was failing. Pretty much any new internet service suffered, because by April 2011 most new internet services were using EC2 for at least some part of the business (e.g. Quora, Foursquare, Reddit, Hootsuite, among very many others including two sites I was working on).

Not a problem

One site that famously did NOT go down (at least not in April) was Netflix, because of one really bad employee.

The new Employee who Solved the Problems (spoiler alert: it’s a monkey)

Netflix survived because previously they’d hired a crazy, chaotic employee--a monkey--whose job description (from 5 Lessons We’ve Learned Using AWS) was:
…to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.
The Netflix Simian Army describes why they created job position:
…comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables -- all the while we continue serving our customers without interruption. By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won't even notice.
Chaos Monkey did such a good job (at being bad) that Netflix has since hired a whole team of monkeys, who each morning chant their motto:
“The best way to avoid failure is to fail constantly.”
Hey, hey, here come The Monkeys

According to Wired Enterprise, Netflix will be releasing The Netflix Simian Army this year in the form of source code. Whether you use that source code directly, or simply learn from it, Netflix's monkeys are some of the best examples of Problems Solving Problems.

Let's keep an eye out for the monkeys.

Related Links
Update: July 30, 2012:

Today Netflix announce that the monkey is out. If you try the monkey, let us know how it goes.
 Today’s Takeaway: The best way to avoid failure is to fail constantly.

No comments:

Post a Comment