Sunday, May 27, 2012

Make Your Own Dumb Luck

Three times in the past month, on three separate projects, I delivered code that I thought was well tested and therefore correct. In two cases I was writing libraries to sort data according to certain rules, and in the other case it was a small program to pull a best-match set of data from a database as fast as possible. In all cases the code, which I thought was excellent and well-tested, reached the real world and soon failed. And in all cases the failures were not easy to evaluate.

It took me all month to realize that in each case I’d made the same stupid testing mistake: not enough dumb luck.

I know what you’re thinking, punk. You’re thinking “I’ve tested every possible permutation.” You’ve gotta ask yourself a question: “Do I feel lucky?” Well, do ya, punk?
The Problem

Being quality minded, before considering my work “finished” I had of course written unit tests for a bunch of different cases that I thought would best stress the code. I wrote a few standard cases, edge cases, cases around data limits, cases around bad input, and cases that I thought really put my algorithms to the limit. With all of my test cases passing, I confidently handed off the code for real-world use.

Somehow the real world managed to create data states and logic paths that my tests had not exercised, even thought I thought I’d been very clever about getint all possible cases into my test suite.

The real world is like that sometimes. Most of the time.

The Problem Behind The Problem

It so happens that in writing these three bits of code I was being about as smart as I know how to be. I.E., writing the code required 100% of my IQ.

But figuring out code tests that mimic all the possible real-world situations and permutations and complications requires more smarts than it takes to write the code. It requires, I dunno, maybe 27.3% more smarts to test the code than to write it. No matter how hard I try I’m just not 27.3% smarter than myself.

Solving the Problem (by adding a lot more Problems)

Working alone, I have no one to lean on who is smarter than me. But I do have a pretty good random number generator, and I know how to use it. So here’s what I did in each of the three cases.

First, for each of my projects I wrote test suites that ran the tests against tons of generated data and queries. Whereas the original tests may have had something like a dozen well-designed cases, the additional random test might have 2 million cases. This was the easy part.

Second (the hard part), since the data was now going to be random, and so I wouldn’t know ahead of time what the correct results would be, there was the difficulty of determining at runtime if the results were correct. This required writing more code than was in the original libraries simply to evaluate the results. Fortunately this new result-validation code didn’t need to be fast or clever, just accurate, and so it could be written so simplistically that even someone like me would have a hard time getting it wrong. (As a bonus, I stored the random input in a temporary file on each run, so if it did fail I could see just what input had caused the failure.)

With millions of random cases running, every error that had been reported (and more) very quickly popped up and I could very quickly squash them. I released my code again and this time no error reports came back.

I know a lot of people dislike random testing because it is, almost by definition, not reproducible. Those people must be smart enough to figure out enough non-random tests to exercise every part of their code--I’m not.

So in the end I didn’t have to be smarter than myself to test my own code. I just needed to make lots of my own dumb luck!

Today’s Takeaway: When you’re done writing all your best test cases, add a few million random cases for good measure. If you’re anything like me, you’re not smart enough to think of a test for everything.

Thursday, May 3, 2012

We reserve the right to refuse this web service to anyone.

Do you use a web API as part of your product? Perhaps you let users login and find friends using a Social API, show users information using a Geolocation API tied to a Mapping API, save user images/video/audio using a Storage API translated with a Translation API, etc…

If you clicked on any of the Web API links in the previous paragraph, then I apologize because none of those APIs work any more. They’re all dead.

If your product did rely on any of those APIs, how would your product fare?

Assertion: Every Web API will someday fail you


Relying on a third party API provides a lot of power very quickly. But keep in mind that the Terms-Of-Service for every 3rd-party Web API goes something like this.

EVERY Web API Terms of Service (TOS)

We reserve the right to refuse this web service to anyone…
…at any time…
…for any length of time…
…for any reason…
…or for no reason at all…
…and to change these or any other terms of service…
…or to change our APIs…
…or to change our pricing…
…or to add limits to your usage…
…or to give away whatever information you send us…
…or to just screw up and stop working…
…or to go belly up…
…or to get bored and stop paying attention...
…at any time.

I can promise you that any Web API you use, or any Web API you create, will, at some point in the future, be different than it is today (i.e. however you use it now, it will someday “break”). I just took a brief tour of programmableweb api directory, and out of 10 mashups I tried at random, only 3 were currently working. Out of 5814 APIs, 602 are listed in the deadpool (about 10%) but randomly picking 10 of those that were not in the deadpool only 5 were still alive. Some of these broken APIs are from tiny little companies I’ve never heard of, and others are from the biggest internet companies of all.

In just the past year I have been personally involved in products that failed in full or in part because a web API they relied on was shut down or changed. In the past year I have personally shut down a service that provided a Web API, causing the apps that relied on that service to become useless.

Web APIs are unreliable.

If Web APIs are so unreliable, should I use them?

If a Web API enables you to build a better product, and to build it faster, then, yes, use the Web API. But use that API in full knowledge that it won’t always work as it does today. Someday it almost certainly will not work at all.

What’s a reliable strategy for using unreliable Web APIs?

The most important step for dependency on 3rd-party APIs is to admit that you are powerless over them, and to accept that they are unreliable. Once you accept that fact, the other steps follow.
  • For every Web API you use, determine whether it is core or supplementary to your product (i.e. is your product totally useless without that service). In the following issues you’ll probably have different answers for core vs. supplementary services.
  • For every Web API you use, create three levels of fallback plans for what should happen when those services fail:
    1. If the service is unavailable temporarily, for just a few seconds, decide where you stand on these questions: Will your fallback simply retry a few times? Will the user still be able to use parts of your product? Will the user have any idea what’s going on?
    2. If the service is unavailable for many seconds/minutes/hours, define what should happen: How will the user be informed? Will your product be totally useless? Note that it is OK to choose that your product to be useless for a while, so long as you have thoughtfully made that decision—if this happens, you must not keep your users in the dark; they must be provided with information, even if it’s just something like “sorry, we’re suffering right now, see http://oursite.com/suffering for more information”.
    3. Assuming you’ve taken care of 1 and 2, you’re in good shape to plan what should happen if the web service goes away forever, or if it’s pricing changes radically.
  • Having determined these fallback plans, provide easy ways to disable Web APIs frequently during development and testing, and to verify that they behave according to your fallback plans (for brief, temporary, and long-term failures). These service-outage simulations should be part of a regular regressive test procedure.
  • Bonus assignment for the best students: Add kill-switches for every service you rely on, so that if they are behaving extremely poorly you can step in quickly to prevent your users from suffering consequences of a bad service that is beyond your control.
  • Check if the Web API provider has a clear deprecation policy (they should) and regularly check up on this so you have a long time to prepare. You cannot guarantee that they’ll follow the policy, but this will help you be prepared.
  • If the Web API means that your important data will be in their systems, verify how to periodically retrieve that data so that you have it for when their service fails and you need to switch.
You can and probably should use 3rd-party web APIs, but rely on them being unreliable.

Today’s Takeaway: Every Web API will sometimes fail--someday it will fail forever. Plan for those failures. Practice your plan.

Monday, April 23, 2012

Don’t hold it that way. Apple’s “secret” problem.


I recently upgraded to an iPhone 4 (because I’m too cheap for a 4S) and so became one of the last people on the planet to experience “Anntenagate”. I’ve learned that to keep calls from dropping in my home I must not hold the iPhone 4 in certain perfectly natural ways.

The Problem

If your bare hand is touching the bare phone in a particular spot, the antenna weakens, as seen in this video. Problem is, touching that spot is a perfectly normal thing to be doing. It takes effort NOT to touch that spot.

Unveiling The Problem Behind The Problem

Everything about Apple’s new products is veiled in secrecy.

Within the company, nobody is allowed to see a new product unless they absolutely need to know about it. Even if you are allowed to see a device, perhaps because you’re writing new software to be released on it, you might not actually be able to touch the device—it may be delivered to you veiled in a locked case with only a single component, such as the screen, accessible.

A few people are allowed to take a new device off-campus, but only if it’s veiled in a case that disguises what it actually looks like (and, btw, prevents you from touching it in a certain spot).

I find it poetic that the first person known to have experienced the antenna problem was the very-secretive Steve Jobs, on stage during the product’s public unveiling, and quite literally because that was the first time the product had been used in public without a veil.

Public Unveiling - Oops

The Fallout

This is old news; you all know what happened. Apple denied the problem as long as it could, which wasn’t very long. It was one of the rare times when Apple just looked ridiculous. Since July 2010 Apple has offered a free bumper (poetically: “a veil”) to all iPhone 4 users, at an estimated cost of $175 million. As of a March 2012 Class Action Settlement, iPhone 4 users can receive a $15 cash settlement.

Today’s Takeaway: The people who use your prototypes have to be able to touch the things. Duh!

Wednesday, April 18, 2012

The Effective CEO: Clumsy Eccentric Oaf

After writing about Netflix’s success amidst the Amazon Outage of April 2011, I remembered about another company that survived that Amazon disaster.

The story, as I remember it:

The way I remember the story goes something like this, as told by their CEO:
I like to walk around the office and just unplug stuff at random. For instance, I’ll see an electric plug in a wall outlet, wonder “what does this do?” and just unplug it. Sometimes I’ll turn off people’s computers at random, or dump coffee in their keyboards, or just pick up a piece of equipment with shiny lights on it (I have no idea what it does) and slam it against the wall. I guess you could say I really like to screw around with my employees. If I find scissors I’ll look for a few cables and cut them (but that doesn’t happen much anymore because someone hid all the office scissors). One time I found a cartridge labeled “full backup” and pulled all the tape out of it to use as a streamer for a party—that was a helluva party. 
Not everyone has my sense of humor about this stuff. They’d probably fire me if I weren’t the CEO. 
One day I was in the CFO’s office and saw all these bills from Amazon, which he said were for “cloud services”. That sounded ridiculous to me. Who would pay for a cloud? So I called up Amazon and said “cut our cloud prices in half—unplug whatever you need to unplug to make that happen”.  I make the same call to Amazon every few weeks to turn stuff off, but for some reason our engineers keep telling Amazon to turn things back on. 
Whatever I’m doing must be working, because when all the other companies were having trouble with Amazon, we and our users hardly even noticed.
The story, as the CEO remembers it:

Thinking that it’s possible I might not be remembering that story 100% accurately, I’ve found the original article, and the CEO’s own words, here: How SmugMug survived the Amazonpocalypse. It’s a nice read—but I like my version better. ☺

Today’s Takeaways from a wise CEO: Build for failure. Test your components.

Sunday, April 15, 2012

Put this on your Netflix queue: Release of The Simian Army

About a year ago, in April of 2011, a whole lot of internet services were failing because a whole lot of internet services run on Amazon’s EC2, and EC2 was failing. Pretty much any new internet service suffered, because by April 2011 most new internet services were using EC2 for at least some part of the business (e.g. Quora, Foursquare, Reddit, Hootsuite, among very many others including two sites I was working on).

Not a problem

One site that famously did NOT go down (at least not in April) was Netflix, because of one really bad employee.

The new Employee who Solved the Problems (spoiler alert: it’s a monkey)

Netflix survived because previously they’d hired a crazy, chaotic employee--a monkey--whose job description (from 5 Lessons We’ve Learned Using AWS) was:
…to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.
The Netflix Simian Army describes why they created job position:
…comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables -- all the while we continue serving our customers without interruption. By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won't even notice.
Chaos Monkey did such a good job (at being bad) that Netflix has since hired a whole team of monkeys, who each morning chant their motto:
“The best way to avoid failure is to fail constantly.”
Hey, hey, here come The Monkeys

According to Wired Enterprise, Netflix will be releasing The Netflix Simian Army this year in the form of source code. Whether you use that source code directly, or simply learn from it, Netflix's monkeys are some of the best examples of Problems Solving Problems.

Let's keep an eye out for the monkeys.

Related Links
Update: July 30, 2012:

Today Netflix announce that the monkey is out. If you try the monkey, let us know how it goes.
 Today’s Takeaway: The best way to avoid failure is to fail constantly.

Thursday, April 12, 2012

Sometimes you have to shed a little dark on the problem.

I bought this fancy DVD/CD/HD-Radio player to watch movies in the bedroom.


This is a closer view of its remote control. Can you see the major flaw?


“Can you see the major flaw?” was a trick question (psych!), because here’s what the remote control looks like when the lights are off.


The Problem

In the dark (e.g., in my bedroom when I'm watching a DVD), this remote is just a hard-to-find small rectangle with a bunch of small buttons. Because it’s symmetrical, there is no way by sense-of-touch to tell up from down. Because it’s so complicated, and all the buttons are identical, to use this remote in the dark I have to memorize the location of the buttons.

The most common button I use is play/pause (for potty breaks).  So here’s what I’ve learned to do:
  1. Feel around for the tiny remote (I can't tell you but I know it's mine).
  2. Count the buttons down five from the top, three from the left.
  3. Push that button. If the movie plays/pauses I’ve got the right one.
  4. If that didn’t work, turn the remote around 180 degrees and go back to step 2.
I’ve memorized similar steps for volume up/down.

Occasionally I’ll hit the wrong button and get the system in a bad state. Then I have to turn on the lights to figure out how to restore order.

The Problem Behind The Problem

I’m sure the team at Polk Audio thoroughly tested their remote control to make sure everything worked, and that they didn’t miss any functionality. I can picture them now in white lab coats (I’m not sure why the white lab coats, but that’s what I’m picturing) running through a very thorough script of remote-control scenarios, validating that the buttons always worked, that the images on the buttons did not fade, and that the battery lasted a long time.

What I don’t picture is anyone on the team taking their new product home, trading their white lab coat for pajamas, turning off the lights, and using this remote control to watch a movie. Had they taken this simple step of putting themselves in their customer's place, they would have realized immediately that their product had some serious design bugs.

Today’s Takeaway: If you’re creating a product that will be used primarily in a dark room. Try using it in a dark room.

Tuesday, April 10, 2012

So what? It’s not as if someone’s life depends on a rare bug in a stupid game.

In “Draw Something? Impossible” I wrote how a Draw Something bug made it impossible for my friend Smudgy to guess my word. Even as I wrote that, I could hear you thinking: “So what? It’s just a stupid bug in a stupid game. Who cares?”

If it could save a life…

To paraphrase an old Steve Jobs tale on how he encouraged an engineer to make the Macintosh boot just ten seconds faster: “If it could save a life, would you care about this bug?”

Math, Yea!

Pulling some estimates from my nethers:
  • If today there are 50 million people playing Draw Something, making an average of 3 moves against 5 friends, then today there were (50million*3*5=) 750 million games played.
  • If this rare rare bug happens in only 1 out of 1000 games, then this bug occurred (750million/1000=) 750 thousand times today.
  • If, as in my friend Smudgy’s case, this bug results in someone spending 2 minutes trying the impossible, then showing it to their boyfriend who spends an additional two minutes trying the impossible, then (750,000*(2+2)=) 3 million minutes were lost to this bug today. That’s (3million/60/24/365=) 5.7 years lost to this bug today.
  • So (5.7*31=) 177 years of life will be lost to this bug this month.

In Summary:
In April, over two lives will be lost due to this one bug in Draw Something.



    A stitch in time saves 2 lives

    Maybe it will take a developer a few hours or even a few days to fix this bug. Is it worth spending a few developer-days on a rare bug like this?

    Yes, it is worth a few days of developer time if it can save many years of user time.

    But don’t worry, developers, it is our goal at ProblemsSolvingProblems to develop the techniques and tools to shorten time required to create bug-free software. Stay tuned.

    Today's Takeaway: Show some respect for your users and their time, even if it is “just a game”. Your customers' time is a precious thing to lose.

    Sunday, April 8, 2012

    Draw Something? Impossible

    How to draw “MULTIPLY”? Hmmm. My first thought was to show two copulating Easter Bunnies surrounded by a field of colorful baby bunny marshmallow Peeps, but I lack the artistic skills to handle anything more complicated than a few dots and lines, and so I drew a series of screens similar to this one:


    I drew 2 dots and 3 dots, then 6 dots. MULTIPLY.  Get it?

    As I watched my friend Smudgy try to solve the puzzle it became painfully clear that there was no way she would ever guess MULTIPLY. It was impossible. Draw Something had given her the wrong letters.

    The Problem

    In turn #18, the previous turn, the answer actually was “EAGLE”. Smudgy’s iPhone hadn’t received the new letter choices for turn #19 and so was stuck with the letters in had in turn #18. Can you draw “BUG”?

    The Problem Behind The Problem

    I don’t know what’s behind this particular bug. For the sake of argument, let’s assume that this EAGLE/MULTIPLY problem is a result of explosive growth—a growing pain.

    Draw Something is the fastest growing game ever, going from zero to more than fifty million downloads in just two months. Growth this explosive is bound to result in problems. (“The kind of problems you want to have,” I hear the people at OMGPOP saying, “like what to do with $210 from Zynga.”)

    Anybody would have a hard time scaling up their servers, bandwidth, code, QA, and bug-patching fast enough to handle this unprecedented growth. Maybe in Smudgy’s case something in the server infrastructure was just plain overloaded and so never got out the message that “the word isn’t EAGLE anymore, it’s MULTIPLY”. Such rapid-scaling problems are inevitable, aren’t they?

    How to prepare for 50 million users (the super short answer)

    If you think you need 50 million testers to make sure you’re ready for 50 million users, then you’re looking at the problem wrong. That one-for-one approach works for 5 users, but not 50 million. When viewed correctly, preparing for 50 million users is fundamentally no different than preparing for 5 users, or 5 thousand, or 5 billion.

    In future blog entries we can explore more deeply how to prepare for scaling without problems. For now, here’s an extreme over-simplification:
    1. create a MAXUSERS adjustable variable
    2. start with a small, manageable size for MAXUSERS (e.g. 5, 50, or 500)
    3. test the hell out of your system with maximally active MAXUSERS, MAXUSERS+1, MAXUSERS*2, MAXUSERS*10, etc…
    4. work out all the kinks at this level of MAXUSERS testing, making sure
      • there is no compromise in data integrity, EVER
      • when maximum limits are approached, appropriate “we’re sorry” actions are taken (e.g. “fail whale”) – this is the absolute worse that should happen (and it’s not very bad)
      • the user never experiences a “what the hell is going on” moment
    One more thing to do early on: Add a giant kill switch that will shut down everything and give your users some version of  “sorry, we’re down for maintenance”. You never want to use this kill switch, but you want to test it often.

    The above steps don’t mean you’re ready for 50 million users yet, not by a long shot (future Problems Solving Problems entries should make it very clear just how much remains to be done), but it does mean you have at least thought about the problems brought on by too many users, and you’ve thought about those problems when it’s still early enough and easy enough to do something about them.

    Apologies to OMGPOP

    BTW, OMGPOP appears to have done a really amazing job of preparing for explosive growth, and I have done no research to show that this particular EAGLE/MULTIPLY bug is due to growing pains on their part. I only pick on OMGPOP because, geez, if you can’t pick on someone who is currently flying so high, who can you pick on?

    Today's Takeaway: To prepare for too much demand on resources, you must simulate too much demand on resources. That often involves lying to yourself about what is "too much".

    Saturday, April 7, 2012

    Got a problem? Step 1: Start a Blog.

    This blog companion to ProblemsSolvingProblems.com is a place to discuss why there are so many problems, why there are so many problems solving those problems, and why that’s such a problem.

    Mostly this will be about software problems, which are generally known as “bugs”.

    Here at ProblemsSolvingProblems we’re not fond of bugs. We don’t like that there are so many bugs. And we really hate it that the people who make those bugs aren’t bugged about it. We want to change that. We want to bug the people who let those bugs slide, and then learn how to keep it from happening again.

    Today's Takeaway: Bugs are bad.