Sunday, May 27, 2012

Make Your Own Dumb Luck

Three times in the past month, on three separate projects, I delivered code that I thought was well tested and therefore correct. In two cases I was writing libraries to sort data according to certain rules, and in the other case it was a small program to pull a best-match set of data from a database as fast as possible. In all cases the code, which I thought was excellent and well-tested, reached the real world and soon failed. And in all cases the failures were not easy to evaluate.

It took me all month to realize that in each case I’d made the same stupid testing mistake: not enough dumb luck.

I know what you’re thinking, punk. You’re thinking “I’ve tested every possible permutation.” You’ve gotta ask yourself a question: “Do I feel lucky?” Well, do ya, punk?
The Problem

Being quality minded, before considering my work “finished” I had of course written unit tests for a bunch of different cases that I thought would best stress the code. I wrote a few standard cases, edge cases, cases around data limits, cases around bad input, and cases that I thought really put my algorithms to the limit. With all of my test cases passing, I confidently handed off the code for real-world use.

Somehow the real world managed to create data states and logic paths that my tests had not exercised, even thought I thought I’d been very clever about getint all possible cases into my test suite.

The real world is like that sometimes. Most of the time.

The Problem Behind The Problem

It so happens that in writing these three bits of code I was being about as smart as I know how to be. I.E., writing the code required 100% of my IQ.

But figuring out code tests that mimic all the possible real-world situations and permutations and complications requires more smarts than it takes to write the code. It requires, I dunno, maybe 27.3% more smarts to test the code than to write it. No matter how hard I try I’m just not 27.3% smarter than myself.

Solving the Problem (by adding a lot more Problems)

Working alone, I have no one to lean on who is smarter than me. But I do have a pretty good random number generator, and I know how to use it. So here’s what I did in each of the three cases.

First, for each of my projects I wrote test suites that ran the tests against tons of generated data and queries. Whereas the original tests may have had something like a dozen well-designed cases, the additional random test might have 2 million cases. This was the easy part.

Second (the hard part), since the data was now going to be random, and so I wouldn’t know ahead of time what the correct results would be, there was the difficulty of determining at runtime if the results were correct. This required writing more code than was in the original libraries simply to evaluate the results. Fortunately this new result-validation code didn’t need to be fast or clever, just accurate, and so it could be written so simplistically that even someone like me would have a hard time getting it wrong. (As a bonus, I stored the random input in a temporary file on each run, so if it did fail I could see just what input had caused the failure.)

With millions of random cases running, every error that had been reported (and more) very quickly popped up and I could very quickly squash them. I released my code again and this time no error reports came back.

I know a lot of people dislike random testing because it is, almost by definition, not reproducible. Those people must be smart enough to figure out enough non-random tests to exercise every part of their code--I’m not.

So in the end I didn’t have to be smarter than myself to test my own code. I just needed to make lots of my own dumb luck!

Today’s Takeaway: When you’re done writing all your best test cases, add a few million random cases for good measure. If you’re anything like me, you’re not smart enough to think of a test for everything.

Thursday, May 3, 2012

We reserve the right to refuse this web service to anyone.

Do you use a web API as part of your product? Perhaps you let users login and find friends using a Social API, show users information using a Geolocation API tied to a Mapping API, save user images/video/audio using a Storage API translated with a Translation API, etc…

If you clicked on any of the Web API links in the previous paragraph, then I apologize because none of those APIs work any more. They’re all dead.

If your product did rely on any of those APIs, how would your product fare?

Assertion: Every Web API will someday fail you

Relying on a third party API provides a lot of power very quickly. But keep in mind that the Terms-Of-Service for every 3rd-party Web API goes something like this.

EVERY Web API Terms of Service (TOS)

We reserve the right to refuse this web service to anyone…
…at any time…
…for any length of time…
…for any reason…
…or for no reason at all…
…and to change these or any other terms of service…
…or to change our APIs…
…or to change our pricing…
…or to add limits to your usage…
…or to give away whatever information you send us…
…or to just screw up and stop working…
…or to go belly up…
…or to get bored and stop paying attention...
…at any time.

I can promise you that any Web API you use, or any Web API you create, will, at some point in the future, be different than it is today (i.e. however you use it now, it will someday “break”). I just took a brief tour of programmableweb api directory, and out of 10 mashups I tried at random, only 3 were currently working. Out of 5814 APIs, 602 are listed in the deadpool (about 10%) but randomly picking 10 of those that were not in the deadpool only 5 were still alive. Some of these broken APIs are from tiny little companies I’ve never heard of, and others are from the biggest internet companies of all.

In just the past year I have been personally involved in products that failed in full or in part because a web API they relied on was shut down or changed. In the past year I have personally shut down a service that provided a Web API, causing the apps that relied on that service to become useless.

Web APIs are unreliable.

If Web APIs are so unreliable, should I use them?

If a Web API enables you to build a better product, and to build it faster, then, yes, use the Web API. But use that API in full knowledge that it won’t always work as it does today. Someday it almost certainly will not work at all.

What’s a reliable strategy for using unreliable Web APIs?

The most important step for dependency on 3rd-party APIs is to admit that you are powerless over them, and to accept that they are unreliable. Once you accept that fact, the other steps follow.
  • For every Web API you use, determine whether it is core or supplementary to your product (i.e. is your product totally useless without that service). In the following issues you’ll probably have different answers for core vs. supplementary services.
  • For every Web API you use, create three levels of fallback plans for what should happen when those services fail:
    1. If the service is unavailable temporarily, for just a few seconds, decide where you stand on these questions: Will your fallback simply retry a few times? Will the user still be able to use parts of your product? Will the user have any idea what’s going on?
    2. If the service is unavailable for many seconds/minutes/hours, define what should happen: How will the user be informed? Will your product be totally useless? Note that it is OK to choose that your product to be useless for a while, so long as you have thoughtfully made that decision—if this happens, you must not keep your users in the dark; they must be provided with information, even if it’s just something like “sorry, we’re suffering right now, see for more information”.
    3. Assuming you’ve taken care of 1 and 2, you’re in good shape to plan what should happen if the web service goes away forever, or if it’s pricing changes radically.
  • Having determined these fallback plans, provide easy ways to disable Web APIs frequently during development and testing, and to verify that they behave according to your fallback plans (for brief, temporary, and long-term failures). These service-outage simulations should be part of a regular regressive test procedure.
  • Bonus assignment for the best students: Add kill-switches for every service you rely on, so that if they are behaving extremely poorly you can step in quickly to prevent your users from suffering consequences of a bad service that is beyond your control.
  • Check if the Web API provider has a clear deprecation policy (they should) and regularly check up on this so you have a long time to prepare. You cannot guarantee that they’ll follow the policy, but this will help you be prepared.
  • If the Web API means that your important data will be in their systems, verify how to periodically retrieve that data so that you have it for when their service fails and you need to switch.
You can and probably should use 3rd-party web APIs, but rely on them being unreliable.

Today’s Takeaway: Every Web API will sometimes fail--someday it will fail forever. Plan for those failures. Practice your plan.