Monday, April 23, 2012

Don’t hold it that way. Apple’s “secret” problem.

I recently upgraded to an iPhone 4 (because I’m too cheap for a 4S) and so became one of the last people on the planet to experience “Anntenagate”. I’ve learned that to keep calls from dropping in my home I must not hold the iPhone 4 in certain perfectly natural ways.

The Problem

If your bare hand is touching the bare phone in a particular spot, the antenna weakens, as seen in this video. Problem is, touching that spot is a perfectly normal thing to be doing. It takes effort NOT to touch that spot.

Unveiling The Problem Behind The Problem

Everything about Apple’s new products is veiled in secrecy.

Within the company, nobody is allowed to see a new product unless they absolutely need to know about it. Even if you are allowed to see a device, perhaps because you’re writing new software to be released on it, you might not actually be able to touch the device—it may be delivered to you veiled in a locked case with only a single component, such as the screen, accessible.

A few people are allowed to take a new device off-campus, but only if it’s veiled in a case that disguises what it actually looks like (and, btw, prevents you from touching it in a certain spot).

I find it poetic that the first person known to have experienced the antenna problem was the very-secretive Steve Jobs, on stage during the product’s public unveiling, and quite literally because that was the first time the product had been used in public without a veil.

Public Unveiling - Oops

The Fallout

This is old news; you all know what happened. Apple denied the problem as long as it could, which wasn’t very long. It was one of the rare times when Apple just looked ridiculous. Since July 2010 Apple has offered a free bumper (poetically: “a veil”) to all iPhone 4 users, at an estimated cost of $175 million. As of a March 2012 Class Action Settlement, iPhone 4 users can receive a $15 cash settlement.

Today’s Takeaway: The people who use your prototypes have to be able to touch the things. Duh!

Wednesday, April 18, 2012

The Effective CEO: Clumsy Eccentric Oaf

After writing about Netflix’s success amidst the Amazon Outage of April 2011, I remembered about another company that survived that Amazon disaster.

The story, as I remember it:

The way I remember the story goes something like this, as told by their CEO:
I like to walk around the office and just unplug stuff at random. For instance, I’ll see an electric plug in a wall outlet, wonder “what does this do?” and just unplug it. Sometimes I’ll turn off people’s computers at random, or dump coffee in their keyboards, or just pick up a piece of equipment with shiny lights on it (I have no idea what it does) and slam it against the wall. I guess you could say I really like to screw around with my employees. If I find scissors I’ll look for a few cables and cut them (but that doesn’t happen much anymore because someone hid all the office scissors). One time I found a cartridge labeled “full backup” and pulled all the tape out of it to use as a streamer for a party—that was a helluva party. 
Not everyone has my sense of humor about this stuff. They’d probably fire me if I weren’t the CEO. 
One day I was in the CFO’s office and saw all these bills from Amazon, which he said were for “cloud services”. That sounded ridiculous to me. Who would pay for a cloud? So I called up Amazon and said “cut our cloud prices in half—unplug whatever you need to unplug to make that happen”.  I make the same call to Amazon every few weeks to turn stuff off, but for some reason our engineers keep telling Amazon to turn things back on. 
Whatever I’m doing must be working, because when all the other companies were having trouble with Amazon, we and our users hardly even noticed.
The story, as the CEO remembers it:

Thinking that it’s possible I might not be remembering that story 100% accurately, I’ve found the original article, and the CEO’s own words, here: How SmugMug survived the Amazonpocalypse. It’s a nice read—but I like my version better. ☺

Today’s Takeaways from a wise CEO: Build for failure. Test your components.

Sunday, April 15, 2012

Put this on your Netflix queue: Release of The Simian Army

About a year ago, in April of 2011, a whole lot of internet services were failing because a whole lot of internet services run on Amazon’s EC2, and EC2 was failing. Pretty much any new internet service suffered, because by April 2011 most new internet services were using EC2 for at least some part of the business (e.g. Quora, Foursquare, Reddit, Hootsuite, among very many others including two sites I was working on).

Not a problem

One site that famously did NOT go down (at least not in April) was Netflix, because of one really bad employee.

The new Employee who Solved the Problems (spoiler alert: it’s a monkey)

Netflix survived because previously they’d hired a crazy, chaotic employee--a monkey--whose job description (from 5 Lessons We’ve Learned Using AWS) was:
…to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.
The Netflix Simian Army describes why they created job position:
…comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables -- all the while we continue serving our customers without interruption. By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won't even notice.
Chaos Monkey did such a good job (at being bad) that Netflix has since hired a whole team of monkeys, who each morning chant their motto:
“The best way to avoid failure is to fail constantly.”
Hey, hey, here come The Monkeys

According to Wired Enterprise, Netflix will be releasing The Netflix Simian Army this year in the form of source code. Whether you use that source code directly, or simply learn from it, Netflix's monkeys are some of the best examples of Problems Solving Problems.

Let's keep an eye out for the monkeys.

Related Links
Update: July 30, 2012:

Today Netflix announce that the monkey is out. If you try the monkey, let us know how it goes.
 Today’s Takeaway: The best way to avoid failure is to fail constantly.

Thursday, April 12, 2012

Sometimes you have to shed a little dark on the problem.

I bought this fancy DVD/CD/HD-Radio player to watch movies in the bedroom.

This is a closer view of its remote control. Can you see the major flaw?

“Can you see the major flaw?” was a trick question (psych!), because here’s what the remote control looks like when the lights are off.

The Problem

In the dark (e.g., in my bedroom when I'm watching a DVD), this remote is just a hard-to-find small rectangle with a bunch of small buttons. Because it’s symmetrical, there is no way by sense-of-touch to tell up from down. Because it’s so complicated, and all the buttons are identical, to use this remote in the dark I have to memorize the location of the buttons.

The most common button I use is play/pause (for potty breaks).  So here’s what I’ve learned to do:
  1. Feel around for the tiny remote (I can't tell you but I know it's mine).
  2. Count the buttons down five from the top, three from the left.
  3. Push that button. If the movie plays/pauses I’ve got the right one.
  4. If that didn’t work, turn the remote around 180 degrees and go back to step 2.
I’ve memorized similar steps for volume up/down.

Occasionally I’ll hit the wrong button and get the system in a bad state. Then I have to turn on the lights to figure out how to restore order.

The Problem Behind The Problem

I’m sure the team at Polk Audio thoroughly tested their remote control to make sure everything worked, and that they didn’t miss any functionality. I can picture them now in white lab coats (I’m not sure why the white lab coats, but that’s what I’m picturing) running through a very thorough script of remote-control scenarios, validating that the buttons always worked, that the images on the buttons did not fade, and that the battery lasted a long time.

What I don’t picture is anyone on the team taking their new product home, trading their white lab coat for pajamas, turning off the lights, and using this remote control to watch a movie. Had they taken this simple step of putting themselves in their customer's place, they would have realized immediately that their product had some serious design bugs.

Today’s Takeaway: If you’re creating a product that will be used primarily in a dark room. Try using it in a dark room.

Tuesday, April 10, 2012

So what? It’s not as if someone’s life depends on a rare bug in a stupid game.

In “Draw Something? Impossible” I wrote how a Draw Something bug made it impossible for my friend Smudgy to guess my word. Even as I wrote that, I could hear you thinking: “So what? It’s just a stupid bug in a stupid game. Who cares?”

If it could save a life…

To paraphrase an old Steve Jobs tale on how he encouraged an engineer to make the Macintosh boot just ten seconds faster: “If it could save a life, would you care about this bug?”

Math, Yea!

Pulling some estimates from my nethers:
  • If today there are 50 million people playing Draw Something, making an average of 3 moves against 5 friends, then today there were (50million*3*5=) 750 million games played.
  • If this rare rare bug happens in only 1 out of 1000 games, then this bug occurred (750million/1000=) 750 thousand times today.
  • If, as in my friend Smudgy’s case, this bug results in someone spending 2 minutes trying the impossible, then showing it to their boyfriend who spends an additional two minutes trying the impossible, then (750,000*(2+2)=) 3 million minutes were lost to this bug today. That’s (3million/60/24/365=) 5.7 years lost to this bug today.
  • So (5.7*31=) 177 years of life will be lost to this bug this month.

In Summary:
In April, over two lives will be lost due to this one bug in Draw Something.

    A stitch in time saves 2 lives

    Maybe it will take a developer a few hours or even a few days to fix this bug. Is it worth spending a few developer-days on a rare bug like this?

    Yes, it is worth a few days of developer time if it can save many years of user time.

    But don’t worry, developers, it is our goal at ProblemsSolvingProblems to develop the techniques and tools to shorten time required to create bug-free software. Stay tuned.

    Today's Takeaway: Show some respect for your users and their time, even if it is “just a game”. Your customers' time is a precious thing to lose.

    Sunday, April 8, 2012

    Draw Something? Impossible

    How to draw “MULTIPLY”? Hmmm. My first thought was to show two copulating Easter Bunnies surrounded by a field of colorful baby bunny marshmallow Peeps, but I lack the artistic skills to handle anything more complicated than a few dots and lines, and so I drew a series of screens similar to this one:

    I drew 2 dots and 3 dots, then 6 dots. MULTIPLY.  Get it?

    As I watched my friend Smudgy try to solve the puzzle it became painfully clear that there was no way she would ever guess MULTIPLY. It was impossible. Draw Something had given her the wrong letters.

    The Problem

    In turn #18, the previous turn, the answer actually was “EAGLE”. Smudgy’s iPhone hadn’t received the new letter choices for turn #19 and so was stuck with the letters in had in turn #18. Can you draw “BUG”?

    The Problem Behind The Problem

    I don’t know what’s behind this particular bug. For the sake of argument, let’s assume that this EAGLE/MULTIPLY problem is a result of explosive growth—a growing pain.

    Draw Something is the fastest growing game ever, going from zero to more than fifty million downloads in just two months. Growth this explosive is bound to result in problems. (“The kind of problems you want to have,” I hear the people at OMGPOP saying, “like what to do with $210 from Zynga.”)

    Anybody would have a hard time scaling up their servers, bandwidth, code, QA, and bug-patching fast enough to handle this unprecedented growth. Maybe in Smudgy’s case something in the server infrastructure was just plain overloaded and so never got out the message that “the word isn’t EAGLE anymore, it’s MULTIPLY”. Such rapid-scaling problems are inevitable, aren’t they?

    How to prepare for 50 million users (the super short answer)

    If you think you need 50 million testers to make sure you’re ready for 50 million users, then you’re looking at the problem wrong. That one-for-one approach works for 5 users, but not 50 million. When viewed correctly, preparing for 50 million users is fundamentally no different than preparing for 5 users, or 5 thousand, or 5 billion.

    In future blog entries we can explore more deeply how to prepare for scaling without problems. For now, here’s an extreme over-simplification:
    1. create a MAXUSERS adjustable variable
    2. start with a small, manageable size for MAXUSERS (e.g. 5, 50, or 500)
    3. test the hell out of your system with maximally active MAXUSERS, MAXUSERS+1, MAXUSERS*2, MAXUSERS*10, etc…
    4. work out all the kinks at this level of MAXUSERS testing, making sure
      • there is no compromise in data integrity, EVER
      • when maximum limits are approached, appropriate “we’re sorry” actions are taken (e.g. “fail whale”) – this is the absolute worse that should happen (and it’s not very bad)
      • the user never experiences a “what the hell is going on” moment
    One more thing to do early on: Add a giant kill switch that will shut down everything and give your users some version of  “sorry, we’re down for maintenance”. You never want to use this kill switch, but you want to test it often.

    The above steps don’t mean you’re ready for 50 million users yet, not by a long shot (future Problems Solving Problems entries should make it very clear just how much remains to be done), but it does mean you have at least thought about the problems brought on by too many users, and you’ve thought about those problems when it’s still early enough and easy enough to do something about them.

    Apologies to OMGPOP

    BTW, OMGPOP appears to have done a really amazing job of preparing for explosive growth, and I have done no research to show that this particular EAGLE/MULTIPLY bug is due to growing pains on their part. I only pick on OMGPOP because, geez, if you can’t pick on someone who is currently flying so high, who can you pick on?

    Today's Takeaway: To prepare for too much demand on resources, you must simulate too much demand on resources. That often involves lying to yourself about what is "too much".

    Saturday, April 7, 2012

    Got a problem? Step 1: Start a Blog.

    This blog companion to is a place to discuss why there are so many problems, why there are so many problems solving those problems, and why that’s such a problem.

    Mostly this will be about software problems, which are generally known as “bugs”.

    Here at ProblemsSolvingProblems we’re not fond of bugs. We don’t like that there are so many bugs. And we really hate it that the people who make those bugs aren’t bugged about it. We want to change that. We want to bug the people who let those bugs slide, and then learn how to keep it from happening again.

    Today's Takeaway: Bugs are bad.