I'm excited to do this post, because I'm particularly fond of failure. It's just so helpful. (I also keep a Laney #fail page in a gdoc that I've found pretty useful. Highly recommend.)
TSG / TSRomania #Fail
The Story
Our partner, TSRomania, is running a
technology-based challenge on the new Net2 beta platform. When they launched and made the challenge widely known to the public (through
hotnews.ro, a popular Romanian site), the site couldn't keep up with the sudden traffic spike and the poor thing crashed.
We got it back up later (as soon as I woke up, more on that in a sec) and spent about a day doing root-cause analysis. Although the coordation and ownership of root-cause analysis was bit bumpy, after a day, we were confident that the site was stable and ready to re-launch. But TSRomania didn't make the site public again until another week had gone by and they had become just as confident about site stability.
(More on this one in a formal post-mortem TK separately.)
The #Fail
The site crash is not the fail. There are two parts to the failure:
- Public launch of a quiet site. I (still) think of the net2 beta platform as, well, a beta. I expect things to break and the site to be down occasionally. But TSRomania had announced the challenge on a site that gets tons of traffic and attention in romania. To them and their users, it's a production site.
- Unnecessary downtime. It happened early morning in Romania, around midnight here. I was the first POC for TSRomania and I didn't know about till I checked my email in the morning. I got the site back online, but by then it had been down for nearly 7 hours.
Key Learnings
My key learnings from this experience:
- Off-hours coverage. The FTS ops team is helping us out now, ensuring first-tier coverage during their business hours. I should've worked that out w/them much earlier. (A related learning, FTS is now covering net2.org in off-hours as well. Same reason.)
- Manage expectations inside and out. I knew of TSRomania's announcement plans, but didn't make the connection that they would expect pretty good uptime and support, given the public nature of the launch.
Amazon #Fail
This one I wasn't involved in personally, really. But it was a spectacular, public failure, and my team (Amazon search) was definitely impacted. Plus, in retrospect, it's kind of a fun story.
The Story
Back in April '09, some Amazon.com users and
bloggers noticed and
widely reported that Amazon had removed BGLT-topic books. Or classified them as "adult". Or burned them, or something. I was on-call for the search team that week and got paged in the wee hours after one of those blog posts to figure out what had gone wrong.
We learned later, after much scrambling and debugging, that an engineer in the amazon.fr office had, in fact, classified a whole huge chunk of books as "adult", a classification that has a few implications, including a far lower ranking in search results. Now, from his perspective, he was casting a very wide net for "adult"---he didn't target BGLT books, but amazon.fr offers fewer books than amazon.com, and they both pull from the same catalog. The overlap b/w the two happened to be BGLT-slanted for whatever reason. So on amazon.fr, it looked like everything that was even remotely sexual was deranked, but on amazon.com it looked like only bglt-themed books were.
The #Fail
Engineers at Amazon, as at most tech companies, have an awful lot of access to all the tech systems: code bases, servers, etc. If I had felt like it, I could've taken the whole site down on a whim (never did, though, promise). In general, I like that approach a lot (it's like Toyota's
anyone-can-pull-the-cord philosophy), but in this case clearly the engineer didn't understand the implications of what he was doing. It was a small code change, and he thought it would have minor implications.
The primary failing, though, was PR-related. Amazon mis-communicated and under-communicated with the press and on social media, there were rumors flying around like crazy, and plenty of people were calling amazon homophobic well after we corrected the problem. We even instituted a code freeze for a couple of weeks, in case we introduced another bug that could be similarly misinterpreted while all eyes were on us.
Key Learnings
I think Amazon tried to learn from its PR approach here, and they did a bit better
more recently, during an EC2 outage.
On the engineering side, we were all a lot more cautious for a little while. Really, the lesson for us was: code review every time. No new formal process was implemented, but several of us definitely stopped thinking, "Eh, this is such a minor change, I don't need a review." Every time, no matter how small.