The Bug that Took Down a Release and How I Wrote It

We were disposing of some last minute bugs today and I was reminded of the time, a few projects ago, when I introduced a bug that screwed up a whole release.

Let's say that I was writing a pizza ordering web site and, after many successful releases, the Big Wigs at PizzaCo wanted to introduce a new feature that would let customers customize their pizzas even more. In production the user could select style of pizza (thin, pan, stuffed), size, and toppings. There was even some cool logic in place to make sure the toppings were available before being offered. Now, in this brave new world of pizza, the customer would be given the chance to customize each topping. If you chose peppers, you would be given the choice of green, yellow, or red (through the magic of javascript). A choice of onions would prompt the question of red or yellow. Pepperoni? -- regular or spicy?

So we wrote the feature, which was difficult, as we had to totally change how we interacted with the Topping Provider Service (TPS) in order to reserve these new toppings and check for availability. But then the Big Bosses had a thought: "What if this new extra bit of choice scares away the customer?" And their solution was to have "switch" to turn the extra customization on and off. As developers we pointed out that this would be costly as making the new system look like the old, but still work with the new under the covers (by selecting defaults for each topping), was not easy. And it would introduce extra complexity by having two different pages that do almost the same thing. But they wanted it, so we did it.

A few weeks later we found a bug with the topping selection page: If you take the last bit of pepperoni, but later come back to the toppings page to change your selections, the TPS service thinks all the pepperoni is gone (because you just reserved it) so you don't see that option on the page. We fixed it by looking at the customer's saved pizza and putting any toppings missing from the page back on the page. Except that we forgot to make the changes in the page that mimics the old behavior. Even worse, the way we fixed the new toppings page caused the pseudo-old page to blow up if you ever try to go back to it. And we found out that this happens right before a release.

And this was all my fault.

I worked on the feature to add the new customization, I also worked on the feature to hide the new feature, and (believe it or not) I worked on the bug fix. There was no one else on the team who was in a better position to remember that there were two pages and each need a change. Aside from my obvious point that I'm a terrible developer and you should never hire me, I think there's an interesting lesson here about unused functionality.

We wrote the new page and then we used it. It did all sorts of sexy Ajax and was cool. And after a few weeks we forgot about the legacy page and things that might break it. There are many times when it will be tempting to bifurcate your code, but every time you do so you create a place for a bug. This is especially true if one path is rarely used. The new code will continue to evolve and the old will sit around and collect bugs. We did a full pass through regression testing, in addition to our normal tests and QA, but we didn't find this bug until hours before the release. Which is better that finding it after, but still, if the release had gone out it would have had the new feature turned off (for "safety") and the monster bug on. We were mere hours away from disaster.

I know it's hard not to want options, but every option has its price.


Stephan said…
That sounds a lot like a lesson learned. In fact it immediately rings the DRY(oo) bell (for 'Don't repeat yourself - or others').

Anyway, it is a little bit surprising that the regression tests didn't find the bug. In my opinion that indicates that the test coverage wasn't as good as it should have been. Given the story it was an important part of the system to show/hide options. So, to me at least, the remaining question is, why there weren't (executed) test cases for this scenario.
Jake Scruggs said…
Yep. I totally agree that the regression test suite should have caught the bug. The company I was working for at the time did not have an automated test suite. So after they found the first bug, the testing they did of the fix was probably not as rigorous as their first pass.

If you're saying, however, that there should have been a testcase for this in the developer build I disagree. I've been down the path where developers try to maintain an extensive click through the app build (using selenium or something similar) and it leads to long builds (30 minutes or more). I'm now of the opinion that one of the developer builds (we had 5 that kick off on every check-in) should do 5-6 minutes of clicking through the app as a smoke test. More than that and you're into QA territory -- and they, of course, should use automated suites.
stephan said…
Well, as long as there actually are (automated) tests, the situation is already so much better, that it's more an optimisation to move the test cases to the best place. Of course, the meaning of 'best' needs to be agreed upon in different ways in different projects: Should be early in the development cycle (in order to find bugs as early as possible), should not (as you said) slow down the build cycle, should match the scope of the test (no point in unit testing issues that can only show up in integration testing), and may be a few others, depending on the project at hand.
And of course I completely agree to using automated test environments.
Taryn said…
I agree with stephan here. This is sounds like a job for "rake test".
You shouldn't need something as heavy as selenium to check if your app works on both the AJAX and non-AJAX pages.
You can do a lot with just your functional tests and passing in the appropriate parameters+content-type.

That said I've made exactly the same mistake myself - ie not providing test coverage for the legacy version (or the non-AJAX version) of a page :P
Lack of complete test coverage always comes back to bite you in the end!

Popular posts from this blog

What's a Good Flog Score?

SICP Wasn’t Written for You

Point Inside a Polygon in Ruby