Defining Done for Your Deployment Process

A Tale of Debugging

The other day, someone came to me and told me that a component of one of the web applications that my team maintains seemed to have a problem. I sat with her, and she showed me what was going on. Sure enough, I saw the issue too. It was the kind of integration thing for which we were able to muster up some historical record and see approximately when the problem had started. Apparently, the problem first occurred around the last time we pushed a version of the website into production. Ruh roh.

scooby

Given that this was a pretty critical issue, I got right down to debugging. Pretty quickly, I found the place in the code where the integration callout was happening, and I stepped through it in the debugger. (As an aside, I realize I’ve often made the case against much debugger use. But when legacy code has no unit, component or integration tests, you really don’t have a lot of options.) No exceptions were thrown, and no obvious problems were occurring. It just hit a third party API, went through it uneventfully, but quietly failed.

At this point, it was time for some tinkering and reverse engineering. When I looked at the method call that seemed to be the meat of the issue, I noticed that it returned an integer that the code I was debugging just ignored. Hovering over it, XML doc comment engine told me that it was returning an error code. I would have preferred an exception, but whatever — this was progress. Unfortunately, that was all the help I got, and there was no indication what any returned error code meant. I ran the code and saw that I was getting a “2,” so presumably there was an error occurring.

Maddeningly, there was no online documentation of this API. The only way I was able to proceed was to install a trial version of this utility locally, on my desktop, and read the CHM file. They offered one through their website, but it was broken. After digging, I found that this error code meant “call failure or license file missing.” Hmm… license file, eh? I started to get that tiny adrenaline rush that you get when a solution seems like it might be just around the corner. I had just replaced the previous deployment process of “copy over the files that are different to the server” with a slightly less icky “wipe the server’s directory and put our stuff there.” It had taken some time to iron out failures and bring all of the dependencies under source control, but I viewed this as antiseptic on a festering sore. And, apparently, I had missed one. Upon diving further into the documentation, I saw that it required some weirdly-named license file with some kind of key in it to be in the web application root’s “bin” folder on the server, or it would just quietly fail. Awesome.

This was confirmed by going back to a historical archive of the site, finding that weird file, putting it into production and observing that the problem was resolved. So time to call it a day, right?

Fixing the Deeper Issue

Well, if you call it a day now, there’s a good chance this will happen again later. After all, the only thing that will prevent this after the next deployment is someone remembering, “oh, yeah, we have to copy over that weird license file thing into that directory from the previous deploy.” I don’t know about you, but I don’t really want important system functionality hinging on “oh, yeah, that thing!”

What about a big, fat comment in the code? Something like “this method call will fail if license file xyz isn’t in abc directory?” Well, in a year when everyone has forgotten this and there’s a new architect in town, that’ll at least save a headache next time this issue occurs. But this is reactionary. It has the advantage of not being purely tribal knowledge, but it doesn’t preemptively solve the problem. Another idea might be to trap error codes and throw an exception with a descriptive message, but this is just another step in making the gap between failure and resolution a shorter one. I think we should try avoid failing at all, though having comments and better error trapping is certainly a good idea in addition to whatever better solution comes next.

What about checking the license file into source control and designing the build to copy it to that directory? Win, right? Well, right — it solves the problem. With the next deploy, the license file will be on the server, and that means this particular issue won’t occur in the future. So now it must be time to call it day, right?

Still no, I’d argue. There’s work to be done, and it’s not easy or quick work. Because what needs to happen now is a move from a “delete the contents of the server directory and unzip the new deliverable” deployment to an automated build and deployment. What also needs to happen is a series of automated acceptance tests in a staging environment and possibly regression tests in a production environment. In this situation, not only are developers reading the code aware of the dependency on that license file, not only do failures happen quickly and obviously, and not only is the license file deployed to where it needs to go, but if anything ever goes wrong, automated notifications will occur immediately and allow the situation to be corrected.

It may seem pretty intense to set all that up, but it’s the responsible thing to do. Along the spectrum of some “deployment maturity model,” tweaking things on the server and forgetting about it is whatever score is “least mature.” What I’m talking about is “most mature.” Will it take some doing to get there and probably some time? Absolutely. Does that mean that anything less can be good enough? Nope. Not in my opinion, anyway.

  • http://genehughson.wordpress.com/ Gene Hughson

    Very well put. IMO, “the deeper issue” is where we make our money. Too much focus on “make the error go away” without addressing the underlying issues leads to a constant cycle of breaks and fixes that leeches time away from productive work and contributes to building a big ball of mud.

  • Jordan Whiteley

    I’ve taken to creating ‘smoke tests’ on most of my services to address this issue. Basically as soon as the service is up on it’s feet I fire up a new thread that runs a quick query on all the used databases, picks up and plays with all the services, and verifies the existence of all flat files and directories that will be in use.

    Better the program crashes 10 seconds after it gets on it’s feet because the database rejects your credentials than letting it crash while I’m at home eating dinner and someone needs their reports.

  • http://www.daedtech.com/blog Erik Dietrich

    I’m dealing with the aftermath of that pretty heavily right now. Death by a thousand cuts of “it’ll be quicker just to add this in manually for the time being.”

  • http://www.daedtech.com/blog Erik Dietrich

    I like this concept a lot, the more I think about it. My brain is starting to wrap around a “production ‘unit’ test concept” where I turn some non-invasive set of sanity checks loose on the production site first thing after deployment.

  • Jordan Whiteley

    I can not express to you the number of times that this has saved me from, “User only has read rights to the error log”, “One of the connection strings is pointing at a staging machine”, “Network drive not mapped”, “Decryption requests are not allowed from this machine”.

    Also you get many brownie points from your ops guys. It really saves their skin when they switch out hardware and there’s a message that pops up and right away it says, “X isn’t working, it threw this exception{0}, maybe go check Y.” They would much rather see that than to have everything blow up right after flipping addresses on a machine.