Why Production Monitoring Can Come Too Late
Editorial Note: I originally wrote this post for the Stackify blog. You can check out the original here, at their site. While you’re there, have a look around at how their offering can help you hunt down issues from development to production.
I’ve spent a number of years, now, writing software. At the risk of dating myself, I worked on software in the early 2000s. Back then, you couldn’t take quite as much for granted. For example, while organizations considered source control a good practice, forgoing it wouldn’t have constituted lunacy the way it does today.
As a result of the different in standards, my life shipping software looked different back then. Only avant garde organizations adopted agile methodologies, so software releases happened on the order of months or years. We thus reasoned about the life of software in discrete phases. But I’m not talking about the regimented phases of the so-called “waterfall” methodology. Rather, I generalize it to these phases: build, prep, run.
During build, you mainly solved the problem of cranking through the requirements as quickly as possible. Next up, during prep, you took this gigantic sprawl of code that only worked on dev machines, and started to package it into some kind of deployable product. This might have meant early web servers or even CDs at the time. And, finally, came run. During run phase, you’d maintain vigilance, waiting for customer issues to come streaming in.
Bear in mind that we would, of course, work to minimize bugs and issues during all of these phases. But at that time with most organizations, having issues during the “run phase” constituted a good problem to have. After all, it meant you had reached the run phase. A shocking amount of software never made it that far.
Monitoring and Software Maturity
We’ve come a long way. As I alluded to earlier, you’d get some pretty incredulous looks these days for not using source control. And you would likewise receive incredulous looks for a release cycle spanning years, divided into completely disjoint phases. Relatively few shops view their applications’ production behavior as a hypothetical problem for a far-off date anymore.
We’ve arrived at this point via some gradual, hard-won victories over the years. These have addressed the phases I mentioned and merged them together. Organizations have increasingly tightened the feedback loop with the adoption of agile methodologies. Alongside that, vastly improved build and deployment tooling has transformed “the build” from “that thing we do for weeks at the end” to “that thing that happens with every commit.” And, of course, we’ve gotten much, much better at supporting software in production.
Back in the days of shrink-wrap software and shipping CDs, users reported problems via phone call. For a solution, they developed workarounds and waited for a patch CD in the mail. These days, always-connected devices allow for patches with arbitrary quickness. And we have software that gets out in front of production issues, often finding them even before users do.
Specifically, we now have sophisticated production monitoring software. In some cases, this means simply watching for outages and supplying alerts. But we also have sophisticated application performance monitoring (APM) capabilities. As I said, we’ve come a long way.
The Remaining Blind Spot
But does a long way mean we’ve come all the way? I would argue that it most certainly does not. In the industry, we still have a prominent blind spot, even with our dramatically improved approach.
Many shops setup continuous integration and automated deployment to internal environments. This practice neatly fuses the build and prep phases. In these internal environments, they run automated regression tests, manual tests, and even smoke or load tests. They diligently exercise the software. But they omit monitoring it. And, in doing so, they preserve the vestige of the phased approach. Now, instead of the three legged “build, prep, run,” you have the two legged “build-and-prep, run.”
By not monitoring the software in production like conditions, shops set themselves up for a category of issues likely only to be seen in production. Often, they’ll view this sort of monitoring in their lower environments as superfluous, since their testing strategy should surface any problems. They’ll save the money and effort of monitoring for the production deployment. But this monitoring can come too late.
APM tools do a lot more than give you a jump on outages or performance problems. They also furnish a lot of valuable information. They reveal performance bottlenecks, show you the source of underlying errors, and help you quickly get to the bottom of issues.
But if you find yourself doing all of this in production environments alone, you’ve missed an opportunity to see all of this before you ever ship. For example, imagine a scenario where your application severely under-performs because of a correctable mistake, such as a massively inefficient query. But it doesn’t under-perform enough to run afoul of your load and smoke testing efforts. And furthermore, it displays the correct behavior from a functional standpoint. So, in spite of this mistake, the code makes it through all rounds of testing.
In production, you don’t trigger any alerts. At least, not at first. But as things scale up, your users begin to report poor experience, and your APM tool starts logging more and more red flags. You notice this, get to the bottom of it, and discover the underlying problem. From there, you fix the issue and make everyone happy.
But why did it need to get that far? You’re reacting to something that you could have caught much earlier. This needlessly wastes a good bit of time, money, and customer goodwill.
Backed into a Design Corner
But it can get even worse than that. Notice that in the hypothetical scenario I just outlined, a relatively easy fix presented itself. Team members tracked down the issue, resolved it, and presumably issued a quick patch. But what happens when it’s not quite that simple?
I’ve seen plenty of instances where an unnoticed mistake becomes intractable. People build on such mistakes with the assumption of a good foundation. Then they build some more on top of what they built on top of the assumption. Before you know it, discovering this mistake might mean a choice between just working around it and significant rework. I think anyone that has spent significant time working with legacy code can relate.
You can thus pay a terrible price for not catching a mistake like this early — one that extends far beyond catching it in production instead of just before shipping. You can pay a price for not catching the mistake before sealing it into the foundation of your codebase.
We write all sorts of automated tests for this reason, and practitioners of TDD cite this as one of the core benefits. Catch mistakes as close to making them as possible. By failing to deploy monitoring capabilities in lower environments, you run the same kind of risk that you run by skipping these types of automated tests.
Getting Rid of Phases Once and For All
When software delivery really hums along, we don’t have any real, distinct phases the way that we used to. Instead of segmenting, we get into tight feedback loops of “build-prep-run.” Organizations like Facebook have this automated the whole way, and wind up with a fused use of their tooling.
In general, I would recommend doing everything in your power to make lower environments look as much like production as possible. You might not get all the way there. Take Netflix, for example. It unleashes its “chaos monkey” because it can’t possibly replicate the mammoth scale it requires in any environment but production. Bu it uses this tool to do the best it can in terms of reducing risk.
You probably don’t deal with their kinds of scale and testing limitations, but you can take their example nonetheless. They fuse build-prep-run and leverage their available tools to address all of those concerns simultaneously. You should do the same. If you’re monitoring your software in production, start monitoring it in your other environments as well to catch mistakes before they have a chance to fester.