Application Instrumentation Patterns

There are many different ways to instrument your application – and many tools available help with the effort. ¬†I’m going to show how you can get a lot of “free” instrumentation out of any size system with just a little bit of code, and a little bit of discipline. ¬†Let me start out by saying that none of this is news to many people – the patterns and practices I’ll discuss have been around for a long time, but it amazes me how with every system that I have met (and even heard of) for the most part instrumentation and logging is essentially random, due to it’s organic evolution. ¬†It’s understandable given the realities of time constraints and short sightedness we typically have under time constraints, but it doesn’t have to be that way.

This post is based on a talk I have given at a few venues entitled Creating Highly Instrumented Applications with Minimal Effort.  I hoped to appeal to the both the lazy AND curious developer with this.  The lazy part is obvious, but you have to be curious (or care?) to want to know this level of detail about how your system works.

Let’s start with a scenario so that you can get an idea about the punchline here: ¬†A customer calls in and says that they have been getting errors every time they try to update their account in your application. ¬†With one element of context (the error ID on the screen they see OR their email address, OR some other identifying element) I can answer the following questions in very little time:

  1. The last time they had that error
  2. The first time they had that error
  3. The component that caused the error
  4. (Potentially) The root cause of the error
  5. The context in which they were working
  6. What they were doing before that error
  7. How many times they tried
  8. How many other customers had that error today
  9. How many other customers have ever had that error
  10. The first time any customer had that error

This all sounds pretty useful – root cause analysis, ability to reproduce the scenario, customer impact over time, and even tying it back to a release which may have introduced a bug in the system.

So how do you get this level of insight from your application?  You instrument it.

Types of System Output

There are a lot of different types of output from both the system (operating system and infrastructure services) and your application.  This is how I distinguish between them.

System Instrumentation

This is a class of data that is very low level – threads, memory, disk I/O that relates to the health and performance of the OS and Infrastructure.

  • JMX
  • WMI
  • SNMP

System Logging

This is a class of data that doesn’t plug into operational monitoring tooling as well, but gives the system and infrastructure some place to dump stuff that we typically ignore unless something goes wrong.

  • system.log (*nix)
  • Event.log (Windows)
  • apache.log
  • nginx.log
  • iis.log

Application Instrumentation

Application Instrumentation is the star of the show for us here – it’s useful when things go wrong, but it’s just as useful when things are going just fine. ¬†We’ll see why a bit later, but here are some of the characteristics of Application Instrumentation:

  • Cross-cutting (free) and non-instrusive
  • Provides app activity metrics
  • Can passively trigger alerts (based on rules)
  • Source of performance data

Application Logging

Application Logging is

  • Intentional, in other words a developer has to go to the trouble to do it
  • Typically relates to Business Transactions and how well they worked out
  • Overtly triggers alerts (the database is down!)
  • Source of business metrics ($20 subscription processed)
  • Aids in troubleshooting failures, bugs

Quality Data

Let’s look at some examples of log output that I’ll randomly sample from my own laptop. ¬†You can follow along if you’re on a Mac by opening the Console application and browsing around. ¬†On Windows you can open up the Event Viewer and do the same. ¬†The goal here is to determine how useful the log output is, and grade it accordingly. We’ll find what quality means as we intuitively see utility in certain characteristics of the data that is provided.

012-09-28 11:14:26:783|SyncServer|38272|0x10ad16f50|Logging|Info| Logging initialized, engine version 673.6 : log level 3
2012-09-28 11:14:26:927|Mail|38269|0x104e23700|ISyncManager|Info| register client from /System/Library/Frameworks/Message.framework/Versions/B/Resources/Syncer.syncschema/Contents/Resources/MailNotesClient.plist (bundleId = (null), bundleRelativePath = (null))
2012-09-28 11:14:26:961|Mail|38269|0x7fee75613020|Stats|Stats| slow sync,
2012-09-28 11:14:26:964|Mail|38269|0x7fee75613020|ISyncSession|Info| prepare to slow sync,
2012-09-28 11:16:26:993|SyncServer|38272|0x10ad16f50|Server|Info| Cancelling all sync plans.
2012-09-28 11:16:27:029|SyncServer|38272|0x10ad16f50|Server|Info| Goodnight, Gracie.

Here’s what I find useful:

  1. Timestamp
  2. Who created the log  (SyncServer)
  3. What part of the SyncServer made the log statement (eg: Logging, Server)
  4. Some kind of session or process id to tie things together
  5. The Log Level (Info)
  6. Pipe (|) seperated values

What I don’t find useful

  1. Non-semantic logging (commas, colons, periods)
  2. It said stats but I don’t see any numbers

Let’s do another.

2013-10-07 08:18:41.600 GoogleSoftwareUpdateAgent[27318/0xb0207000] [lvl=1] -[KSCheckAction performAction] KSCheckAction found installed product for ticket: <KSTicket:12345
 xc=<KSPathExistenceChecker:0x1d259a0 path=/Applications/Google>
 creationDate=2012-09-14 22:20:29

Here’s what I find useful:

  1. Timestamp
  2. Who created the log  (GoogleSoftwareUpdateAgent)
  3. The Action that was being performed
  4. Some kind of session or process id to tie things together
  5. The Log Level (1)
  6. Some semantic (name = value) values

What I don’t find useful

  1. Some Non-semantic logging (colons, < >, LF delimiters)

One more, this time Tomcat starting up

Oct 09, 2013 9:35:14 PM <init>
INFO: tc Runtime property decoder using memory-based key
Oct 09, 2013 9:35:14 PM <init>
INFO: tcServer Runtime property decoder has been initialized in 225 ms
Oct 09, 2013 9:35:15 PM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["http-bio-8080"]
Oct 09, 2013 9:35:15 PM com.springsource.tcserver.serviceability.rmi.JmxSocketListener init
INFO: Started up JMX registry on in 123 ms
Oct 09, 2013 9:35:15 PM org.apache.catalina.startup.Catalina load
INFO: Initialization processed in 884 ms
Oct 09, 2013 9:35:15 PM org.apache.catalina.core.StandardService startInternal
INFO: Starting service Catalina
Oct 09, 2013 9:35:15 PM org.apache.catalina.core.StandardEngine startInternal
INFO: Starting Servlet Engine: VMware vFabric tc Runtime 2.7.2.RELEASE/7.0.30.A.RELEASE
Oct 09, 2013 9:35:15 PM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory /deploy/webapps/manager
Oct 09, 2013 9:35:15 PM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory /deploy/webapps/ROOT
Oct 09, 2013 9:35:15 PM org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["http-bio-8080"]
Oct 09, 2013 9:35:15 PM org.apache.catalina.startup.Catalina start
INFO: Server startup in 470 ms

Here’s what I find useful:

  1. Timestamp (on alternate lines?)
  2. Who created the log ¬†(org.apache.catalina…)
  3. The Action that was being performed (init?)
  4. The Log Level (INFO)
  5. Some metrics (startup times)

What I don’t find useful

  1. Mostly Non-semantic logging (no delimiters, name value pairs)
  2. Split lines

From this we can see that basically knowing the classic who, what, when, where and how in some structured and consistent manner is what we find useful. ¬†Also that it’s human readable AND machine readable. ¬†And guess what, when you have all that, you have some potentially “quality” data.

Theory and Best Practices

There are two principals that I follow that have enabled this type of scenario to be a reality, rather than a fantasy.

  1. Create consistent, semantic data that can be analyzed with a variety of tools
  2. Get in the middle of everything in your system, time it, log it, and maintain contextual information along the way

Good Data

  • Timestamp everything – be sure to use Global Timestamps (UTC ‚Äď 2013-08-21 22:43:31,990)
  • Use key value pairs (name=value) and delimit your data properly (use pipes, semicolons, whatever works). ¬†When you go to analyze this data, trust me it will be easier. ¬†Microsoft has gone so far as to create an “Application Block” ¬†called the Semantic Logging Application Block¬†that formalizes this type of approach.
  • Set and log the context data when you have it ¬†(who/what/where/when/how) because it’s interesting (as we have seen) and very useful for correlation downstream.
  • Leverage log categories to identify what tier, and better yet what component emitted the log
  • Time everything. ¬†It’s easy, free and you will thank yourself later when you have that aha moment when something related to the timings yields some insight that makes your system better.
  • Be conscious of security, never log sensitive data that could put you or your customers privacy and data at risk.
  • Be consistent in naming¬†(action=purchase; sale=oct13; productId=123123) vs¬†(action=buy; promo=oct13; sku=123123).

Get In the Middle Of Everything

Getting in the middle of everything in your system is the source of all your good data. Without doing this, you can follow all the above best practices against very little or random data. ¬†There are many ways to meddle in your own business and most of them are incredibly simple, cheap and non-invasive. ¬†The secret is to leverage existing patterns and frameworks to do the dirty work. ¬†In the context of web applications life is good because there aren’t really any viable frameworks that don’t have a pipeline of some sort. ¬†Whether you are in Java, .Net, Node.js or Ruby, you have some way to intercept incoming calls, and outbound responses (and in some cases MANY ways). ¬†This is commonly used for audit, security, logging, performance tracking – and I have combined everything but security into one – all that information is more useful together since together they form the basis of the context that you will want to propagate downstream such that subsequent logging statements are enriched with this contextual information.

Another way to get in the middle of everything that is not “web” related is to leverage Aspects. ¬†Just as filters or interceptors in the web framework let you get into the middle of the request/response flow, various aspect “enabling” technologies and patterns exist to let you get into the middle of the Object-to-Object request and response and apply the same audit, security, logging and performance tracking. ¬†In Java there is AspectJ and this is heavily used by the Spring Framework itself and they have provided a nice abstraction to the developer to simplify common use cases. ¬†In .Net you can get similar functionality by leveraging a dependency injection framework like Unity that can support dynamic proxies with internal interceptor pipelines. ¬†In Node.js you can leverage event handlers, which can be chained together easily with frameworks like Express. ¬† I have created examples of all three implementations in GitHub and will continue to tend to them based on interest and feedback.

Once you are in the middle of everything, creating consistent and contextual data, you then start to create a very rich dataset which can be parsed and indexed by various products or even just grep.

I’ve seen this leveraged for operations dashboards, business dashboards, SLA measurement, QA, customer support, performance testing, and much more. ¬† For a developer, it really provides a level of transparency about what the system is doing that no debug session can replicate.

The slides from the presentation are here:

Log On!

Application Instrumentation Patterns

I Am On Fire with Testability – Part 1

I’ve been taking some time off building out features to focus on reducing the unit testing debt we have in our codebase, and along the way build out some core testing infrastructure to make it easier to sustain building out tests in concert with features (imagine that). Some great lessons have been learned, notably about making code testable (see previous post).

Along this journey, the ASP.Net MVC framework that we use has turned out to be one of the more elegant bodies of software I have used in a while, but I didn’t really appreciate that until I tried testing code in it’s context. The main principal at play here is that in MVC you are always using an abstraction/wrapper over the core ASP.Net classes. You don’t have to use them (apparently if you are dumb or a sucker for pain) since you can always reference HttpContext.Current and do whatever you want to.

Light Painting With Fire

Long story short, whenever we bypassed the framework (and it’s abstractions), we paid dearly. It turns out the MVC framework was designed with testability in mind, so if you go with the flow (man) you get to inherit that for free. It took a fair amount of refactoring to make sure nobody was touching ASP.Net constructs directly. Let’s just say that I had to checkout 100+ files to remedy the situation.

Moral Of The Story: never reference¬†HttpContext.Current,¬†ever ever ever. While this ASP.Net convenience function is handy and reliable, it makes your code inherently untestable. Why? Because it can’t be mocked, and if it can’t be mocked, it cannot be unit tested.

So what is a coder to do?  Every Controller and View has a Context (ControllerContext and ViewContext respectively).  These provide accessors to an HttpContext that derives from HttpContextBase which is an Abstract and thus mockable class.  So, if you want to work with the Request РControllerContext.HttpContext.Request (or, even more conveniently the BaseController has a shortcut РHttpContext.Request).  Similarly, in Views you can reference the a Session like so: ViewContext.HttpContext.Session.  Neato peachy keen and a boon to testability!

So what do you need to do to fully mock the MVC application? We found all sorts of snippets of code around the intertubes that did this and that, even some attempts and a full blown mocking framework for MVC. Your mileage may vary with any of these, but at the core there are a few things you need to know. First off, you need to be able to take control of the

  • HttpSession
  • HttpRequest
  • HttpResponse

Everything pretty much pivots around these main classes. You have some choices to make about the “how” of mocking as well. You can use Mock Objects or Stubs. ¬†Your choice depends on how much control you want over the objects themselves. ¬†We use Moq which is pretty powerful in that it can Mock almost anything and you can add dynamic behaviors to objects with relative ease. ¬†That said I like to mix in Stub (or pretend) objects that mimic real behavior. ¬†For example, I want to Mock the HttpSession, which is a pretty dumb object (not a lot of logic) but it does have a storage mechanism. ¬†By simply extending the Abstract base class, I can mimic real Session level storage.

public class MockHttpSession : HttpSessionStateBase
     readonly Dictionary _sessionStorage = new Dictionary();

        public override object this[string name]
                if (_sessionStorage.ContainsKey(name))
                    return _sessionStorage[name];
                return null;
            set { _sessionStorage[name] = value; }

        public override void Remove(string name)
            if (_sessionStorage.ContainsKey(name))

Then I can use a real Mock and have it leverage this Stub

var Session = new MockHttpSession();
var Context = new Mock();
Context.Setup(ctx => ctx.Session).Returns(Session);

At test setup time, I can set variables in the session storage, and the executing code is none the wiser.

For simple property set/get you can just leverage Moq’s SetupAllProperties functionality. This will mimic the basic get/set on the object so that you can get/set on them without having to create a stub or define the dynamic functionality at setup. EG:

var Cache = new Mock();

So what does it look like to mock everything at once?  More on that in Part 2.

I Am On Fire with Testability – Part 1