Application Instrumentation Patterns

There are many different ways to instrument your application – and many tools available help with the effort. ¬†I’m going to show how you can get a lot of “free” instrumentation out of any size system with just a little bit of code, and a little bit of discipline. ¬†Let me start out by saying that none of this is news to many people – the patterns and practices I’ll discuss have been around for a long time, but it amazes me how with every system that I have met (and even heard of) for the most part instrumentation and logging is essentially random, due to it’s organic evolution. ¬†It’s understandable given the realities of time constraints and short sightedness we typically have under time constraints, but it doesn’t have to be that way.

This post is based on a talk I have given at a few venues entitled Creating Highly Instrumented Applications with Minimal Effort.  I hoped to appeal to the both the lazy AND curious developer with this.  The lazy part is obvious, but you have to be curious (or care?) to want to know this level of detail about how your system works.

Let’s start with a scenario so that you can get an idea about the punchline here: ¬†A customer calls in and says that they have been getting errors every time they try to update their account in your application. ¬†With one element of context (the error ID on the screen they see OR their email address, OR some other identifying element) I can answer the following questions in very little time:

  1. The last time they had that error
  2. The first time they had that error
  3. The component that caused the error
  4. (Potentially) The root cause of the error
  5. The context in which they were working
  6. What they were doing before that error
  7. How many times they tried
  8. How many other customers had that error today
  9. How many other customers have ever had that error
  10. The first time any customer had that error

This all sounds pretty useful – root cause analysis, ability to reproduce the scenario, customer impact over time, and even tying it back to a release which may have introduced a bug in the system.

So how do you get this level of insight from your application?  You instrument it.

Types of System Output

There are a lot of different types of output from both the system (operating system and infrastructure services) and your application.  This is how I distinguish between them.

System Instrumentation

This is a class of data that is very low level – threads, memory, disk I/O that relates to the health and performance of the OS and Infrastructure.

  • JMX
  • WMI
  • SNMP

System Logging

This is a class of data that doesn’t plug into operational monitoring tooling as well, but gives the system and infrastructure some place to dump stuff that we typically ignore unless something goes wrong.

  • system.log (*nix)
  • Event.log (Windows)
  • apache.log
  • nginx.log
  • iis.log

Application Instrumentation

Application Instrumentation is the star of the show for us here – it’s useful when things go wrong, but it’s just as useful when things are going just fine. ¬†We’ll see why a bit later, but here are some of the characteristics of Application Instrumentation:

  • Cross-cutting (free) and non-instrusive
  • Provides app activity metrics
  • Can passively trigger alerts (based on rules)
  • Source of performance data

Application Logging

Application Logging is

  • Intentional, in other words a developer has to go to the trouble to do it
  • Typically relates to Business Transactions and how well they worked out
  • Overtly triggers alerts (the database is down!)
  • Source of business metrics ($20 subscription processed)
  • Aids in troubleshooting failures, bugs

Quality Data

Let’s look at some examples of log output that I’ll randomly sample from my own laptop. ¬†You can follow along if you’re on a Mac by opening the Console application and browsing around. ¬†On Windows you can open up the Event Viewer and do the same. ¬†The goal here is to determine how useful the log output is, and grade it accordingly. We’ll find what quality means as we intuitively see utility in certain characteristics of the data that is provided.

012-09-28 11:14:26:783|SyncServer|38272|0x10ad16f50|Logging|Info| Logging initialized, engine version 673.6 : log level 3
2012-09-28 11:14:26:927|Mail|38269|0x104e23700|ISyncManager|Info| register client from /System/Library/Frameworks/Message.framework/Versions/B/Resources/Syncer.syncschema/Contents/Resources/MailNotesClient.plist (bundleId = (null), bundleRelativePath = (null))
2012-09-28 11:14:26:961|Mail|38269|0x7fee75613020|Stats|Stats| slow sync,
2012-09-28 11:14:26:964|Mail|38269|0x7fee75613020|ISyncSession|Info| prepare to slow sync,
2012-09-28 11:16:26:993|SyncServer|38272|0x10ad16f50|Server|Info| Cancelling all sync plans.
2012-09-28 11:16:27:029|SyncServer|38272|0x10ad16f50|Server|Info| Goodnight, Gracie.

Here’s what I find useful:

  1. Timestamp
  2. Who created the log  (SyncServer)
  3. What part of the SyncServer made the log statement (eg: Logging, Server)
  4. Some kind of session or process id to tie things together
  5. The Log Level (Info)
  6. Pipe (|) seperated values

What I don’t find useful

  1. Non-semantic logging (commas, colons, periods)
  2. It said stats but I don’t see any numbers

Let’s do another.

2013-10-07 08:18:41.600 GoogleSoftwareUpdateAgent[27318/0xb0207000] [lvl=1] -[KSCheckAction performAction] KSCheckAction found installed product for ticket: <KSTicket:12345
 xc=<KSPathExistenceChecker:0x1d259a0 path=/Applications/Google>
 creationDate=2012-09-14 22:20:29

Here’s what I find useful:

  1. Timestamp
  2. Who created the log  (GoogleSoftwareUpdateAgent)
  3. The Action that was being performed
  4. Some kind of session or process id to tie things together
  5. The Log Level (1)
  6. Some semantic (name = value) values

What I don’t find useful

  1. Some Non-semantic logging (colons, < >, LF delimiters)

One more, this time Tomcat starting up

Oct 09, 2013 9:35:14 PM <init>
INFO: tc Runtime property decoder using memory-based key
Oct 09, 2013 9:35:14 PM <init>
INFO: tcServer Runtime property decoder has been initialized in 225 ms
Oct 09, 2013 9:35:15 PM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["http-bio-8080"]
Oct 09, 2013 9:35:15 PM com.springsource.tcserver.serviceability.rmi.JmxSocketListener init
INFO: Started up JMX registry on in 123 ms
Oct 09, 2013 9:35:15 PM org.apache.catalina.startup.Catalina load
INFO: Initialization processed in 884 ms
Oct 09, 2013 9:35:15 PM org.apache.catalina.core.StandardService startInternal
INFO: Starting service Catalina
Oct 09, 2013 9:35:15 PM org.apache.catalina.core.StandardEngine startInternal
INFO: Starting Servlet Engine: VMware vFabric tc Runtime 2.7.2.RELEASE/7.0.30.A.RELEASE
Oct 09, 2013 9:35:15 PM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory /deploy/webapps/manager
Oct 09, 2013 9:35:15 PM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory /deploy/webapps/ROOT
Oct 09, 2013 9:35:15 PM org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["http-bio-8080"]
Oct 09, 2013 9:35:15 PM org.apache.catalina.startup.Catalina start
INFO: Server startup in 470 ms

Here’s what I find useful:

  1. Timestamp (on alternate lines?)
  2. Who created the log ¬†(org.apache.catalina…)
  3. The Action that was being performed (init?)
  4. The Log Level (INFO)
  5. Some metrics (startup times)

What I don’t find useful

  1. Mostly Non-semantic logging (no delimiters, name value pairs)
  2. Split lines

From this we can see that basically knowing the classic who, what, when, where and how in some structured and consistent manner is what we find useful. ¬†Also that it’s human readable AND machine readable. ¬†And guess what, when you have all that, you have some potentially “quality” data.

Theory and Best Practices

There are two principals that I follow that have enabled this type of scenario to be a reality, rather than a fantasy.

  1. Create consistent, semantic data that can be analyzed with a variety of tools
  2. Get in the middle of everything in your system, time it, log it, and maintain contextual information along the way

Good Data

  • Timestamp everything – be sure to use Global Timestamps (UTC ‚Äď 2013-08-21 22:43:31,990)
  • Use key value pairs (name=value) and delimit your data properly (use pipes, semicolons, whatever works). ¬†When you go to analyze this data, trust me it will be easier. ¬†Microsoft has gone so far as to create an “Application Block” ¬†called the Semantic Logging Application Block¬†that formalizes this type of approach.
  • Set and log the context data when you have it ¬†(who/what/where/when/how) because it’s interesting (as we have seen) and very useful for correlation downstream.
  • Leverage log categories to identify what tier, and better yet what component emitted the log
  • Time everything. ¬†It’s easy, free and you will thank yourself later when you have that aha moment when something related to the timings yields some insight that makes your system better.
  • Be conscious of security, never log sensitive data that could put you or your customers privacy and data at risk.
  • Be consistent in naming¬†(action=purchase; sale=oct13; productId=123123) vs¬†(action=buy; promo=oct13; sku=123123).

Get In the Middle Of Everything

Getting in the middle of everything in your system is the source of all your good data. Without doing this, you can follow all the above best practices against very little or random data. ¬†There are many ways to meddle in your own business and most of them are incredibly simple, cheap and non-invasive. ¬†The secret is to leverage existing patterns and frameworks to do the dirty work. ¬†In the context of web applications life is good because there aren’t really any viable frameworks that don’t have a pipeline of some sort. ¬†Whether you are in Java, .Net, Node.js or Ruby, you have some way to intercept incoming calls, and outbound responses (and in some cases MANY ways). ¬†This is commonly used for audit, security, logging, performance tracking – and I have combined everything but security into one – all that information is more useful together since together they form the basis of the context that you will want to propagate downstream such that subsequent logging statements are enriched with this contextual information.

Another way to get in the middle of everything that is not “web” related is to leverage Aspects. ¬†Just as filters or interceptors in the web framework let you get into the middle of the request/response flow, various aspect “enabling” technologies and patterns exist to let you get into the middle of the Object-to-Object request and response and apply the same audit, security, logging and performance tracking. ¬†In Java there is AspectJ and this is heavily used by the Spring Framework itself and they have provided a nice abstraction to the developer to simplify common use cases. ¬†In .Net you can get similar functionality by leveraging a dependency injection framework like Unity that can support dynamic proxies with internal interceptor pipelines. ¬†In Node.js you can leverage event handlers, which can be chained together easily with frameworks like Express. ¬† I have created examples of all three implementations in GitHub and will continue to tend to them based on interest and feedback.

Once you are in the middle of everything, creating consistent and contextual data, you then start to create a very rich dataset which can be parsed and indexed by various products or even just grep.

I’ve seen this leveraged for operations dashboards, business dashboards, SLA measurement, QA, customer support, performance testing, and much more. ¬† For a developer, it really provides a level of transparency about what the system is doing that no debug session can replicate.

The slides from the presentation are here:

Log On!

Application Instrumentation Patterns

Test-Enabled Development

I’m going to coin a phrase that for me captures a more realistic and achievable form of development than Test-Driven Development (TDD). TDD is nice, in theory and I’m happy for those who are capable of truly executing on it. The reality for me has been that even in “green field” development projects where you are coding fun new things, the feature development quickly moves past the tests for said features. For the purists out there, yes this is an affront to Scrum (which I portend to follow), TDD and perhaps even good engineering discipline. However, many developers don’t code for aesthetic reasons, they code to make products that serve customers which hopefully make money for them.

Fish, chillin'

So if we accept reality and the imperfections in process that tend to occur, how can we as quality-conscious developers still “do the right thing”? ¬†My contention is that we can follow the practice of Test-Enabled Development (TED).

The principal of TED is that regardless of when you test your code, your code must be testable. More on what that entails shortly. The key point is that your code may be tested prior to impelementation (if you are a saint), during implementation (if you are pretty good) or after the fact (if you are lazy/under pressure to deliver). In all cases, your code should be amenable to testing.

So what characterizes code that is testable? In the world of strongly-typed languages like Java and C# we can do a few things right off the bat:

  • Interface-based programming – ensure all system components are defined with interfaces. ¬†For one thing it encourages design by responsibility, secondly it provides a looser coupling, and most importantly, it allows for mocks, stubs and other chicanery on the testing front.
  • Don’t use static references if possible, and avoid stateful behavior – let the state be brought to you. ¬† In other words, mind your own business and let either the caller or some persistent store tell you what state the world is in. ¬†That also encourages looser coupling and lets your tests set up the state/context that they need.
  • Factor code into discrete functions – huge methods that do 100s of things are bad for many reasons but from a testability standpoint they are a higher form of sin.

Bonus Points: use a Dependency Injection framework like Spring, Ninject, Unity, or whatever – then the test code can take control of what is real and what is mocked.

Sound like a lot to do? ¬†It is, but once you get the patterns in place, the goodness is self-replicating and yields dividends down the road. ¬†Further, in the worse case you have a very well architected but untested system, and if there’s a rainy day or a sudden influx of engineering discipline you can actually effectively test the software.

    Test-Enabled Development