Back to newsletter 128 contents
<! Why APM is Broken >
Last week we listed an article stating "APM is Broken". Here's the reason why in my experience:
Application performance management starts with system level monitoring. Sensibly, every organisation recognizes that rather than a piecemeal approach to their enterprise, they should deploy a single system-level monitoring solution across the enterprise. This makes perfect sense, is very cost-effective, and provides for a one-stop core group available to the enterprise for system level monitoring. Naturally this group will identify an excellent product for the organisation, very flexible and capable of monitoring all the different types of servers available, including both production and non-production systems.
With this product monitoring hundreds, thousands, or tens of thousands of systems, the flexible monitoring capability is then delivered as a client tool to application teams, with some reasonable alerting defaults in place. And this is where APM starts to break down. Reasonable alerting defaults are always always unsuitable for every application. This means that when delivered to the application team, the system-level monitoring is immediately useless. It gives far too many alerts so is considered a pain to use by operations. Or it doesn't alert where the operations teams would think it obviously should. Typically both of these.
So it starts with a bad reputation, which means a hurdle to adopt. But let's not underestimate operations and development, they will recognise the value obtainable from the system monitoring solution and work to configure it for their systems. And there is the next hurdle - a system that is capable of monitoring across so many system-level resources and can integrate application level monitoring flows is of necessity going to have a huge amount of configuration capability. Which always makes it complex to adapt. Things that operations and development could easily do with a script or a bit of code (hours of work including testing) become multi-day efforts to get working in the system monitoring product.
The next hurdle is that of ownership. The system monitoring product is flexible, has plugins, and is built to expect data pumped into it of any sort - after all if you've spent a chunk of money on this, it should quite reasonably be capable of producing system and application level alerts. So the whole expected flow is to pump monitoring data out of the application and into the system monitoring tool as custom data, which then processes through the system monitoring workflow to raise alerts. But the alerts don't go back to the application infrastructure to be farmed out (after all the application knows where they should go!), instead the system monitoring tool has it's own mechanism to propagate alerts out. Naturally this is hugely flexible supporting many channels, so (with a chunk of work) it would be possible to push the alerts back into some kind of application level alert-handling workflow. But not with structured data, so the alert "comments" have to include everything that is needed for manual or automated handling, which means parsing free-flow text and then deciding on what to do ...
You don't see it from any single segment here, but the whole setup is topsy-turvy. Responsibilities are in the wrong places. Capabilities are in the wrong places. Ownership is all wrong. What's needed is for the application to be able to request whatever system information it wants from the system monitoring tool - then it can integrate that with what it knows about itself and decide appropriately what to do. But this impinges on everything - openness (or lack of it) of the system monitoring data; enterprise-wide support of APIs rather than just through the system monitoring team; duplication costs for data storage of historical data; etc. And the thing is, developers are clever people. They'll figure out an alternative (less optimal) way to get the data they need. And the system level monitoring will get bypassed except for some basic functionality. The result is no one gets to use the full capabilities of the tools. And this same set of issues tends to get multiplied up the stack and across the enterprise, leaving APM broken.
Here's a test for you whether you have it right. Could one of your application services make the following decision by using your enterprise tools: Is paging above normal levels (where the application service decides what is normal) - if so is that because a system backup is occurring, in which case alert the operations staff with a high alert if this is during business hours for this particular application service, or if the system backup is outside the expected backup hours but not in application business hours then send a medium priority email to systems team requesting them to look into it, copying operations; or is it one of a set of known local reasons for higher than normal paging (COB reporting expected to last 5 minutes produces heavier logging and is normal at this time); or is it a previously unknown cause in which case check the application latency to see if there is a significant decrease compared to normal latencies and if there is, alert operations.
I'm betting you couldn't implement that decision process without implementing your own local paging monitor. Which is a shame when the system monitoring tool has the information available - and that's why APM is broken. The responsibilities are wrong.
Now on with this month's newsletter and all our usual Java performance tools, news, and article links. We also have a new cartoon over at fasterj.com how more abstraction gives better performance ... eventually, and, as usual, all the extracted tips from all of this month's referenced articles.
Java performance tuning related news.
Java performance tuning related tools.
Back to newsletter 128 contents