|
|
|
Back to newsletter 092 contents
My fat Java clients started crashing from out-of-memory issues recently, first the odd one or two, then two or three in a day, increasing until I had a crescendo of eight crashes in one day. The severity was such that the the business head of the division was now involved. All the while, I was deperately trying to identify what the problem was. "Easy", you say. You get a memory dump, analyse it, find what is taking up the space or growing, and presto the job is done.
Well, the first little problem is that my crashes were all on the native memory side - no Java dumps, and no native dumps either sadly. But of course I have basic heap monitoring in place in prod, so I could see all the trends. And it was clear that the native memory was growing faster than the Java heap. Of course, that screams "native memory leak", but how to determine that without native memory dumps? We did extensive testing but couldn't identify a leak. And there was always the possibility that we had a small Java leak that was hanging on to larger native objects, a scenario that I have encountered before.
So we obtained Java histograms. But amidst the millions and millions of objects varying over the day, a small Java leak could easily go unnoticed. In fact I can pretty much guarantee that we have dozens of small Java leaks, in a large complex application you can't even notice them no matter how detailed your analysis. So what can you do? Memory leak analysis works for large objects, or many objects, but this scenario of a comparatively few small Java objects causing large native objects is not easily analysed.
So, how did I solve it? I got lucky. Eventually I found a couple of my fat clients kicking off a thread that was doing more work than my baselines showed they should. The stack traces for the thread indicated a particular class was doing something slightly unexpected. Analysing the class showed that it had a small memory leak under certain restricted conditions. And this was specifically the problem here.
Now look at that sequence. We had process monitoring down to the thread level in place. We had baselines. We had sufficient details that we could identify one thread out of several hundred as being out of its normal range of execution. We had historical stack traces from the thread with sufficient granularity that we could determine what specifically was unusual compared to the baseline. We had enough knowledge of the system to be able to determine at the class level what looked unusual. And we had the expertise at the code level to identify a non-trivial memory leak (call back objects not being released).
And here is my challenge to tool vendors. Give me a tool that finds that needle in the haystack much quicker - and without me having to get lucky.
Now on with the newsletter. We have our usual lists of Java performance tools, news, and articles - and worth picking out as special is Brent Boyer's comprehensive article on how to avoid benchmarking pitfalls, probably the best one that I've seen on this subject. At fasterj Kirk reviews Emily Halili's "Apache JMeter" book, Javva The Hutt finds out about outsourcing strategy. And finally, as usual, we have extracted tips from all of this month's referenced articles.
Java performance tuning related news.
Java performance tuning related tools.
Back to newsletter 092 contents