|
|
|
Back to newsletter 095 contents | All Javva's articles
Anomalous CPU spikes during the release, occurring intermittently. Can I find the problem before I have to pull the release? Ugh, I work feverishly, looking for the issue, but progress is so slow. But let me step back and explain how I got here.
As the performance and troubleshooting team, my team are involved in supporting the release process. We cover each release, and need to identify any significant changes in performance, as well as troubleshoot low level problems that crop up. Of course, we aim to have everything performance tested prior to deployment, but things slip through. So we keep an eye on each release, looking for unexpected changes.
So far, so simple. Nothing unusual in all that. My first release here, and I decided that I'd see the release through myself, a baptism of fire. Not that I was really expecting anything to happen, I'd been told the last couple of releases had gone very smoothly. Twenty minutes into the release, the release manager calls me over. "We're getting an odd spiking behavior on the server2 CPU" he said, and showed me a graph of the server2 CPU. Sure enough, every few minutes it spikes up to 100% for a bit, then cuts back to the underlying level of about 25%.
So I went off to do some analysis. It was an irregular spike. And of course while I was watching the system, nothing happened. Isn't that always the way? But when I went off for a coffee, bang! And a little later, after going over to ask the support team if they had seen this before, bang, again while I wasn't watching!
Of course, every performance guy (and gal) has a box of tricks ready. Systems are sneaky, anyone who tells you they are dumb boxes that just do what they have been instructed to do, is a year short on experience. So you need your box of tricks to be able to sneak up on them and catch them doing the naughty. And I can be just as sneaky as the box. I opened my box of tricks and set it to record any elusive spikes, and went off for a sandwich.
And when I came back, there had been two spikes. And they had lasted easily long enough for me to see that there was a process causing them: "java xxx.yyy.release.ReleaseSupport5" was the process.
"Phew", thought I, "just some release mechanism that has been altered, nothing that will need the release stopped or rolled back". With less pressure, I figured I might as well have a look at the class and see what had changed before I went back to the release manager to tell him what the issue was.
There was just one change in that class since the last release. And it was only a comment. It said "//welcome to the shop, Javva. Let's see how good you are". Apparently our releases are not quite interesting enough for our release manager. He has a little streak of the joker in him. Yes, all ReleaseSupport5 did was some nasty little calculations in parallel that would consume CPU - all the CPUs - for a little while and then die.
And so passed my first release. The support team were unsurprised, being in on the joke and aware - and appreciative - of the Release Manager's propensity for practical jokes. And now I was too.
BCNU - Javva The Hutt.
Back to newsletter 095 contents