Back to newsletter 122 contents | All Javva's articles
Somehow, because we are the performance tuning team, obscure concurrency bugs get thrown our way. I'm not sure exactly where the join is, we aren't the "obscure concurrency bug" team. But why fight it, I prefer to go with the flow. So, each time, we examine them.
The latest one is representative of what we tend to see. Across several hundred servers in all the different environments (prod, pre-prod, QA, various devs), every couple of months we get a report on the same issue cropping up. Each time it will be a different server, running on a different box (usually but not always). A different piece of code with a different stack will be involved (usually but not always). An intensive investigation will identify that the only way the problem could occur is either a simple obvious way which, if it did occur, would be occurring all the time on multiple servers so we'd be seeing it all the time; or in a highly obscure way that requires an exception being thrown that no one has ever seen, while at the same time several JVM threads would need to have died, and at the same time a system resource would need to have been temporarily exhausted. And note that none of the individual strands of the combination have been seen (but then none are being monitored for, so possibly unsurprising); and yet the combination has to occur once a month across the full set of servers for us to see what we see.
A coincidence too far? Each time I think about this, I'm reminded of Gabriel Weinbergon's article One in a million happens a lot when your site is big. Although I'm not talking about a big website for my bug here, the statistics are the same; if you are doing something a lot - and several hundred servers running almost continuously in an enterprise project would be spot on in the list of things that do stuff a LOT - then very unlikely incidents stop being very unlikely and instead become things that happen on an infrequent but regular basis.
Given that, you need to have a strategy for when obscure things go wrong. The usual strategy in IT for this is redundancy - assume something will fail at some point, and have something else that can quickly take it's place when that happens. This is now common in large scale systems - possibly the most famous is Google, where the systems are designed to expect failures of any component and they "design the system with multiple backups and ways to route around problems". So that's what I tell the teams: things will fail obscurely and unexpectedly - you need failover capability. And now my analysis of obscure concurrency bugs has become much faster: since I know what the answer for these are before I start, my analysis mostly consists "is this something that is unlikely to occur normally, but is likely to occur occasionally when you scale up to do it a lot". If the answer is yes - and the bugs sent our way typically are - then the solution is "there's no obvious fix - you need a failover capability".
BCNU - Javva The Hutt. (LinkedIn profile http://www.linkedin.com/in/javathehutt feel free to connect me, my email as javva@ this java performance domain).
Back to newsletter 122 contents