Java Performance Tuning

Java(TM) - see bottom of page

|home |services |training |newsletter |tuning tips |tool reports |articles |resources |about us |site map |contact us |
Tools: | GC log analysers| Multi-tenancy tools| Books| SizeOf| Thread analysers|

Our valued sponsors who help make this site possible
New Relic: Try free w/ production profiling and get a free shirt! 

Site24x7: Java Method-Level Tracing into Transactions @ $12/Month/JVM. Sign Up! 

Javva The Hutt January 2011

jKool for DevOps
Light up your Apps & get a cool t-shirt

JProfiler
Get rid of your performance problems and memory leaks!


Java Performance Training Courses
COURSES AVAILABLE NOW. We can provide training courses to handle all your Java performance needs

Java Performance Tuning, 2nd ed
The classic and most comprehensive book on tuning Java

Java Performance Tuning Newsletter
Your source of Java performance news. Subscribe now!
Enter email:


New Relic
New Relic: Try free w/ production profiling and get a free shirt!

Site24x7
Site24x7: Java Method-Level Tracing into Transactions @ $12/Month/JVM. Sign Up!


jKool for DevOps
Light up your Apps & get a cool t-shirt

JProfiler
Get rid of your performance problems and memory leaks!


Back to newsletter 122 contents | All Javva's articles

Somehow, because we are the performance tuning team, obscure concurrency bugs get thrown our way. I'm not sure exactly where the join is, we aren't the "obscure concurrency bug" team. But why fight it, I prefer to go with the flow. So, each time, we examine them.

The latest one is representative of what we tend to see. Across several hundred servers in all the different environments (prod, pre-prod, QA, various devs), every couple of months we get a report on the same issue cropping up. Each time it will be a different server, running on a different box (usually but not always). A different piece of code with a different stack will be involved (usually but not always). An intensive investigation will identify that the only way the problem could occur is either a simple obvious way which, if it did occur, would be occurring all the time on multiple servers so we'd be seeing it all the time; or in a highly obscure way that requires an exception being thrown that no one has ever seen, while at the same time several JVM threads would need to have died, and at the same time a system resource would need to have been temporarily exhausted. And note that none of the individual strands of the combination have been seen (but then none are being monitored for, so possibly unsurprising); and yet the combination has to occur once a month across the full set of servers for us to see what we see.

A coincidence too far? Each time I think about this, I'm reminded of Gabriel Weinbergon's article One in a million happens a lot when your site is big. Although I'm not talking about a big website for my bug here, the statistics are the same; if you are doing something a lot - and several hundred servers running almost continuously in an enterprise project would be spot on in the list of things that do stuff a LOT - then very unlikely incidents stop being very unlikely and instead become things that happen on an infrequent but regular basis.

Given that, you need to have a strategy for when obscure things go wrong. The usual strategy in IT for this is redundancy - assume something will fail at some point, and have something else that can quickly take it's place when that happens. This is now common in large scale systems - possibly the most famous is Google, where the systems are designed to expect failures of any component and they "design the system with multiple backups and ways to route around problems". So that's what I tell the teams: things will fail obscurely and unexpectedly - you need failover capability. And now my analysis of obscure concurrency bugs has become much faster: since I know what the answer for these are before I start, my analysis mostly consists "is this something that is unlikely to occur normally, but is likely to occur occasionally when you scale up to do it a lot". If the answer is yes - and the bugs sent our way typically are - then the solution is "there's no obvious fix - you need a failover capability".

BCNU - Javva The Hutt. (LinkedIn profile http://www.linkedin.com/in/javathehutt feel free to connect me, my email as javva@ this java performance domain).


Back to newsletter 122 contents


Last Updated: 2017-03-29
Copyright © 2000-2017 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
URL: http://www.JavaPerformanceTuning.com/news/javvathehutt122.shtml
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us