Back to newsletter 119 contents
This month a particular article caught my eye as a must read. Todd Hoff of highscalability fame wrote about an absolutely fascinating example of how your application living on the edge can turn gradual degradation into sudden catastrophe (see the article listings section below). The sequence, which caused downtime of FourSquare, was as follows:
- First off, over time, the load distribution across their servers slowly became slightly askew in memory usage (a 50:67 ratio), not anything that would normally be a problem - it is fairly common for load distribution across servers to be at least a little askew, expecially on the memory side since load balancing would usually target processor availability rather than memory (and many load distributors don't even load-balance, they just round-robin or equivalent).
- In this case, that meant that the system with the larger memory load just slightly exceeded RAM and memory started paging. Systems can recover or limp along for a while after paging kicks in, especially if your overall system has spare capacity and your load distributor can really balance, as more requests will be shunted over to the servers with capacity. But paging memory is definitely something that needs to be monitored for, because it's one of the biggest causes of potential dramatic performance slowdowns, a paging system can easily degrade by one or even two orders of magnitude slower than the equivalent unpaged system!
- Here, the paging caused enough performance degradation to cause a backlog to build up. This in turn made the situation worse - the more the backlog, the more stress put on the system, and even if it just queues the incoming requests, that backlog will grow; the worse case is where processing actually slows down as the load gets ever larger and makes the system snowball into a death spiral (I'm not clear which of those happened with FourSquare, but in any case it was unsustainable).
- They hoped that providing a new server might have relieved the system and allowed it to recover, but in this case it failed to relieve the memory strain on the overloaded server as memory was so fragmented that it carried on paging. Again, my experience is that this is quite common, untried failover backups are rarely capable of relieving the stress on the system because the configuration is often not quite right.
Take FourSquare's experience and turn it to your advantage. Have a regular review of your system, looking for where it is on the edge of its capacity. Think about what happens when you exceed that capacity, and how the system will degrade, and whether that degradation will be handled gracefully or snowball into downtime. And do you have the monitoring in place to identify where your system is and when it exceeds those edges? Then test out your contingencies with actual representative loads, because anything less and you are just guessing.
Now on with this month's newsletter. We have all our usual Java performance tools, news, and article links. Javva The Hutt tells us about the NotFastEnoughException; Over at fasterj we have a new cartoon about how multi-cores were invented because of garbage collection; and, as usual, we have extracted tips from all of this month's referenced articles.
Java performance tuning related news.
Java performance tuning related tools.
Back to newsletter 119 contents