Tips May 2010

Deadlocks occur from "oversynchronized" or badly ordered code, data races from "undersynchronized" code. Both are multithreaded race conditions.
Multithreaded race conditions are volatile and can easily not occur in QA. By running on different JVMs in QA, the thread scheduling is altered and so race conditions are more likely to show up in QA.
Native pre-compilation of Java code can provide better performance compared to traditional JIT-based JVMs, especially for application startup.
To increase the chance of seeing race conditions in QA, run systemwide tests for several hours and simulate highly varied volumes of request traffic.
If a Thread/Runnable.run() method is declared synchronized, the Thread/Runnable object monitor is always locked for that Thread/Runnable. This is probably a bug.

http://www.slideshare.net/timmorrow/shopzilla-performance-by-design-2433735
Performance By Design - A look at Shopzilla's path to high performance (Page last updated November 2009, Added 2010-05-26, Author Tim Morrow, Publisher Shopzilla). Tips:

High performance highly scaled site design principals: Simplify layers; Decompose architecture; Define SLAs; Continuous performance testing; Utilize caching; Apply best-practice UI performance techniques.
High performance highly scaled architecture: browser-distributed load balanced webservers-(1)application servers-distributed cache-database;-(2)search engine grid-dedicated search storage.
High performance highly scaled site useful features: Connection pooling; Stale connection checking; Hardware load balancers; Connection and Socket Timeouts; O(1) invocations on a single page; JAXB XML->Java unmarshaling; distributed caching; automatic data partitioning; distributed data grid for calculations;
High performance highly scaled site performance SLAs: session & context lookup - 50ms; business service response time 600ms; full page request to response full rendered time - 1.5 seconds; continuous performance testing targeting all critical services and identifying synchronization bottlenecks in particular.
Use Yahoo (currently list 34 best practices) UI performance techniqies - http://developer.yahoo.com/performance/
Minimize HTTP Requests: Combined files; CSS Sprites http://spriteme.org/
Use a CDN: Move your content closer to end users; Reduce latency; Every resource except for dynamic HTML; Offloads 100s of gigabytes per day.
Expiry, Compression and Minification: Expiry headers instruct the Browser to use a cached copy; > 2 days considered "Far Future"; Use versioning techniques to allow forced upgrades; Compressing reduces page weight; Minifying may still reduce size by 5% even with compression.
Reduce DNS lookups: Yahoo recommends 3 - 4 DNS lookups per page, e.g. Base page: www.bizrate.com; Javascript & CSS: file01.bizrate-images.com; Static images: img01.bizrate-images.com; Dynamic images: image01.bizrate-images.com; 3rd party ads are a different story
Avoid Redirects: Redirects delay your ability to server content. We strive for zero redirects. Exceptions: Redirect after POST; Handling legacy-styled URLs; Links off-site for tracking purposes.
Use a Cookie-free domain. Don't send cookies when requesting static resources. Use a separate domain name (e.g. bizrate-images.com) - Saves many Kb of upload bandwidth, Revenue increased by 0.8%!
Do Not Scale Images in HTML. Don't request larger images only to shrink them. Utilize a dynamic image scaling server. CDN caches and delivers exact image size
Make favicon.ico small and cacheable.
Flush the Buffer Early (every 8Kb or even less).
Continuously monitor full page load performance (including rendering).
Performance of web page loads directly corresponds to page abandonment (from 6% to 4% abandonment for shopzilla and 9% to 4% abandonment, in both cases moving from 8 seconds page response time to 1.5 second response time, including a 25% increase in page views during the same performance improvement).

http://www.infoq.com/presentations/Diagnosing-Memory-Leaks
Diagnosing Web Application OutOfMemoryErrors (Page last updated April 2010, Added 2010-05-26, Author Mark Thomas, Publisher InfoQ). Tips:

Common causes for perm gen memory leaks in a webserver application are registries holding multiply loaded classes from logging, JDBC drivers, GWT, causing references to be retained to the web application class loaders.
Process heap consists of: Perm Gen, Thread Stacks, native Code, compiler, GC, heap (young and old gens).
Class objects are loaded into PermGen
Common OutOfMemoryErrors that are not memory leaks: too many classes (increase perm gen); too many objects (increase heap or decrease objects/object sizes); stack overflow (reduce/alter recursion or increase stack size).
Memory leaks are indicated by steady increases in memory, and more frequent GC; however these can also be normal to the system so just indicators.
Apart from gross heap sizes, different garbage collector algorithms need to be tuned differently. The default is probably a good starting point.
In tomcat, putting a JDBC driver into the WEB-INF/lib directory can cause a memory leak (use common/lib and there is no leak) of web application loaders being pinned in memory - reloading causes the actual leak. Look for instances of the web application classloaders - there should be one per application, any extra are a memory leak (the leaks will have a "started" field set to false). Find the roots and see what is keeping it alive.
Finding the reference holding memory on reloads in tomcat web applications is straightforward, but this doesn't tell you how that reference was populated, for that you need allocation stack traces - which is horrendously expensive, so can only be done in debug mode.

http://skillsmatter.com/podcast/java-jee/making-every-millisecond-count-jvm-performance-tuning-in-the-real-world
Making every millisecond count! JVM performance tuning in the real-world (Page last updated December 2009, Added 2010-05-26, Author Ben Evans, Publisher skillsmatter). Tips:

Systems are too complicated to infer what they do from the code - you need a profiler, and you need to use the profiler output to understand what the system is doing. You cannot performance tune by staring at source code.
You need to know what is good and what is bad performance BEFORE tuning, so that you know when to stop optimising.
Ignore marginal potential improvements. If you need a 20% improvement, targeting a pathway that takes 0.1% of the time is not going to help much.
The only basic bottlenecks are CPU, memory, IO, and synchronization. All fixes should address the use of one of these.
You need to profile on a prod-mirrored environment, or the system effects can be so different that your tuning is useless (or worse, makes things worse).
Use garbage collection logging to monitor the garbage collection.

http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
Splittable LZO Compression (Page last updated November 2009, Added 2010-05-26, Author Matt Massie, Publisher Cloudera). Tips:

LZO compression in Hadoop allows for reduced data size and shorter disk read times
LZO's block-based structure allows compressed data to be split into chunks for parallel processing
Storing compressed data makes better use of hardware
Some compression formats (e.g. gzip) cannot be split for parallel processing (LZO can)
Some compression formats (e.g. bzip2) are slow enough at decompression that jobs become CPU-bound, eliminating your gains on IO (LZO is fast enough that there is an overall performance benefit)
MapReduce jobs are nearly always IO-bound; storing compressed data of the right type (parallel compressible and not too slow) means there is less overall IO to do, so jobs run faster. LZO compression is proposed as a good balance that is fast and parallelizable and benefits IO-bound processing.
You could split files prior to compression to allow parallel compression of blocks of the file.

http://www.informit.com/articles/article.aspx?p=1390173
How Not To Optimize (Page last updated August 2009, Added 2010-05-26, Author David Chisnall, Publisher Informit). Tips:

There are two types of optimizations: algorithmic improvements and language/implementation specific optimizations. Algorithmic optimizations are generally valid across versions of the system, but language/implementation specific optimizations can become invalid and even make the system worse as the underlying system changes.
You can rewrite any multiplication into bit-shift operations - this used to be a valid optimization but could now slow down multiplications on superscalar pipelined CPUs.
Just because something used to be slow on old CPUs doesn't mean that it's still slow. Time your optimizations and make sure that they really offer an improvement in speed.
Global variables used to offer some advantages in memory access, these advantages have disappeared on modern systems where references can be passed in registers and local caches; global variables can also prevent the code from being reentrant, which in turn means the system cannot automatically split code across threads whereas current systems would otherwise be capable of doing so.
Changing recursive code to run iteratively can often require a stack in the iterative code - but the code-level stack is likely to be slower than the system/hardware code calling stack, so the iterative code could be slower.
A lot of compilers will now do tail-recursion optimization, in these cases there is no benefit at all in converting to iterative code.
Inlining and template use in C++ can actually make the code slower by causing code to overflow the L1 cache. Java code can be faster if the compiler considers this (as it can do).
When running benchmarks, it's common to avoid running anything else. Unfortunately, that's not how most code will end up being used; it will be run concurrently with a lot of other programs, all wanting a slice of the system and caches.
Optimizing for special cases can add some overhead to every use of a function, providing a speed improvement only in special cases. Unless a special case is particularly common, or orders of magnitude slower than every other case, then it's typically not worth the bother.
In modern JVMs, the optimizer ignores "final" entirely because the VM already knows which classes aren't subclassed and which methods aren't overridden, and will apply these optimizations to all of them, not just the ones marked as final
Don't optimize until you're sure that your code will be too slow without optimization
When you do need to optimize, make sure that your improvements actually help.

Back to newsletter 114 contents

Last Updated: 2026-03-30
Copyright © 2000-2026 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
URL: http://www.JavaPerformanceTuning.com/news/newtips114.shtml
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us