Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Tips August 2006
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 069 contents
Understanding The User Experience (Page last updated June 2006, Added 2006-08-30, Author Kirk Pepperdine, Publisher TheServerSide). Tips:
- As important as performance is, it's an aspect of functionality that is often not specified as part of the requirements nor is it often not fully considered in the QA process.
- The only way to truly know performance is to measure it.
- If performance is an important requirement of your application then it should be specified along with every other requirement.
- Quite often users will not know how to set a performance requirement. In this case it maybe best to start with requirements that are unacceptable and then work from there.
- One can and should performance test each of the components that will be used to build the final product.
- There are classes of performance problems that won't appear until then entire application can be tested.
- Competition may introduce contention that would not be present when either of the components was tested in isolation. The resource in contention may be something as simple as CPU. If this is the case, the solutions could be also as simple as adding more CPU.
- Micro-performance benchmarks are very difficult to get right and hardly every provide more then a marginal gain in performance. However they may prove to be quite useful in helping you resolve micro-level performance problems that have been identified.
- The danger in performance testing a system that has access to external data sources and services mocked out is that mock object will perform differently than the real service.
- A bad mock can create false bottlenecks, which may cause developers to be chasing problems that don't really exist.
- If you are doing a production level test on a system then the database must be configured and populated to a production level, otherwise optimizations in the database technology will cause harm.
- Whatever disk configuration is in production, it should be the same in your test if you are dealing with a system that is bounded by I/O.
- If the database contains too little data then after an initial warm-up, the database will never have to read from disk again, which may not be enough to simulate problems related to disk activity.
- Any deviation from the hardware in your production system and you run the risk of either creating an artificial bottleneck or you may just end up moving it to some place else. You are in danger of chasing phantom performance issues while the real ones will remain hidden.
- Add an extra CPU into the production system and you could see longer response times then what you see in QA. This may seem like a paradox but in fact is a normal reaction for applications that are network bound. More CPU capacity allows the application to put more pressure on the network. Once that pressure crosses a certain threshold the whole network will start thrashing and in the process kill response times. In this instance less CPU is actually better.
- Having Gigabit network capacity in production and Megabit in QA can easily create an artificial bottleneck if your application's network utilization is more than what the Megabit network can handle. What's worse is that this artificial bottleneck will work to hide all others.
- The simple act of taking a measurement will impact performance and hence impact your measurement.
- Try to minimize the effects of measuring during tests.
- Use tools that impose minimal overhead.
- Process measurements outside of the application and any system that affects the application testing.
- Measure only what you need to and no more.
- If you need to measure many things measure them one at a time.
- Don't use tools that compete for the resources your application is dependent upon.
- The Unix command line utility vmstat is a lightweight tool that can be used to monitor the health of the Unix kernel.
- Writing logging information to a console window can harm your application's performance.
- Make your measurements impose minimal overhead.
- Taking too many measurements will result in a cumulative drag on system performance.
- Minimize drag due to monitoring by taking measurements at the outer edges of the application.
- A good test harness will: Emulate hundreds if not thousands of users; Throttle request rates; Randomize request rates; Perform client side tasks; Randomize request parameters; Measure and report request response times; Monitor other aspects of the system.
- A misconfigured test harness can be the bottleneck in the system.
- With a bad test harness, bad results will almost always be attributed to the application and not the harness itself.
- If too many other things are happening during a benchmark, then your response times will become too corrupted (or noisy) to provide useful meaning.
- In benchmarking generate the average, variance, median, minimum, maximum and the 90th percentile statistics.
- To measure throughput find the point where the throughput starts falling off by adjusting the number of requests per second that you throw at the server.
- Typically you need to sacrifice user response time for throughput and throughput for use response time.
- Benchmarks almost always produce surprising results and it is very difficult to create a proper schedule around surprises.
To Thread or not to Thread (Page last updated July 2006, Added 2006-08-30, Author Jean-Francois Arcand, Publisher java.net). Tips:
- Best performance (max throughput to fast clients) was seen when the same thread as the selector was used to handle the accept calls; while a separate thread was used to handle the reads.
- Best performance (max throughput to fast clients) was seen when the same thread that handled the reads, also handled the writes.
- If you have many slow clients then having a separate thread per client to handle reads and writes is probably not optimal.
Performance Bloopers (Page last updated June 2006, Added 2006-08-30, Author eclipse.org, Publisher eclipse.org). Tips:
- String.substring() can leave many small Strings with large char arrays, causing a memory discrepancy (substring uses the same char array as the String it subs, just holding offsets).
- InputStream and OutputStream are unbuffered, and should be wrapped in a BufferedInputStream BufferedOutputStream.
- Strings are one of the least space efficient data forms in Java, e.g. 25 characters requires 90 bytes of storage.
- Strings as identifiers are easy to readbut not efficient.
- String.intern() often degrades performance, dramatically in some cases depending on usage and JVM.Interning strings eagerly and early fills the intern table with increasingly more collisions - the intern table is often a static size.
- Use a private table rather than the String.intern table.
- Avoid Strings as objects where they are not really necessary, i.e. where a textual representation is not needed.
- For infrequently used tables, store data on disk if heap memory is tight - this trades speed for heap memory.
- Scaning the disk for objects is very slow compared to in-memory access.
- Long keys are not particularly useful and waste space. String keys are useful for developers - but tend to have a performance cost.
- Mechanisms which are going to be used throughout the system should be well thought out and used consistently across the system.
- You can set each message to a Java field instead of a property or resource bundles, yielding big memory improvements.
- Initialisation code may not always be needed, so either load on demand or refactor into appropriate different initialisations.
- There is a tradeoff between initialising everything, so that it is all defined as early as possible, and the cost of doing that in terms of startup delays. You need to consider that tradeoff and design and implement accordingly.
- Avoid resource leaks. Cache images and make sure any are disposed of when they are finished with.
- Change events can lead to handlers processing them generating yet more change events, causing a storm of change events. These should all be combined into fewer batches of changes.
- Separate out the updates to a central manager component, possibly run in its own dedicated thread.
- Using the appropriate data structure for the data and the way that data is used provides optimisation options (for example hash lookups instead of linear searches for keys).
- Beware of side effects, especially those that can lead to resource leaks.
- URLClassLoader will verify any JAR that it loads, and there is no way to avoid that verification cost.
- JarFile(String) and JarFile(File) methods will verify any signed JARs, even if you weren't planning on loading or running any code in the JAR you are opening.
- As soon as you instantiate Manifest, the entire manifest file is loaded and parsed. This can be very expensive in large signed JARs, that can sometimes have 500KB of signatures in them.
- JarFile has constructors with a boolean flag which can disable verification, allowing you to avoid those verification costs.
- When reading a manifest file, consider doing a light-weight parse to find the main attribute you are looking for, to avoid parsing the entire file.
- When using a bounded cache, be aware of the thrashing that can occur when operating on data sets that are too large to fit in the cache.
High Performance GUIs (Page last updated May 2006, Added 2006-08-30, Author Christopher Butler, Publisher ClientJava). Tips:
- The Graphics API has a number of different layers: your application - Swing - AWT - Java2D - the Sun Graphics layer - the native layer. You should bear this multilayer in mind as they have different costs and optimisation possibilities.
- Within Swing, optimizations include object creation and setting rendering hints.
- In AWT, we focus on streamlining event handling, coalescing events, and making sure not to clog the EDT with unnecessary work.
- At the Java2D layer you have direct control over object rendering.
- GlyphVector provides faster antialiased text rendering than using Strings except for Apple platforms. Apple's Graphics implementation actually renders Strings faster.
- Reduce the burden on the graphics pipeline: generally the simplest option will probably be the best. To draw a Rectangle you could simply drawRect() or draw a polygon or draw four lines; the simplest course of action is to draw a rectangle - and that is likely to be the best performing too. (using java.awt.geom.Rectangle2D may offer an even better option, depending on our needs, since it's internal representation allows us to avoid floating-point roundoff errors, which may cause hell in chained affine transforms).
- Try to stick to weak references or MRU (Most Recently Used) caches for rendered images so your application may reuse them where necessary without running low on memory.
- Don't paint outside of the current clipping region
- Favor speed-rendering over quality-rendering algorithms during animations.
- Antialiasing is not that important during an animation, as users typically don't notice those sort of details within a moving picture.
- Use lower cost lower resolution rendering during animations, and When the animation is complete revert to your quality-algorithm to render your static image.
- It's faster to render an image than it is to render a collection of objects.
- Render your scene once, dump it to an image, and whenever you need to rerender your scene, simply pull up the cached image.
- Spatial Decomposition involves determining which components in the container hierarchy happen to intersect with the clipping region using an R*-Tree traversal. The idea here is that you grab your top-level node (component/glyph/object) and check its bounds to see if it intersects the clipping region. If so, check to see if it's a leaf node. If so, paint it. Otherwise, recurse through the painting algorithm with its children, so on and so forth until all branches of the tree within the clipping region are rendered. This allows you to skip any nodes in the graph that don't intersect the clipping region, potentially speeding up rendering time considerably.
Tuning Memory Allocation (Page last updated June 2006, Added 2006-08-30, Author Henrik St?hl, Publisher BEA). Tips:
- The default size of a TLA in the JRockeit JVM is 2 kB, which may be too small for a Java program that allocates lots of memory, in particular when the JRockit process spans a large number of CPUs.
- A 2 kB TLA size is too small for an application that allocates large numbers of large arrays, not uncommon when XML is involved.
- Tuning the TLA size can result in a moderate performance boost (5-10% is not rare).
- To determine if TLA size is a bottleneck in a JRockit application, create a JRA recording and load it in the JRA tool. The recording will open on the General tab, which contains a section on Allocation.
- [Article shows an example of memory tuning a JRockit application].
- JRA recordings done using recent versions of JRockit include lock profiling with data on Native locks, i.e. JVM internal locks. One of these locks is the GC: Heap lock, and JRA displays how many times this lock was contended during the recording. A high number indicates that the heap lock was a bottleneck, and that tlasize and largeobjectlimit tuning may be called for.
- High values for TLA sizes increase the need for GC, since areas of free memory that are smaller than these numbers will not be usable without compaction (i.e. defragmentation of the heap during GC).
The Shared Classes feature (Page last updated May 2006, Added 2006-08-30, Author Ben Corrie, Publisher IBM). Tips:
- The IBM implementation of the 5.0 JVM allows all system and application classes to be stored in a persistent dynamic class cache in shared memory.
- The Shared Classes feature reduces the JVM virtual memory footprint and improves JVM startup time.
- Class sharing is enabled by adding -Xshareclasses[:name=<cachename>] to an existing Java command line.
- Specify the cache size of a shared class archive using the parameter -Xscmx<size>[k|m|g]; this parameter only applies if a new cache is created by the JVM
- A shared class archive cache is deleted either when it is explicitly destroyed using a JVM utility or when the operating system restarts.
- A shared class archive cache cannot grow in size and, when it becomes full, a JVM can still load classes from it but cannot add any classes to it.
- a JVM running with Class Sharing uses the following classloading order: 1. Classloader cache; 2. Parent; 3. Shared cache; 4. Filesystem.
- java.net.URLClassLoader gets class sharing support for free.
- [Articles shows example of using the Shared Classes feature with the IBM JVM].
- If a JVM is running with a JVMTI agent that has registered to modify class bytes and the modified suboption is not used, class sharing with other vanilla JVMs or with JVMs using other agents is still managed safely, albeit with a small performance cost because of extra checking. Thus, it is always more efficient to use the modified suboption.
Back to newsletter 069 contents
Last Updated: 2017-10-01
Copyright © 2000-2017 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us