Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Concurrency, Threading, GC, Advanced Java and more ...
Tips August 2013
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 153 contents
Understanding the Java HotSpot VM Code Cache (Page last updated July 2013, Added 2013-08-26, Author Ben Evans, Publisher Oracle). Tips:
- For faster loading performance, the JDK classes (in rt.jar) are not security checked the way other classes are.
- Once a method has been run for "compile threshold" times in interpreted mode (-XX:CompileThreshold=n, default is n=10000 for -server, n=1500 for -client), the method gets JIT compiled.
- JIT compiled methods are held in the CodeCache - -XX:+PrintCompilation shows what the JIT compiler is doing, including deoptimizations (where a speculative optimization has been found to be false so the method needs to be deoptimized beefore it is again JIT compiled) which are printed with the messages "made not entrant" and "made zombie".
- VisualVM shows the loaded and unloaded classes in its "Classes" section.
- If the CodeCache gets filled (size can be specified with -XX:ReservedCodeCacheSize) , JIT compilation stops as there is no more space to store compiled methods (Zombie methods which are deoptimized "not entrant" methods which have been around for a while can be flushed to make room, if any are present).
- Since Java 7 update 4 speculative flushing ages JIT compiled code when it is not being called (e.g. one implementation halves the count of times the method has been called every 30 seconds while it is not called), and if it is not called again soon enough, the code becomes eligible for ejection from the code cache. This has significant implications for applications that "warm up" code so that later potentially infrequent code calls still expects to use JIT compiled methods. -XX:+UseCodeCacheFlushing controls this, and is on by default from update 4 (so use -XX:-UseCodeCacheFlushing to disable it if needed).
Using Java in Low Latency Environments (Page last updated August 2013, Added 2013-08-26, Author Charles Humble, Peter Lawrey, Martin Thompson, Todd L. Montgomery, Andy Piper, Publisher ). Tips:
- Low latency is sub-100 milliseconds. Real-time is not the same as low latency, it's deterministic latency (though typically the latency desired is quite low). The difference is in outliers - low latency may be able to accept the odd outlier, real-time cannot.
- Low latency systems in Java tend to avoid 3rd party and some standard libraries because most of these have not been written with low latency in mind, so can impose GC, latency and locking overheads that are a problem in low latency applications. However the excellent tooling allows you to easily identify and replace the offending libraries in the critical portions of code, so you can get the best of both worlds - fast time to market and fast code.
- The latency difference between a contended and uncontended lock is typically 3 orders of magnitude.
- Java specific low latency techniques include: JVM warmup; avoiding reflection; using data structures in DirectByteBuffers; use of Unsafe; use the excellent array of Java profiling tools to identify exactly what needs optimizing.
- One solution to GC issues is to produce so little garbage that the GC never impacts tha application. This has the additional side-effect that you are not filling CPU caches with garbage, which in turn makes the application faster even excluding the lack of GCs.
- Object reuse is a common technique to reduce garbage collections, used in many low-latency libraries.
- Forcing GCs at specific times reduces the GC pressure at other times, if GCs are infrequent enough.
- Common causes for latency outliers include: GC pauses; Waiting for I/O; cache misses; cache flushes; context switches; lock contention; network loss and retransmissions;
- Sockets Direct Protocol (SDP) over InfiniBand creates quite a bit of garbage.
- A good way to deal with contention is architecturally using the "single writer principle"; another is asynchronous processing.
- Use a lock-free queue (wait-free on the enqueue) in front of a single writer on a contended resource and use a thread to do all the writing. The thread does nothing but pull off a queue and writes.
- If you must have contention on a given data resource then atomic instructions tend to be better solutions than locks because they operate in user space without ever involving the kernel.
- Lock tuning techniques include: don't synchronize stuff that doesn?t need to; remove locks that are not protecting anything; reduce the scope of locks that are needed; reduce the time that locks are held for; don't mix the responsibilities of locks; queue updates using lock-free enqueueing and use a single-writer to avoid locks all together.
Garbage Collection - Let the VM do it (Page last updated June 2013, Added 2013-08-26, Author Jon Masamitsu, Publisher Oracle). Tips:
- UseParallelGC with -XX:+UseAdaptiveSizePolicy is the ergonomic (tries to grow or shrink the heap to meet the specified maximum pause time and/or throughput goal) throughput collector. The default goal is no pause time target and throughput with 98% of the time doing application work and 2% of the time doing GC work.
- To find out what the throughput collector (UseParallelGC with -XX:+UseAdaptiveSizePolicy) ergonomics are doing, set -XX:AdaptiveSizePolicyOutputInterval=1 which will print the ergonomics details every GC.
- GC ergonomics (UseParallelGC with -XX:+UseAdaptiveSizePolicy) tries to meet, in order, the: pause time goal; throughput goal; minimum footprint.
- A lower tenuring threshold moves objects more eagerly to the tenured generation; a higher tenuring threshold keeps copying objects between survivor spaces longer.
- The UseParallelGC collector varies the tenuring threshold dynamically, unlike other collector.
- The throughput collector (UseParallelGC with -XX:+UseAdaptiveSizePolicy) With the default throughput goal of 98% usually makes the heap grow to it's maximum value. If footprint is important change the throughput goal, e.g. -XX:GCTimeRatio=4 is a throughput goal of %20 of time spent in GC.
C++ like Java for low latency (Page last updated July 2013, Added 2013-08-26, Author Peter Lawrey, Publisher vanillajava). Tips:
- Large heap sizes and low latency requirements don't work well together, so need effort to make it work. A quick low effort test would be to try with Zing which may give you sub 10ms pauses with no change to your application (assuming sufficient hardware).
- Using Unsafe can be significantly faster on HotSpot but significantly slower on Android.
- To replay months of data quickly you need high degrees of parallelism and memory mapped files which have pre-canned queries of the data needed, ideally in binary format - these can be accessed repeatedly across thread very fast.
- One approach to completely avoiding GCs is to make Eden large enough (e.g. 10s of GB), and object creation sloww enough so that Eden doesn't fill until you have a quiet period (e.g. overnight) when you can force a full GC.
- Eliminating garbage completely means the L1 and L2 caches always have relevant data and so the application runs much faster as main memory fetches are eliminated or reduced.
- Using shared memory for IPC is much faster than any other IPC technique - 100 nanosecond latency is possible.
- If you bind to an isolated core the worst case latencies can drop from 2 milliseconds to 10 microseconds, but pinning to a non-isolated core doesn't appear to help at all.
- Busy waiting on a pinned and isolated core avoids context swtching (at the cost of losing that core to anything else).
- You can overclock an disable hyperthreading on specific cores that are isolated and have specific threads pinned to, to optimize CPU for those threads.
- You can write 90% of the code in "natural" Java and use third party libraries, while still getting very low latency if you write the critical 10% in low level Java, thus gaining the best of both Java productivity and tools while still attaining the latencies required.
Tuning the Size of Your Thread Pool (Page last updated May 2013, Added 2013-08-26, Author Kirk Pepperdine, Publisher infoQ). Tips:
- If you have too many active threads, your system will spend a significant proportion of time context switching instead of doing application work.
- The number of requests in a system is the rate at which they arrive multiplied by the average amount of time it takes to service each request (Little's law).
- Use Little's law to tune your thread pool: Multiply the rate at which requests arrive by the average amount of time to service them. If that is smaller than your thread pool, you have sufficient capacity and can even reduce your pool size. If not, you need to increase the size of the thread pool - but this is only effective if you have sufficient system capacity to handle an increased thread pool size (if not, you'll need to either increase system capacity or improve service times of the application or discard requests).
- The kernel is a lot more efficient at managing threads than a thread pool, so try running beyond capacity (i.e. stress test) to see how it affects end user response times and throughputs.
Java Garbage Collection Distilled (Page last updated July 2013, Added 2013-08-26, Author Martin Thompson, Publisher mechanical-sympathy). Tips:
- GC tuning is about trading between: Throughput (how much overall time the application spens in GC), Latency (individual pause times); and Memory (how much heap is available).
- Generic GC tuning tradeoffs are: Overall GC overhead can be reduced by increasing heap; Worst-case pause times are reduced by minimising the number of live objects and keeping the heap small; The duration between pauses is minimised by controlling allocation rates and optimising heap generation sizes; The frequency of large pauses can be reduced by using concurrent GC.
- The rate at which young generation/minor/Eden collections occur is directly proportional to the rate of object allocation.
- The -XX:+PrintGCApplicationStoppedTime flag is useful to identify all pauses in the application.
- The cost of a minor GC collection is related to the cost of copying objects to the survivor and tenured spaces - work done is directly proportional to the number of live objects found, and not to the size of the new generation.
- The total time (across application lifetime) spent doing minor collections can be almost be halved each time the Eden size is doubled, though will result in increases in per-collection times (larger individual pauses).
- Reducing promotion failure is fiddly: You can alter the threshold when old gen GCs start (-XX:CMSInitiatingOccupancyFraction=N -XX:+UseCMSInitiatingOccupancyOnly) though more concurrent GCs means more fragmentation; you can alter when objects are promoted (-XX:MaxTenuringThreshold=N); you can tune the buffer size that is present specifically for promotion failures (-XX:PromotedPadding=N).
- Garbage collectors are: Serial collector (-XX:+UseSerialGC); Parallel collector (-XX:+UseParallelGC); Parallel Old collector (-XX:+UseParallelOldGC, may benefit with -XX:+UseNUMA); CMS (-XX:+UseConcMarkSweepGC - but stop-the-world full GC will get called eventually, and has a larger footprint); G1 (-XX:+UseG1GC -XX:MaxGCPauseMillis=N but N should be > 100 for now);
- Both CMS and G1 can have significant and regularly occurring stop-the-world events and worst-case scenarios that can make them unsuitable for strict low-latency applications.
- JVMs that should be tested for achieving low latency collections include: Oracle HotSpot, Oracle JRockit Real Time, IBM Websphere Real Time, and Azul Zing.
- All concurrent collectors targeting latency make you give up some throughput and gain footprint and usually have higher (multi-core) CPU costs.
- Recommended logging flags include: -verbose:gc, -Xloggc:, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationConcurrentTime , -XX:+PrintGCApplicationStoppedTime. jHiccup can also be used to capture pause times.
- Recommended GC viewing tools include: Chewiebug GCViewer, VisualVM with the VisualGC plugin.
- To effectively tune GC you need repeatable representative tests with end-user latency timings.
Back to newsletter 153 contents
Last Updated: 2021-03-29
Copyright © 2000-2021 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us