Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Threading Essentials course
Tips June 2013
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 151 contents
Advanced JVM Tuning (Page last updated January 2013, Added 2013-06-25, Author David Keenan, Monica Beckwith, Publisher Oracle). Tips:
- Change one thing at a time.
- Top down analysis and tuning: Monitor OS and the JVM with the application under load; Profile the application; Tune the OS and JVM; Change the code.
- Bottom up analysis and tuning (When you can't change the code): Gather metrics from the lowest level - CPU statistics like cache misses, CPI, path length. Target improving CPU efficiency. The sequence is: Monitor the CPU; profile the application using hardware counters; identify problematic instruction sequences; tune the OS, JVM, nativ libraries and use code generation to adapt the bytecode.
- The largest performance improvements come from changing the code, but tuning the OS and JVM are the easiest changes to do.
- -XX:+TieredCompilation (available since J7u4, not on by default up to J7u21) uses both client and server JIT compilers for improved optimisation levels (instruments compiled code). For most applications this should give a slight speedup, but the ReservedCodeCache likely needs increasing in size (compilation will stop if the ReservedCodeCache gets full). Use the PrintCompilation flag and count to determine how much has been used; heap dumps also dump the code cache.
- Fault in pages up front rather than dynamically with -XX:+AlwaysPreTouch.
- For large heaps, -XX:UseLargePages -XX:LargePageSizeInBytes=256m - also need to enable LargePages in the OS
- If you have more than 1009 unique class names the system dictionary performance will degrade (you'll see it in profiles), use -Xverbose:class to determine what classes are loaded: use -XX:UnlockExperimentalVMOptions -XX:PredictedClassLoadCount=# to tune
- If you have more than 1009 interned strings the interned dictionary performance will degrade, use -XX:+PrintStringTableStatistics to find how many interned strings are loaded: use -XX:StringTableSize=# to tune
- Use -XX:PermSize -XX:MaxPermSize to set pem gen size - only collected on full GCs
- Use compressedoops flag when runnning up to 32GB heaps.
- On Linux you can tune the OS scheduler with policies and groups with CFS
- If you have multiple JVMs on a box, the default GC thread count is probably too high, target the total number of (live?) GC threads to be the same as the total number of hardware threads using -XX:ParallelGCThreads=# (default per JVM is 5/16 of hardware threads)
- Set the Java tmp directory to the OS tmp: e.g. -Djava.io.tmpdir=/tmp
- Collect short-lived objects in the young gen by sizing the young gen big enough but not too large (wastes space) - age objects in the young gen and try to collect them there; size the old gen to hold steady state objects; keep the total memory size out of swap!
- Choose the collector to use depending on your overall goals: througput; pause times; footprint.
- Tuning for throughput: use parallel GC with adaptivesizepolicy on (alters various parameters dynamically to reduce overheads). The GCTimeRatio (default 99) tries to give that much %time to the application; the TargetSurvivorRatio (default 50) reduces the impact of spikes in surviving objects.
- Use GC log output from PrintGCDetails and PrintGCTimeStamps
- NewRatio sets the young to old generation ratio; or you can set NewSize and MaxNewSize.
- Target the GC overhead to be less than 5% of the total runtime.
- Only disable adaptivesizepolicy if ergonomics are failing to reduce overheads, e.g. objects are overflowing (promoted) into the old generation or the tenuring distribution is wrong. In this case set heap sizes (eden, survivorratio, new size) and tenuring threshods explicitly.
- For G1, setting Xmn or NewSize explicitly will interfere with G1. G1 is targeted at acceptable pause times (200ms), not throughput (GCTimeRatio is 90%). Too aggressive pause time targets will impact througput.
- G1 advanced tuning options: InitiatingHeapOccupancyPercent (default 45, sets how full the heap is for when the collector mark phase will kick in); G1MixedGCCountTarget (default 4, maximum number of mixed GCs to performed after a mark); G1OldCSetRegionLiveThresholdPercent (default 90, the amount of live data in the region that will be considered to be included in a collection).
- Low latency collector - CMS. Know your average max and frequency of your pauses. Resize your young generation - too frequent, make larger, too long a pause, make smaller.
- Promotions in CMS are expensive, causing fragmentation, CMS old gen collection should be occasional, not frequent, or fragmentation will cause stop-the-world full GCs.
- CMS initiation tuning: -XX:CMSInitiatingOccupancyFraction=percent (how full the old gen is to initiate a CMS) -XX:+UseCMSInitiatingOccupancyOnly (ignore ergonomics and use the explicitly set previous flag) -XX:CMSInitiatingPermOccupancy=percent (how full the perm gen is to initiate a CMS, class unloading should be enabled).
- G1 pause time targets - set the time and frequencies -XX:MaxGCPauseMillis=#
- For G1, setting InitiatingHeapOccupancePercent too early will increase the frequency of GCs, setting it too late will lead to an evacuation failure and failover to stop-the-world full GC. It should be set to a higher value than your steady state live size.
- Tuning for footprint: Know the live data set (heap size after GC), promotion rate (number of objects overflowing into the old generation per second) and allocation rate (number of objects created per second). Then adjust the heap size to handle those. If this is too large, get a heap profile and reduce the heap with more space-efficient structures; work on reducing retained objects and object allocations.
- Larger heaps (footprint) mean fewer GCs (throughput), but longer to collect (pause times)
7 Life Saving Scalability Defenses Against Load Monster Attacks highscalability (Page last updated March 2013, Added 2013-06-25, Author Todd Hoff, Publisher ). Tips:
- Specify the resources that have limits; measure the use of the resource as a proportion of that limit; prevent resource exhaustion by staying below the limit.
- Merge operations on a particular resource to reduce the per-operation overhead.
- Delete operations that are no longer necessary, e.g. if queued and a later operation would make an earlier one pointless, then eliminate the earlier operation.
- Where only some changes will actually be used, don't send all changes, send instead the information that something has changed and let the changed data be accessed as needed.
- Delay events becoming active until a cancellation period has passed, to prevent temporary situations causing cascade storms.
- Reject or redirect work if the resources needed to process the work do not have the capacity to process the work item.
Scalable System Design (Page last updated April 2011, Added 2013-06-25, Author Ricky Ho, Publisher JavaLobby). Tips:
- Scalability is not the same as performance - you may need to reduce performance in order to achieve scalability, e.g. by distributing an application across multiple servers where it would be faster on just one server because all the components are local to each other, but the one server cannot provide sufficient capacity.
- Key metrics to consider and measure include: Number of users, Transaction volume, Data volume, Response time, Throughput.
- Prioritise requests so that higher priority requests get adequate resources provided, at the expense of lower priority requests.
- Modular code is essential - being able to easily swap out old code with new code allows you to quickly experiment different optimizations.
- Use end-to-end improvement measurements to identify bottlenecks.
- Capacity planning is important. Collect usage statistics, predict the growth rate.
- Stateless applications can scale horizontally indefinitely - use a front-end loadbalancer and distribute to available servers.
- Use data partitioning to spread the data load. Data that should be accessed together should be staying together in the same server.
- A sophisticated data partitioning approach is to migrate data continuously according to data access patterns.
- Try to make any critical algorithms parallel. Use parallel frameworks like Map/Reduce (Hadoop).
- A content delivery network (copies of static data geographically distributed for local distribution) is an efficient way to deliver static data.
- Caching is essential for scaling efficiently.
- Reuse sessions, threads and connections and any other expensive to construct resources.
- Determine where inaccuracy is acceptable and the level of inaccuracy acceptable, and if that is faster to calculate, provide partially accurate (interim) results.
- Asynchronous processing allows you to use resources the most efficiently, at the cost of response times (queue requests, process when resources are available, use callbacks to return results asynchronously).
- Concurrent access scenarios need careful synchronization - make sure the locking is fine-grained or course-grained enough. Make sure you detect deadlocks and prevent or break out of them. Lock-Free data structures are often more scalable.
Hashmap Internal Implementation Analysis in Java (Page last updated December 2012, Added 2013-06-25, Author Niranjan, Publisher tekmarathon). Tips:
- Whenever the capacity of a HashMap reaches to 75% of its current capacity, it doubles the capacity and recomputes all the hash codes and reinserts all elements. So for higher efficiency explicitly give the size of the HashMap when creating it.
- HashMap is not thread safe - this is specified in the documentation. If used from multiple threads it can get corrupted; one particular corruption occurs on resizing, and sends the HashMap.put() into an infinite loop.
- The more collisions (elements mapped to the same table entry) in the HashMap, the less efficient the HashMap is.
Virtualizing and Tuning Large Scale Java Applications (Page last updated March 2013, Added 2013-06-25, Author Emad Benjamin, Publisher infoQ). Tips:
- Establish how many JVMs will be on the platform, and what the mix is (throughput/latency requirrements). Run load tests with that mix so that you can identify resource utilisation requirements of the expected mix.
- Iterate load tests, looking for the bottleneck layer (network, storage, application configuration, platform OS); remove the bottleneck and scale the load test for the next iteration until te required scale is achieved.
- The JVM process memory has multiple spaces: heap, perm gen, stack, guest os memory (for VMs), and other (sockets, jit info, direct buffers, etc).
- The stack space used by each thread can be significant for many threads.
- For virtualised JVM processes, the OS guest memory can be quite large e.g. 0.5GB for a 4GB heap - and don't forget the ~0.5GB needed for the JVM off heap memory (say assuming 256MB for Perm).
- The heap size should typically be 3x-4x the size of the active data size (1x active in old gen, 1x arriving from young gen, 1x space to operate).
- Pinning the JVM to a NUMA node -XX:+UseNUMA gives a significant improvement in memory throughput - but the JVM needs to be sized smaller than the NUMA node memory (the memory per socket) for this to be fully effective (on ESX that would be 47GB max size for the JVM process - including off heap memory).
- Disable NIC interrupt coalescing for latency-sensitive virtual machines (but this has a higher CPU cost).
- A large thread stack size can be benficial for objects that are very short-lived and thread specific - such objects might not even escape to Eden with a large thread stack size.
- You can't have both reduced response time and increased throughput without compromise - it's best to separate the requirements into different JVM instances and tune separately.
- For CMS tuning, use three-step tuning - first young gen: Determine minor GC frequency and duration, then tune the number of parallel threads and the young gen size. Then old gen - specify the size of the heap. Then adjust the survivor space sizes. The young gen should be less than a third of the total heap.
- Other CMS flags typically used: -XX:+UseCMSinitiatingOccupancyOnly -XX:CMSinitiatingOccupancyFraction=51 to 75 -XX:+ScavengeBeforeFullGC -XX:TargetSurvivorRatio=80 -XX:+UseBiasedLocking (if not a high level of contention) -XX:MaxTenuringThreshold=4 to 15 -XX:CompressedOops (now the default) -XX:+OptimizeStringConcat -XX:+UseCompressedStrings -XX:+UseStringCache
- The main IBM GC garbage collectors are: -Xgcpolicy:Optthruput (default, throughput collector), -Xgcpolicy:Optavgpause (targets pause-time where pauses don't have much variability), -Xgcpolicy:Gencon (targets pause time, useful for memory bound applications).
Five Ways of Synchronising Multithreaded Integration Tests (Page last updated April 2013, Added 2013-06-25, Author Roger Hughes, Robert Saulnier, Publisher JavaLobby). Tips:
- To synchronize across threads you can: use a random delay, then check that the threads are done (and repeat if not); Use a CountDownlatch; Use Thread.join(); Acquire a Sempahore; Use Futures with ExecutorServices; poll a volatile; use wait-notify; use Locks.
Back to newsletter 151 contents
Last Updated: 2018-10-29
Copyright © 2000-2018 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us