Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Threading Essentials course
Tips April 2014
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 161 contents
Garbage Collection Optimization for High-Throughput and Low-Latency Java Applications (Page last updated April 2014, Added 2014-04-30, Author Swapnil Ghike, Publisher LinkedIn). Tips:
- Tune the GC on a codebase that is near completion and includes performance optimizations.
- If you cannot tune on a real workload, you need to tune on synthetic workloads representative of production environments.
- GC characteristics you should optimize include: Stop-the-world pause time duration and frequency; CPU contention with the application; Heap fragmentation; GC memory overhead compared to application memory requirements.
- You need a clear understanding of GC logs and commonly used JVM parameters to tune GC behavior.
- A large heap is required if you need to maintain an object cache of long-lived objects.
- GC logging should use -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime.
- Use GC logging to analyze GC performance.
- GC collection frequency can be decreased by reducing the object allocation/promotion rate and/or increasing the size of the generation.
- The duration of young generation GC pause depends on the number of objects that survive a collection.
- Increasing the young generation size may produce longer young GC pauses if more data survives and gets copied in survivor spaces, or if more data gets promoted to the old generation; but could have the same pause time and decrease in frequency if the count of surviving objects doesn't increase much.
- Applications that mostly create short-lived objects, only need to tune the young generation after making the old generation big enough to handle the initially generated ong-lived objects.
- Applications that produce long-lived objects need to tune the application so that the promoted objects fill the old generation at a rate that produces an acceptable frequency of old-generation GCs.
- If the threshold at which old generation GC is triggered is too low, the application can get stuck in incessant GC cycles. For example the flags -XX:CMSInitiatingOccupancyFraction=92 -XX:+UseCMSInitiatingOccupancyOnly would start the old gen GC only when the old gen heap is 92% full.
- Try to minimize the heap fragmentation and the associated full GC pauses for CMS GC with -XX:CMSInitiatingOccupancyFraction.
- Tune -XX:MaxTenuringThreshold to reduce the amount of time spent in data copying in the young generation collection while avoiding promoting too many objects, by noting tenuring ages in the GC logs.
- Young collection pause duration can increase as the old generation fills up due to object promotion taking more time from backpressure from the old generation (the old gen needing to free space before allowing promotion).
- Setting -XX:ParGCCardsPerStrideChunk controls the granularity of tasks given to GC worker threads and helps get the best performance (in this tuning exercise the value 32768 reduced pause times).
- The -XX:+BindGCTaskThreadsToCPUs option binds GC threads to individual CPU cores (if implemented in the JVM and if the OS permits).
- -XX:+UseGCTaskAffinity allocates tasks to GC worker threads using an affinity parameter (if implemented).
- GC pauses with low user time, high system time and high real (wallclock) time imply that the JVM pages are being stolen by Linux. You can use -XX:+AlwaysPreTouch and set vm.swappiness to minimize this. Or mlock, but this would crash the process if RAM is exhausted.
- GC pauses with low user time, low system time and high real time imply that the GC threads were recruited by Linux for disk flushes and were stuck in the kernel waiting for I/O. You can use -XX:+AlwaysPreTouch and set vm.swappiness to minimize this. Or mlock, but this would crash the process if RAM is exhausted.
- The following combination of options enabled an I/O intensive application with a large cache and mostly short-lived objects after initialization to achieve 60ms pause times for the 99.9th percentile latency: -server -Xms40g -Xmx40g -XX:MaxDirectMemorySize=4096m -XX:PermSize=256m -XX:MaxPermSize=256m -XX:NewSize=6g -XX:MaxNewSize=6g -XX:+UseParNewGC -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=8 -XX:+UnlockDiagnosticVMOptions -XX:ParGCCardsPerStrideChunk=32768 -XX:+UseConcMarkSweepGC -XX:CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:+CMSClassUnloadingEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:-OmitStackTraceInFastThrow.
Understanding Throughput and Latency Using Little's Law (Page last updated March 2014, Added 2014-04-30, Author Ali Hussain, Publisher Future Chips). Tips:
- Little's Law: occupancy = latency x throughput. Occupancy is the number of requestors in the system, also often referred to as capacity used.
- For a given system, Little's Law says the capacity of the system is fixed, inversely relating the maximum latency and maxium throughput of the system. E.g. If you want to increase maxium throughput of the system by some factor, you need to decrease the latency by that same factor.
- When system occupancy (capacity used) has reached it's peak, additional requests can't add to the throughput; instead they increase the average latency caused by queuing delay.
- Target latency in preference to throughput. Improving latency improves throughput at the same time - if you do things faster you get more things done. However it's often easier to increase throughput.
- Throughput at one level determines latency at the next level up.
- When using parallelism to increase throughput, ask: What are the data dependencies and how much data will need to be shared?
How the AOL.com Architecture Evolved to 99.999% Availability, 8 Million Visitors Per Day, and 200,000 Requests Per Second (Page last updated February 2014, Added 2014-04-30, Author Dave Hagler, Publisher highscalability). Tips:
- Architect for redundancy in everything - including after when one element is offline, i.e. at least triple redundancy. A requirement for five 9's availability corresponds to about 5 minutes of downtime per year.
- Do not depend on any shared infrastructure - if any infrastructure system goes down, the rest of the system should stay up.
- Use caching to optimize performance but not as a requirement for the system to operate at scale. Caches should be a bonus but not a necessity.
- Use front-end load balancing to direct traffic to the nearest unloaded system capable of serving the request.
- Monitor the system and trigger alarms when thresholds for availability and response times are not met. Monitor hosts, CPUs, interfaces, network devices, file systems, web traffic, response codes, response times, and application metrics.
- Test every feature with a subset of users to see how it performs before rolling out to a wider audience.
- Use replicated data closer to the servers if data is distant (e.g. replicated across datacentres so servers in a datacentre doesn't need to access data from another datacentre).
- Use a frontend request balancer to the data servers to easily enable scaling by adding new dataservers.
Thread Confinement (Page last updated February 2014, Added 2014-04-30, Author Dr. Heinz M. Kabutz, Publisher The Java Specialists' Newsletter). Tips:
- A useful trick for ensuring thread safety is "stack confinement" - any object that is created and doesn't escape a method (i.e. doesn't get referenced from any other object fields) is guaranteed threadsafe, because it only lives in the one thread (the thread owning the stack the method is executing on).
- A useful trick for ensuring thread safety is "thread confinement" - any object which is only ever used in the same thread (typically by being held by a ThreadLocal and never being referenced from any other object fields) is guaranteed threadsafe, because it only lives in the one thread. But beware that ThreadLocal held objects can be the cause of a memory leak if they are not managed carefully.
- Don't leak a ThreadLocalRandom instance, either by storing it in a field or passing it to a method or having it leak accidentally to an anonymous class. Access it using ThreadLocalRandom.current() (you can call that every time or store that to a method local variable and use it that way).
The Facts and Fiction of In-Memory Computing (Page last updated April 2014, Added 2014-04-30, Author Nikita Ivanov, Publisher JDJ). Tips:
- In-memory computing speeds up data processing by roughly 5,000 times compared to using disks.
- In-memory computing requires less hardware meaning decreased capital, operational and infrastructure overhead; existing hardware lifetimes can also be extended.
- In-memory computing uses a tiered approach to data storage: RAM, local disk and remotely accessible disk.
- In-memory computing only puts the operational data into memory for processing - offline, backup and historical data should all remain out of memory.
Parallel Array Operations in Java 8 (Page last updated March 2014, Added 2014-04-30, Author Eric Bruno, Publisher DrDobbs). Tips:
- Arrays.parallelSort() implements a parallel sort-merge algorithm that recursively breaks an array into pieces, sorts them, then recombines them concurrently and transparently, resulting in gains in performance and efficiency when sorting large arrays (compared to using the serial Arrays.sort). For large arrays, this can improve sorting time by a factor corresponding to the number of cores available.
- Arrays.parallelPrefix() allows a mathematical operation to be performed automatically in parallel on an array. For large arrays, this can improve calculation time by a factor corresponding to the number of cores available.
- Arrays.parallelSetAll() sets each element of an array in parallel using any specified function.
- Spliterators split and traverse arrays, collections and channels in parallel.
Spliterator spliterator = ... //e.g. Arrays.spliterator(anArray); spliterator.forEachRemaining( n -> action(n) );
- Stream processing in Java 8 allows for parallel processing on a dataset derived from an array or collection or channel. A Stream allows you to perform aggregate functions such as pulling out only distinct values (ignoring duplicates) from a set, data conversions, finding min and max values, map-reduce functions, and other mathematical operations.
Back to newsletter 161 contents
Last Updated: 2019-12-31
Copyright © 2000-2019 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us