Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Threading Essentials course
Tips October 2016
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 191 contents
Understanding Retry Pattern with Exponential back-off and Circuit Breaker Pattern (Page last updated October 2016, Added 2016-10-30, Author Rahul Rajat Singh, Publisher rahulrajatsingh). Tips:
- Transient faults are inevitable in any distributed system. Transient failures are easily be circumvented by simply calling the service again after a delay. But you need to identify that the fault is transient, by classifying errors and eliminating non-transient ones.
- Because a fault could be due to overload, retrying the request can make the situation worse. For this reason, retrying should be done over increasingly larger delays (article recommends exponentially increasing delays).
- A typical transient fault retry should: 1. Identify if the fault is a transient fault; 2. Define the maximum retry count; 3. Retry with increasing delays between retries until success or maximum retry count is reached.
- Long lasting transient faults will incur multiple wasteful retries with the retry pattern; the circuit breaker pattern reduces these by immediately returning failure without making any request after too many calls fail even with retries. After a timeout period the requests are allowed to go back to the normal retry pattern, with failures again triggering the circuit breaker.
High Bandwidth Data & Low-latency Java (Page last updated September 2016, Added 2016-10-30, Author Ryan Pratt, Justin Flude & Michael Vander Pluym, Publisher GOTO). Tips:
- Many reliable transport solutions work by storing the data that's being transmitted, and retransmitting when clients indicate that the data was not received. This means that when the route is congested, you get not just packet loss but also requests for even more packets making the congestion worse. Where data can be acceptably dropped (eg you're only interested in the most recent item rather than all the items) you can improve on a general purpose reliable transport mechanism by asking for the "latest" packet (saying what has been received) rather than a simplistic retransmission request (and that request can ignored if a newer version has already been sent).
- Reliable transport mechanisms (which work by temporarily storing data that's being transmitted for retransmitting if clients don't get the first attempt) are only reliable up to the size of storage available. There will always be a slow enough client that will lose data (unless all data can be stored permanently).
- A memory mapped file (backed by disk or shared memory) with one writer and multiple readers is an efficient way to provide data across multiple processes which can be different implementations (even different languages)
- Unsafe lets you manage off-heap memory if you want to avoid heap objects for garbage collection; you can also avoid creating objects in the heap by reusing objects.
- You can avoid locking by using a version which is invalidated at the start of a write, then made valid at the end; readers only read when the version is valid and the same value at the beginning and end of the read.
Combining Collections and Concurrency (Page last updated October 2016, Added 2016-10-30, Author Chris Hegarty, Mike Duigou, Publisher JavaOne). Tips:
- The synchronized keyword gets a lock at the beginning of the method or block, and releases it at the end, enforcing serial access with respect to that lock.
- Using volatile on a field which will be updated within a synchronized block can make performance worse, don't use volatile unless you know what you're doing it for and the impacts from using it.
- synchronized code is not a high overhead for uncontended code because of JVM optimizations. For contended code it serializes execution of the code blocks across threads which makes the code work correctly - but is often not optimal for performance. synchronized is the right thing in many (most) cases.
- If you have multiple conditions on which you're blocking, java.util.concurrent.locks.Lock allows for multiple conditions to be signalled for one lock, whereas synchronized only supports one condition per lock.
- Bounded data structures vs unbounded is something you need to think about at the design stage, unbounded structures are convenient but has consequences of potentially filling memory.
- Locks are heavyweight for many operations, compare-and-swap operations (used in the Atomic* classes) are often a lower cost valid alternative.
- The streams library (Java 8+) gives you a relatively straightforward way to provide safely concurrent processing.
Java Performance Companion extract (Ch. 1, Garbarge First Overview) (Page last updated April 2016, Added 2016-10-30, Author Charlie Hunt, Monica Beckwith, Poonam Parhar, Bengt Rutisson, Publisher Addison-Wesley Professional). Tips:
- Serial GC , Parallel GC and Concurrent Mark Sweep GC correspond roughly to "minimize memory footprint", "maximize application throughput" and "minimize GC- related pause time". G1 GC aims to keep pause times reasonable for very large heaps.
- Parallel GC is parallel and stop-the-world in both young and old generations; old generation collections also perform compaction; on average pause times tend to increase according to the size of the heap. Compaction is a function of the size of the Java heap and the number and size of live objects in the old generation and may take a considerable amount of time. Pause times are likely to stay reasonable for heaps below 4GB.
- Concurrent Mark Sweep (CMS) GC is a parallel stop-the-world collector in the young generation but a mostly concurrent one in the old generation - the old generation collections attempt to avoid long pauses in application threads by doing most of its work concurrently with application thread execution. CMS tends to need extra heap (eg 20%) to manage the collection. The major tuning challenge with CMS is to avoid fragmentation - as the compaction needed to handle it uses Serial GC so can lead to very long pauses. CMS is difficult to tune.
- G1 GC is partially concurrent, parallel and stop-the-world. G1 divides the heap into a target of around 2000 regions (but can handle much more and much less), all the same size, one of 1MB/2MB/4MB/8MB/16MB/32MB. By collecting only some regions at a time rather than the whole heap (for both young and old generations), pause times are kept low. The number of regions collected is based on how much space will be freed together with the pause time target. G1 GC does compaction as it goes, but in the worst-case where too many regions are fragmented it will failover to a full long pause old generation collection and compaction.
- G1 GC old generation collection is triggered when a Java heap occupancy threshold (-XX:InitiatingHeapOccupancyPercent, default 45%, which compares to the entire Java heap, not purely against the old gen - CMS old gen threshold is purely an old gen occupancy threshold) is exceeded.
- If the marking phase does not finish prior to running out of available regions, G1 will fall back to a serial full GC to free up memory.
- G1 calculates the time spent to perform the GC compared to the time spent executing the Java application. If too much time is spent in GC according to the command-line setting -XX:GCTimeRatio (default 9, the larger the value the more aggressive the size increase), the Java heap size is increased (up to Xmx) so that GCs aare triggered less frequently.
- GC related flags include: -XX:+ClassUnloadingWithConcurrentMark, -XX:+CMSParallelInitialMarkEnabled, -XX:+CMSParallelRemarkEnabled, -XX:ConcGCThreads, -XX:G1ConcRefinementGreenZone, -XX:G1ConcRefinementRedZone, -XX:G1ConcRefinementThreads, -XX:G1ConcRefinementYellowZone, -XX:G1HeapWastePercent, -XX:G1MaxNewSizePercent, -XX:G1MixedGCCountTarget, -XX:G1MixedGCLiveThresholdPercent, -XX:G1NewSizePercent, -XX:+G1PrintRegionLivenessInfo, -XX:G1ReservePercent, -XX:+G1SummarizeRSetStats, -XX:G1SummarizeRSetStatsPeriod, -XX:+G1TraceConcRefinement, -XX:+G1UseAdaptiveConcRefinement, -XX:+G1UseAdaptiveIHOP, -XX:GCTimeRatio, -XX:+HeapDumpAfterFullGC, -XX:+HeapDumpBeforeFullGC, -XX:InitiatingHeapOccupancyPercent, -XX:InitiatingHeapOccupancyPercent, -XX:MaxGCPauseMillis, -XX:MaxHeapFreeRatio, -XX:MaxTenuringThreshold, -XX:MinHeapFreeRatio, -XX:ParallelGCThreads, -XX:+ParallelRefProcEnabled, -XX:+PrintAdaptiveSizePolicy, -XX:+PrintGCDetails, -XX:PrintGCTimeStamps, -XX:+PrintReferenceGC, -XX:+PrintStringDeduplicationStatistics, -XX:+ResizePLAB, -XX:+ResizeTLAB, -XX:SoftRefLRUPolicyMSPerMB, -XX:StringDeduplicationAgeThreshold, -XX:TargetSurvivorRatio, -XX:+UnlockCommercialFeatures, -XX:+UnlockDiagnosticVMOptions, -XX:+UnlockExperimentalVMOptions, -XX:+UseConcurrentMarkSweepGC, -XX:+UseG1GC, -XX:+UseParallelGC, -XX:+UseParallelOldGC, -XX:+UseSerialGC, -XX:+UseStringDeduplication
10 Ways to Reduce Lock Contention in Threaded Programs (Page last updated July 2007, Added 2016-10-30, Author Michael Suess, Publisher thinkingparallel). Tips:
- [Note, although nearly 10 years old, this blog entry is excellent advice that still applies well].
- Make sure locks are actually the problem before you try to tune to reduce lock contention; don't guess, measure the cost.
- Only the data needs lock protection, so minimize the locking to data access and updates. If appropriate, reorder code and use temporary variables to optimally minimize.
- Lock striping is a technique of using multiple locks to protect subsections of an array to reduce contention.
- Avoid locks where an atomic operation is available to achieve the same functionality.
- Use lock-free data structures (classes, libraries) if they are available.
- Where a lot of threads read a memory location that is rarely changed, use a reader-writer lock - the lock is only exclusive when a writer locks. Note that ReadWriteLocks have significant overhead compared to synchronized, so usage needs to be measured.
- Read-only data is thread-safe so needs no locks (eg use final fields and classes).
- Object pools tend not to work so well in multi-threaded usage.
- Local variables and thread-local storage are thread-safe, so need no locks.
- If there is a shared field that will be repeatedly updated, look for ways to eliminate or provide alternative functionality that avoids the repeated updates.
- Keep locks private - having them publically accessible increases the potential creators of contention (eg synchronize on an internal private object rather than the "this" instance itself).
- A shared-nothing design needs no locks.
Unorthodox Paths to High Performance (Page last updated August 2016, Added 2016-10-30, Author Alex Rasmussen, Publisher QCon). Tips:
- Before starting design, estimate the CPU/Memory/Disk IO/Network IO of the problem to be solved, and estimate where the bottleneck will lie. The data throughput and algorithmic complexity should guide your estimates of theoretical capacity. This should either specify hardware capacity requirements or limitations of the hardware you will have available.
- Use the theoretical bottleneck of the intended system to guide your design - the bottleneck cannot be exceeded, so the rest of the system needs to be built around that limitation.
- Fault tolerance can be built at various levels, factor in the bottlenecks in the system to determine where to build fault tolerance to minimize the cost of the bottlenecks.
- There may be a better algorithm for the particular data flow of your problem rather than the standard best solution.
- Consider partitioning your logging to devices that are separate from the business data flow.
- Organize your log information and system configuration and analysis and results in time grouping so that you can access and compare it easily.
- Buffer pools are fast and simple. But inflexible because they are fixed size - if you need variable sized pools it becomes a severe management problem. Proprietary memory management may be a valid solution.
- If you need to get more complex, stand back and ask if you're solving the right problem - maybe there is an alternative view that keeps the solution simple.
- Provide resource as a priority to the task that will minimizze the bottleneck.
- Accept that the bottlenecks will shape your architecture, and consider them early on.
Back to newsletter 191 contents
Last Updated: 2018-03-27
Copyright © 2000-2018 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us