Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Threading Essentials course
Tips August 2016
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 189 contents
Java Collections: The Force Awakens (Page last updated June 2016, Added 2016-08-29, Author Richard Warburton, Raoul-Gabriel Urma, Publisher Devoxx). Tips:
- In Java 8 there is a new Collection.removeif(Predicate) method.
- Using concurrent data structures is insufficient to avoid race conditions if the read-then-write updates are not fully atomic.
- In Java 8 there are new Map.replace() methods; the Map.replace(K key, V oldValue, V newValue) lets you use compare-and-swap on a ConcurrentMap.
- In Java 8 there are new Map.computeIfPresent() Map.computeIfAbsent() methods which let you use compare-and-swap on a ConcurrentMap to update entries. (and don't forget putIfAbsent).
- From Java 8, ArrayLists and HashMaps that never have an element added will not create the underlying table at all, so are more memory efficient (around 1% for a typical application).
- One way to categorise collections is: Unsynchronized Mutable (eg HashMap); Concurrent Mutable (eg ConcurrentHashMap); Unmodifiable View (eg Collections.unmodifiableMap); Immutable Persistent ; Immutable Non-Persistent (eg Collections.emptySet)
- If you have an unmodifiable/immutable object (including collections) you can optimize it's memory structure and it can be shared concurrently with no issues.
- A Bitmapped Vector Trie is an efficient structure for mutating an immutable collection by copying as little as possible.
- If using primitive data types, avoid wrapper objects (boxing). Use data structures, streams and methods specialised to handle the primitive data types directly without Object wrappers.
- HashMap implementations include Probing (open address mapping) and Chaining (closed address mapping) - with Chaining using linked lists or trees. In Java8, the HashMap added a tree implementation if the linked list structure gets too long.
- IdentityHashMap is a Probing (open address mapping) based map, unlike most of the other maps in the JDK, so for the right use-case, it can be faster.
- NavigableMap, NavigableSet and Dequeue have all superceded their older super-interfaces (SortedMap, SortedSet, Queue).
Immutable Collections (Page last updated August 2016, Added 2016-08-29, Author Michael Steindorfer, Publisher JVM Language Summit). Tips:
- Immutable data is much easier to reason about - and provides concurrency safe data.
- The only disadvantage of immutable classes (including collections) is potential performance problems in some circumstances.
- Immutable collections are more efficiently implemented as structures where only part of of the collection needs to be copied when a copy of the collection is made with a single change.
- A memory efficient Hash Array Mapped Trie can be formed by careful consideration of the memory use per node [presented in the talk, with a link to the associated OOPSLA paper].
The Art of Performance Monitoring (Page last updated April 2016, Added 2016-08-29, Author Brian Smith, Publisher Usenix). Tips:
- High cardinality alarms (lots of alarms from one event) - cause you to misunderstand the cause because you're busy looking through so many alarms; Reactive alarms (built to detect a specific issue that was fixed) - become useless and distracting over time; Tool fatigue - too many tools, not integrated, not particularly well implemented, manually fixed. All seem to be the right thing at the time, but are the wrong long term solution.
- Key components of good alarms: Signal - no false positive; Actionability - can I do something about it right now; Relevancy - Is this the only alarm relevant to this event (if not, you need to delete those extra alarms). Eliminate all alarms that do not satisfy these.
- Notifications that do not require human attention degrade alarm response capability.
- Humans are very good at understanding density, so graphs of many metrics which show outliers using density (eg darker marks) are easily understood. Cubism.js is a tool which shows density.
- Always ask "Is there a better way I could be doing/measuring this". Don't stop changing until you can answer "No".
- The answer to "should I track this" is always yes EXCEPT when tracking will negatively impact performance.
- After fixing a non-alarmed problem, look for the best signal that would have predicted it, and create an alarm based on that.
Everything I Ever Learned About JVM Performance Tuning at Twitter (Page last updated July 2016, Added 2016-08-29, Author Attila Szegedi, Publisher JeeConf). Tips:
- Latency contributors: garbage collection (easily the biggest); locking; thread scheduling; IO; algorithmic inefficiencies.
- There are four generic shared resources, and hence tuning targets: Memory, CPU, IO, lock contention.
- Memory tuning targets: footprint; allocation rate; garbage collection.
- The fastest GC is the one that doesn't happen.
- Footprint tuning is about making in-memory data structures memory efficient. Using LRU caches, and soft references, allow you to hold subsets of recomputable data. Object overheads mean that if you are minimizing memory, you need to get creative with memory, minimizing object numbers by flattening hierarchical structures and using object data encoded into primitive arrays.
- Many performance problems can be fixed by throwing more memory at it.
- Compressed oops is automatically used below 30GB heaps; there is no point in having heaps between 32GB and 48GB as the compressed oops are lost and you actually lose space in that region.
- Avoid using boxing (primitive data object wrappers), especially if memory is an issue.
- ThreadLocals stick around, be aware that if you are using them in pools they need resetting or you have a type of memory leak. Often you are better off just creating objects as you need rather than holding a threadlocal instance.
- Compactness(inverse of memory size) x Responsiveness (inverse of latency) x Throughput == some constant for a particular system. Tuning means you can increase one of these at the expense of one or two of the others. Optimization means you can change the system configuration or algorithmsto increase the constant.
- The biggest threat to responsiveness (request latency) in the JVM is garbage collection pauses.
- Dead objects are free to collect in the young generation. So tThe young generation should be big enough to hold more than one set of all concurrently generated request-response cycle objects (ie they'll all be dead at the end of the cycle, so efficiently collected). Each survivor space should be big enough to hold all active request objects + tenuring ones.
- The tenuring threshold should be set to that tenuring objects tenure fast.
- The adaptive throughput collector can adjust to targets specified by MaxGCPauseMillis and GCTimeRatio
- Always start by tuning the young gen. Use PrintGCDetails, PrintHeapAtGC, PrintTenuringDistribution. Keep an eye on survivor sizes. Try to make sure the Survivor spaces are never 100%.
- CMS is good if it can stay ahead of object allocation; otherwise you need to tune it. You also have to keep fragmentation low.
- CMS typically needs a third larger heap than other collectors for extra space while it cleans up concurrently to the application running.
- The CMS stop-the-world pause compacts at the same time and this is a very long pause (minutes!). You need to avoid this.
- The InitiatingOccupancyFraction for CMS can be lowered to make it more responsive - even down to 0, if you have space CPU, it would run continuously then.
- If the NewSize is large and you have many live objects, you could get a long pause. You have to reduce young gen and tenuring threshold to reduce the pause time.
- If you have too many threads, this takes more time during GC because each thread is a GC root so needs to be scanned during the GC.
Smashing Atomics: Concurrency in Java (Page last updated June 2016, Added 2016-08-29, Author John Cairns, Publisher ChicagoJUG). Tips:
- In order of increasing complexity for concurrency options: stateless; immutable; synchronized; locks; read-copy-update; java.util.concurrent.atomic; queueing.
- Stateless applications are easily made concurrent.
- Immutable objects are automatically fully concurrent.
- synchronized blocks scale well with IO; but should not be used in tight loops.
- Using synchronized on methods means you are using the instance itself as the lock object. Any code that has access to that same instance can also synchronize on the instance, which is a potential denial of critical section to all other users of that instance. To avoid this, you can use an internal lock object in the instance class to explicitly synchronize the method body, instead of using synchronized on the method.
- ReentrantLock is more sophisticated than synchronized because it is interruptible. It should be used in a try-finally block. Because it is a library lock, the JVM cannot optimize it in the same way as it can optimize synchronized.
- ReentrantReadWriteLock let's you separate out read locking and write locking, typically for one writer and many readers.
- Semaphore, CountDownLatch, Phaser, DelayQueue, are all based on AbstractQueuedSynchronizer, so similar under the covers to ReentrantLock.
- The read-copy-update technique uses a separate immutable data object to hold compound data, with the mutable field holding the immutable data object being updated atomically. It generates some extra garbage.
- volatile on its own is normally only adequate to solve the simplest concurrency problems, eg a boolean that needs to be read across threads. It doesn't even provide atomic increments.
- The downsides of locks: they prevent parallelism; can cause deadlocks, livelocks, thread starvation; they give no guarantee that the thread can progress; they can be used unstructured (eg forget to unlock).
- AtomicReference.compareAndSet is used to speculatively update a value based on the old value not having changed. Eg
AtomicReference ref = ...; old = ref.get(); doStuff(); if (ref.compareAndSet(old, new)) //nothing changed ref while we did stuff
- Sequence locking uses an atomically updated long as a sequence lock: you can write compound values to other shared fields only if the sequence is even (ie after a speculative increment succeeds, so the sequence value would be odd and no ther thread could now write). This works for not too many concurrent threads, but not for thousands of threads. Sequence locking is available as StampedLock in Java 8+.
- Padding fields to keep data in separate cache lines reduces cache-contention across cores.
Avoiding Big Data Antipatterns (Page last updated December 2015, Added 2016-08-29, Author Alex Holmes, Publisher Oracle). Tips:
- A single commodity host can have 6TB RAM (in 2016) - it's not big data if it fits in memory of one host. Just process it.
- Current recommendations: Low-latency lookups -> Cassandra, memcached; Near real-time processing -> Storm; Interactive analytics -> Vertica, Teradata; Full scans, record data -> HDFS, MapReduce, Hive Pig; Data movement and integration -> Kafka.
- IO is slow - patition the data according to how it will be accessed efficently in parallel.
- Design deletion patterns to take account of tombstoning.
- Hyperloglog gives a good efficient approximate count if exact counts of large amounts don't matter.
Back to newsletter 189 contents
Last Updated: 2018-07-29
Copyright © 2000-2018 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us