Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Threading Essentials course
Tips August 2015
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 177 contents
Gentle Introduction to Lockless Concurrency (Page last updated July 2015, Added 2015-08-31, Author Alex Petrov, Publisher coffeenco). Tips:
- An atomic set of operations behaves as if it were a single step. No other thread can observe the set of operations in a half-complete state.
- Locking typically implies blocking: an exclusive ownership over a locked resource, allowing only one thread to progress at a time.
- Locks introduce contention when there are many parties interested in the lockable resource.
- Read/write locks allow multiple readers to access a resource simultaneously while providing an exclusive resource ownership for a writer. This can reduce blocking contention where there are many readers and few writers.
- Releasing a lock too early can lead to data corruption; holding it for too long can cause extra contention.
- Lock-free algorithms allow threads to try and progress forward as long as it is possible without blocking: readers can always proceed without any locking at all; writers only have to make sure that the changes that they apply to the object do not conflict with other writes.
- Optimistic Concurrency atomically verifies that the resource is in the expected state before committing the new state (otherwise no commit occurs). Optimistic Concurrency is efficient when modifications won't be performed simultaneously most of the time, so retries will be infrequent, eg with lots of reads and few writes.
- Compare-and-swap operations can have false-positive matches (ABA problems); you can work around this with a counter alongside the write operation, eg as you have with AtomicStampedReference.
- You should minimize cross core writes and try to localize writes and avoid write contention.
- Keeping a log of incoming events that you can replay can make your software even more robust by allowing you to independently validate states.
- Immutable data structures are thread safe (eg classes with all final fields).
Java performance tutorial - How fast are the Java 8 streams? (Page last updated July 2015, Added 2015-08-31, Author Angelika Langer, Publisher JaxEnter). Tips:
- Collections support operations such as add(), remove(), and contains() that work on a single element at a time; Streams have bulk operations such as forEach(), filter(), map(), and reduce() that access all elements in a sequence (potentially in parallel). Streams are designed to hide the complexity of running multiple threads
- Benchmarking is hard and error-prone. You need to perform a proper warm-up, avoid distorting effects from optimizations applied by the virtual machine?s JIT compiler (dead code elimination being a notorious one) up to hardware optimizations (such as increasing one core?s cpu frequency if the other cores are idle).
- The overhead of boxing and unboxing primitive data values is an order-of-magnitude compared to directly handling the primitive data directly.
- The overhead of streams for sequential access (ie single-threaded) makes them slower than simply directly iterating a collection. The benefit of streams comes from the ease with which you can parallelise processing them.
- If you call distinct() (eliminates duplicates) on a parallel stream its state must be accessed concurrently by multiple worker threads, so requiring coordination and synchronisation to the extent that parallel execution may be significantly slower than sequential execution. For this and similar issues, parallel stream operations are not always faster than sequential stream operations.
Effective Ways to Implement and Use the Singleton Design Pattern (Page last updated July 2015, Added 2015-08-31, Author Payene Denis Kombate, Publisher Oracle). Tips:
- Lazy construction of a singleton (constructing the instance when the instance is required rather than when the class is loaded) has to be done in a thread-safe way. The simplest way to do that is to synchronize the accessor that lazily constructs the instance.
- If a synchronized method is called frequently from many threads, the threads will be effectively single-threaded making the program work slower than expected.
- Simple double-checked locking on an ordinary field works if the underlying compiler, scheduling system and concurrent activity are correctly implemented. However in general the compiler is allowed to apply out-of-order compilation to field updates which means in practice it's possible to see a half constructed object (fields with default values instead of constructor defined values) in another thread while one thread is constructing. Using a volatile field enforces ordering on the compiler and allows double-checked locking to work.
- Accessing a volatile variable is less efficient than accessing a normal variable. For double-checked locking with a volatile field, it's possible to implement a sequence which minimizes volatile access by using a temporary variable (eg see the wikipedia article on double-checked locking).
- Because inner classes are not loaded until they are referenced, you can implement lazy loading of an object by using an inner-class (ideally private static) - the classloader ensures that only one instance is created even in a multi-threaded context.
- [Double-checked locking without a volatile can be implemented using a final wrapper class, eg see the wikipedia article on double-checked locking]
Asynchronous Processing (Page last updated June 2015, Added 2015-08-31, Author Tomasz Nurkiewicz, Publisher Oracle). Tips:
- CompleteableFuture can be composed, chained and used asynchronously properly (Future was basically blocking).
- CompleteableFuture can be used declaratively in a procedural way rather than with callbacks, fully non-blocking.
- You can take multiple CompleteableFutures and declaratively define to proceed with the first that has a result.
- Parallel streams have no reliable way to inject your own thread pool (there is a hack because it uses the current thread pool, and you can change that, but that's not reliabe), so there is one global pool across the JVM being used, which is not scalable - anything not CPU bound anywhere will cause severe restrictions on parallel stream performance.
- The actor model is quite good if you are writing heavily parallel applications but has issues (reliable delivery of messages, complexity of code).
- A shared distributed memory model can be a simple replacement for single-JVM shared memory (eg replacing a Map with a distributed Map) but they have very different performance characteristics - you don't get distributed sharing for free, so be careful how you use these.
- Return a CompleteableFuture if your method will take time to execute, especially for any kind of remote call.
- For more efficient debugging asynchronous calls, either pass the stacks as you move from one context to the next, or more efficiently generate a UUID at the start and pass the UUID across all asynchronous contexts (and log that UUID) so that you can track a request across asynchronous processing using the UUID.
JVM Buzzwords Java developers should understand (Page last updated July 2015, Added 2015-08-31, Author Pierre-Hugues Charbonneau, Publisher Java EE Support Patterns). Tips:
- JVM Buzzwords: "Allocation Rate" is the rate of object creation (normally in the young generation); "Promotion Rate" is the rate at which objects are being promoted from the young generation to the old generation; "Live Data" are objects in the heap that are not short-lived nor dead, these should mainly be in the old generation and tend to be long-lived; "Stop-the-world Collection" atre garbage collections that cause a temporary suspension of your application threads until completed.
- Use one of the tools listed at http://www.fasterj.com/tools/gcloganalysers.shtml to analyse the GC logs and assess your JVM pause time and memory allocation rates; delve into the raw logs where necessary for detailed analysis of certain log sections.
- Leave the GCAdaptiveSizePolicy active as part of the JVM ergonomics. Turn it off and tune by hand only if the adaptive performance doesn't achieve your targets.
- Live application data usually corresponds approximately to the old generation occupancy after a full GC. The old generation heap should be big enough to hold your live data comfortably and to limit the frequency of old generation (major) garbage collections.
- A simple starting point is to select your heap size so that the old generation occupancy is about 50% after full GCs; this allows a sufficient buffer for typical higher load scenarios (fail-over, spikes, busy business periods). The corresponding young generation starting point is 1/3 of the overall heap.
- A continual increase in live data over time indicates a memory leak.
- PermGen and Metaspace are collected during Full GCs; keep track of the Class meta data footprint and GC frequency.
- All GCs are stop-the-world except for the CMS old generation (as long as it remains concurrent and doesn't failover to a serial full GC).
The Dos and Don'ts of Multithreading (Page last updated July 2015, Added 2015-08-31, Author Hubert Matthews, Publisher InfoQ). Tips:
- Avoid multithreading if you possibly can - it's harder to write, read, test; the same code unexpectedly succeeds and fails under slightly different external conditions, typically failing under load (heisenbugs); and it may not even be faster.
- Alternatives to explicit multithreading are: event-driven code; async IO; non-blocking IO; coroutines/fibres (co-operative threads); multiple processes; use threading libraries that handle all the concurrency for you.
- The only sure way to guarantee good performance of multithreaded code is to avoid sharing and use futures.
- The first step for multithreading the problem is working out how the problem can be split up, and also how the results can be recombined.
- Too fine-grained multithreading makes the threading overhead and context-switching dominate the cost of the task; too coarse-grained gives you unbalanced threads or pipeline stalls and resources not fully utilised.
- Shared mutable data is a massive problem for multithreading. Shared writes do not scale at all because of hardware cache coherency. Non-shared or immutable data works nicely.
- Use the single-writer principle to scale writes - shared writes do not scale at all because of hardware cache coherency.
- Don't under-synchronize, if you need to lock for writing you typically need to synchronize for reading.
- Don't rely on implicit ordering, this will fail at some point.
- Use locks on shared mutable data structures or use single atomics.
- Too much locking will make the code effectively serial (but slower than single-threaded because you have additional overhead) - don't keep adding locks, have a clear plan.
- Amdahl's Law: serial code limits scaling regardless of the number of threads available (the speedup possible is inversely related to the amount of serial code).
- Locking order is important to avoid deadlocks (release order is not important).
- Queue based systems can be performance limited by context-switches and shared writes. Be careful with queues if performance is important.
- Hold locks for as short a time as possible, but they must be long enough to protect the data correctly.
- Combined atomic APIs (putIfAbsent, popIfNotEmpty) allow you to efficiently handle concurrent updates
- A spin loop is of use when it uses less overhead than a context switch (1000s of cycles). Spin loops with sleeps are more efficient but have a lower latency.
- Target zero-copying solutions.
- Don't rely on thread priorities.
- Deleting data in a concurrent environment is hard - other threads can still be referencing the data. Use garbage collection.
- Atomics work fine for individual field updates, but for atmoic updates to multiple fields you need to maintain transactional correctness, eg by using a struct that encodes all fields together.
- Propagate stacks across threads to make error handling more understandable.
- False sharing is effectively a shared write, but without you meaning to share the write, just the two variables are sharing the cache line. You may need to pad to avoid false sharing.
- For short lock times and many concurrent writes, compare-and-swap or spin locks can work (for long lock times you are better off locking).
- Avoid locks on your "fast path", or if you have to have some, make an effort to avoid contention. Push work to the slow paths (queue, background threads, r/w locks).
- Per-core locks are always uncontended on reads (writes cause everything to contend).
Back to newsletter 177 contents
Last Updated: 2018-04-29
Copyright © 2000-2018 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us