Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Tips January 2016
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 182 contents
Building and Tuning High Performance Java Platforms (Page last updated November 2015, Added 2016-01-29, Author Emad Benjamin, Publisher InfoQ). Tips:
- 44% of new applications miss their SLAs, 90% use more than 2x capacity than needed.
- set initial heap = max heap.
- The JVM process memory has multiple spaces: heap, perm gen, stack, guest os memory (for VMs), and other (sockets, jit info, direct buffers, etc).
- The JVM off-heap memory requirement is significant and has to be factored.
- The stack space used by each thread can be significant for many threads.
- JVM Memory = Max heap + perm heap (if present, removed from Java 8 in HotSpot) + NumberOfConcurrentThreads * -Xss + other memory (nio direct memory, JNI memory, JIT code cache, classloaders, socket buffers, additional GC info).
- For virtualised JVM processes, the OS guest memory can be quite large e.g. 0.5GB for a 4GB heap - and don't forget the ~0.5GB needed for the JVM off heap memory (say assuming 256MB for Perm).
- VM memory is JVM process memory * the number of JVMs + Guest OS memory (~0.5GB).
- Frameworks tend to come with reflection code which has a higher stack requirement.
- In general you should size a JVM system based on heap allocation, not CPU allocation.
- Load balance against each tier rather than only at the system entry point.
- Monitor your db connection pool size compared to your thread pool - the commonest is a ratio of one (it should be able to operate on 10%!).
- For big JVMs, deploy one JVM per NUMA node (and that should be equal to VM as well).
- 1 4GB JVM is much more efficient compared to 4 1GB JVMs (because of the reduced GC cycles).
- Try increasing the heap size before splitting into more than one JVM.
- After 3 copies of a JVM, you're wasting money. 3 Copies give you the high availability, after that, just increase the heap size of those 3.
- NUMA: Memory on the system is split between the CPU sockets. Getting memory from another socket causes a 30% decrease in pperformance, so try to size the JVM to fit on the memory for one socket.
- Increasing the stack size is an optimization if escape analysis benefits your application (in-memory databases benefit from this).
- If you care about performance, the practical limit for the heap is the per NUMA RAM (including non-heap memory needs).
- High level GC tuning recipe: measure young gen pause and frequency. Adjust young gen size, number of threads, and survivor space ratio until target is hit. Then measure old gen pause and frequency, adjusting heap size until target is hit.
Low Latency in Java 8 (Page last updated October 2015, Added 2016-01-29, Author Peter Lawrey, Publisher JavaZone). Tips:
- You rarely need "as fast as possible" - you usually have a "fast enough" target. You should decide what that is.
- Acceptable latency on any user interaction request is "not humanly detectable" latency.
- Consistency matters - people remember the worst service, not the average.
- For low latency you have to eliminate major collections and reduce minor ones to absolute minimum with short pauses.
- 64-bit pointers can cause the JVM to use 30% more memory than 32-bit pointers (as applications typically have many small objects). 32GB heaps can be addressed with 32-bit pointers in Java7, and 64GB heaps in Java8 (because you can only access objects so the lower bits don't need a memory address).
- With an Eden size of 48GB, you can generate ojects taking 500KB/s (2GB/hour) and you would have one GC over the day (of a couple of seconds).
- With ultra-low garbage, you're not filling your CPU caches with garbage, so you get a double plus on performance. At 300MB/s of garbage (reasonably tuned web server) your L2 cache fills with garbage every millisecond; your L1 cache 10 times a second.
- GC pauses are not the only pauses, other ones are IO delays (network delays, waiting for databases, disk reads/writes), context switches, OS interruptes (5-10ms pauses for no apparent reason, 50ms on a virtualised system!), and lock contention. OS interrupts are not measured by the JVM, unless you are logging times and measuring skips.
- Lambdas which capture no values (this, other variables) can be automatically reused by the JVM. (Multiple instances of non-capturing lambdas are equal, non-capturing lambdas can be serialized).
- Serialized lambdas let you port functionality to another JVM.
- Escape analysis doesn't unpack arrays (in Java8) even if they don't escape the method. Tune escape analysis with flags -XX:MaxBCEAEstomateSize and -XX:FreqInlineSize (suggested values of 450 and 425 respectively for Chronicle tests).
- Throughput is the wrong measurement if latency matters. Instead look at latency distributions.
- A system without flow control is easy to performance tune (and test and debug) as you can decouple the producers and consumers.
- Once you're using all the CPUs, increasing the parallelism won't improve performance.
Flavors of Concurrency in Java (Page last updated October 2015, Added 2016-01-29, Author Oleg ?elajev, Publisher JavaZone). Tips:
- The problems of concurrency come from: shared resources, multiple consumers and producers, out of order events, and locks
- The purpose of locks is to limit parallel operations so that it's clear what is happening.
- Popular different options for concurrency include: pre-emptive threads, threads pools, the fork-join framework, completablefutures, actors, fibres, software transactional memory.
- Bare threads are easy to operate, but are difficult to communicate between and difficult to coordinate and to avoid race conditions (and so concurrent data corruption).
- Executor thread pools are quite easy to use, and allow control on how many threads run, and directly support inter-thread communication of results. You need to consider the queueing of tasks and how to manage task progress.
- The Fork-Join framework is a thread pool that allows tasks to be progressed recursively in parallel, so full utilising available processing capability. All fork-join tasks use the same shared pool. But blocking a thread in a task impacts all fork-join tasks globally as nothing else can use that thread - and the pool is shared globally. This is easy to do though. Don't block in a fork-join task.
- CompleteableFuture lets you control asynchronous tasks in a fine-grained way.
- The Actor model let's you create an inherently asynchronous and parallel application, but it can be difficult to understand large Actor based codebases.
- Fibres (lightweight threads) allows massive parallelism in the code to be mapped into the more restricted parallel capability of OS threads.
Java 8 JVM Memory and Thread Management (Page last updated October 2015, Added 2016-01-29, Author Ken Sipe, Publisher JavaZone). Tips:
- There are 3 axes of tuning: footprint, latency, throughput. Use -XX:CompressedOops to reduce footprint with almost no overhead (it should be on by default).
- JVM Memory types: stack (-Xss to set the stack size, default is 512k per thread), needs to be deep enough to handle recursive and reflective calls; heap (which is garbage collected, set heap size with Xmx), typically split into multiple generations because most objects are short-lived and it is more efficient to collect these with a different collector from generic collection for any aged object; perm space (pre Java8) or metaspace (java8+) which includes interned strings, class metadata and code caches.
- Garbage collection occurs when the memory space is full or nearly full. The young generation is collected when Eden is full, and objects are copied to survivor space until no space is available or the object has lived across max-tenuring-threshold collections, when they get "promoted" to the old generation.
- From Java8, there is no perm space heap, instead metaspace is native, can be segmented, and can grow indefinitely (to fill the OS). Metaspace GC occurs when the metaspace hits MetaspaceSize, various Metaspace flags let you configure how metaspace grows and garbage collects.
- Minor GC frequency depends on your object allocation rate and sizes, and the size of Eden. The promotion rate depends on the frequency of minor GCs and the size of the survivor space.
- Minor GC duration depends on the number of live objects in the young generation.
- -XX:+UseParallelOldGC usually shows the best throughput. It offers good minor GC times, and typically 1-5 second per GB old generation pauses (so if you can avoid old gen pauses completely, it's a good GC option).
- The CMS collector minor GC times are slower than other GCs, but old generation pauses are kept low, until heap fragmentation causes a very bad pause - so avoiding fragmentation is key.
- An application that has performance heavily bound to the L1 cache can use cpuset to bind to specific cores. A JVM sees availableProcessors as the number of cores set using cpuset.
- If you assign cores using CPU shares, a JVM sees availableProcessors all cores, even though the JVM can only use a subset.
- On a (virtualized) server, if your JVM gets assigned just one core and the JVM thinks that it only has one core, it will start in client mode - when you might want server-mode.
JCache - Say Goodbye to proprietary Caching APIs (Page last updated October 2015, Added 2016-01-29, Author Christoph Engelbert, Jaromir Hamala, Publisher JavaZone). Tips:
- Why not just us a ConcurrenHashMap as a cache? That handles concurrency, but what about expiry, auto-cleanup, resource minimization, huge caches, etc.
- Distributed caching let's you scale out, and in a cluster the cache latency is tiny, under 100microseconds.
- The Jcache interface is similar to Map, but JCache avoids inefficiencies in the API. For example Map.put() returns the old value; for a distributed cache its very inefficient to return a value if that is not used, and typically the Map.put() return value is not used, so the JCache.put() has a void return.
Understanding Thread Interruption in Java (Page last updated December 2015, Added 2016-01-29, Author Praveer Gupta, Publisher Praveer's Musings). Tips:
- One thread cannot stop the other thread, you can only request the other thread to stop by calling Thread.interrupt() - this sets the interrupt status state as true on the receiving thread instance. Calling Thread.interrupt() does not throw an InterruptedException in the thread - instead it is up to the (blocking) call currently being executed in the thread to check if the "interrupted" status has been set on the thread, and respond accordingly (eg by throwing InterruptedException).
- Most blocking methods in core Java can throw an InterruptedException, and normally will do so if they are blocking and the "interrupted" status has been set on the thread (normally by calling Thread.interrupt() on that thread). Some calls block in the OS kernel, and won't see that the thread has been interrupted, so continue to block. Any user implemented call can similarly incorrectly fail to handle being "interruped".
- If you create a long running task, it's good practice to intermittently (whenever the task is logically interruptible) check if the "interrupted" status has been set on the thread as the task runs, and cleanup then throw InterruptedException.
- The Executor framework is preferred over directly managing Threads as it provides a separation of task execution from the thread management.
- The ExecutorService.shutdownNow() method interrupts the currently running tasks, and ExecutorService.awaitTermination() waits for the service to shutdown.
- If you have a Future task, you can interrupt it by calling Future.cancel().
- When a blocking method throws InterruptedException, it clears the "interrupted" status. If you are catching the exception in order to interrupt and exit your task, you should either then rethrow the InterruptedException, or reset the "interrupt" status (Thread.currentThread().interrupt()) so that the method that called your interrupted method knows that an interrupt has occurred and has the opportunity to handle that in turn.
Back to newsletter 182 contents
Last Updated: 2017-11-28
Copyright © 2000-2017 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us