Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Threading Essentials course
Tips July 2015
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 176 contents
A Brief History of Scaling LinkedIn (Page last updated July 2015, Added 2015-07-28, Author Josh Clemm, Publisher LinkedIn). Tips:
- Subsystems that need to scale independently of the system should be separate processes.
- Vertical scaling (faster CPU, more cores and more memory on a single box) can buy you time, but if the load is growing, eventually that can't scale further.
- Use database replicas to split read and write operations to different servers, where it is valid for reads to be slightly out of date compared to writes.
- Splitting a large monolithic service into microservices gives more flexibility and better scaling characteristics.
- Stateless services can increase scale simply by adding a new instance to the load balancer targets.
- Caching and precomputing results helps systems scale tremendously. Keep the cache as close to the data store as possible.
- Dynamic discovery of services gives you automated client based load balancing, discovery, and scalability of each service API.
Practical Java performance tuning: Techniques and Optimizations Across 20 Years of Development (Page last updated July 2015, Added 2015-07-28, Author Ben Evans, James Gough, Publisher O'Reilly). Tips:
- [The first 80 minutes is a detailed description of how the JVM works - skip the first 80 minutes if you are stuck for time and are looking just for the practical advice].
- For CPU-bound highest performance, your data needs to be in the CPU cache otherwise you are accessing main memory which is an order of magnitude slower.
- Microbenchmarking Java is hard and usually pointless.
- Check that your GC and memory settings are sensible.
- If the code cache fills, JIT compilation stops.
- Make sure everything is getting JIT compiled. A quick technique to check this is to switch on the -XX:+PrintCompilation flag, then redo with -XX:ReservedCodeCacheSize doubled and if more methods are getting compiled the second time, the code cache was too small.
- JITWatch parses the very large logfile from the LogCompilation TraceClassLoading PrintAssembly flags, showing in great detail what is happening in JIT compilation.
- Use a thread profiler to identify waiting threads and why. Common reasons for waiting are: slow external (db, REST, SOAP, authentication, authorisation) queries; slow network; too much RMI.
- Techniques to reduce external waits include spreading requests across more external servers; caching; batching.
- OS problems (use vmstat to monitor) include: disk IO (too much, conflicting, slow hardware); context switching; threads contending for hardware resources. Use SSDs, look for spikes and what causes them, and move contending processes to different systems.
- JVM performance issues are usually from garbage collection (GC). Note GC is in user space, not sys space. Tuning GC is complex. Use GC log files, from -Xloggc:... -XX:+PrintGcDetails -XX:+PrintTenuringDistribution -XX:+PrintGCTimeStamps
- It's unusual for application code to be slow. You can use a profiler like VisualVM as a first step if it is slow.
- If your system doesn't have enough capacity to handle the request load, scale by adding instances.
- Measure your application's performance, JVMs, OSes, networks traffic and external communications to find the actual problems.
- After any change, measure the effect of the change to show it has improved.
- Normal distribution (bell curve) statistics are often incorrect for measuring performance, Gamma distributions (long tail) are usually more useful. Use long tails percentiles, 99%, 99.9%, 99.99% etc.
Developing applications with a micro-service architecture (Page last updated May 2015, Added 2015-07-28, Author Chris Richardson, Publisher JaxEnter). Tips:
- JVMs consisting of large codebases tend to take longer to start; micro-service JVMs tend to start quickly.
- Horizontal duplication scaling is running multiple copies of a service behind a load balancer; data partitioning scaling scales by routing requests to the appropriate service dependent on the data; functional decomposition scaling is scaling by sending the requests to the appropriate functional service (microservice). These three scaling techniques can be applied independently of each other.
- Too much microservices decomposition can cause too many network hops, excessive runtime overheads, and a complex interlinked difficult to understand system. Note having multiple parallel microservices is a separate issue, it increases throughput without increasing network hops so is a valid scaling technique.
- An Amazon product page can hit over 100 microservices to render.
- A microservice architecture improves fault isolation and allows technology to be upgraded piecemeal rather than across the entire system and encourages A/B testing.
- A microservice architecture has to handle failures of services, communication failures, eventual consistency, testing a distributed system.
- Use an API gateway to provide client specific requests and protocols for optimal client handling; the gateway translates the request into standardised internal requests.
- For performance critical microservices, you might need to use non-blocking IO, ansynchrony and concurrent request handling.
- Microservices should handle partial failures, eg some dependent microservices being unavailable while others are still available. Netflix Hystrix is a library for dealing with partial failure.
- For communication protocols, asynchronous is usually more efficient, and binary formats also although more difficult to understand.
- Messages decouples client and servers and can be very efficient and fault tolerant, but adds complexity and latency.
Introduction to JMH Profilers (Page last updated June 2014, Added 2015-07-28, Author Mikhail Vorontsov, Publisher java-performance.info). Tips:
- The JMH framework includes several profilers: MBean Classloader profiling (CL) ; MBean JIT compiler profiling (COMP); MBean GC profiling (GC); HotSpot classloader profiling (HS_CL); HotSpot JIT compiler profiling (HS_COMP); HotSpot memory manager (GC) profiling (HS_GC); HotSpot runtime profiling (HS_RT); HotSpot threading subsystem profiling (HS_THR); and a simple stack profiler (STACK).
- The JMH Stack profiler samples thread stacks at regular intervals (set the number of lines to keep for each sample with the property jmh.stack.lines, default 1) to find the methods most often seen executing. jmh.stack.period (interval in ms, default 10), jmh.stack.top (traces to show in output, default 10) and jmh.stack.detailLine (distinguish different execution lines in the same method when jmh.stack.lines=1, default false). Calling thread dumps too often will slow down performance noticeably.
- The JMH Simple JIT compiler profiler (COMP) tells the time spent by the JIT compiler. It's useful to determine how many iterations are needed to warmup the code until JIT compilation is no longer relevant.
- The JMH Simple garbage collection profiler (GC) provides the time and frequency spent in garbage collections, but is not as useful as logging GC output.
- The JMH Hotspot profilers use internal JVM counters to measure what they are profiling.
- The JMH Hotspot compilation profiler (HS_COMP) gives you statistics on the number of compiled blocks of code, on-stack replacements, invalidated code and bailouts.
- The JMH Hotspot threading profiler (HS_THR) gives you deltas on started/stopped/active thread counts.
- The JMH Hotspot runtime profiler (HS_RT) gives you information about Java locks (monitors) and safepoints.
Implementing Filter and Bakery Locks in Java (Page last updated April 2015, Added 2015-07-28, Author Furkan KAMACI, Publisher furkankamaci). Tips:
- Java does not guarantee linearizability, or even sequential consistency when reading or writing fields of shared objects (so you should use appropriate techniques such as synchronization, volatile fields, or Atomic data structures).
- The Filter lock handles up to N threads by having N-1 waiting slots; a thread has to traverse all the slots to acquire the lock; if more than one thread attempts to enter a slot, only one will succeed and the others will wait (spinning), resulting in only one thread at a time being able to traverse all slots and so acquire the lock. As threads waiting to acquire a lock spin, you should be aware of the CPU consequences of using Filter lock for non-short locked sequences.
- A Bakery lock has every thread that wants a lock take a new higher numbered ticket (or potentially the same numbered ticket as another thread trying to acquire a lock, the lock doesn't mind) and then has the thread wait spinning until there is no lower numbered ticket (or equal value ticket with a lower ID thread) before it gains the lock. As threads waiting to acquire a lock spin, you should be aware of the CPU consequences of using Bakery lock for non-short locked sequences.
Five ways to maximize Java NIO and NIO.2 (Page last updated October 2012, Added 2015-07-28, Author Cameron Laird, Publisher JavaWorld). Tips:
- NIO file-system change notifications are faster than polling and usually less intensive. Polling for file-system changes involves: check the file-system or other object, compare it to its last-known state, and, if there's no change, check back again after an interval. NIO notification involves setting up a WatchService and waiting for a notification.
- It's normally more efficient to look for the event that signifies the end of a file modification, rather than any file modification event.
- NIO non-blocking IO is event based; a Selector waits for an IO event, and then wakes up. A Selector allows you to process multiple IO events from a single thread, often referred to as multiplexing.
- NIO can perform worse than basic Java I/O, but it is generally much more responsive.
- For simple sequential reads and writes of small files, a straightforward Stream implementation might be two or three times faster than the corresponding NIO event-oriented channel-based coding.
- Non-multiplexed channels, channels in separate threads, can be much slower than multiplexed channels, those that register their selectors in a single thread.
- The most consistently dramatic performance improvements from using NIO generally involves memory mapping.
- Memory mapping lets IO happen at the speed of memory access rather than file access. The former is often two orders of magnitude faster than the latter.
- Memory mapping allows several different readers and writers (even from different processes) to attach simultaneously to the same file image and all operate concurrently. Of course you need to explicitly guard against corruption if doing that.
Back to newsletter 176 contents
Last Updated: 2018-10-29
Copyright © 2000-2018 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us