Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Threading Essentials course
Tips July 2018
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 212 contents
Production Profiling: What, Why and How (Page last updated June 2018, Added 2018-07-29, Author Richard Warburton, Publisher Voxxed days). Tips:
- Small differences between a test harness and the production system can lead to dramatically different performance.
- Not enough data, users, variety, etc, in your performance load test can leave data fitting in to the caches or result in very different JVM optimizations applied resulting in very unrepresentative performance compared to production.
- The more instrumentation you add or the more fine-grained it is, the more overhead you add. A fine-grained instrumentation profiler can more than double the time that your application takes
- Too little monitoring and you can't understand problems; too much and you swamp the useful information with all the other metrics which makes it very difficult to understand problems.
- Unactionable metrics: a metric which doesn't help identify problems or trigger a need for remediation.
- Alerts that happen too often will get ignored.
- The overhead of a sampling profiler is dependent on the sampling mechanism. GetAllStackTraces (which class the underlying JVMTI function) is a blocking call that uses JVM safepoints to stop the thread and walk the stack. This is potentially a large overhead, and additionally the stack trace will always be at the safepoint rather than when the trace call was initiated and the JVM optimizes out some safepoints. And waiting for all threads to reach a safepoint adds additional blocking time.
- The JVM has an efficient AsynGetCallTrace function that will walk the stack without safepointing, much lower overhead than the blocking stack trace walk. AsyncProfiler, HonestProfiler and JavaMissionControl all use this function to get stack traces. Also at the linux level perf (use perf-map-agent to make it understandable) and eBPF work similarly. These low overhead profilers can be run in production at a low enough sampling rate to have less than 1% overhead on the system.
- Sampling when the TLAB is copied is low overhead memory profiling but provides restricted information
Applied Performance Theory (Page last updated June 2018, Added 2018-07-29, Author Kavya Joshi, Publisher QCon). Tips:
- Yolo method: try it and see if it works
- Load simulation: Stress to find the load bottlenecks
- Performance modeling: create a model of your system and analyze that to improve performance
- Model servers as queueing systems. Response time = queueing delay + service time
- Utilization = arrival rate * service time. Queuing delay (and so response time) is U/(1-U) - a hockey stick graph. The maximum throughput of a server is at the bend of the hockey stick.
- Prevent the queue from getting to long to avoid overloading the server. Set a max queue length and backoff making requests when there are too many queued (responses getting longer).
- Controlled delay: queues are typically empty, so if you have a queue and it hasn't been emptied recently, set shorter timeouts for rejecting them.
- Use LIFO queueing, because older requests are likely to timeout.
- Improve mean response times by: decreasing service time; and decreasing service time variability; and batching requests.
- Little's law for closed systems (synchronized requests and limited client numbers): requests = throughput*(sleep time + response time). At the high throughput end, response time grows linearly with the number of clients.
- Mostly open systems are typical for business systems, but load simulators are typically closed systems, and these have very different models, so you may be getting inaccurate information from load testing, typically lower response times and better variability of tolerance in your load testing.
- Capacity planning for clusters asks: how many servers do you need to support the target throughput at the given response targets
- Systems don't scale linearly because of contention penalty (contention for shared resources) - linear overhead, Amdahl's law; and coordination penalty (typically from synchronizing mutable state) - quadratic overhead, Universal Scalability Law.
- The Universal Scalability Law says to improve scalability you should minimize contention and synchronization. Eg better data partitioning, smaller partitions, smarter aggregation, better load balancing, finer grained locking, multiversion concurrency control.
- Run experiments to get the data for your Universal Scalability Law graph and use that to infer the scalability.
Architecting for performance. A top-down approach (Page last updated June 2018, Added 2018-07-29, Author Ionut Balosin, Publisher JPrime). Tips:
- Response time ranges challenges: seconds - easily achievable (small methods, minimize branching, use cohesion, abstract cleanly); hundreds of milliseconds - needs general performance tactics to achieve (optimized data structures, algorithm complexity minimized, use batching and caching); tens of milliseconds (low latency) - needs specific performance tactics to achieve (memory access patterns optimized for CPU caches, lock free algorithms, asynchronous processing, stateless, RamFS/TmpFS, GC and object lifecycle tuning); under 1 millisecond (ultra-low latency) - very specific techniques needed (thread affinity, NUMA, Large pages, avoid false sharing, data-oriented design, disable c-states, ensure CPU cache friendly operations).
- Cohesion vs decoupling: classes shipped together should have high cohesion (gives better locality, more CPU cache friendly); elements not related should be decoupled.
- Cyclomatic complexity (the different paths your application can take) increases branches which are less predictable so less efficient on CPUs. Also it's more difficult to read.
- Big-N order of complexity for algorithm performance is insufficient to analyse performance on modern systems, because data transfer dominates costs and so working with the system data transfer efficiencies can dominate algorithm efficiency.
- Reduce code footprint as much as possible. Use small methods, fewer branches, minimize indirections, use primitives instead of objects
- Caching is a top tip to improve performance, but you must cache and evict in conjunction with read/write patterns or caching is just an overhead. Eviction algorithms include LRU, LIFO, FIFO; Fetching strategies include pre-fetch, on demand, predictive; topologies include local, distributed, replicated, partitioned.
- Batching minimizes the number of round trips, and batch size should be dynamically tuned to the handling rate (both receivers) - BBR congestion control algorithm is a good option applicable to batches.
- Asynchronous and stateless makes for a very scalable application. CompleteableFuture is useful.
- Memory access patterns that are helpful: striding (predictable patterns); spatial (nearby memory is needed soon); temporal (memory recently accessed will be needed again soon).
- Lock-free algorithms avoid deadlocks.
- Cache-oblivious algorithms are more CPU cache friendly independent of the CPU cache size.
- Data-oriented design creates data structures according to how data is read and written to optimize performance.
- Thread affinity binds a thread to a range of or a single CPU, which optimizes how the CPU cache is used, as long as other threads are blocked from those CPUs.
- NUMA affects how data is transferred. Try to size to fit the data onto one socket. There is -XX:UseNUMA for the parallel GC collector (G1 will support this at some point in the future).
- Large pages (-XX:+UseLargePages) reduces page table misses for memory intensive applications with large contiguous memory accesses. Look at TLB misses and TLB page walks for profiling to determine whether large pages might help.
- RamFS/TmpFS are useful for IO intensive applications.
- False sharing (different threads on different cores accessing at the same time different fields that happen to fall onto the same cache line) can reduce performance significantly. @Contended annotation allows the JVM to put the fields on different cache lines. Note there is no problem if none of the falsely shared fields are being written to.
- SSDs can be tuned. Trim can be on(asynchronous cleaning) vs off; and the IO Scheduler can be changed between Noop, deadline, cfq - noop seems to be faster.
- C-states is power management, by default configured to save power (clocks down the core if not used recently). Disable c-states if you need optimal performance.
Transactions and Concurrency Control Patterns (Page last updated June 2018, Added 2018-07-29, Author Vlad Mihalcea, Publisher JPrime). Tips:
- Serial execution testing is not the same as parallel execution testing.
- You can avoid concurrency issues by avoiding concurrency - using one controlling thread eg like nodejs
- Concurrency control shared amongst multiple processes/threads can be coordinated with locks, and can be reasoned about well.
- Multi-version concurrency control (MVCC) uses a version stamp to determine which readers and writers have the latest data. MVCC is optimistic locking compared to 2-phase commits which are pessimistic.
Back to newsletter 212 contents
Last Updated: 2020-12-28
Copyright © 2000-2020 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us