Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Concurrency, Threading, GC, Advanced Java and more ...
Tips December 2018
JProfiler
|
Get rid of your performance problems and memory leaks!
|
JProfiler
|
Get rid of your performance problems and memory leaks!
|
|
|
Back to newsletter 217 contents
https://www.youtube.com/watch?v=BD9cRbxWQx8
How low can you go? Ultra low latency Java in the real world (Page last updated December 2018, Added 2018-12-26, Author Daniel Shaya, Publisher LJC). Tips:
- Ultra low latency is < 100microseconds & you care about outliers p99.99. Java can achieve down to the 10microsecond range
- Hardware - overclocked boxes, fast networks, fast switches, network hardware acceleration cards, colo, specially laid lines, FPGA, ASIC
- With parallel pipelines on a multi-core processor you can get 100+ cycles in a nanosecond (light travels 30 cm in a nanosecond)
- You can't GC at all for ultra low latency, you need to warmup so it's all JITed including all paths so that there is no recompilation
- Optimizing memory layout is an important optimization for ultra low latency
- Techniques to avoid GCs: the API shouldn't force allocation (eg reuse existing data structures rather than copy); use CharSequence instead of Strings (Strings tend to be the biggest cause of allocation).
- Operations on the CPU are almost free compared to memory access - O(N) analysis is insufficient because you need to take into account memory transfer overheads which can dominate.
- To reduce outliers you need to figure out how to you fix something you only see once in 10 000 requests.
- Measuring times in code is insufficient if you don't measure the time in getting to the time-measuring code (called coordinated omission), ie you have to measure right from the edges.
- Specify latency requirements in detail, eg Activity A running in System S for a throughput of X over a duration of T on hardware H with OS configuration O should not exceed latency L1 at percentile P1 (and L2 at P2 etc), timed using methodology M measured from the edges (eg by packet capture).
- The test harness should not be blocked by the application. Requests being sent should be independent of times taken to get the results.
- Use a null implementation of a component and measure the latency for that - you can't get latency better than that, this defines your minimum achievable latency.
- For ultra-low latency: you can't have any GCs; use shared memory; apply single-threaded processing logic (no synchronization) with the thread pinned to the core and all other threads excluded from that core; use simple object pooling (single-threaded); scale by partitioning data across non-shared threads/processes/microservices; spin when waiting for data to keep hold of the CPU and keep it hot; record everything so that you can replay in test to analyse outliers; don't cross NUMA regions, each process/microservice should run on one core; use wait-free data structures (no waits and guaranteed that the thread can always proceed and will finish within a set number of cycles); run replicated hot-hot for high availability.
https://www.youtube.com/watch?v=rnHY7YJq1ps
Allocation - Mechanics, Profiling & Optimization (Page last updated November 2018, Added 2018-12-26, Author Nitsan Wakart, Publisher GeeCon). Tips:
- Objects in the old generation that are dead could still be keeping young generation objects alive because the oldgen objects haven't been garbage collected yet. For data structures like linked list or trees, this can cause extra garbage being pulled in to the old gen - for these, it's best to null out references when they are no longer needed, so that the young gen explicitly knows the objects are no longer referenced from the old gen
- Varargs and foreach both can allocate objects without you realizing
- Thread local allocation buffers allocate a chunk of memory to a thread, objects are created there with uncontended memory and the thread only needs to ask for more memory from the centrally allocated memory (which IS contended) when the thread local allocation buffer is used up. This pattern is applicable to similar thread operations.
- You can fit a JVM in an L3 cache (45MB)! Tiny apps can be very fast.
- Techniques to reduce allocations (after finding the allocation hotspots): replacing iterators with counted loops; avoid varargs; avoid strings; use primitives (unboxed); pass a function to a list iterating method instead of passing the list to be iterated over; reuse special values (like empty lists); reduce copies; allocate less often; use ThreadLocal to reuse thread unsafe objects; use object pooling; lazy allocate; go off heap.
- If the young gen is too small, you get premature promotion in to the old gen, which tends to be inefficient.
- Typical causes of middle-aged objects (not long- nor short-lived) are caches, buffers, connections, sessions. Use heap dumps to compare snapshots over time to find these. The JVM is not well optimized for middle-aged objects, so these can impact the efficiency of your app.
https://www.youtube.com/watch?v=BTIcja5xcK0
Java Garbage Collectors' Current Performance Impact (Page last updated October 2018, Added 2018-12-26, Author Sergey Kuksenko, Publisher Oracle Code One). Tips:
- Java 11 has 7 garbage collectors available: CMS, G1, Parallel, Serial, ZGC, Shenandoah, Epsilon (not actually a collector).
- Garbage collector "sweet spots": CMS - mostly concurrent low pause; Epsilon - tests and no-gc; G1 - balanced throughput vs pause; Parallel - high throughput; Serial - low footprint; ZGC - mostly concurrent low pause in large heaps.
- Native memory tracking can be turned on with -XX:NativeMemoryTracking=summary or -XX:NativeMemoryTracking=detail, and -XX:+UnlockDiagnosticVMOptions -XX:+PrintNMTStatistics
- The best GC for: throughput - Parallel; latency (pauses) - ZGC then G1; memory - Serial then CMS; easy tuning - Serial then Parallel then ZGC (CMS is far away the worst with over 70 tuning options).
- Parallel and G1 GCs intentionally co-locate objects in JVM memory during the GC copying phase when those objects were referenced together. This can actually increase throughput since the objects could be pulled in to the CPU cache together.
- ZGC uses significantly more additional CPU than other collectors; ZGC doesn't work well on small heaps (eg below 16G).
- If logging pauses, you should log safepoint pauses as well as GC pauses, stopping threads can take significant time.
https://medium.com/airbnb-engineering/building-services-at-airbnb-part-3-ac6d4972fc2d
Building Services at Airbnb, Part 3 (Page last updated December 2018, Added 2018-12-26, Author Weibo He, Liang Guo, Publisher Airbnb). Tips:
- Some common reliability issues from load are: request spikes, system overload, server resource exhaustion, aggressive retry, cascading failures. These are well handled by applying request timeouts, retries with exponential backoff, and circuit breakers.
- Async processing smooths loads from request spikes and supports higher throughput, at the cost of higher latency.
- Every server has a limit to the number of requests that it can handle within its defined service level objectives for latency. This limit is a function of the limited resources available to it. Most services are network I/O-bound, not compute-bound, and therefore are more likely to be limited by memory.
- A Controlled Delay Queue uses FIFO with a normal request timeout value. If the queue is not being emptied within the timeout, the timeout switches to a more aggressive value and the queue switches to LIFO. This allows requests that are most likely to timeout to be discarded while still processing the maximum number of requests that the resources will support. This is ideally combined with back-pressure to the clients reducing the load on the server when it is under pressure.
- Aggressive client retries is the most common cause of cascading failures because the retry storm does not give an overloaded service any room to recover.
- An API timeout should be propagated with the request to downstream systems so that the request is fast failed at those downstream systems when the timeout threshold is reached.
- Clients should be rate-limited on a per-client basis so that badly executing clients cannot swamp a server and deny service to other well behaved clients.
- Use error rates to detect badly performing instances and replace those in the cluster automatically.
Jack Shirazi
Back to newsletter 217 contents
Last Updated: 2022-06-29
Copyright © 2000-2022 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
URL: http://www.JavaPerformanceTuning.com/news/newtips217.shtml
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us