Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Threading Essentials course
Tips October 2020
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 239 contents
Practical Garbage Collection Tuning For Anyone (Page last updated September 2020, Added 2020-10-28, Author Jack Shirazi, Publisher Java2days). Tips:
- The collector doesn't just collect garbage. It also: decides on the memory spaces and layout; the size and layout of objects; how allocation happens; tracks objects over their life; adds in JIT compilation codes; re-layouts space and memory as needed to optimize (eg compaction). So choosing the garbage collector affects everything!
- Any collector that has 2 (or more) generations has one tuning consequence: Try to have objects collected in the young generation
- There are 12 current garbage collectors in the OpenJDK. No pause: Epsilon (-XX:+UseEpsilonGC/-Xgcpolicy:nogc) - terminates rather than GCs. Serial targeted: Serial (-XX:+UseSerialGC) targeted at 1 vCPU. Throughput targeted: Parallel (-XX:+UseParallelGC) ; Throughput (-Xgcpolicy:optthruput). Pause time targeted: CMS (-XX:+UseConcMarkSweepGC) (gone from JDK14+); G1 (-XX:+UseG1GC); ZGC (-XX:+UseZGC); Shenandoah (-XX:+UseShenandoahGC); Balanced (-Xgcpolicy:balanced); Generational Concurrent (-Xgcpolicy:gencon); Metronome (-Xgcpolicy:metronome); Pause optimized (-Xgcpolicy:optavgpause).
- Garbage collection tuning has 4 available tuning steps: 1. Adjust heap size; 2. Choose an appropriate collector; 3. Reduce the rate of object allocation and promotion (adjust young generation heap size, adjust tenuring threshold, change code); 4. Fine tune the GC algorithm
- Turn on GC logging, overhead is negligible. Eg -Xlog:gc*=info,safepoint:file=<path>/logs/gc_%t.log: tags,time,uptime,level:filecount=10,filesize=50M (or Java 8 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<path>/logs/gc_$(date +%Y_%m_%d-%H_%M).log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=50M -XX:+PrintGCApplicationStoppedTime)
- Have business SLOs (targets), not GC ones. Every millisecond tighter you specify will cost you in compute resources and development time and devops time.
- If you don't know what Xmx to start with, start with Xmx set at 2x your live set. The live set is the stable size of the heap after GCs. Then adjust as needed in response to SLOs: lower for smaller footprint or more frequent but shorter GCs; higher for better pause times when using concurrent algorithms
- Eliminate memory leaks before tuning garbage collection, there is not much point in tuning the GC with a memory leak, no matter what you do it will eventually get ugly then die.
- Eliminate OS paging before tuning garbage collection, otherwise it dominates, by one or two orders of magnitude.
- If your SLO fails from CPU utilization: With 1 vCPU, use Serial GC; Otherwise, scale down the number of GC threads to the number of vCPUs-1 using -Xgcthreads (OpenJ9) -XX:ParallelGCThreads and -XX:ConcGCThreads (Hotspot)
- If your SLO fails from startup time, try a different collector and setting Xms to Xmx. Startup time is usually improved using class data sharing, partial AOTC restoring previous compilations, and tiered compilation (and making sure your app CAN start quickly at the application level).
- If your SLO fails from footprint lower Xmx until it's as low as possible while still achieving your SLOs (and use compressed oops if available)
- If your SLO fails from throughput, parallel or optthruput GCs will optimize your throughput
- If your SLO fails from latency (pause time) use the latency flow chart at https://miro.medium.com/max/875/1*Xu5qc7r6J1IxJ0zwpoVnAw.jpeg to target which GC algorithm works best for your application and targets. Choose the target pause times you need, check that your application and system has the corresponding capabilities (to the right of the flow: simple application/lots of spare resources/no large object graphs reloaded) for that target, then choose the listed GC algorithm if all is in line. If not, you need to move to code tuning or fine tuning the GC algorithm (but the latter is hard and fragile, start with try any obvious options eg resizing the young gen, or starting GC collections earlier).
- Pauses can be from other than the GC, so it's useful to log safepoint info too -Xlog:safepoint*... (Java 8: -XX:+PrintGCApplicationStoppedTime)
- Pause times in GC logs (Java 9+) are straightforward, they have "Pause" in the line and contain a "heap-used-before->heap-used-after(heap-size) time" format
- Metaspace stores the Java class metadata, the internal representation of Java class. It is allocated in Non-Heap native memory and can increase its size (up to what the OS provides). It is limited by the JVM parameter "MaxMetaSpaceSize". It can cause a Full GC when it is full and needs expanding, so size it to avoid GCs
- If you've tried the GC tuning flow and still don't achieve your SLOs, first lookup the best GC you found and try any obvious options (eg resizing the young gen, or starting GC collections earlier). But ultimately, the code may need tuning. Start with looking for Finalizers and Reference processing in the GC logs, and if these take significant time eliminate Finalizers (this is a best practice anyway) and reduce Reference object usage.
- Allocation rates limit how effective the GC can be. If you are allocating too fast, the GC can't keep up or, in the case of the recently built GCs, will slow down the allocations to let the GC keep up. Either way, this impacts your app so reduce allocation rate by profiling allocations and targeting the top allocators. A rule of thumb is up to 300MB/sec allocation should be okay for one of the GCs, 1GB/sec is too high
- Big objects (typically large arrays) are problematic for garbage collectors, because they are expensive to move around in memory. So you want to avoid making them garbage. Try to pre-size collections to the maximum they will reach; Be aware of this cost for large collections that are temporary, they may be worth targeting if all else has failed; Some GCs will produce "humungous" object processing statistics to identify if these are a problem; Process streams directly, avoid intermediate copies of the stream data, or at worst using a small reusable buffer to process the data.
Cache is the Root of All Evil: Avoiding Pitfalls of Common Caching Techniques (Page last updated September 2020, Added 2020-10-28, Author Vova Galchenko, Publisher Box). Tips:
- Caching is a trade off between lowering latencies and load while trying to avoid introducing nasty correctness problems of out of date or inconsist cache entries. If you can live with the performance and load without caching, don't add it.
- Concurrency-related cache consistency problems are difficult to identify and eliminate. For example updating a value in the datastore and invalidating the cache entry is a race condition that will often only show up at high load, and can easily result in a stale value being retrieved from the cache.
- Keeping consistency while writing values and invalidating cache entries can be done using locks (leases, effectively a non-blocking lock) and optimistic updates (atomically updating the cache entry when it is absent or backing off and trying again).
- Any cache read that's waiting on a lease (lock) can safely use the value retrieved by the reader that holds the lease. This is one way to deal with reducing latency during write-intensive bursts of activity.
Building Netflix's Distributed Tracing Infrastructure (Page last updated October 2020, Added 2020-10-28, Author Maulik Pandey, Publisher Netflix). Tips:
Distributed tracing with unique session request IDs with propagated context is essential for productively tracing issues
- Netflix distributed tracing architecture: open-zipkin applies the headers to requests, streamed to a stream processing layer (Mantis) which is pushed to storage (Cassandra) in a queryable format (ElasticSearch) with a Big data interface for ML (Hive).
- Recording all distributed traces is very resource intensive, but sampling means using the traces for issue analysis is very hit and miss. Netflix uses a hybrid sampling approach that records 100% of traces for mission critical requests while continuing to randomly sample all other traffic.
- For efficiency trace spans are buffered for a period so all spans for a trace will be streamed together.
- To make storage cost growth sublinear to traffic growth, Netflix used 3 strategies: cheaper EBS volumes instead of SSD; better compression; and only stored relevant and interesting traces by using simple rules-based filters (tail-based sampling).
- Cassandra on EBS performance was optimized using EBS Elastic volumes with optimized Time Window Compaction Strategy (TWCS) parameters, which reduced the disk write and merge operations of Cassandra SSTable files, thereby reducing the EBS I/O rate. This reduced the data replication network traffic amongst cluster nodes because SSTable files were created less often than in the previous configuration. Additionally, Zstd block compression was enabled on Cassandra data files, reducing the size of trace data files by half
Back to newsletter 239 contents
Last Updated: 2021-02-25
Copyright © 2000-2021 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us