Tips November 2024

https://www.youtube.com/watch?v=3BFcYTpHwHw
The next phase of Project Loom and Virtual Threads (Page last updated October 2024, Added 2024-11-29, Author Alan Bateman, Publisher Devoxx). Tips:

Virtual threads troubleshooting additions for JDK21: platform thread stack traces show the virtual thread ID they are running, and also that virtual thread stack trace.
Virtual threads troubleshooting additions for JDK24: jcmd PID Thread.dump_to_file -format=json FILEPATH.
Virtual threads troubleshooting additions for JDK21: -Djdk.tracePinnedThreads=full.
Virtual threads troubleshooting additions for JDK24: jfr print --events jdk.VirtualThreadPinned --stack-depth 100 JFR.RECORDING.
Virtual threads troubleshooting additions for JDK24: more types of JFR pinned events to distinguish them.
Virtual threads troubleshooting additions for JDK24: new jdk.management.VirtualThreadSchedulerMXBean which JMC understands.
Virtual threads troubleshooting additions for JDK23: heap dumps include virtual threads.
Virtual threads troubleshooting additions for JDK24: Lock information added to virtual thread dumps.

https://www.p99conf.io/session/java-heap-memory-optimization-improve-p99-query-latency-linkedin-scale/
Java Heap Memory Optimization to Improve P99 Query Latency at Linkedin Scale (Page last updated October 2024, Added 2024-11-29, Author Vivek Iyer Vaidyanathan, Publisher P99 Conf). Tips:

Memory mapped off heap memory can be fast, but is still slower than on-heap memory. So segment your data according to it's latency requirements and leave only the lowest latency requirement data on the heap.
-XX:+StringDeduplication (only works with G1 GC) deduplicates Strings during spare CPU cycles (sets the char[] of duplicate Strings to point at the same char[]). This is good to decrease memory usage but has some CPU overhead and can impact low latency responses (due to the competing concurrent CPU overhead).
Guava interners supports isolated caches and is a widely used technology to deduplicate objects.
FALF - fixed-size array, lock-free - interner for deduplicating objects - code here https://www.slideshare.net/slideshow/java-heap-memory-optimization-to-improve-p99-query-latency-at-linkedin-scale-byvivek-iyer-vaidyanathan/272606268?embed_session_id=e0d98f89-8187-4656-b342-f095b14e1512#20.
Smaller heaps tend to improve latencies by lowering GC costs.

https://www.p99conf.io/session/patterns-of-low-latency/
Patterns of Low Latency (Page last updated October 2024, Added 2024-11-29, Author Pekka Enberg, Publisher P99 Conf). Tips:

Common code easily produce bad tail latency. Because latency compounds, if a end-user request hits multiple internal app requests, many users will hit the tail latency of intermediate requests.
Maximum latency is hard to optimize, so focus on the near maximum latencies for optimizations.
Visualize latency with a logarithmic x-axis, or eCDF visualization.
Reduce latency by: avoiding moving data, minimizing operations, and avoiding waits.
Avoid data movement by: avoid network calls or minimize the network distance that the call makes; colocating data; replicating data to where it is needed; and caching the data.
Reduce latency by doing less: simpler algorithms; use memory structures that minimize operational overheads (linked lists and graphs are usually bad choices for low latency); optimize code; avoid CPU-intensive calculations; avoid memory allocation in the fast/critical code path; avoid demand paging (make sure memory being used is already paged in).
Optimizing code for low latency means: reduce cpu cycles, reduce CPU-cache misses, etc. Split long running tasks into multiple short ones. Use profilers to find the inefficient code. Be aware that you are often swapping performance for something else (memory, or overall latency).
Avoiding waiting by: eliminating synchronization (eg partitioning the data per core and processing just on the core); use wait-free algorithms; keep the code in user-space (avoid kernel calls or bypass it if possible); avoid context-switching (use dedicated thread-to-core); use async/non-blocking IO; use busy polling; make shared data structures read-only; use single-producer+single-consumer queues to transfer between cores; use TCP_NODELAY; don't process requests as they come in from the network, take them off the queue and process separately so that longer latency requests don't block the queue.
Hide latency by: parallelizing request processing; send requests to multiple servers and take the fastest response; use light-weight threads.
Tune the system for low latency: configure CPU frequency to constant (to performance), isolate CPUs to specific threads, disable swap, configure network stack interrupt affinity.

Jack Shirazi Back to newsletter 288 contents

Last Updated: 2026-03-30
Copyright © 2000-2026 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
URL: http://www.JavaPerformanceTuning.com/news/newtips288.shtml
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us