Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Concurrency, Threading, GC, Advanced Java and more ...
Tips March 2026
|
JProfiler
|
|
Get rid of your performance problems and memory leaks!
|
|
JProfiler
|
|
Get rid of your performance problems and memory leaks!
|
|
|
Back to newsletter 304 contents
https://www.youtube.com/watch?v=IQrKTDNBd1s
A Java Developer's Quest for IO Performance (Page last updated March 2026, Added 2026-03-30, Author David Vlijmincx, Publisher P99 CONF). Tips:
- To prevent native method calls from pinning virtual threads, offload those calls to a dedicated background platform thread running an event loop. Have virtual threads communicate with it via CompletableFuture. the virtual thread calls get() and unmounts cleanly while waiting, and the event loop completes the future when the result arrives.
- Use io_uring from Java via the Foreign Function & Memory API to batch multiple file read/write submissions into a ring buffer and submit them in a single syscall, then poll a completion queue asynchronously. This decouples I/O submission from completion and avoids per-operation syscall overhead.
- Never benchmark Java code with System.currentTimeMillis() or System.nanoTime() in a simple before/after loop. Use JMH (Java Microbenchmark Harness) instead - it handles JIT warmup, dead-code elimination, loop optimisation traps, and provides statistically rigorous results with minimal setup (one annotation per benchmark method).
- When allocating off-heap memory with the Foreign Function & Memory API, the standard Arena (ofConfined/ofShared) zeros out every allocated segment. If you do not need zeroed memory, calling C malloc directly via a downcall handle and reinterpreting the returned pointer as a MemorySegment is measurably faster for high-frequency allocations.
- Implement the Arena interface with a custom allocate() method that delegates to native malloc. This gives you Arena lifecycle semantics (try-with-resources cleanup) while bypassing the zeroing overhead of the built-in arenas for allocation-intensive code paths.
- Recycle MemorySegments instead of allocating new ones for each I/O operation. Pre-allocated segment pools eliminate repeated off-heap allocation and deallocation syscalls, reduce memory pressure, and avoid GC-visible allocation entirely - at the cost of managing the pool and clearing stale data yourself.
- When creating a downcall handle for a native function that only needs to read a Java byte array (e.g. a file path string), pass the Linker.Option.critical() option. This lets the native call access heap memory directly, eliminating the need to allocate an off-heap MemorySegment and copy data into it - a significant speedup for frequent small calls like file open.
- With the critical linker option, convert a Java String to a heap-backed MemorySegment using getBytes() and pass it straight to the native call. Without the critical option the same code throws an exception, so this is specifically an optimisation for calls that can safely access the Java heap.
- Store native method downcall handles in functional interfaces wrapped inside a Java record rather than in static MethodHandle fields. Use MethodHandleProxies.asInterfaceInstance() to bind the handle to a functional interface; this gave a measurable throughput boost over plain static MethodHandle invocation in JMH benchmarks.
- Use JMH's built-in async-profiler integration to generate flame graphs of your benchmark runs. This identifies which Java methods consume the most CPU directly in the context of the workload you are measuring, replacing guesswork with data when deciding what to optimise.
- The async-profiler only samples threads executing in user space. If your application spawns kernel-level worker threads (e.g. io_uring worker threads), they will not appear in async-profiler flame graphs. Use the Linux perf tool alongside async-profiler to capture kernel-thread activity and get a complete picture of where time is spent.
- When optimising I/O bindings, the three levers you control are: which native C function to call (some are faster or batch-friendlier), how and when you allocate off-heap memory, and how you bind native methods to Java (the downcall handle configuration). Systematically benchmark each lever independently using JMH to find the combination that yields the highest throughput.
- Misusing a library can be the dominant bottleneck , no amount of Java tuning will compensate for incorrect library usage.
- Network I/O does not pin virtual threads, but file I/O through FileChannel does pin them because the underlying pwrite/pread are native methods. If your virtual-thread workload is file-heavy, this pinning silently blocks carrier threads and kills scalability.
https://www.youtube.com/watch?v=lnSn2rxSlKo
Get The Most Out of Virtual Threads in Java 25 (Page last updated February 2026, Added 2026-03-30, Author Christian Worz, Publisher Devoxx). Tips:
- Prior to Java 24, a synchronized block pinned a virtual thread to its carrier, preventing detachment. Java 24+ fixes this: virtual threads now unmount during synchronized blocks. Pre-Java 24 you may need to replace synchronized blocks in hot paths with ReentrantLock to avoid carrier pinning. Apache Tomcat made exactly this change to support virtual threads before the JDK fix landed.
- Virtual threads use a small pool of carrier (platform/worker) threads. When a virtual thread blocks on I/O (apart from file I/O), it detaches from its carrier and moves to the heap, freeing the carrier for other virtual threads - so a few carriers can service millions of concurrent blocking tasks.
- Use StructuredTaskScope.open() to fork independent blocking calls into separate virtual threads without rewriting anything to async/reactive. Wrap each blocking call in scope.fork(() -> blockingCall()) and they run concurrently.
- Always call scope.join() before calling subtask.get(). Unlike Future.get(), subtask.get() is not blocking - calling it without join() will likely find the subtask in UNAVAILABLE state, not the completed result.
- The default StructuredTaskScope in Java 25 fails fast: if any subtask throws an exception, scope.join() throws FailedException. Use Joiner.awaitAll() to wait for all subtasks regardless of individual failures.
- Implement a custom Joiner to encapsulate all result-handling logic (aggregation, error collection, selection of best result). The Joiner's onComplete() is called per finished subtask and result() is called from scope.join(), keeping controllers free of concurrency plumbing and making the logic independently unit-testable.
- In a custom Joiner's onComplete(), return true to cancel the remaining scope (first successful) or false to let all subtasks finish (collect all).
- Store subtask results in ConcurrentLinkedQueue or similar thread-safe collection inside a custom Joiner because onComplete() is called from different virtual threads concurrently.
- Thread locals do not propagate into virtual threads created by StructuredTaskScope.fork(). A value set via ThreadLocal.set() in the parent request thread is null inside forked child threads. Switch to ScopedValue for data that must be visible across structured forks.
- ScopedValue binds a value for a specific code scope using ScopedValue.where(key, value).run(() -> ...). Everything called within that lambda - including forked virtual threads - inherits the value. Unlike ThreadLocal, ScopedValue is immutable within its scope, eliminating race conditions from concurrent set() calls.
- Enable -Djdk.traceVirtualThreadLocals to get warnings whenever a virtual thread accesses a ThreadLocal. Use this during migration to find and replace ThreadLocal usage that silently returns null in virtual-thread contexts.
https://blog.gceasy.io/garbage-collection-tuning-guide/
The Ultimate Guide to Java Garbage Collection Tuning (Page last updated February 2026, Added 2026-03-30, Author Annya Arun, Publisher GcEasy). Tips:
- Set -Xms equal to -Xmx in production to eliminate heap resizing overhead.
- Size -Xmx based on measured live set plus allocation rate headroom.
- Enable GC logging with -Xlog:gc* in every environment including production, the overhead is negligible.
- Monitor five key GC metrics before making any tuning change: pause time, GC frequency, allocation rate, promotion rate, and heap occupancy after GC.
- When frequent Minor GCs appear, increase the Young Generation using -XX:NewSize / -XX:MaxNewSize or lower -XX:NewRatio.
- High promotion rates into Old Generation indicate premature promotion. Increase survivor space (-XX:SurvivorRatio) or raise the tenuring threshold (-XX:MaxTenuringThreshold) so short-lived objects die in Young Gen instead of filling Old Gen and triggering costly Full GCs.
- For latency-sensitive applications, switch from Parallel GC to G1 (-XX:+UseG1GC), ZGC (-XX:+UseZGC), or Shenandoah (-XX:+UseShenandoahGC). G1 provides predictable sub-200ms pauses via region-based evacuation; ZGC and Shenandoah achieve sub-millisecond pauses through concurrent compaction.
- Use -XX:MaxGCPauseMillis to set a pause-time target with G1 GC.
- Reduce the object allocation rate at the application level (pool frequently created objects, use primitives instead of boxed types, avoid temporary String concatenation in loops) to lower GC pressure before reaching for JVM flags.
- Watch for humongous object allocations in G1 GC. These cause fragmentation, and trigger premature Full GCs. Increase -XX:G1HeapRegionSize or refactor code to avoid allocating oversized objects.
- Systematic GC tuning: enable GC logging, baselines, identify pause hotspots, right-size the heap, optimize Young Generation, select the appropriate GC algorithm, then load-test every change under production-like traffic before deploying.
- Frequent Full GC cycles point to Old Generation pressure. Analyze heap occupancy after GC with heap dumps. If post-GC heap keeps growing, it could be a memory leak.
- Use GC log analysis tools - https://fasterj.com/tools/gcloganalysers.shtml - to visual pause-time distributions, allocation rate charts.
Jack Shirazi
Back to newsletter 304 contents
Last Updated: 2026-03-30
Copyright © 2000-2026 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
URL: http://www.JavaPerformanceTuning.com/news/newtips304.shtml
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us