Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Concurrency, Threading, GC, Advanced Java and more ...
Tips November 2023
JProfiler
|
Get rid of your performance problems and memory leaks!
|
JProfiler
|
Get rid of your performance problems and memory leaks!
|
|
|
Back to newsletter 276 contents
https://medium.com/@haasitapinnepu/the-a-z-guide-to-distributed-caching-d0c6fec9592a
The A-Z guide to Distributed Caching (Page last updated February 2023, Added 2023-11-28, Author Haasita Pinnepu, Publisher Medium). Tips:
- Caching helps systems work more efficiently by storing frequently used data in a fast-access location. At the same time as making frequently used data accessible faster, it also reduces load on the back-end datastore that stores the master data.
- Distributed caching stores frequently accessed data across multiple memory nodes in a network.
- For caching a small number of (preferably immutable) objects that have to be read multiple times, in-process caches are a good solution.
- Distributed caches typically consist of multiple nodes that store a portion of the cached data, communicating with each other to keep the data consistent and to handle data replication and distribution.
- Distributed caching generic procedure is: client sends a request for data to the cache; the cache node receiving the request serves the data from its local cache if available; if not available the node retrieves the data from the data store and caches and serves it; the node then replicates newly acquired data to other cache nodes; the cache periodically checks for stale or outdated data and removes it.
- Cache invalidation in distributed systems is difficult: maintaining consistency between multiple nodes and data store; invalidating data across nodes takes time; concurrent access from clients can lead to inconsistency; scaling nodes increases invalidation costs exponentially if not considered.
- Common cache invalidation algorithms: Time-based after a certain period of time; Version-based whenever version changes; Delta-based - compare the current version with the latest version and invalidate if the delta between the two exceeds a certain threshold; Least Recently Used (LRU) - the item not accessed for the longest time is evicted from the cache; Least Frequently Used (LFU) - the item accessed the least is evicted from the cache. Of course caches can combine multiple algorithms.
https://www.youtube.com/watch?v=Z3mTpqIp1Q0
Decoding the secrets of the JVM internals (Page last updated August 2023, Added 2023-11-28, Author Lennart ten Wolde, Publisher J-Spring). Tips:
- Data access to/from memory is orders of magnitude difference between registers/CPU cache/RAM/SSD or HDD. This affects program performance, basically smaller data amounts can be executed more efficiently because they can fit into the CPU cache without too many higher cost memory accesses. Value objects help keeps data together so can improve cache utilization.
- Stack processing of data (local primitives) is more efficient than heap processing (objects), but more difficult for good practice programming.
- Objects live in the young generation (for many GCs) for 7 garbage collections (by default). If they die while in the young generation, they can be collected much more efficiently. If not, they get promoted to the old generation and garbage collection there is much less efficient and higher overhead. So keeping your objects very short-lived is more efficient than medium length life objects. All GCs either have generational GC or are adding it (as of JDK 21).
- Try different GCs if you need better performance. For optimizing for throughput (eg batch processing), parallel is likely best; for very short pause times ZGC or Shenandoah is probably best. Enable GC logging with Xlog:gc...
- Leave enough heap headroom for your allocation rate, set -Xmx to handle that.
https://www.youtube.com/watch?app=desktop&v=TDpbt4thECc
A Simple Approach to the Advanced JVM Profiling (Page last updated March 2021, Added 2023-11-28, Author Andrei Pangin, Publisher IntelliJ IDEA). Tips:
- Start with finding which shared resource is stopping better performance: CPU, memory, network, heap, IO, handles.
- Under utilized CPU suggests there is a problem elsewhere; maxxed out CPU suggests the CPU load is a problem.
- Profiler types: bytecode instrumentation that records method entry to exit times - this has high overhead and affects the processing time; sampling profilers that sample the stack - low overhead that's controllable by changing sampling period, but can miss data.
- Many sampling profilers - including JFR - are safepoint biased - they wait for safepoint before getting a sample. This can give wrong data.
- Busy and idle IO code often looks the same to a sampling profiler, which is misleading.
- Profilers can use hardware performance counters to add information, but not many do. The Linux perf command with perf-map-agent and -XX:+PreserveFramePointer let's you access the data they make available.
- Being able to see native and kernel stack frames (without safepoint bias) let's you see things like page faults, inefficient choices of clock sources, system locks.
- Lock profiling is a different type of profiling from execution profiling, and is very useful to understand what the bottleneck is when no other main shared resources are loaded.
- With asyncprofiler, if you attach at startup it doesn't need the -XX:+DebugNonSafepoints option, but if you attach it later it needs that option for full visibility of JIT compiled code.
- CPU profiling tells you what is consuming CPU, it doesn't tell you about other threads that are idle for whatever reason - for that type of profiling you need wall clock profiling. Wall clock profiling is useful for identifying threads waiting for IO and why (eg DNS lookups).
- Allocation profiling is useful to identify where allocation is dominating and causing GC. Allocation profilers tend to sample allocations because otherwise the overhead is very high. JFR and asyncprofiler use sampling at TLAB allocation and outside of TLAB allocation (slow path allocation). There is a SampledObjectAlloc() and SetHeapSamplingInterval() method available from JDK11 which also tries to sample slow path allocations.
- Asyncprofiler let's you profile cache misses to identify inefficient uses of the CPU the CPU cache (typically from data size and layout) and profile native memory usage (eg malloc and mmap) to find native memory leaks, and find where humungous objects are allocated.
Jack Shirazi
Back to newsletter 276 contents
Last Updated: 2024-09-29
Copyright © 2000-2024 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
URL: http://www.JavaPerformanceTuning.com/news/newtips276.shtml
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us