Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Threading Essentials course
Tips May 2013
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 150 contents
JVM Mechanics - A Peek Under the Hood (Page last updated May 2013, Added 2013-05-29, Author Gil Tene, Publisher infoQ). Tips:
- Write readable code. Don't try to second guess the compiler or write code to optimise code that the compiler could. The compiler can: reorder independent code; eliminate code that has no effect (dead code); simplify expressions; precompute constants; remove loops if the result can be deduced; delay checks until needed (if ever); speculatively inline and branch (if the speculative assumptions get violated, the code is thrown away and reverted to the original code which then gets optimised); and more.
- Field access via an accessor can be easily inlined and cached by the compiler, there is no need for you to cache it explicitly in code - keep the code readable and let the compiler do the optimisations. The compiler uses class hierearchy analysis to inline code that isn't actually final but is effectively final (so don't use final as a performance optimisation, only use it for actual design necessity, when you want to disable overwrites).
- If you need to see a change to a variable from another thread, it needs to be volatile (or cross some other memory barrier) or the compiler can ignore changes.
- Stepping through debug code uses compiler non-optimised code, so debug code at the CPU level is often quite different from the code being run in non-debug code.
- The compiler optimisations assume good code. If you write your code in obscure style, for example lots of null values and NPEs expected and caught to control flow, this will make the code slower since, for example, NPE checks are removed and seg-faults used to find them.
- Adaptive compilation means that you cannot guarantee you will measure the same overheads for the same code at different times. Warmup techniques can easily fail, because the warmup takes one path, the JIT optimises for that, but when the real code runs that uses paths that were eliminated so the code gets de-optimised at that point (which means that the first 'real' call actually takes longer than if you did no warmup at all).
- If you need warmup code, the system should not be able to distinguish between real and fake (e.g. a downstream system could filter the fakes). Avoid any conditional or class-specific code based on fake requests. Path depedencies should depend on the data in the request, not on code (though a smart enough compiler could even optimise that).
- Reflection is not particularly slow in modern JVMs.
- Java GC is much more efficient than direct malloc management, but it requires approximately 1 second pause per live GB. You can mostly only delay when that happens, rather than eliminate it. Every doubling of your empty memory doubles the efficiency of your garbage collector. The amount of empty memory on the heap is the dominant factor controlling the amount of GC. For GCs (both copy and mark/compact), the amount of work is linearly proportional to the live set. Empty memory controls how often the GC happens, but not pause time.
Low Level Scalability Solutions - The Conditioning Collection (Page last updated March 2013, Added 2013-05-29, Author Todd Hoff, Publisher highscalability). Tips:
- Determine the fixed limits of resources. Measure usage as a proportion of the fixed limit, so you can immediately tell how close to exhasution they are.
- Ensure that you your system avoids exhausting resources, if necessary by limiting the work done (rejecting work) to keep the resources within their limits.
- Don't move data until it is needed.
- Avoid replication if possible; share objects.
- If a resource is reaching it's limit, push back upstream to limit the flow (so upstream systems can initiate their own throttling mechanisms where available) and reduce retries which could otherwise make the situation even worse.
- The worst case scenario when a resource reaches it's limit and drops works is that upstream systems repeatedly retry causing even more work to be pushed onto the system, causing yet worse resource limitations - effectively causing a denial-of-service attack.
- Batching is an effective mechanism to increase throughput. UI requests can connect to a proxy to aggregate operations and forward them as a batch to the server.
- Throttling the flow of requests is a better strategy than rejecting requests because rejected requests cause retries much faster than slowly progressing requests.
- Remove choke points where everything has to operate serially through a segment of the system, e.g. by using: Data Parallel Algorithms; Load Balancing; Hash Based Node Selection.
- High priority control messages should never block behind data or lower priority control traffic.
- Reduce chatter; Create separate networks for control and data so control messages always go through; Use intelligent retry policies that avoid useless work piling up in queues; Delete obsolete messages from queues.
- Idempotency (something which has the same effect if used multiple times as it does if used only once) helps handle unreliable communication channels. A request to write 5 bytes at offset 165 in a file is idempotent; a request to write 5 bytes at the current end-of-file is not.
Single Producer/Consumer lock free Queue step by step (Page last updated March 2013, Added 2013-05-29, Author Nitsan Wakart, Publisher Psychosomatic, Lobotomy, Saw). Tips:
- OneToOneConcurrentArrayQueue (currently OneToOneConcurrentArrayQueue3) is faster than ArrayBlockingQueue at the expense of limiting your scope to single producer/consumer (from multi producer/consumer which ArrayBlockingQueue supports).
- Atomic*.lazySet() is a cheap volatile write: it provides happens-before guarantees for single writers without forcing a drain of the store buffer. This can impose a lower overhead to the thread writing, as well as reducing cache coherency noise as writes are not forced through.
- If using the modulo '%' operator frequently, you can replace the operation with a faster bitwise operation as long as you can force the the values to be a power of 2 e.g. with 1 << (32 - Integer.numberOfLeadingZeros(value - 1));
- Cache line padding can ensure that writes by different threads don't cause CPU overheads (though this is dependent on the details of the hardware architecture).
Common Pitfalls in Writing Lock-Free Algorithms (Page last updated March 2013, Added 2013-05-29, Author Pieguy, Publisher memsql). Tips:
- A lock-free algorithm guarantees forward progress in some finite number of operations - deadlock is impossible.
- For highly concurrent applications, locking can be a serious bottleneck.
- Lock-free algorithms rely on atomic primitives such as the classic "compare-and-swap". Writing correct lock-free code is extremely difficult.
- A lock free algorithm must consider any upate or access to a data field could have been interleaved with another thread's access or update - including setting the field to null.
- "compare-and-swap" makes no guarantees about whether a value has changed, only that the new value is the same as the old value. There could have been multiple changes in the meantime, with the value having being reset before the "compare-and-swap" succeeds.
- A lock free algorithm must be lock free at all steps, not just at some steps.
- Different threads can see changes to memory occur in different orders.
- The ABA problem (multiple intermediate changes that end with the same value, which the compare-and-swap sees as no change) can be avoided by always making sure any change is unique, e.g. by always incrementing.
- Allocating objects during a lock-free algorithm implies it is not lock-free - as the GC can kick in and halt it. Using pools and reusing objects can void this.
- A lock-free algorithm guarantees progress, but it does not guarantee efficiency. It could increase contention from compare-and-swap repeatedly failing. If you want to use lock-free in your applications, make sure that it?s going to be worth it ? both in actual performance and in added complexity.
Why Not One Application Per Server / Domain? (Page last updated March 2013, Added 2013-05-29, Author Adam Bien, Publisher adam-bien). Tips:
- An application per JVM rather than multiple applications per JVM has one disadvantage (wasted overheads per JVM) but multiple advantages: Classloader interference is eliminated; upgrades don't interfere; heap can be customised per application; cores are better utilised (multiple JVMs scale better than just one); application monitoring is more clearly defined; throttling per application is possible; crashes are separated so only affect one application (at a time).
OpenJPA: Memory Leak Case Study (Page last updated March 2013, Added 2013-05-29, Author Pierre-Hugues Charbonneau, Publisher javaeesupportpatterns). Tips:
- The simplist indication of a memory leak is seeing a series of ever higher lows on the graph of heap used.
- You can dynamically obtain a heap dump from a running JVM using jmap (but it can take time and freezes the JVM).
- The heap dump analysis performed was: load the heap dump into MAT (eclipse memory analyser); click "leak suspects"; This showed that one object held 600MB of the 1.5GB heap; Use the "find object by address" to find the root object (the classloader holding the object retaining 600MB) in the object view; sort on retained heap; drill down into the object from Classloader->classes(Map)->element->PCRegistry->listeners(linkedlist) to identify that the "listeners" instance variable of the PCRegistry is holding all the memory; drilling down further shows the leaking objects were actually the JDBC & SQL mapping definitions metadata; you now know which object, but not why, need to look at the source to identify why the leak occurred; The source shows that if close() is not called, then listeners is never released; (in this case an EntityManagerFactory is being used to create multiple objects from the application, but close was not being called - the solution is to either call close or - as in this case - use a singleton).
Back to newsletter 150 contents
Last Updated: 2019-12-31
Copyright © 2000-2019 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us