Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Concurrency, Threading, GC, Advanced Java and more ...
Tips March 2015
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 172 contents
Using wait(), notify() and notifyAll() in Java: common problems and mistakes (Page last updated August 2012, Added 2015-03-28, Author Neil Coffey, Publisher javamex). Tips:
- wait() should be called in a loop, with a condition test in each loop iteration that evaluates to false if the notifying thread has notified (this catches notifications that trigger before entering the loop even once).
- wait() can exit without a notifcation having been triggered (so should be called in a loop).
- Do not wait() forever (
long forever = 0; object.wait(forever)).
- Choosing between notify() and notifyAll() is to some extent a tuning choice; notifyAll() works in all situations but reduces throughput by increasing unnecesary context switching; notify()ing is generally more efficient unless you actually need to wake up multiple threads to process following a condition, in which case calling notify() can stall (as threads which need to be woken up may never be woken up).
- object1.wait() only releases the lock of object1, not other locks
- There is a non-deterministic delay between notify()ing and the notified thread executing (the lock is released, the waiting thread is set to be runnable, and then run when the OS next schedules it).
Why Non-Blocking? (Page last updated March 2015, Added 2015-03-28, Author Bozhidar Bozhanov, Publisher bozho). Tips:
- Non-blocking applications are written so that threads never block. Instead of blocking, threads get notified when new data is available.
- Implementations of the reactor pattern typically have one thread serving all requests by multiplexing between tasks and never blocking; when something is ready to be processed, it is immediately processed or handed off to a thread.
- There is a trade-off in using non-blocking implementations - you have higher complexity for potentially higher scalability, but latency can suffer and a blocking implementation is easier to understand and test.
- For thread-safety you just have to follow one simple rule - no mutable state in the code. No instance variables and you are safe, regardless of how many threads execute the same piece of code.
- Use only immutable and concurrent data structures to ensure thread-safety.
Understanding the JVM and Low Latency Applications (Page last updated March 2013, Added 2015-03-28, Author Simon Ritter, Publisher Oracle). Tips:
- If garbage collection kicks in, there is a pause of variable length meaning you get non-deterministic performance.
- The frequency of minor GC is dependent on the rate of object allocation (how quickly you fill Eden) and the size of Eden.
- The frequency of object promotion to tenured space is dependent on how quickly objects age (in minor GC counts), the size of the survivor spaces, and how many times objects are copied across survivor spaces.
- Object retention (live objects) impacts latency more than object allocation - minor GC time is a function of how many live objects there are (and the complexity of how they are connected).
- Very short lived objects (never copied out of Eden) are efficient to use, but if you use too many you cycle Eden faster causing more frequent pauses. Object allocation to Eden is very fast - about 10 cycles compared with about 30 cycles for the fastest malloc; and are very cheap to reclaim - they are just ignored and their space gets overwritten later!
- The ideal application only experiences small minor GCs and no old generation GCs, so negligible promotion.
- Start with parallel GC (-XX:+UseParallelOldGC and/or -XX:+UseParallelGC) as this provides the fastest minor GCs; move to CMS GC if old generation collection pauses are too long, but this will make minor GCs longer due to promition into free lists.
- Avoid creating large objects as much as possible (they may not fit into Eden, they must be zeroed, they can cause fragmentation) or try to keep them initialized during application initialization.
- Avoid resizing collections - try to size them from the start to be as large as they'll need.
- Don't implement finalize() methods. Explicitly free resources, or use a Reference if you absolutely have to clear up during GC.
- SoftReference clearup is up to the garbage collector and is nondeterministic (though you can work it out for particular versions).
- Inner classes have an implict reference to the outer instance, which increases object connection complexity which in turns can make GCs longer. (From Java 8 inner classes can be implemented as lambda expressions, avoiding this complexity).
- CMS has a pause target, but that can only be targeted by changing internal heap sizes so is very limited.
- The G1 collector is targeted at replacing CMS - it includes compacting (CMS doesn't) and is more predictable than CMS.
- Code in catch blocks are not normally JIT compiled (the JIT compiler assumes it won't be reached so no need to optimize it).
- JIT deoptimisation causes non-deterministic behaviour.
Tuning Large Scale Java Platforms (Page last updated November 2014, Added 2015-03-28, Author Emad Benjamin, Jamie O'Meara, Publisher InfoQ). Tips:
- Establish your load profile - know or estimate: comcurrent requests; requests per second; peak & average response times. Specify the response time SLA.
- To size your system, create benchmark tests based on your known or anticipated load profile, and tune and scale the test systems until you achieve your SLAs - this establishes your production configuration.
- JVM Memory = Max heap + perm heap (if present, removed from Java 8 in HotSpot) + NumberOfConcurrentThreads * -Xss + other memory (nio direct memory, JNI memory, JIT code cache, classloaders, socket buffers, additional GC info)
- The JVM should be sized below the memory available to a socket, eg on a 2-socket machine with 96G, each socket has 48G available to it (or half that on AMD) so the JVM heap size needs to be sized so that the full JVM process size is less than 48G. Otherwise NUMA interleaving happens and performance can drop by 30%. (-XX:+UseNUMA makes some GCs NUMA aware, but some don't work with it, eg parallelold works, CMS doesn't).
- GC tuning is essentially balancing between latency (pause times) and throughput (after you have eliminated the major inefficiencies
- The most successful general GC algorithm option is ParNew in the young gen and CMS in the old gen.
- GC tuning: 1. measure minor GCs, and adjust young gen size and/or parallel thread count to minimize for either individual pauses or overall stop time, depending on whether latency or throughput is your target; 2. Adjust the total heap size in the same way; 3. Adjust survivor space size again for the same targets, but be aware the smaller survivor spaces can cause more promotion which can then cause more frequent old gen GCs.
- An increase in young gen heap size will decrease the frequency of young GCs, but this can make individual pause times suffer (depends on the amount of live objects after a GC) .
- Example best tuned config for a 50G JVM: -Xms50g -Xmx50g -Xmn16g -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+ScavengeBeforeFullGC -XX:TargetSurvivorRatio=80 -XX:SurvivorRatio=8 -XX:+UseBiasedLocking -XX:MaxTenuringThreshold=15 -XX:ParallelGCThreads=6 -XX:+OptimizeStringConcat -XX:+UseCompressedStrings -XX:+UseStringCache
- For IBM JVMs, gencon GC algorithm tends to give the best GC performance.
- A microservice architecture should not result in most requests needing to traverse most of your estate - that would imply the microservices are over-fragmented and should be coalesced into not quite so "micro" services.
- More JVMs typically means you have more overall GC for the same total heap size (eg 4 x 1G JVMs uses more CPU because of more GC than 2 x 2G JVMs).
- 64bit JVMs compress oops so that it should use a similar amount of memory as a 32bit one up to 4G. From 4G to 32G, there is also a benefit in compressed oops for memory, but above that there isn't.
You Won't Believe How the Biggest Sites Build Scalable and Resilient Systems! (Page last updated January 2015, Added 2015-03-28, Author Jeremy Edberg, Philip Fisher-Ogden, Publisher InfoQ). Tips:
- Build for at least 3 instances - that ensures you have architected correctly for horizontal scaling.
- Automate as much as possible - confg, deployment, monitoring, alerts, etc, together with self-service interfaces for any of these.
- Monitoring should be built-in as part of the development.
- Any system that you haven't broken parts of is not a resilient system.
- Disable non-critical features rather than the whole system when parts of the system fail.
- Data best practices: Never have a single copy of data - have multiple copies of data; keep the copies in multiple datacentres (or availability zones); avoid keeping state on a single instance; don't keep secret keys on an instance (that's hugely vulnerable);
- Queueing help scale - because they buffer throughout the system. And by monitoring the queues you can see if things get backed up and by how much.
- Provide cached content rather than immediate content if the cached content is sufficiently recent or acceptable.
- Sharding works well but you have to be careful about how the data is sharded.
- Lambda/kappa architecture duplicates stream processing to a fast less accurate and a slower more accurate stream; you get quick results initially then accurate ones later.
- Stateless scales much much much more easily than stateful.
ExecutorService vs ExecutorCompletionService in Java (Page last updated March 2015, Added 2015-03-28, Author Akhil Mittal, Publisher DZone). Tips:
- ExecutorService executes tasks, but doesn't tell you when individual tasks have completed. ExecutorCompletionService allows you to find task results as each task is completed by asking ExecutorCompletionService.take() for the next completed task (blocking until one comes available).
- With ExecutorCompletionService you can kick off a set of parallel tasks, and when the first completes, cancel the other (iei if you just want the fastest to complete).
Back to newsletter 172 contents
Last Updated: 2021-03-29
Copyright © 2000-2021 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us