Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Concurrency, Threading, GC, Advanced Java and more ...
Tips November 2016
JProfiler
|
Get rid of your performance problems and memory leaks!
|
JProfiler
|
Get rid of your performance problems and memory leaks!
|
|
|
Back to newsletter 192 contents
https://www.youtube.com/watch?v=dpNRiXLoiIM&list=PLPIzp-E1msrYicmovyeuOABO4HxVPlhEA&index=72
Troubleshooting the Java HotSpot VM (Page last updated September 2016, Added 2016-11-30, Author Poonam Parhar, Publisher JavaOne). Tips:
- OutOfMemoryErrors relate to different spaces, each of which has a different fix. First make sure you know which memory space is exhausted.
- A heap space OutOfMemoryError could be simply because the heap is too small - a simple fix is to set Xmx to a larger value. The GC logs will show you the stable footprint of the application - look at the used space after GC, your heap needs to be bigger enough than this to give the application headroom for working.
- If the heap after GC is always increasing, this indicates a possible memory leak. Java Flight Recorder with Heap Statistics enabled can also show you this, eg the live set is always increasing. Generate a heap dump to identify the memory leak: jcmd processid GC.heap_dump_filename=mydump; jmap -dump:format=b,file=mydump processid; JConsole using mbean HotSpotDiagnostic; -XX:+HeapDumpOnOutOfMemoryError. Or use histograms: -XX:+PrintClassHistogram + Cntrl-brk; jcmd processid GC.class_histogram_filename=mydump; jmap -histo processid. Also worth monitoring objects pending finalization: jmap -finalizerinfo processid; or via JConsole.
- Excessive use of finalizers can lead to OutOfMemoryErrors, because finalizers delay the collection of objects.
- Heap dumps can be analyzed with jhat, JVisualVM, Eclipse MAT, JOverflow JMC plugin.
- A perm gen OutOfMemoryError could be simply because the permgen is too small - a simple fix is to set -XX:MaxPermSize to a larger value. Make sure that classes are being unloaded, -XX:+TraceClassLoading -XX:+TraceClassUnloading -XX:+CMSClassUnloadingEnabled. Check that you DON'T have -Xnoclassgc set. jmap -permstat lets you monitor the permgen; also java -cp $JAVA/lib/sa-jdi.jar sun.jvm.hotspot.tools.PermStat processid
- A metaspace OutOfMemoryError may need a higher MaxMetaspaceSize (by default it's unlimited so you must be setting that somewhere if this is the case). Monitor with jmap -clstats, jcmd GC.class_stats, -XX:+PrintGCDetails, Jconsole, JVisualVM.
- A compressed class space OutOfMemoryError may need a higher CompressedClassSpaceSize (default 1GB).
- A native heap OutOfMemoryError (typically only seen on 32bit JVMs or 64bit with commpressed oops). One possible tuning option is to set -XX:HeapBaseMinAddress to reset the base address within the restricted memory space. Another is to turn off compressed oops.
- There is support for native memory tracking, -XX:NativeMemoryTracking=off,summary,detail. Use -XX:+UnlockDiagnosticVMOptions -XX:+PrintNMTStatistics or jcmd processid VM.native_memory
- Analyze latencies with measurrements from: Java flight recorder (sort events using durations); GC logs; stack traces (with native frames); -XX:+PrintGCApplicationStoppedTime -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1
- Log GCs with -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime (-XX:PrintFLSStatistics=2 with CMS). Use GCHisto or VisualGC or gclogviewer or other GC log analysis tools.
- Disable explicit GCs if these are unintended, eg with -Dsun.rmi.dgc.server.gcInterval=somethingbig -Dsun.rmi.dgc.client.gcInterval=somethingbig -XX:+DisableExplcitGC.
- GC pauses with high "sys" time compared to "user" can occur from underlying OS IO happening, pausing the JVM. Similarly GC pauses with low "user" and low "sys" but high "real" can be caused from JVM getting stuck waiting for OS IO to complete (including when GC file rotation triggers filesystem flushing).
- With hung processes, check for deadlocks. Trigger a stack dump (eg kill -3), jstack -m/-l/-F; check what the VM thread is doing (eg SafepointSynchronize::begin); an apparently stuck thread could be doing JVM level work like lokking for space in the CodeCache.
- Use OS level tools to find which threads use lots of CPU (eg prstat -L, ps, top); the id maps to the thread nid from jstack dumps.
- [Talk also has some nice info on what to be monitoring when you get crashes].
- -XX:GCTimeLimit and -XX:GCHeapFreeLimit are useful options to tell the parallel collector to give up earlier than it otherwise would (otherwise it can just continuously full gc for a long time).
https://www.youtube.com/watch?v=nVPyR3mq8O4
Concurrency options on the JVM (Page last updated November 2016, Added 2016-11-30, Author Christin Gorman, Publisher JDD Conference). Tips:
- If you want to maximize concurrent performance, you just cannot afford to block.
- CPU bound tasks are slower if you execute them in parallel on one core, because all you have changed is to add overhead. But IO bound tasks can be faster.
- There is a limit to how many threads an operating system can run. A framework like AKKA (using the actor model) multiplexes tasks across a limited number of threads.
- There is a limit to how many threads an operating system can run. A callback model lets you decompose tasks, add them to a list and have the task checked (polling) on a limited number of threads until done, then called back to the called - vertx.io uses this model.
- If you multiplex tasks on a single thread, you don't have to worry about locks because you only have one thread. Obviously, you can't make blocking calls in this mode. This can lead to code that is difficult to understand and frustrating to work with.
- If your necessary scale can be handled with a single thread task, this leads to very simple code, easy to understand.
- There is a limit to how many threads an operating system can run. If you use continuations to multiplex tasks, eg like Quasar fibres (light-weight threads) do, you can write sequential looking code while still multiplexing tasks on a limited number of threads.
- CountDownLatchs are very useful for testing concurrent code.
https://www.youtube.com/watch?v=03GsLxVdVzU
Designing for Performance (Page last updated November 2016, Added 2016-11-30, Author Martin Thompson, Publisher Devoxx). Tips:
- Scalability means: if you add hardware you can increase the load while keeping the same latency.
- Service time (the time taken actually doing the work rather than waiting) is important; the total response time is wait time + service time. You can decrease wait time by adding more servers.
- Response time gets worse very rapidly once you go over about 70% utilisation of a resource. Ensure you have sufficient capacity on any system (including non-compute ones eg teamwork), the same limitations apply.
- Amdahl's law says that the speedup possible is limited by the part of the system that can be parallelized. A task that can be 95% parallelized (5% is always sequential) can theoretically get a 1/(1-0.95) = 20x speedup if you can parallelize across 20 cores. But the speedup is limited if you have 1000 cores, there is no further speedup.
- The Universal Scalability law says that you can't even achieve Amdahl's speedup limits, because of communication overheads between parallelized subtasks. C = N / (1 + a(N - 1)) + ((b * N) * (N - 1)) where C is the capacity or throughput, N is the number of processors, a is contention oenalty, b is the coherence penalty.
- Most logging frameworks scale extremely badly - as you add more threads, the logging framework adds more items to a queue that has only one server to take items off the queue.
- Only consider abstraction when you see at least 3 things that ARE the same. All abstractions have a cost - does that abstraction pay back for itself? Copy-paste is a reasonable initial approach, abstract later when you see many ACTUAL copies in the finished product - megamorphism with 3+ branches are very badly handled by modern processors.
- Memory systems try to keep items close together in time and space, as often enough things next to each other in memory are accessed closely in time, and things that were recently accessed are often accessed again soon after.
- Try to keep together fields that need each other's data for their operations.
- Data structures are more useful than frameworks - the right data structure for the problem is one of the best optimizations.
- Batching is your friend at all scale levels, operate on data in batches wherever possible.
- Write code with fewer branches - it should be easy to understand.
- Write code, leave it for a day at least, then go back to it - we come back to it with a new perspective and will improve it.
- Keep your loops small in terms of the full inlined stack. Do one thing at a time, keep it clean and simple.
- Each statement/method/class/module should be doing just one thing (really well). Focus on the API.
- Measure, but use histograms rather than averages - outliers are important.
- Queues of requests build up while you're busy (or pausing), but you won't notice the higher
- Build measurements into your production system from the beginning, measurements which you can view in real-time rather than via logs.
- Expose counters in histograms of: queue lengths; concurrent users; exceptions; transactions.
https://www.youtube.com/watch?v=OFgxAFdxYAQ
A Crash Course in Modern Hardware (Page last updated November 2016, Added 2016-11-30, Author Cliff Click, Publisher Devoxx). Tips:
- Because memory hasn't really got much faster while CPUs have and are getting much more cores, programs running on modern hardware are totally cache-miss dominated. Every level of indirection is typically an additional cache miss and extra space.
- Get rid of wrappers, they are indirection that causes a cache miss and extra space.
- Every conversion empties the cache, so ensuring another cache miss. Don't convert unless you will used the conversion multiple times.
- Shared data is okay; mutable data (on a single core) is okay; but shared mutable data will cause cache contention and requires synchronization.
- Immutable data is great.
- CPU utilization is a misleading metric - it doesn't tell you if you're working or cache-missing (eg working set could be too big). No good tools available to show this (vtune and solaris sunstudio - which works on linux - have some support).
- Avoid touching data wherever possible.
- Some indication for cache misses can be inferred from profilers - look at hot codes in loops, and see what data it is touching and any indirections are suspects.
- Getters and setters don't always get inlined - get rid of them in the hot loops (but note this is bad coding style).
https://www.youtube.com/watch?v=sSAHvuA0B40
Trash Talk! How to Reduce Downtime by Tuning Garbage Collection (Page last updated October 2016, Added 2016-11-30, Author Denise Unterwurzacher, Publisher Atlassian). Tips:
- The default garbage collection configuration is aimed to be the best for most applications, so a good point to start with.
- Increasing memory is not going to decrease pause times.
- Using so much memory that the machine starts paging will cause very big pauses.
- G1GC tends to need more memory than the parallel collector. G1GC should avoid huge pauses on large heaps.
- Young generation collections are always stop-the-world pauses. Old generation garbage collectors have larger pauses except for concurrent ones, but all tend to have very big pauses when there is a failure in being able to allocate in the normal way, eg G1 runs out of reserved regions or concurrent is too fragmented.
- A bug in G1GC makes big pauses when it runs out of reserved space - you see a to-space allocation failure - bump up the reserved region with -XX:G1ReservePercent=20 if the bug applies.
- Use the GC logging to identify exactly what the GC is doing. GCViewer is a nice tool to visualize GC logs.
- Some GC settings prevent the GC being adaptive, which can make the overall GC less efficient - when it's adaptive the GC resizes heaps to be more efficient.
- OOMEs are often because the heap is too small rather than because there is a leak.
- Too large a heap can cause very long old generation pauses.
- Benchmark, plan your change, measure it has improved, repeat.
- Your GC tuning goals are some tradeoff between latency (maximum pause), throughput (total time paused across all GCs compared to elapsed time); and footprint. For low pause times and high throughput, try G1GC and a large heap. For low pause times and low footprint, try G1GC and a medium sized heap. For low footprint and high throughput, try the parallel collector and a small heap.
- Common GC tuning errors: making the heap too large; using parameters that stop the GC auto-tuning (eg instead of MaxNewSize, use NewRatio; instead of GCTimeLimit use GCTimeRatio; instead of SurvivorRatio use G1ReservePercent).
https://www.youtube.com/watch?v=kb-m2fasdDY
What I Wish I Had Known Before Scaling Uber to 1000 Services (Page last updated September 2016, Added 2016-11-30, Author Matt Ranney, Publisher GOTO). Tips:
- The time when things are most likely to break is when you change them.
- A distributed systems is much harder to reason about than a single (monolithic) system.
- Multiple languages has a cost - it's hard to share code, move across teams, and fragments the culture.
- RPC using HTTP/REST has a cost - the protocol has many codes that are meant for browsers and become complicated costs for independent server-server comms. Treat the RPC like a function call, not a browser call.
- A common profile format across languages are flame graphs.
- Performance doesn't matter - until it suddenly does, and if you don't already have the infrastructure to deal with optimizing, you're in trouble. SLAs and a measurement infrastructure should be a standard part of every service.
- You have to wait for the slowest thing in your call chain - even if you have a very low, say 1% chance, of having a slow call, when you build up the chain of calls that 1% multiplies cumulatively to make the probability of the overall chain being slow much higher.
- Make sure you can trace your requests across your system so you can identify where any slowness occurs in the distributed call chain.
- The overhead from tracing can be significant, you may want to sample traces rather than log them all.
- You should use a consistent structured logging layer that is common across all services.
- Drop log messages if you can't log fast enough so that the logging doesn't become a bottleneck.
- Limit how much logging is allowed transaction, otherwise you'll get too much to leave it usable.
- When there's no way to create a test environment as big as production or generate realistic loads, you can load test in production with test traffic (in off peak periods). That means you need your system to identify test requests as non-real-load traffic.
- Make failure testing happen randomly whether you like it or not, as teams will most likely not opt-in.
- Keep all your services on recent code releases or you will be stuck when you need to migrate an old service.
- Only build things that are business specific. Everything else will be built cheaby some other org soon.
Jack Shirazi
Back to newsletter 192 contents
Last Updated: 2024-11-29
Copyright © 2000-2024 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
URL: http://www.JavaPerformanceTuning.com/news/newtips192.shtml
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us