Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Tips February 2012
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 135 contents
Everything I Ever Learned about JVM Performance Tuning @twitter (Page last updated December 2011, Added 2012-02-28, Author Attila Szegedi, Publisher QCon). Tips:
- The enemies of latencies are: garbage collection pause (the biggest by far); inter-thread communication; thread context scheduling; synchronization and locks; I/O; algorithmic inefficiencies.
- Tuning areas to reduce latencies are: garbage collection pause tuning; memory tuning; lock contention tuning; CPU usage tuning; I/O tuning.
- Memory tuning breaks down into: memory footprint tuning; allocation rate tuning; garbage collection tuning.
- Memory size is tuned by: reducing the data held in memory; reducing the overhead of holding data in memory (so that the structures take up less space); eliminate memory leaks.
- Analyse gross memory usage with verbosegc and equivalent flags, noting heap sizes before->after and the times taken. Gross tuning is to resize the heap. Application object tuning is to use caches or weakly held objects.
- Objects have an overhead which may include padding and depends on the JVM implementation - it can be 32 bytes per object. Below 30GB you can use compressed object pointers (automatically done in recent JVMs) to give a 30% saving in memory usage - which means that there is a jump in memory cost above 32GB of heap - which means that abive 32GB, you actually typically can fit fewer objects in the heap until above 48GB.
- Primitive wrappers (whether implicit or explicit) are a large memory overhead on data.
- Only use queueing when tight latency constraints don't matter.
- Thrift generated communications have lower latency and bandwidth costs but higher maintenance overheads. They are not efficient to use internally as domain objects.
- TheadLocals stay around if you use pooled threads - this can easily be a difficult to detect memory leak.
- Tuning tradeoffs tend to be between memory footprint, latency and throughput. Improving one tends to make the others worse - unless you have spare CPU available, when sometimes you can trade the spare CPU for improvement in one of these without affecting the others.
- The ideal young generation size is: big enough to hold one set of all concurrent request handling objects; with survivor space big enough to hold all active objects and any tenuring ones; with a tenuring threshold that tenures long-lived objects as fast as possible while not tenuring any short-lived objects.
- Garbage collectors: -XX:+UseSerialGC (?, throughput); -XX:+UseParallelGC (young, throughput); -XX:+UseParallelOldGC (old, throughput); -XX:+UseConcMarkSweepGC (old, low-pause); -XX:+UseG1GC (?, low-pause);
- Throughput collectors can tune themselves, after you give them hints: -XX:+UseAdaptiveSizePolicy -XX:MaxGCPauseMillis=NNN -XX:GCTimeRatio=MMM (the last is a %)
- Garbage Collector recommended starting points: for bulk services, throughput collector with no adaptive sizing; for all others throughput collector with adaptive sizing, or if that fails, concurrent-mark-and-sweep;
- Tune the young generation size first (after providing the old generation with enough size to retain the maximum working heap plus space overhead for GCs), focusing on the tenuring to size the survivor space and making Eden as large as possible up to pause time constraints. Tenuring should be strongly declining across ages.
- Tuning for CMS: find the minimum and maximum working sizes (the stop-the-world throughput collector will tell you that), then overprovision memory above that by 25%-35% - CMS needs a cushion while it works and that's what this provides.
- When a full GC happens, every thread-stack must be walked so more threads means longer GCs.
- Thread coordination optimization: 1) avoid coordinating where possible; 2) use java.util.concurrent Atomic operations; 3) use volatile variables; 4) synchronize as a last resort.
- To use java.util.concurrent Atomic operations for combined data values, use AtomicReference with a composite immutable object to set the new value.
- Softreferences are cleared on any Full GC - but take 2 GC cycles to be removed. They also make CMS more unpredictable. Size constrained LRU caches provide more predictable performance.
The Perils of Asynchrony (Page last updated January 2012, Added 2012-02-28, Author Frank Kelly, Publisher softarc). Tips:
- The default solutions in Java for asynchronous processing are: Threads and java.util.concurrent classes; JMS (Java Message Service)
- The downside to Threads / java.util.concurrent for asynchronous processing is that there is no built-in persistancy, nor load balancing across JVM instances.
- JMS has two models of asynchronous execution: Queues - point-to-point communication; and Topics - "broadcast" communication system.
- With JMS, the key problem with Queues is ensuring that there is only one reader that actually successfully executes a job (as opposed to none). The The key problem with Topics is ensuring that only one reader of the many actually successfully executes a job (as opposed to many readers executing it).
- Persistence typically reduces throughput by an order of magnitude or more.
- One solution for failover is to maintain a set of JVMs with one active and the rest on hot standby, using heartbeats to detect if the active dies and negotiate a failover to becoming active.
- Problems of asynchronous processing include: persistence; high availability; messages handled more than once; messages handled out of order; slow requests ahead in the queue delaying fast requests;
Building Memory-efficient Java Applications: Practices and Challenges (Page last updated October 2009, Added 2012-02-28, Author Nick Mitchell, Gary Sevitsky, Publisher IBM). Tips:
- It is easy to add memory overhead - especially the more you abstract the code/frameworks. Good coding practices actually tend to encourage additional memory overhead.
- A sample of a range of applications shows memory overheads of 50%-90% were usual, i.e. the actual data needed for processing took only 10%-50% of the memory used by the application to hold the data.
- Distinguishing the actual data memory requirements from the additional overhead used by collections, data types, delegations, etc, allows you to determine whether the overhead cost is worth the functionality added and whether there is significant scope for improvement if memory needs reducing.
- JVM and hardware impose significant memory overheads for small objects (headers and alignments).
- HashSet is not a good collection choice for small collections in terms of memory overheads - though is often used exactly that way.
- Boxing primitive datatypes adds large memory overheads to holding primitive data values.
- Default sizes for collections tends to add significant memory overheads - the actual number of elements held rarely matches the collection capacity. Using empty collections are particularly bad for memory overheads. Concurrent collections also have high memory overheads.
- Gnu Trove collections include many space-efficient collections.
- Caching unecessarily has a high memory overhead - for example caching the result of toString() is often unnecessary for performance but is frequently encountered; immutable data is often duplicated.
- Simple short-lived objects are mostly free, but some objects are designed to live longer and are inefficient to be used, e.g. SimpleDateFormat creation costs are designed to be amortized over many uses (should be reused in a thread-local), as are formatters, converters, factory objects, schemas, connections, etc.
- You should be careful with the lifetime of objects. Three typical reasons for long-lived data are: In-memory design; Cached/pooled/thread-local objects (space used to reduce time costs); Long transaction/request support objects.
- Caches and pools should always be bounded. Large caches are not necessarily better - they may just use more resources for no benefit, and may even cause problems by using too much memory.
- Soft references can be useful for simple caches/pools, shouldn't be relyed on as the sole ejection policy; also the may not leave enough headroom for temporary objects, causing the GC to run more often.
- Weak references are ideal for objects tied to the lifetime of other objects, e.g. listeners, shared pools, annotations. Failure to unregister listeners is a common cause of leaks.
JVM Performance Tuning (notes) (Page last updated January 2012, Added 2012-02-28, Author Andrew Wang, Publisher umbrant). Tips:
- Beware autoboxing where compact data types are automatically converted into fatter Objects. If memory is really tight, a plain old array of primitive types can be the best choice.
- Beneath 32GB of heap, the JVM will only use 4B per pointer instead of 8B. However, this means if you want a heap bigger than 32GB, you need to jump up a lot; Attila says 48GB.
- Java optimizes for the common case of short-lived trash that can get quickly collected in the young generation - young generation allocation and garbage collection is really cheap: allocation is just a pointer shift and zeroing (if needed) the space between the last pointer and the new one; garbage collection only copies live objects to older space, then resets the pointer to the start of the young generation (no explicit deallocation needed).
- Young generation GC time is proportional to the number of live young objects, which is usually small compared to the amount of trash.
- The more memory you can give the young generation, the better, since allocation and deallocation is so cheap (though very big young generations could cause too large pauses).
- You want a young generation big enough to hold active and tenuring objects; for long-lived objects to quickly tenure and reach the old generation; but you don't want survivors that could be collected in the young generation to get forced to the old generation early by memory pressure on the young generation.
- Garbage collection algorithm options are: SerialGC, ParallelGC, ParallelOldGC, ConcMarkSweepGC, G1GC.
- Try to reduce the application's memory consumption - less memory pressure means less garbage collection.
- Try a throughput collector with adaptive sizing turned on.
- Use -XX:+PrintHeapAtGC to get details of what is being collected.
- -verbosegc and -XX:+PrintGcDetails are useful for monitoring garbage collection.
- If data is shared between multiple objects, point them all to the same instance instead of each having their own copy.
- Sharing connections in thread pools can result in m * n cached connection objects if you're using thread locals.
- Fewer threads with asynchronous I/O lowers resource costs (at the expense of CPU).
The Definitive Set of HotSpot Performance Command-line Options (Page last updated October 2011, Added 2012-02-28, Author Dustin Marx, Publisher marxsoftware). Tips:
- Use -XX:+PrintGCDetails for monitoring garbage collection, and -XX:+PrintGCTimeStamps or -XX:+PrintGCDataStamps for longer running applications.
- Use -XX:+PrintReferenceGC when you need garbage collection informtaion when using reference objects such as WeakReference and SoftReference.
- Set heap space sizes with -Xmx, -Xms, -XX:NewSize, -XX:MaxNewSize, -XX:NewRatio, -XX:PermSize and -XX:MaxPermSize.
- Use -Xshare:on to enable class data sharing and -client for faster startup or -server for overall better speed. -XX:+TieredCompilation enables a "JIT compilation policy" similar to that used for -client for rapid startup time.
- Start with ParallelOld/Parallel (-XX:+UseParallelOldGC/-XX:+UseParallelGC) GC first with -XX:UseAdaptiveSizePolicy and -XX:+PrintAdaptiveSizePolicy, and then move to CS (-XX:+UseConcMarkSweepGC which uses -XX:+ParNewGC automatically) or G1 (-XX:+UseG1GC with -XX:MaxGCPauseMillis=.. to set the target time) if latency requirements are not met. -XX:ParallelGCThreads can be used to specify number of parallel garbage collection threads to use and -XX:ParallelCMSThreads specifies the number of parallel CMS threads.
- When -XX:+PrintReferenceGC output shows a high Reference reclamation time, enable -XX:+ParallelRefProcEnabled.
- Specifying the survival time of a soft reference after last strong reference to the object has been collected using -XX:SoftRefLRUPolicyMSPerMB - smaller values mean more aggressive collection.
- -XX:+ScavengeBeforeFullGC should be on (it is by default, but some people disable it)
- -XX:+DisableExplicitGC, -XX:+ExplicitGCInvokeConcurrent and -XX:+ExplicitGCInvokesConcurrentAndUnloadsClasses can be used when explicit garbage collection System.gc() is used.
- For fine tuning the young generation: -XX:+PrintTenuringDistribution is very useful for monitoring; -XX:SurvivorRatio sets the ratio of survivor space size to eden space size; -XX:TargetSurvivorRatio sets the target survivor space occupancy to target after a minor garbage collection; and set -XX:MaxTenuringThreshold too high rather than too low to avoid a full GC.
- -XX:+PrintGCApplicationStoppedTime and -XX:+PrintGCApplicationConcurrentTime are useful for tracking down latency induced into the application as a result of JVM safepoint operations.
- Other flags that affect performance include -XX::+UseCompressedOops, -XX:+UseLargePages, -XX:LargePageSizeInBytes, -XX:+UseNUMA, -XX:+AggressiveOpts, -XX:AggressiveHeap, -XX:+UseBiasedLocking, -XX:+DoEscapeAnalysis, -XX:+AlwaysPreTouch,
- Useful for monitoring are -XX:+PrintCommandLineFlags and -XX:+PrintFlagsFinal.
5 Things That Are Toxic to Scalability (Page last updated August 2011, Added 2012-02-28, Author Sean Hull, Publisher iheavy). Tips:
- Avoid using Object-Relational Mappers (ORM); ORM SQL queries are often complex queries that the database cannot optimize well nor do they allow easy tweaking of queries, slowing down the tuning process.
- Locks are like stop signs, non-locking solutions are usually faster and more scalable. Row level locking is better than table level locking. Use asynchronous replication and "eventual consistency" for clusters.
- A single database is a bottleneck. Use parallel databases and let a driver select between them.
- Monitor your system, collect appropriate metrics. Include low level system cpu, memory, disk & network usage as well as database level activity like buffer pool, transaction log, locking sorting, temp table and queries per second activity.
- Build to be able to turn features off via a flag so when a spike hits features can be turned off to reduce load.
Back to newsletter 135 contents
Last Updated: 2017-10-01
Copyright © 2000-2017 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us