Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Threading Essentials course
Tips September 2012
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 142 contents
Statistics for Performance Analysis (Page last updated September 2102, Added 2012-09-28, Author Ben Evans, Peter Lawrey, Publisher Oracle). Tips:
- Microbenchmarking is full of pitfalls. You should focus your effort on benchmarking the application tiers rather than microbenchmarks.
- System.nanoTime() resolution is limited to the combination of operating system and hardware; although it reports nanoseconds, two measurements within the resolution limit would produce the same value.
- Times measured in a JVM tend to follow a gamma distribution, which is a bit like a skewed normal distribution with a fat tail. That's due to the "noise" (e.g. with garbage collection, context switches, etc) added by JVM activities that will slow down some measurements. Gamma distributions have different considerations of deviations than normal distributions - stick to using 90th centiles, 99th centiles, etc. rather than standard distributions.
- Focus on measuring overall throughput or end-to-end response times, rather than benchmarking small sections of code.
Eleven Tips to Becoming a Better Performance Engineer (Page last updated September 2102, Added 2012-09-28, Author Rebecca Clinard, Publisher JDJ). Tips:
- Performance engineering involves data analysis of resource usage patterns, modeling, capacity planning, and tuning in order to detect, isolate, and alleviate saturation points within a deployment.
- Performance testing creates concurrency conflicts to expose resource competition at a server level - over-utilized resources become bottlenecks.
- Be methodical. Set up the most realistic test scenarios then Do Not change even the most minute details in your test case for a sequence of test where you compare other variables. Any deviation within the test case scenario will result in different throughputs which affect resource patterns.
- An architectural diagram of the entire deployment allows you to focus on transactions flows and resources utilized across the whole system. This is an essential guide in understanding system-wide bottlenecks and where to monitor.
- Tuning is a balancing act - the situation where you tune the software servers in order to take full advantage of hardware resources, without overloading them.
- Tune a layer of the environment only when the results are reproducible - otherwise you cannot reliably determine the effects of your changes.
- Tune the layer which showed contention earliest in time - other bottlenecks are often a symptoms of that first bottleneck.
- Look for low free resources based on percentages (free threads, free cache, and free file descriptors, etc), this allows you to spot a bottleneck quicker. When a resource's "free percentage" gets low, keep it on your radar for a cause of performance degradation.
- Tuning is a process you repeat until the workload reaches target capacity with acceptable response times.
- Validate that your tuning change actually had the desired effect -
Java Memory Profiling Simplified (Page last updated September 2102, Added 2012-09-28, Author Manu PK, Publisher The Object Oriented Life). Tips:
- The JVM Memory is divided into 3 parts: Heap memory; non-Heap memory (per-class structures, constant pools, field and method data, the code for methods and constructors, interned Strings); Other memory
- Objects are created on the heap, and only references are passed around on the stack. Stack values only exist within the scope of the function they are created in.
- Objects are created in Eden. When the Garbage collector runs, if the object is not dead it is moved to S1 (Survivor Space 1) or S2. Objects that survives in S1/S2 long enough are moved in to Old Gen space.
Speed as a Feature, Part 2: Frontend Performance (Page last updated August 2012, Added 2012-09-28, Author Chris Kelly, Publisher newrelic). Tips:
- Sample average page display times from a large eCommerce website showed average display times of 3.3seconds broken down into 0.1s spent in server-side processing, and approximately one second each in a) network transfer time, b) browser-side DOM processing, and c) browser-side rendering. In the author's experience most consumer web applications spend 60-80% of the time within the frontend layer.
- Ensure that if you use external services loaded into the frontend, that you use the asynchronous implementation, as any synchronous implementations will pause page-loading until they are done. The company Keynote analysed performance data during a Facebook outage and found that several major sites had closely coupled their pages to facebook synchronous dependencies such that page-response times became dominated by facebook response times whenever facebook response times surged.
- Studies show that users are increasingly likely to abandon a page if it hasn't rendered within three seconds.
- Most browsers will block rendering until all screen CSS has been downloaded - and some block until all stylesheets are downloaded. Send your stylesheets in the smallest form possible, customised for the device (e.g. even smaller and no print stylesheet for mobile clients).
- Use gzip compression and minified text.
Analyzing performance data (Page last updated August 2012, Added 2012-09-28, Author Philip Tellis, Publisher speedawarenessmonth). Tips:
- Page load times often have a Log-normal distribution or, when there are two different major types of reponses, a bimodal distribution.
- Most people use the term average to refer to the arithmetic mean, but the Average could refer to any single number that summarises a dataset, and there are many alternatives to pick from, e.g. the arithmetic mean (sum total/count; skewed by outliers), the median (middle of sorted data points; difficult to calcuate for large datasets), the mode (most popular measurement; near-mode values help identify multimodal distributions), the geometric mean (Nth root of the product of all N numbers; useful for Log-normal distributions), the harmonic mean.
- Real user performance data can have a very long tail, and you?ll determine how far to the right that tail goes by looking at percentile values, e.g. the 95th, 98th or 99th (e.g. 99th is the value that is at or above 99% of values when sorted).
- The spread/deviation in the data is important to identify as this is often an important metric to target for tuning - high variability in response times usually indicate a problem and also mean a portion of users - or possibly all users at some times - are experiencing unacceptable performance.
- Split the data into two ranges, the "acceptable" range indicating data that is both valid and within acceptable range parameters based on "real-world" expectations (e.g. no page download will be quicker than 50ms, or slower than a couple of minutes); and data that consists of outliers (which may include error data). Analyse these datasets separately as they need different considerations applied.
Cache Craftiness for Fast Multicore Key-Value Storage (Page last updated April 2012, Added 2012-09-28, Author Yandong Mao, Eddie Kohler, Robert Morris, Publisher ACM). Tips:
- A system with a single storage server often has that storage server as the performance bottleneck, so it needs to be as fast as possible.
- Use different specialized storage systems for different workloads.
- Aim for lookups to use no locks and no writes to global memory, and for updates to lock only locally the region being updated rather than locking more extensively or globally.
- The more local and specific the lock, the higher the amount of concurrent updates the structure can support.
- The data structure is the key to performance efficiency. Unusually for performance where simpler is usually better, complex multi-layered data structures can provide better performance if the layers each target a specific performance bottleneck.
- Optimised in-memory data processing will become dominated by RAM fetch times. To reduce these you need to carefully organise the data structure to work with operating system and hardware memory access patterns, e.g. storing data that are likely to accessed closely together (in time) in contiguous memory, and pre-fetching such data where they are not contiguously stored.
- The standard high efficiency mechanism to make in-memory data persistent is journaling (writing out changes sequentially, usually in batches, to a log file of changes, ideally per-core logging to a core specific logfile) and checkpointing (asynchronously flushing the in-memory datastructure to persistent storage, and removing those journalled logs that now are no longer needed).
- One lockless access mechanism: updates change the data item's 'version number'. Access checks the version number before and after the access - if the version number changes during the access, it is retried. If the data is complex, you can split the version number into parts and look up each part separately. Atomic updates do not need a change in version number (since the read is guaranteed to be reading consistent data).
- By maintaing a consistent lock ordering, you can prevent deadlocks.
- By minimizing the overheads of lock management and ensuring that locks are only held for a very short time, it is possible for a spin-lock based highly concurrent-update application to outperform a lock-free (compare-and-swap) implemntation.
- To convert complex non-atomic updates into effectively atomic updates, you can implement so that the intermediate updates are not visible to any readers until the final step is complete (you must handle concurrent updates in some way).
- Masstree journalling strategy uses a per-core in-memory log buffer which a per-core background thread writes in batches or within 200ms to a per-core logile. Log entries are timestamped and contain the information needed to restore storage from the last checkpoint. When restoring from logs, log entries across all per-core logs are sorted by timestamp and applied from teh last checkpoint time. Checkpointing works similarly, allowing logs to be reclaimed.
Back to newsletter 142 contents
Last Updated: 2018-10-29
Copyright © 2000-2018 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us