Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Tips October 2011
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 131 contents
Effective Java Profiling With Open Source Tools (Page last updated September 2011, Added 2011-10-27, Author Joachim Haagen Skeie, Publisher infoQ). Tips:
- Types of Java issues that need analysing include slow service, JVM crashes, hangs, deadlocks, frequent JVM pauses, sudden or persistent high CPU usage, OutOfMemoryError.
- jmap prints object memory maps or heap memory details of a given process or core file or remote debug server, allowing you to analyse memory usage in a JVM.
- jps lists process IDs of JVMs running on a machine.
- Comparing jmap histograms taken at different times lets you identify which objects are growing in JVM memory.
- VisualVM is a visual tool integrating several command line JDK tools and lightweight profiling capabilities.
- A continuous increase in after-GC memory heap size is usually associated with a memory leak.
- VisualVM has two measuring modes, sampling and profiling. The sampling measures by sampling at regular intervals; the profiler uses bytecode instrumentation to continuously measure; both modes allow you to profile the application, teh sampling mode has a lower overhead and less accuracy, the profiler a higher overhead but higher accuracy. The profiler mode can impose orders of magnitude higher overhead, so you are usually better off using the sampling mode.
- VisualVM is not a full-featured profiler, it is not suitable to constantly run against your production JVM. It doesn?t persist data, nor will it be able to specify thresholds and send alerts when these thresholds are breached.
- Useful statistics to gather include: Memory usage (Heap, Non Heap, Perm Gen, New gen, Tenured Gen, Survivor Space; Thread counts (and occasional thread stack dump); CPU load; System load; selected method invocation counts, execution time and wall clock time; SQL calls; disk and network operations.
- BTrace can gather many useful statistics. A BTrace script is a normal Java class containing some special annotations to specify just where and how BTrace will instrument your application. BTrace Scripts are compiled into standard .class files by the BTrace Compiler btracec.
- EurekaJ is a tool that parses btrace output, forwards it to statistics manager that provides views on those statistics.
- You need to know the overheads of different BTrace measurements: Sample-based measurements have a low overhead (assuming the sample frequency is low, e.g. every 5 seconds); Measurements of long-running tasks have little cost, e.g. SQL-queries, network traffic, disk access or the processing being done in a Servlet; never add measurements for methods that are executed within a loop.
- Eclipse memory analyzer is a good tool to perform a post-morten analysis of a heap dump.
Charlie Hunt on Java Performance Monitoring and Tuning (Page last updated October 2011, Added 2011-10-27, Author Steven Haines, Publisher informIT). Tips:
- The most common Java performance issues are: poor choice of algorithms or data structures; unnecessary object allocation; unnecessary object retention; using unbuffered I/O; poorly tuned JVMs; high lock contention; data structures resizing.
- Monitoring should not make performance worse, it is targeted at finding whether there are bottlenecks; profiling can as it is targeted at finding an bottleneck that definitely exists.
- Scalability is the application's ability to service additional load while maintaining the same throughput and/or latency.
Measure Java Performance - Sampling or Instrumentation? (Page last updated October 2011, Added 2011-10-27, Author Fabian Lange, Publisher codecentric). Tips:
- Instrumentation profiling is equivalent to adding code to your application and measuring elapsed times for code segments to execute.
- Sample profiling is equivalent to having an external observer examining the running code at regular intervals to see what is running.
- Techniques for instrumenting runtime execution include: Adding timing to the code (e.g. System.currentTimeMillis() or System.nanoTime()) and adding logging to the code (System.out.println or equivalent); Using javax.management beans to record times and querying externally or later; using AOP libraries to inject timing and logging code; using the JVMTI agent to inject timing and logging code.
- Techniques for sampling runtime execution include: Adding a monitoring thread to the application that samples other thread activities at intervals; sampling via JVMTI; sampling by getting regular stack dumps.
- When the CPU executes measuring code instead of real code, this is overhead. This produces errors in measurements, consumes CPU (and sometimes memory) that could otherwise be used by the application, and delays application code from executing.
- With sampling, code that can be safely halted at JVM safepoints (e.g. getters, setters) tends to show more often than than it should, as the JVM needs to safepoint to get the thread state (unless you use underlying JVM api's that can bypass that safepointing).
Rapid Bottleneck Identification - A Better Way to do Load Testing (Page last updated February 2010, Added 2011-10-27, Author Oracle, Publisher Oracle). Tips:
- Every application has some bottleneck. Oracle's experience is that the bottleneck is 40% of the time in the application server component; 30% in the network communications; 20% in the database server; 10% of the time in the webserver.
- Performance testing scenarios typically simulate multiple scenarios from different types of users doing different things, all simultaneously. The more complex type of activity, and the more the complexity of the scenarios, the larger the number of bottlenecks that are generated by the testing, which makes bottleneck identification difficult.
- Throughput is the amount of data flowing through a system, measured in hits per second, pages per second, transactions per second, and megabits of data per second.
- Concurrency is the number of users simultaneously using an application (simultaneous connections may also affect concurrency, though idle connections have lower overheads and can be ignored if they consume no significant resources).
- In Oracle's experience, most application performance issues result from limitations in throughput.
- Throughput testing involves hitting key pages and user transactions (with limited delay between requests) to find the page-per-second capacity limit of the various functional components. Pages not achieving the targeted performance need tuning.
- Concurrency testing ramps up the number of users (using realistic page-delay times) at a ramp-up speed slow enough to gather useful data.
- Comparing a 100 user test with 1-second think times and another 1,000 user test with 10-second think times produces two tests with the same throughput but different concurrency; the latter 1000 user test has 10 times the concurrency of the 100 user test.
- Performance testing should focus on throughput first, not concurrency, as this identifies bottlenecks faster.
- Rapid Bottleneck Identification first uses simple test scenarios (probably unrepresentative of full user complexity) focused on throughput to identify the more common bottlenecks first; followed by concurrency testing that more accurately models user behavior to find less common bottlenecks. This is proposed as the most cost-efficient way to eliminate bottlenecks.
- A concurrency test ramp up should proceed slowly, e.g. one user every five seconds.
- Before load testing, basic system-level tests should be run to validate bandwidth, hit rate, and connections supportable.
- Response times are a key metric of overall performance, and should be used to identify if a bottleneck is encountered (degraded response times) and for error detection.
- Test the home page first, then add pages and business functions until the complete real-world activity is simulated. With this gradual testing ramp up, degradations in response times or page throughput will be caused by a newly added step, making it easier to identify which code needs to be investigated.
- Full scenario load testing should accurately reflect what real users do on the site - browse, search, register, login, purchasing, etc.; and the steps in those transactions must be performed at the same pace as real-world visitors would do, with appropriate think times between each step.
Single Writer Principle (Page last updated September 2011, Added 2011-10-27, Author Martin Thompson, Publisher Mechanical Sympathy). Tips:
- The single biggest limitation on scalability is having multiple writers contending.
- The two main approaches for handling mutliple writers are: provide mutual exclusion (probably using locks) to the contended resource while the mutation takes place; take an optimistic strategy and repeatedly swap in the changes until successful (typically using compare-and-swap).
- Locking strategies require an arbitrator to decide access ordering, which can be a very expensive process.
- Threads waiting to enter a critical section must queue, and this queuing effect causes latency to become unpredictable and ultimately restricts throughput.
- Managing cache misses is the single largest limitation to scaling the performance of our current generation of CPUs.
- For highly contended data it is easy to get into a situation whereby the system spends more time managing contention than doing real work.
- To avoid contention, design systems so that any item of data or resource is only mutated by a single writer/thread. Multiple readers are fine to work with that.
- A system with no shared state made of components that maintain their own state communicating via message passing, eliminates contention.
- The Single Writer Principle is that for any item of data or resource, that item should be owned by a single execution context for all mutations.
A Pair of (somebody else?s) Concurrency Bugs (Page last updated September 2011, Added 2011-10-27, Author Cliff Click, Publisher Azul). Tips:
- Spinning is usually bad if you don?t have enough CPUs. Spin-waiting works when you have more real CPUs than threads spinning, because threads needing to do real work still get on a CPU.
- Using Thread.yield() is not an effective strategy as you are increasing OS level calls and making the OS context scheduler do more work - in the worst case this can mean a system gets swamped doing context switching.
- Multiple spinning threads can dominate a CPU and take huge amounts of cpu-time away from the threads that are actually working.
- There are JVM options to control how yield works, e.g. -XX:+DontYieldALot makes yields get ignored.
- Do not make your concurrency algorithm depend on Thread.yield()
- You cannot make arrays of volatile primitive elements in Java. E.g. 'volatile boolean field' does not define 10 volatile elements, only the array variable itself is volatile, but the elements are not (assignment to an element will be local until a memory barrier is passed).
- Disruptor runs hard-up against the limits of modern hardware and it's performance becomes extremely sensitive to CPU placement of the working threads; based on the vagaries of OS scheduling performance varies by a factor of 3 (depending on whether producer/consumer threads share the same L3 cache).
- Disruptor is intended to work with each thread having a dedicated CPU - and typically on a machine which has spare CPU cores.
Back to newsletter 131 contents
Last Updated: 2017-10-01
Copyright © 2000-2017 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us