Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Concurrency, Threading, GC, Advanced Java and more ...
Tips October 2009
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 107 contents
Meet Alex Miller, Author of the Core Java Concurrency Refcard (Page last updated July 2009, Added 2009-10-29, Author James Sugrue, Publisher DZone). Tips:
- The most common concurrency bugs relate to issues of visibility. If fields are changed outside synchronization, those changes are simply not guaranteed to be seen in any other thread, ever.
- Concurrency bugs are tricky as programs may appear to work fine on some machines or most of the time, but will fail in potentially confusing ways when run under load on server-class machines.
- Shared mutable instances of classes like SimpleDateFormat or GregorianCalendar are very tempting to construct, save in a static, and reuse. However, they contain mutable state that is modified while performing calculations - if multiple threads use the same instance without locking, incorrect answers will inevitably result.
- Concurrency is really all about managing access (with locks) to shared, mutable state. Start from the data - decide which data is shared and how access to that data will be protected.
- If at all possible encapsulate locks with your data; for example use the excellent thread-safe collections in java.util.concurrent.
- When dealing with shared, mutable state, consider ways to NOT share or NOT mutate.
- You can avoid sharing by using data that is thread-confined or using ThreadLocal.
- You can avoid mutating by making objects immutable and following more functional programming techniques.
- Test on real hardware with real loads. Use profilers, thread dumps, Visual VM, and other tools to verify your assumptions and find your bottlenecks.
- Use Locks, the various thread coordination classes like CyclicBarrier and CountdownLatch, and Executors. Any time you are explicitly creating Threads, using Thread.join() or wait()/notify(), there are probably better alternatives in the concurrency library.
Performance Improvement: Understanding (Page last updated June 2009, Added 2009-10-29, Author Robert Bogue, Publisher Developer.com). Tips:
- Load balancing improves reliability and scalability; clustering improves fault tolerance.
- In sticky sessions mode, the load balancer ensures that a user who starts on one web server stays on that web server for their entire browsing experience except in the case of a failure of the web server. This makes caching data on a per user basis simpler, as the caches do not need to be kept synchronized across servers - hence overall more efficient.
- There are four primary bottlenecks: CPU, Memory, Disk, and Network (or Communications). When building a system, consider the impact on each of these resources and test your system while monitoring these resources.
- watch out for single threading - any time you max out a single CPU in the system, you've got a problem.
- Monitor overall CPU utilization - 100% utilization for one second isn't a problem; however for fifteen minutes and it's definitely an issue.
- One way to monitor memory is to monitor paging - if this goes above background levels for any significant period, memory is under stress.
- Minimizing (or eliminating) the paging file is a valid performance improvement if you have sufficient real memory for everything.
- Monitor the disk queueing to identify whether there is a disk I/O problem.
- Resolve memory issues before considering disk performance as in low memory situations the disks are used as virtual memory and so will provide difficult to understand or misleading indications.
- If there's a problem with the network interface card you'll likely see it with the Output Queue Length counter. This counter should be less than 10. However statistics from network interface cards are notoriously bad so you may not be able to trust the numbers that you're getting back from this counter.
- Load tests create artificial load on the system to measure the performance, the scalability, or determine the most likely points where the system will break.
- Load tests are supposed to simulate the user behavior. if you get the workloads (behavior patterns) wrong or the balance between workloads wrong, you end up with a test that isn't valid.
- Stress testing aims at breaking the system. Stress tests are designed to make the system work as hard as possible to see how it breaks.
Mixing long and short lived objects (Page last updated September 2009, Added 2009-10-29, Author Kirk Pepperdine, Publisher pepperdine). Tips:
- Session states tend to keep objects alive - make sure you minimize the amount of space these longer lived objects take as the young generation garbage collector cannot eliminate them.
- You can tell if your system is retaining a lot of objects by turning on -XX:+PrintTenuringDistribution and scanning the logs for the amount of objects taking space in different ages.
- You can reduce the heap space taken by long lived objects by storing the data externally to the JVM, such as in a database.
The perils of negative scalability (Page last updated September 2009, Added 2009-10-29, Author David Dice, Publisher Sun). Tips:
- If communication overheads start to dominate, the scalability of the application tends to fall.
- Tuning performance only at one set of concurrency levels (such as high concurrency) then you can make performance worse at other concurrency (e.g. at low concurrency). Ideally such performance changes would be applied in such a way that they change according to the level of concurrency so that performance is no worse at all concurrency levels.
- Be careful to measure the performance of any proposed change over a wide range of concurrency values, and not just at the extremes.
Performance tuning considerations in your application server environment (Page last updated January 2009, Added 2009-10-29, Author Sean Walberg, Publisher IBM). Tips:
- For a simple webserver serving static files, potential bottlenecks are the network bandwidth, webserver cache efficiency, and disk speed.
- For a server serving disk based data, a request service time of 10ms (largely constrained by having to seek disk heads), would reach about 100 requests per second before the disks become saturated.
- With a multi-tier system, the throughput bottleneck is the tier that has the longest service time per client request.
- Putting more steps between the user and the final page tends to make things slower and decreases system capacity.
- Horizontal scaling can introduce shared state overheads - so two servers typically will not handle double the throughput of one server.
- If the rate of requests entering a particular queue exceeds the rate at which the queue can process requests, then the requests back up. When requests back up, the service time is unpredictable, and your users will be seeing stalled browser sessions.
- Tuning the queues is a balancing act. Too small, and you'll drop requests when you still have capacity; Too big, and you'll try to serve more requests than your system can handle, causing poor performance for everyone.
- The recommended approach is to queue requests at the front-end of a system, or at very least prior to the bottleneck component.
- Your application should have some way of providing measurements back to a collection system (even if the collection system is just a log file). This helps you understand when things are running slowly, and which parts of your code are taking the most time.
- Only use sessions where you need them - load the session when the session is needed.
- One approach to session management is to encrypt the session data and send it back to the client, eliminating the need to store the session locally. You can store at least 20 cookies per domain name, each a minimum of 4K bytes in a user's cookie (RFC 2109).
- When considering caching, ask yourself "Does this information have to be fresh?" If not, it might be a candidate for caching.
- Be careful of what happens when the cached entry expires or is removed. Ensure that only the first request regenerates the cache entry, and other concurrent requests wait or use a stale item until the new one is available.
- Consider processing requests asynchronously where possible - often even long transactional requests can return control immediately to the user with the transaction ID, and the actual success or failure state of the transaction sent ot the user later.
Creating Highly-Scalable Components in Java (Page last updated August 2009, Added 2009-10-29, Author Zhi Gan, Raja Das, Xiao Jun Dai, Publisher InfoQ). Tips:
- ConcurrentHashMap can replace a synchronized HashTable to make an application more scalable.
- It is difficult to get good scalability when dealing with applications which require frequent communications between subtasks.
- Worth profiling are: contention and locks (Java Lock Monitor and IBM Java Lock Analyzer are two useful tools); execution stacks; OS-level performance.
- Some of the well known problems introduced by locks are dead-locks, live-locks, priority inversion, lock-contentions etc. Lock contention tends to reduce the scalability of components and algorithms. Lock-Free and wait-free algorithms like those used in the java.util.concurrent classes usually provide superior multi-threaded performance.
- Compare-and-swap (CAS) may seem simple enough, but profiling shows that CASs take up a big part of the execution time in some cases and the ConcurrentLinkedQueue.enqueue requirement of two successive successful CASs increases the cost. On modern multiprocessors, even a successful CAS operation costs an order-of-magnitude longer to complete than a load or a store, since they require exclusive ownership and flushing of the processor's write buffers. Optimal implementations can reduce the CAS costs in some cases [Article refers to a more optimal queue implementation for some operations].
- If allocation is too frequent, the thread-local buffer can be exhausted quickly which can severely impact performance. Using ThreadLocal for temporary objects can help reduce the impact on the thread local buffer - but this may have its own cost, so code should be able to switch using such a technique easily.
Back to newsletter 107 contents
Last Updated: 2022-06-29
Copyright © 2000-2022 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us