Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Threading Essentials course
Tips July 2011
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 128 contents
6 Ways Not to Scale that Will Make You Hip, Popular and Loved By VCs (Page last updated April 2011, Added 2011-07-27, Author Todd Hoff, Publisher highscalability). Tips:
- Recently "hot" and newer tools typically have not been tested for scalability, and are risky to use in where you need scalability.
- You should be resource monitoring, performance testing, monitoring traffic, load testing, and doing tuning analysis using statistics and mathematical modelling.
- Design and implement your applications to use parallel programming from the ground up.
- Locks, bottlenecks forcing single-thread execution, wider than needed scope of variables and memory are all recipies for reduced scalability.
- Avoid: frequently-updated single-row tables; single master queues that controls everything; and blocking threads.
- Cache the majority of remote calls; avoid remote comms wherever possible.
- Analyse again and again for single points of failure, and endeavour to eliminate them.
The LMAX Architecture (Page last updated July 2011, Added 2011-07-27, Author Martin Fowler, Publisher martinfowler). Tips:
- Writing concurrent code is very hard; locks and semaphores are hard to reason about and hard to test. Various concurrency models, such as Actors and Software Transactional Memory, aim to make this easier.
- Keeping all data in-memory has two important benefits: it's fast, avoiding IO and transactional behavior; and it simplifies programming - there's no object/relational mapping to do.
- Using an in-memory structure has an important consequence - what happens if everything crashes? One solution is to make the current internal state entirely derivable by processing the input events, and keeping the input event stream in a durable replayable store. If using this approach, it's probably a good idea to take intermittent snapshots, so that the amount of events that needs to be replayed is not excessive.
- One approach to fault tolerance is to have multiple "hot" instance running and replicating all processing but only one outputting (or having only one output used, the others ignored). Failover is immediate as soon as downstream systems recognize the current primary has stopped outputting results.
- Restarting daily is advisable for intensive systems to avoid any garbage buildup - if you have hot failover capability, you can restart one instance at a time without any system downtime.
- Being able to replay your production inputs into a test system provide the ability to performance test and debug the production system. It also supports running intensive non-realtime tasks (like generating reports) against the equivalent of the production system, with no impact on the production system itself.
- Converting your code to use in-memory data provides fast code. Concentrating on simple elements of good code, well-factored and with small methods which works best with HotSpot, can make that code ten times faster. Moving to cache friendly, minimally garbage generating code, implementing custom collections optimized for your application, gets another order of magnitude improvement.
- Performance tests are important. Even good programmers are very good at constructing performance arguments that end up being wrong, so the best programmers prefer profilers and test cases to speculation.
- External service calls are slow - don't make calls to external services from within the business logic.
- An LMAX Disruptor is a type of queue: a ring with multiple producers and consumers. Each producer and consumer has a sequence counter to indicate which slot in the buffer it's currently working on, all can read all counters. Consumers can only process slots lower than all dependent consumers have finished working on - but they can work concurrently or batchwise on all slots up to there. Each producer/consumer can only write to its own unique data portion (field) within the buffer slot - so there is no race condition; a consumer dependent on a field being written will wait until that consumer has increased it's counter beyond the slot before accessing the field.
- Using a buffer size that's a power of two lets the compiler do efficient modulus operations to map from sequence counter number to the buffer slot number.
- An initial attempt using an actor model found that the processors spent more time managing queues than doing the real logic of the application. Queue access was a bottleneck.
- These days going to main memory is a very slow operation in CPU-terms. CPUs have multiple levels of cache, each of which of is significantly faster. So to increase speed you want to get your code and data in those caches.
- To deal with write contention a queue often uses locks, but if a lock is used, that can cause a context switch to the kernel. When this happens the processor involved is likely to lose the data in its caches. This reduces efficiency.
- To get the fastest speed from hardware caches, you need a design that has only one core writing to any memory location. Multiple readers are fine, processors often use special high-speed links between their caches. Queues fail the one-writer principle.
- When working on a single thread, ensure that you have one thread running on one core. Caches warm up, and as much memory access as possible goes to the caches rather than to main memory. Both the code and the working set of data should be as consistently accessed as possible. Also keeping small objects with code and data together allows them to be swapped between the caches as a unit, simplifying the cache management and again improving performance.
- Creating meaningful performance tests is often harder than developing the production code.
- Mechanical disks are slow for random access, but very fast for streaming, so an appropriate target for journaling.
Why Perfomance Management is easier in public than on-premise clouds (Page last updated May 2011, Added 2011-07-27, Author Michael Kopp, Publisher dynatrace). Tips:
- Measurement inside an OS running on a virtualized server is typically not accurate - it can lag behind real time and speedup to catch up (OSes can also behave like this).
- To drive utilization higher, virtualization and cloud environments overcommit resources - like hotels and airlines overbooking, the virtualized environment assumes not all requested resources will actually be used. The result is the same for hotels, airlines, and virtualized environments - at some point, something will get bumped off the resource. For virtualized environments, this means your app could suddenly be descheduled for any or all of CPU. memory, disk and network being unavailable because other applications (which you cannot be aware of or monitor) spike.
- Any OS resource used by your application in a virtualized environment can suddenly slow down and random times, due to other virtualized environments temporarily having above normal resource requirements.
- Even if you get accurate OS level statistics from virtualized environments for intervals, this can't be used to determine the proportion of resources available to you - as you'll never be allocated 100% of any resource.
- For virtualized environments, resource use application response time and/or throughput rather than any resource utilization to determine performance. This measures what really matters and avoids false resource utilization readings from the virtualized environment. What you then need is the ability to identify within your app why response time and/or throughput has missed targets.
Multithreading -- Fear, Uncertainty and Doubt (Page last updated June 2011, Added 2011-07-27, Author S. Lott, Publisher slott-softwarearchitect). Tips:
- Multi-threaded applications are error-prone due to lack of thought of race conditions on shared mutable variables.
- The best kind of lock seems to be a message queue, not a mutex nor a semaphore (and definitely not an RDBMS!).
- Using a message queue to dequeue data, process in a thread with only local data, then enqueue results is elegant, scales, and avoids race conditions.
- Using message queues with atomic gets means that there's no race condition when getting data to start doing useful work; each thread gets a thread-local, thread-safe object; there's no race condition when passing a result on to the next step in a pipeline if that's another queue.
- I/O bound programs rarely benefit from increasing threads.
- Break your problem down into independent parallel tasks and feed them from message queues. Collect the results in message queues.
A Method for Reducing Contention and Overhead in Worker Queues for Multithreaded Java Applications (Page last updated June 2011, Added 2011-07-27, Author Sathiskumar Palaniappan, Kavitha Varadarajan, and Jayashree Viswanathan, Publisher java.net). Tips:
- Most server applications use a common worker queue and thread pool; a shared worker queue holds short tasks that arrive from remote sources; a pool of threads retrieves tasks from the queue and processes the tasks; threads are blocked on the queue if there is no task to process.
- A feeder queue shared amongst threads is an access bottleneck (from contention) when the number of tasks is high and the task time is very short. The bottleneck gets worse the more cores that are used.
- Solutions available for overcoming contention in accessing a shared queue include: Using lock-free data structures; Using concurrent data structures with multiple locks; Maintaining multiple queues to isolate the contention.
- A queue-per-thread approach eliminates queue access contention, but is not optimal when a queue is emptied while there is unprocessed queued data in other queues. To improve this, idle threads should have the ability to steal work from other queues. To keep contention to a minimum, the 'steal' should be done from the tail of the other queue (where normal dequeing from the thread's own queue is done from the head of the queue).
Scalable System Design (Page last updated April 2011, Added 2011-07-27, Author Ricky Ho, Publisher DZone). Tips:
- Scalability is about maintaining target performance when load increases.
- To understand the workload, measure: Number of users, Transaction volume, Data volume, Response time, Throughput.
- Architect to scale your system horizontally for easier scalability. Code modularity helps for this.
- Use performance tests to identify bottlenecks; target improvements where they are frequently used.
- A stateless application can scale well across a server farm; use a front end load balancer and dynamically enabled server instances.
- For stateful applications partition data to enable load to be spread across servers.
- Use parallel algorithms, caches, pooled resources, aynchronous processing, approximate results, filtered datasets, optimal data structures, minimize locking, use lock-free structures.
- Geographically distribute content servers to provide localized access.
Back to newsletter 128 contents
Last Updated: 2018-08-27
Copyright © 2000-2018 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us