Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Concurrency, Threading, GC, Advanced Java and more ...
Tips March 2016
JProfiler
|
Get rid of your performance problems and memory leaks!
|
JProfiler
|
Get rid of your performance problems and memory leaks!
|
|
|
Back to newsletter 184 contents
https://www.nexmo.com/blog/2016/03/25/build-rest-implementation-scales-better-smpp/
How-To Build a REST Implementation that Scales Better than SMPP (Page last updated March 2016, Added 2016-03-29, Author Paul Cook, Publisher Nexmo). Tips:
- Use connection keep-alive (HTTP1.1 instead of HTTP1.0) avoiding the overhead of socket establishment on each request.
- One of the biggest scaling pitfalls of making REST requests is running your dispatching in a single thread and keeping everything serial. Use an executor/worker-thread style design pattern to ensure that multiple requests are happening at once.
- In order to go fast, it is necessary to throttle: if you transmit data as fast as you can, you will exceed either the capacity of the other end to respond and keep up, or exhaust the capacity of the pipe in between.
- Detach the receipt and acknowledgement of a request from its execution, so that you are not blocking new requests.
- Queue and process asynchronously procedures that take a long time and/or share resources so that they don't block high priority threads.
- Scale across multiple backends using a load balancer frontend to distribute requests.
http://speedier.net/cspeed/papers/hotnets14.pdf
The Internet at the Speed of Light (Page last updated October 2014, Added 2016-03-29, Author Ankit Singla, Balakrishnan Chandrasekaran, P. Brighten Godfrey, Bruce Maggs, Publisher ACM). Tips:
- For Amazon, a 100 millisecond latency penalty implies a 1% sales loss; for Bing, 500 millisecond of latency decreases revenue per user by 1.2%.
- For most page downloads speeds, bandwidth (of the average developed world consumer) is no longer the limiting factor; instead it is the other resources involved.
- Human perception limits imply events separated by less than 30 milliseconds are perceived as the same instant; Online game play is is noticeably compromised from 50 millisecond latencies.
- If you need extremely low latencies, you have to take account of the mode and route of physical transmission of signals as well as the transport layer. Fibre transmits at 2/3 the speed of light, and most fibre routes tested were 1.5-2 times longer than the shortest physical path, so a typical good fibre route is already ~3x slower than maximum possible. DNS lookup and TCP handshakes imply further large overheads.
- Maintaining a pool of persistent connections over which you send requests eliminates connection setup cost from request latency.
https://blog.logentries.com/2016/03/a-point-of-contention-cache-coherence-on-the-jvm/
False Sharing, Cache Coherence, and the @Contended Annotation on the Java 8 VM (Page last updated March 2016, Added 2016-03-29, Author Chris Mowforth, Publisher logentries). Tips:
- x86-64 machines invalidate chunks of memory at the level of 64 byte-wide cache lines and different cores can pull bits of data from main memory into the same cache line.
- JOL (Java Object Layout) is a tool for taking the guesswork out of JVM object sizes, internal structures, field layout and packing
- Fields declared together tend to be laid out next to each other in memory by the JVM - this means that if those fields are being operated on by different threads, there is likely to be cache contention, causing lots of CPU cache invalidation which limits speed and scalability.
- Before Java 8, cache contended fields had to have padding (unnecessary fields declared between the fields) to make the fields go into separate cache lines in order to avoid cache contention; from Java 8 the @Contended annotation can be used to keep the fields in separate cache lines.
- Common constructs like counters and queues often fall victim to cache contention in multithreaded environments.
https://vimeo.com/160219051
Using Java Reflection to Debug Performance Issues (Page last updated March 2016, Added 2016-03-29, Author Dr Heinz M. Kabutz, Publisher Vimeo). Tips:
- Optimization methodology: load test; hypothesize what the bottleneck is; test the hypothesis (don't trust the profile); make a change to reduce the bottleneck; test that the change improves the situation; repeat from the initial load test stage until the targets are met.
- Measure the bottlenck; target the layer that is causing the actual bottleneck - optimizing the parts that don't matter will have not much effect.
- Typical causes of bottlenecks: hardware - CPU, Memory, Disk, Network; JVM - garbage collection, number of threads; Application - lock contention; People - Usage patterns, rates.
- -XX:GCTimeRatio=N specifies a target fraction of N/(N+1) for the execution time of the application threads (related to the total program execution time). The default value for -XX:GCTimeRatio is 99, which means the GC targets 99/100 or 99% of the time to be not in GC, ie that GC should take up less than 1% of the application threads time.
- Fixing algorithms and architectures tends give much better improvements than fixing code.
- A big difference between "live" object count and "allocated" object count (ie created objects) indicates that you are creating a large number of temporary objects.
- Hashing distribution can be more important than the efficiency of generating a hash code - because collisions are very expensive.
- Java 8 hashmap handle bad collisions better than previous implementations by replacing nodes with tree nodes where there are lots of elements in a bucket; but only for Comparable elements (or they can't be put into a tree).
- Java 8 hashmaps have better worst-case performance (than previous implementations), but worse average case performance for hashcodes that are explicitly implemented other than Strings.
- Inspect the distributions of your hashmap, you want to minimize collisions which you can achieve by changing the hashing.
https://www.youtube.com/watch?v=-6nrhSdu--s
When the OS gets in the way (Page last updated September 2015, Added 2016-03-29, Author Mark Price, Publisher StrangeLoop). Tips:
- To minimize pauses to under 10ms: make all garbage collection pause times be within allowable latency or go gc-free; move slow I/O out of the critical path, or upgrade hardware to make it non-slow.
- Measure and validate that changes improve performance; use end-to-end tests with realistic loads; change one thing at a time.
- Higher priority means longer execution slices.
- Try to fit your queue into the L3 cache.
- To minimize jitter, try to make one critical thread run per core (so only one hyperthread should run on that core), and no other JVM or OS tasks should run on that core.
- BIOS settings are typically set for power saving rather than highest performance - adjust for highest performance.
- lstopo displays hardware hyperthreads, physical cores, numa nodes and pci locality.
- isolcpus (kernel boot parameter) lets you isolate CPU resources from the OS so you can explicitly allocate threads to those that are isolated using taskset, the OS won't schedule anything else on those.
- Use a library that calls sched_setaffinity to set your critical threads to run on specific cores.
- cpusets lets you create subsets of the hardware
- For an OS running one JVM, you want to separate your cores into 3 groups: those that run all non JVM threads (could use isolcpus to isolate all other threads); those that run all JVM non-critical threads (could use taskset to start the JVM process for those cores); and those that run the critical JVM application threads (would use sched_setaffinity to schedule those onto one of the hyperthreads of each of those cores, leaving the other hyperthread unused). Unfortunately using this particular combination of techniques would leave all the non-critical JVM threads running on one hyperthread as the scheduler no longer distributes across the isolated cores! Consequently, to achieve this configuration use cpuset instead. cpusets lets you create subsets of the hardware, eg 'cset --set=/system --cpu=6-9' creates a cpuset with core threads 6-9 called /system; 'cset proc --move --from-set=/ --to-set=/system -k --threads --force' moves all processes into that /system cpuset including those spawned while being moved (--threads option) and any scheduled (-k option) threads; 'cset --set=/app --cpu=0-5,10-13' creates a cpuset with core threads 0-5 and 10-13 called /app; 'cset proc --exec /app taskset -cp 10-13 java ...' starts the JVM on the core threads 10-13; and finally use sched_setaffinity as previously outlined).
- If you have isolated your critical threads to seperate core threads, you can use perf_events (very low overhead performance events sampler) to check that nothing else is scheduled to those core threads. Eg 'perf record -e "sched:sched_switch" -C 3' would sample task switches from core thread 3.
https://dzone.com/articles/thread-concurrency-vs-network-asynchronicity
How To Handle Blocking Calls: Thread Concurrency vs. Network Asynchronicity (Page last updated March 2016, Added 2016-03-29, Author Ricardo Almeida, Publisher DZone). Tips:
- In the new reactor model, a single thread will potentially be handling thousands of connections; all it takes is a single connection executing a blocking call to impact and block all other connections.
- Use a work queue and thread pool; tune the thread pool to fully utilise the CPUs, but not overload them; prioritise requests that take a long time onto a separate dedicated thread.
- Use an HTTP server as a multiplexing point to receive both external and decompose those as internal requests to send to other processing nodes which return the results for the HTTP server to compose and reply asynchronously. Use keep-alive connections to the processing nodes to reduce overheads and provide pipelining of requests.
- An ideal distributed system based on asynchronous messages would have parallelism, decoupling, failover/redundancy, scalability/load balancing, elasticity (handling peaks), and resiliency.
- The one-thread-per-request model does not scale.
Jack Shirazi
Back to newsletter 184 contents
Last Updated: 2025-01-27
Copyright © 2000-2025 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
URL: http://www.JavaPerformanceTuning.com/news/newtips184.shtml
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us