Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Threading Essentials course
Tips September 2014
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 166 contents
Building Systems That Perform : Looking For A Needle In A Haystack (Page last updated August 2014, Added 2014-09-29, Author Dr. Leonid Grinshpan, Publisher Practical Performance Analyst). Tips:
- The number of tuning options across OS, JVM and other middleware is in the ten's of thousands. You need to simplify your options by focusing on the nodes of the system, and how they interact with each other to process requests.
- There are two high level reasons for bottlenecks: a node has insufficient capacity (making other requests queue); A shared resource is too busy handling other nodes to provide the resource to a requesting node, making the node wait to acquire the shared resource. These can be summarised as shortage of resource capacity and limited access to a resource.
- Monitor all resources (hardware and software) that are involved in processing requests including: CPU, memory, disk space, network load, software threads, connection pools, locks, semaphores, etc.
- To identify bottlenecsk, identify where requests are being queued waiting for a node to process (the node could be busy processing another request or waiting to acquire a resource in order to proceed with processing).
Understanding Application Performance on the Network - Part VII: TCP Window Size (Page last updated August 2014, Added 2014-09-29, Author Gary Kaiser, Publisher Compuware). Tips:
- Turn off Nagle (TCP_NODELAY).
- If the bytes in flight (bandwidth in bytes x round-trip-time) exceeds the TCP window size then the TCP session will not be able to use all of the available bandwidth; instead, throughput will be limited by the receive window.
- TCP window size can be varied but firewalls, load balancers and server configurations may limit this; consequently you need to pay attention to the TCP window size when considering the performance of applications that transfer large amounts of data.
- Large data transfers across a 4km distance (with typical round-trip-times of 85ms) on a 20Mbps link using the default TCP window size of 64k would limit the achievable throughput to 6Mbps, less than a third of the link's full capacity. To get the full link capacity, the TCP window size would need to be increased.
- Sending data continuously limits it's throughput by the bandwidth available and the TCP window size. Sending in many small chunks cannot be limited by the TCP window size; instead it is likely to be limited by the application data handling and the link latency.
- To optimise large data transfers, consider: adjusting the TCP window size; relocating the server to reduce latency; using a more optimal network path (closer/straighter/higher bandwidth); make sure the TCP stack is not CPU bound.
How to use ConcurrentHashMap in Java (Page last updated February 2013, Added 2014-09-29, Author Javin Paul, Publisher Javarevisited). Tips:
- ConcurrentHashMap is a thread-safe alternative to HashMap - before ConcurrentHashMap was available you'd need to use Hashtable or a synchronized Map wrapper around one of the non-thread-safe maps.
- ConcurrentHashMap provdes better performance than Hashtable and synchronizedMap maps where multi-threaded access and updates occur.
- ConcurrentHashMap only locks a portion of the Map instead of the whole Map, whereas Hashtable and synchronizedMaps lock the whole table on access and update. ConcurrentHashMap thus allows concurrent read operations without blocking threads.
- By default ConcurrentHashMap is divided into 16 segments, each governed with a different lock, allowing up to 16 threads to operate simultaneously (you can change the number of segments, but note that it's only relevant for the number of ACTIVE threads operating on the map, not the total number of simultaneously running threads - 16 is usually enough for most applications).
- Because ConcurrentHashMap is concurrent, it's possible that a read might not access the very latest state of the map; the ultimate state is correct but individual reads and writes are not guaranteed to be ordered.
- ConcurrentHashMap.putIfAbsent() allows you to atomically write to the table to avoid write races across threads.
- ConcurrentHashMap is suitable when you have multiple readers and few writers. If writers outnumber readers then the performance of ConcurrentHashMap effectively will be similar to a synchronized map or Hashtable as locking will dominate.
- ConcurrentHashMap is a good choice for caches.
8 Reasons Big Data Projects Fail (Page last updated August 2014, Added 2014-09-29, Author Matt Asay, Publisher Information Week). Tips:
- Start your big data project with a small targeted case, using technology specifically engineered to handle big data.
- Your big data project needs to start from the business knowledge requirements, not the computing or maths capabilities. Ask business questions, then have IT answer that.
- Ensure you capacity plan the big data projects, network as well as servers.
- The big data projects should be core to how the company uses the data, or they will become disjoint. They should span across the company departments.
- Nearly all significant big data technology is open source. Start a project, expect it to fail and learn from that.
Parallel-lazy Performance: Java 8 vs Scala vs GS Collections (Page last updated July 2014, Added 2014-09-29, Author Craig Motlin, Publisher InfoQ). Tips:
- GSCollections 5 supports lambdas and has some stream-like operations which can filter as well as accumulate. E.g. FastList.count() takes a lambda to filter but counts the true results directly without passing elements downstream to a counter.
- Java 8 filter operations are lazy, they don't process until a downstream operation requires evaluation. GSCollection operations need a .asLazy() call to be set to operate lazily.
- GSCollections have a .asParallel() method to execute in parallel, which requires the executor service (thread pool) and batch size as arguments (so more flexible than JDK8 Stream.parallel()).
- Java 8 uses lazy evaluation and builds a pipeline to execute operations, but doesn't actually execute the pipeline until a terminal operation requires it. This can be very efficient. But in the case of count() the JDK 8 pipeline executes three lambdas whereas the equivalent GSCollection only needs to execute one.
- Java 8 Stream.collect() does a binary merge into a single list, whereas the GSCollection equivalent just builds a composite list of the partial results. This is much faster but adds overhead to accessing elements.
- Specialised data structures for combining the results of operations without copying (ie by creating composite data structures) can be very efficient.
- Fork-join has merging costs. The merging strategy makes a significant difference in performance and scaling success. For example merging maps with lots of keys is much more expensive than merging maps with a few keys. OTOH ConcurrentHashMap is efficient when updating from many threads to many different keys, but inefficient when there are few keys.
- Batching size of the total problem using size divided by (8 x number of cores) is a good starting point for choosing a batch size.
- CPU bound thread pools should be sized to roughly the number of cores available for processing the threads; IO bound thread pools should be sized to the number of threads capable of accessing the resource concurrently (eg the number of connections available for a database).
- Runtime.availableProcessors() doesn't return the number of cores on all machines (sometimes you get the number of sockets or hardware threads).
- For the most efficient parallel processing you want to have enough work spread to use all available cores without leaving some cores idle.
Cache coherency primer (Page last updated July 2014, Added 2014-09-29, Author Fabian Giesen, Publisher The ryg blog). Tips:
- [Excellent basic article on understanding CPU caches].
- Caches are organized into "lines", corresponding to aligned blocks of either 32, 64 or 128. Lines are mapped from/to main memory, so when accessing/updating a memory location, if you access/update data close to that, you gain efficiencies since the line is read/writen all together (ie several application level memory read/write ops could effectively result in one system memory read/write op if the memory is all in one line).
- Caches can allow multiple writes to the same memory location in the cache to collapse into one write to the main memory if this doesn't violate any coherency.
- Because there are multiple caches (often one or more per core, per socket, ...) the system needs to maintain cache coherency across caches, ie memory written to in one cache means that if that memory exists in other caches it is no longer valid in those and needs to be refetched. This has consequences for how concurrency operates, especially for concurrent updates. Bear in mind caches operate on lines, so it's not just the data being read/written that gets coherently handled, but the data in the same line has the same overheads as it gets dragged along for the trips.
Back to newsletter 166 contents
Last Updated: 2018-12-26
Copyright © 2000-2018 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us