Java Performance Tuning
Java(TM) - see bottom of page
Our valued sponsors who help make this site possible
JProfiler: Get rid of your performance problems and memory leaks!
Training online: Threading Essentials course
Tips July 2016
Get rid of your performance problems and memory leaks!
Get rid of your performance problems and memory leaks!
Back to newsletter 188 contents
State in Scalable Architectures (Page last updated July 2016, Added 2016-07-28, Author Felipe Fernández, Publisher DZone). Tips:
- To understand scalability of data, you need to consider mutability, concurrency, isolation and scope of the data.
- The more globally accessible the data is, the less it can scale naturally.
- Increased scalability is gained from: Isolation, Avoiding state where possible, Handlingstate specially, Immutability, and Keeping state handling procedures close to the state.
- A single store is scalable - but potentially slow. Speeding up by caching adds the complexity of needing to keep state synchronized across the system, and the decision of how to keep state synchronized determines how you can scale - basically the level of consistency you want the data to have limits your scaling options.
- A (journaled) log is an ordered set of persistent messages which can be eventually consistent and provides one option for scaling persistence.
Understanding Parallel Stream Performance in Java SE 8 (Page last updated July 2016, Added 2016-07-28, Author Brian Goetz, Publisher InfoQ). Tips:
- Coarse-grained parallelism splits tasks up into naturally (eg per request) parallel tasks and processes each task on a separate thread (or core), aiming for higher throughput. Fine-grained parallelism subdivides any parallelisable subtask, to try and make that subtask run faster across the available cores, aiming for the task to be completed quicker.
- Parallelism is about execution - doing many things at the same time; concurrency is about controlling access to shared resources. Parallelism is not difficult - just add another thread. Concurrency is hard - it's hard to write thread-safe data structures apart from those that just force all access and update to be executed serially (eg synchronize all methods).
- Partitioning the data is the easiest technique for making parallel execution work concurrently.
- Prefer a sequential implementation until parallelism is proven to be effective - a parallel computation will ALWAYS do more work than the bext sequential alternative.
- There are three ways to thread-safely handle state: don't share or don't mutate or coordinate access.
- The simplest way to decompose parallely is divide-and-conquer: recursively split the problem until you reach sub-problems that are more efficient to solve sequentially than by splitting; then combine the solutions of the sub-problems to reach the aggregate solution. This uses no shared mutable states.
- Parallelising data processing has costs which can reduce the potential speedup: splitting the data (how easily does it split, arrays split well, linked lists don't); managing the split data; dispatching to different processors; possible data copying (splitting/merging collections is expensive); non-locality of data on the cores (keeping the CPU busy - waiting for cache misses doesn't).
- A rule of thumb is that if N is the number of data items and Q is the amount of work (eg operations) per item, then NQ < 10000 means there will not be a speedup.
- Operations like limit(), skip(), findFirst() are inefficient for parallelism - either avoid them, or call .unordered() if the order of encountering elements is not meaningful (eg with HashSet) or not important, and the stream will optimize those operations correspondingly.
- sum() and max() operations are really efficient to merge, but groupingBy() on a HashMap is very expensive (a lot of copying) and could overwhelm any parallelism advantage.
- If you don't have performance requirements, your implementation is fast enough so optimizing is wasting time and money.
- If you are not measuring the performance, you cannot know any change improves it.
- There is one common fork-join pool for all parallel streams. This is deliberate to avoid lots of competing pools.
14 High-Performance Java Persistence Tips (Page last updated June 2016, Added 2016-07-28, Author Vlad Mihalcea, Publisher vladmihalcea). Tips:
- Validate statement efficiency by logging the statements and their execution times.
- Use a connection pool for any set of database connections, as connections are expensive to create.
- Batch statements using the JDBC batching API; Hibernate, supports batching configuration options.
- Check if the JDBC driver supports statement caching, and use that if applicable.
- Use SEQUENCE identifiers for Hibernate, not IDENTITY nor TABLE.
- The more compact the column type, the more efficient it can be - choose colum types carefully.
- For Hibernate relationship mapping, Unidirectional associations and @ManyToMany should be avoided. For collection, bidirectional @OneToMany associations should be preferred, but try to avoid collections as they are not easily paginated.
- Object inheritance doesn't work well with relational databases; stay with flat structures.
- Restrict the number of managed entities in the Persistence Context.
- Fetch only what is necessary - tune queries to avoid fetching extra rows and columns.
- Tune the database engine so that the working set resides in memory and is not fetched from disk all the time.
- Transaction isolation level is very important for performance: to avoid lost updates, you should use optimistic locking with detached entities or an EXTENDED Persistence Context; To avoid optimistic locking false positives, you can use versionless optimistic concurrency control or split entities based write-based property sets.
- Database replication and sharding are good ways to increase throughput.
What the JIT!? Anatomy of the OpenJDK HotSpot VM (Page last updated June 2016, Added 2016-07-28, Author Monica Beckwith, Publisher InfoQ). Tips:
- In Java 8, the "tiered compiler" is the default server compiler. You can select the non-tiered server compiler (called the C2 compiler) by disabling tiered compilation with -XX:-TieredCompilation.
- A particular method or loop is considered performance critical when its method entry and loop-back edge-counters cross the compilation threshold -XX:CompileThreshold (default 1500 for the client/C11 compiler, 10000 for the server/C2 compiler).
- The code cache has a fixed size, and when full, the Java VM will cease method compilation. The default code cache size for tiered compilation in Java 8 is 240MB as opposed to the non-tiered default of 48MB; the size can be set with ?XX:ReservedCodeCacheSize.
- Tiered compilation has its own set of thresholds for every level e.g. -XX:Tier3MinInvocationThreshold, -XX:Tier3CompileThreshold, -XX:Tier3BackEdgeThreshold. The minimum invocation threshold at tier 3 is 100 invocations - compared to the non-tiered C1 threshold of 1,500.
- -XX:+PrintCompilation reports when the code cache becomes full and when the compilation stops.
- There are a lot of tuning options for inlining based on size and invocation thresholds, but it is already likely optimized to very near its maximum potential - -XX:+PrintInlining prints the decisions being made.
- Scoping variables to their minimum scope will aid the escape analysis in using registers instead of the heap where possible.
Designing SSD-Friendly Applications For Better Application Performance and Higher IO Efficiency (Page last updated May 2016, Added 2016-07-28, Author Zhenyun Zhuang, Publisher LinkedIn). Tips:
- SSDs are mostly treated as faster HDDs. Applications designed to be SSD-friendly can gain more significant performance.
- SSD throughput is typically 2 orders of magnitude higher than HDD, but you can gain up to a further 4x speedup by making the application SSD friendly - using multiple concurrrent I/O threads (for small reads; large reads should keep the concurrent I/O thread count low). This differs markedly from HDDs where multiple concurrrent I/O threads tends to actually decrease throughput.
- Use an SSD friendly filesystem: those supporting TRIM feature, eg Ext4 and Btrfs; and those specially designed for SSDs with a log-structured data layout to accommodate SSDs "copy-modify-write" property, eg NVFS, JFFS/JFFS2, and F2FS.
- I/O loading from LAN systems are often faster than HDD loading, but SSD loading is much faster, the data infrastructure tier design should take account of this. An SSD-based infrastructure is now more scalable, cheaper, and with a higher IOPS capacity than networked data layer.
- A HDD data persist to storage has significantly better IOPS performance for in-place updating vs. random updating. SSD reverses this, random updating has the same I/O performance as in-place updating, but in-place updating has additional overheads because SSD pages cannot be directly overwritten, so effectively SSD random updating is faster than in-place updating and better for the SSD lifetime.
- A SSD accesses at the page level, so proximity of data that needs to be accessed together is important for optimal performance; similarly putting unnecessary data together with necessary data means that you will be loading that extra data. Separate "hot" data from "cold" data on SSD.
- SSD reads and writes operate at the page level (eg 4KB); single bytes r/w by an application translate to a full page r/w. The application should aim to use compact data structures that work at the page level rather than arbitrary bytes level for optimal efficiency.
- Avoid long heavy writes on SSD. Long heavy writes on SSD can take disproportionately longer than multiple short writes as the SSD can run out of free blocks, needing the I/O to pause while the SSD makes more blocks free (this normally happens in the background).
- Because of the need for free block management on SSDs, when they are close to full, performance drops markedly. A rule of thumb is to aim for less than 80% full disks.
How Twitter Handles 3,000 Images Per Second (Page last updated April 2016, Added 2016-07-28, Author Todd Hoff, Publisher highscalability). Tips:
- Decoupling large bandwidth/load transactions from attached but potentially independent small ones gives you many optimization options.
- Move handles not blobs - moving large chunks of data through your system unnecessarily eats bandwidth and causes performance problems.
- Segmented resumable uploads (client segments the media, server provides a mediaID, segments are uploaded individually specifying the mediaID and segment index, upload is complete when all segments are uploaded) significantly reduces media upload failure rates.
- Twitter found that a 20 day TTL (time to live) on image variants (thumbnails, small, large, etc) was a sweet spot. Old image variants should be deleted and recreated on the fly.
- Progressive JPEG is the winner as image format - it has great frontend and backend support and performs very well on slower networks.
- Have an upload endpoint who's only responsibility is to put the original uploaded media into your store.
Back to newsletter 188 contents
Last Updated: 2018-07-29
Copyright © 2000-2018 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on JavaPerformanceTuning.com are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. JavaPerformanceTuning.com is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us