Back to newsletter 080 contents | All Javva's articles
Today readers, I shall tell you about my first performance tune at PrettyBigCo. It's not particularly impressive. But it's got that certain wow factor that means I have to tell you about it. It was all about some HashMaps.
Practically every Java coder has created a HashMap - "new HashMap()" right? It's basic Java, you do it again and again every day, it's a workhorse structure. But how many of you have used a different HashMap constructor. It has a few more. Have you used "HashMap(int initialCapacity)"? Anyone gone as far as using "HashMap(int initialCapacity, float loadFactor)"? I have. And so, clearly, had at least one person on the project I'm now looking at for inefficiencies here at PrettyBigCo. Now, quickly, if you have used the constructor with the loadFactor, do your recall what values it is useful for it to take, and what is the effect of varying it? Those of you who haven't used it, have a quick look at the doc for the constructor. Now, can you guess what new HashMap(10,20) does?
That's what I found defined in a couple of classes while I was looking for the source of a slow query. At first I thought, huh? I couldn't immediately recall what the loadfactor would do if set to that. So I looked at the documentation. Of course documentation shows what it might do, comments show what is should do, but the source shows what it does. So I checked the code. Have you guessed what that 20 loadfactor does yet? Well, let's put it this way, it is suitable for a thedailyWTF (now at worsethanfailure.com). For those of you who can't be bothered to look it up, the loadFactor sets how full the hash table underlying the HashMap can get, before the table gets resized.
Hang on, maybe you were recovering from a hangover the morning when you were learning about hash tables, so let me just quickly explain - entering lecture mode. The elements in a hash map are mapped into an array (the table) using the hashcode of the key objects. So say your table has two slots and you are sticking in three elements. Then one of the table slots will need to hold at least two elements. That's called a collision. Collisions make for slower tables because instead of having one element there in that slot, you have a list of more than one element in the slot, and you have to figure out where it goes or which of several elements it could be when retrieving it. So you can see you want to avoid collisions. And the bigger the table size compared to the number of elements in the table, the more likely it is that you will avoid a collision. If our table has a 1000 slots, and we are putting in those three elements, we'd have to be pretty unlucky for a collision to occur. - ending lecture mode now, you can wake up again.
So a hashmap is often a tradeoff between size and performance. And that's where the loadfactor comes in. In HashMap it is sort of the percentage full that the table can get before the table is resized bigger - so that collisions can be kept down. The default is 0.75 which means that your typical hashmap can take 75% of the table size in entries before it gets resized. If you are getting collisions and need a slightly faster map, you reduce the load factor (yeah, yeah, you should also check the distribution of hash codes and the hashing function, shutup geeks). If space is at a premium, you can increase the loadfactor up beyond 1.0 which means more collisions but fewer empty slots. Even at 1.0 loadfactor it is likely that some slots are still empty - 1.0 just says an average of 1 element per slot, and there will usually be some distribution around the average. So what does it mean to have a loadfactor of 20, which is essentially 2000%? It means that every slot has an average of 20 elements before the table gets resized. That produces a hashmap which uses a teeny bit less memory than one that has a loadfactor of, say, 2.0, but that is much much slower, because get() and put() has to deal with multiple collisions. I did a simple test against the default 0.75 loadfactor, and the 20 loadfactor was 3 times slower (all those linear lookups within the table has a huge overhead).
My guess was that the unnamed implementor got the wrong end of the stick completely, and probably thought he (or she) was sizing the map to something useful. Considerably more worrying would be the idea that they were actually intending for there to be an average of 20 elements in each table slot. On the bright side, this is just what you move to a new project for: exposure to new techniques and new challenges. It's always to good to learn from mistakes - and so much the better if it's other people's mistakes.
Javva The Hutt.
Back to newsletter 080 contents