One of the things that is really important for us in RavenDB is the notion of Safe by Default and Zero Admin. What this means is that we want to make sure that you don’t really have to think about what you are doing for the common cases, RavenDB will understand what you mean and figure out what is the best way to do things.
One of the cases where RavenDB does that is when we need to generate new ids. There are several ways to generate new ids in RavenDB, but the most common one, and the default, is to use the hilo algorithm. It basically (ignoring concurrency handling) works like this:
var currentMax = GetMaxIdValueFor("Disks");
var limit = currentMax + 32;
SetMaxIdValueFor("Disks");
And now we can generate ids in the range of currentMax to currentMax+32, and we know that no one else can generate those ids. Perfect!
The good thing about it is that now we have a reserved range, we can create ids without going to the server. The bad thing about it is that we now reserved a range of 32. If we create just one or two documents and then restart, we would need to request a new range, and the rest of that range would be lost. That is why the default range value is 32. It is small enough that gaps aren’t that important*, but it since in most applications, you usually create entities on an infrequent basis and when you do, you usually generate just one, then it is big enough to still provide a meaningful optimization with regards to the number of times you have to go to the server.
* What does it means, “gaps aren’t important”? The gaps are never important to RavenDB, but people tend to be bothered when they see disks/1 and disks/2132 with nothing in the middle. Gaps are only important for humans.
So this is perfect for most scenarios. Except one very common scenario, bulk import.
When you need to load a lot of data into RavenDB, you will very quickly note that most of the time is actually spent just getting new ranges. More time than actually saving the new documents takes, in fact.
Now, this value is configurable, so you can set it to a higher value if you care for it, but still, that was annoying.
Hence, what we have now. Take a look at the log below:
It details the requests pattern in a typical bulk import scenario. We request an id range for disks, and then we request it again, and again, and again.
But, notice what happens as times goes by (and not that much time) before RavenDB recognizes that you need bigger ranges, and it gives you them. In fact, very quickly we can see that we only request a single range per batch, because RavenDB have optimized itself based on our own usage pattern.
Kinda neat, even if I say so myself.
No comments:
Post a Comment
Thank's!