Saturday, March 31, 2012

RavenDB Sharding – Blind sharding

RavenDB Sharding – Blind sharding:
From the get go, RavenDB was designed with sharding in mind. We had a sharding client inside RavenDB when we shipped, and it made for some really cool demos.
It also wasn’t really popular, we didn’t implement some things for sharding. We always intended to, but we had other things to do and no one was asking for it much.
That was strange. I decided that we needed to do two major things.

  • First, to make sure that the experience for writing in a sharded environment was as close as we could get to the one you get with a non sharded environment.

  • Second, we had to make it simple to use sharding.


Before our changes, in order to use sharding you had to do the following:

  • Setup multiple RavenDB server.

  • Create a list of those servers urls.

  • Implement IShardStrategy, which exposes


    • IShardAccesStrategy – determine how we call to the servers.

    • IShardSelectionStrategy – determine how we select which server a new instance will go to, and what server an existing instance belongs on.

    • IShardResolutionStrategy – determine which servers we should query when we are querying for data (allow to optimize which servers we are actually hitting for particular queries)



All in all, you would need to write a minimum of 3 classes, and have to write some sharding code that can be… somewhat tricky.
Oh, it works, and it is a great design. It is also complex, and it makes it harder to use sharding.
Instead, we now have the following scenario:
image
As you can see, here we have three different servers, each running in a different port. Let us see what we need to do to get us working with this from the client code:
image
First, we need to define the servers (and their names), then we create a shard strategy and use that to create a sharded document store. Once that is done, we are home free, and can do pretty much whatever we want:

string asianId, middleEasternId, americanId;

using (var session = documentStore.OpenSession())
{
var asian = new Company { Name = "Company 1", Region = "Asia" };
session.Store(asian);
var middleEastern = new Company { Name = "Company 2", Region = "Middle-East" };
session.Store(middleEastern);
var american = new Company { Name = "Company 3", Region = "America" };
session.Store(american);

asianId = asian.Id;
americanId = american.Id;
middleEasternId = middleEastern.Id;

session.Store(new Invoice { CompanyId = american.Id, Amount = 3 });
session.Store(new Invoice { CompanyId = asian.Id, Amount = 5 });
session.Store(new Invoice { CompanyId = middleEastern.Id, Amount = 12 });
session.SaveChanges();

}

using (var session = documentStore.OpenSession())
{
session.Query<Company>()
.Where(x => x.Region == "America")
.ToList();

session.Load<Company>(middleEasternId);

session.Query<Invoice>()
.Where(x => x.CompanyId == asianId)
.ToList();
}




What you see here is the code that saves both companies and invoices, and does this over multiple servers. Let us see the log output for this code:
image
You can see a few interesting things here:

  • The first four requests are to manage the document ids (hilos). By default, we use the first server as the one that will store all the hilo information.

  • Next (request #5) we are saving two documents, note that the shard id is now part of the document id.

  • Request 6 and 7 here are actually queries, we returned 1 results for the first query, and none for the second.


Let us look at another shard now:
image
This is much shorter, since we don’t have the hilo requests. The first request is there to store two documents, and then we see two queries, both of which return no results.
And the last shard:
image
Here we again don’t see the hilo requests (since they are all on the first server). We do see putting of the two docs, and request #2 is a query that returns no results.
Request #3 is interesting, because we did not see that anywhere else. Since we did a load by id, and since by default we store the shard id in the document id, we were able to optimize this operation and go directly to the relevant shard, bypassing the need to query anything other server.
The last request is a query, for which we have a result.
So what did we have so far?
We were able to easily configure RavenDB to use 3 ways sharding in a few lines of code. It automatically distributed writes and reads for us, and when it could, it optimized the data access so it would only access the relevant shards. Writes are distributed on a round robin basis, so it is pretty fair. And reads are optimized on whatever we can figure out a minimal number of shards to query. For example, when we do a load by id, we can figure out what the shard id is, and query that server directly, rather than all of them.
Pretty cool, if you ask me.
Now, you might have noticed that I called this post Blind Sharding. The reason this is called this name is that this is pretty much the lowest rung in the sharding ladder. It is good, it split your data and it tries to optimize things, but it isn’t the best solution. I’ll discuss a better solution in my next post.

No comments:

Post a Comment

Thank's!