Thursday, August 30, 2012

NuGet Perf, Part III–Displaying the Packages page

NuGet Perf, Part III–Displaying the Packages page:
The first thing that we will do with RavenDB and the NuGet data is to issue the same logical query as the one used to populate the packages page. As a reminder, here is how it looks:

SELECT        TOP (30) 
-- ton of fields removed for brevity
FROM (

SELECT Filtered.Id
, Filtered.PackageRegistrationKey
, Filtered.Version
, Filtered.DownloadCount
, row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number]
FROM (
SELECT PackageRegistrations.Id
, Packages.PackageRegistrationKey
, Packages.Version
, PackageRegistrations.DownloadCount
FROM Packages
INNER JOIN PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey
WHERE Packages.IsPrerelease <> cast(1 as bit)
) Filtered
) Paged
INNER JOIN PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey
INNER JOIN Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version
WHERE Paged.[row_number] > 30
ORDER BY PackageRegistrations.DownloadCount DESC
, Paged.Id



Despite the apparent complexity ,this is a really trivial query. What is does is say:

  • Give me the first 30 – 60 rows

  • Where IsPrerelease is false

  • Order by the download count and then the id


With Linq, the client side query looks something like this:

var results = Session.Query<Package>()
.Where(x=>x.IsPrerelease == false)
.OrderBy(x=>x.DownloadCount).ThenBy(x=>x.Id)
.Skip(30)
.Take(30)
.ToList();



Now, I assume that this is what the NuGet code is also doing, it is just that the relational database has made it so they have to go to the data in a really convoluted way.
With RavenDB, to match the same query, I could just issue the following query, but there are subtle differences between how the query works in SQL and how it works in RavenDB. in particular, the data that we have in RavenDB is the output of this query, but it isn’t the raw output. For example, we don’t have the Id column available, which is used for sorting. Now, I think that the logic is meaning to say, “sort by download count descending and then by age ascending”. So old and popular packages are more visible than new and fresh packages.
In order to match the same behavior (and because we will need it to the next post) we will define the following index in RavenDB:
image
And querying it:
image
The really nice thing about this?
image
This is the URL for this search:

/indexes/Packages/Listing?query=IsPrerelease:false&start=0&pageSize=128&aggregation=None&sort=-DownloadCount&sort=Created

This is something that RavenDB can do in its sleep, because it is a very cheap operation. Consider the query plan that would for the SQL query above. You have to join 5 times just to get to the data that you want, paging is a real mess, and the database actually have to work a lot to answer this fiddling little query.
Just to give you some idea here. We are talking about something that conceptually should be the same as:

select top 30 skip 30 * from Data where IsPrerelease = 0





But it get really complex really fast with the joins and the tables and all the rest.
In comparison, in RavenDB, we actually do have just a property match to do. Because we keep the entire object graph in a single location, we can do very efficient searches on it.
In the next post, I’ll discuss the actual way I modeled the data, and then we get to do exciting searches Smile.


DIGITAL JUICE

No comments:

Post a Comment

Thank's!