Saturday, September 8, 2012

NuGet Perf, Part VIII: Correcting a mistake and doing aggregations

NuGet Perf, Part VIII: Correcting a mistake and doing aggregations:
I hope this is the last one, because I can never recall what is the next Latin number.
At any rate, it has been pointed out to me that I made an error in importing the data. I assumed that the DownloadCount field that I got from the Nuget API is the download count for the specific package, but it appears that this is the total downloads count, across all versions of this package. The actual download number for a specific package is: VersionDownloadCount.
That changes things a bit, because the way Nuget sorts things is based on the total download count, not the download count for a specific version. The reason this complicate things is that we aren’t going to store the total download count in all the version documents. First, let us see the sort of query we need to write. In SQL, it would look like this:
select top 30 skip 30 
    Id,
    PackageId,
     Created, 
    (select sum(VersionDownloadCount) from Packages all where all.PackageId = p.PackageId) as TotalDownloadsCount
from Packages p
where IsPrerelease = 0
order by TotalDownloadsCount desc, Created

This is a much simplified version of the real query, and something that you can’t actually write this simply in SQL, most probably. But it gets the point.
Note that in order to process this query, the RDMBS would have to first aggregate all of the data (for each row, mind) then do the paging, then give you the results. Sure, you can keep a counter for all the downloads for a package, but considering the fact that downloads are highly parallel and happen all the time, waiting for writers to finish doing their update.
Instead, with RavenDB, we are going to use a map/reduce index and query on that.
image
This should be fairly simple to follow. In the map we go over all the packages, and output their package id, whatever they have been released, the specific version download count and the date it was created.
In the reduce, we group by the package id and whatever is was pre released or not ( I am assuming that we usually don’t want to show the pre-release stuff there).
Finally, we sum up all of the individual package downloads and we output the oldest created date. Using all of that, we can now move to the next step, and actually query that:
image
There  is a small bug here, since I don’t see RavenDB in the results,  but I guess I’ll have to wait until I get the updated data from Nuget.
Actually, that is not quite true, for pre-released software, we are pretty high up:
image
That explains much, RavenDB 1.2 is pretty awesome.


DIGITAL JUICE

No comments:

Post a Comment

Thank's!