There is the FLOSS Mole data set, which provide a lot of interesting information about open source projects. As I am always interested in testing RavenDB with different data sets, I decided that this would be a great opportunity to do that, and get some additional information about how things are working as well.
The data is provided in a number of formats, but most of them aren’t really easy to access. SQL statements and raw text files that I assume to be tab separated, but I couldn’t really figure out quickly.
I decided that this would be a great example of actually migrating content from a SQL System to a RavenDB System. The first thing to do was to install MySQL, as that seems to be the easiest way to get the data out. (As a note, MySQL Workbench is really not what I would call nice.)
The data looks like this, this is the Google Code projects, and you can also see that a lot of the data is driven from the notion of a project.
I explored the data a bit, and I came to the conclusion that this is pretty simple stuff, overall. There are a few many to one associations, but all of them were capped (the max was 20 or so).
That meant, in turn, that we had a really simple work to do for the import process. I started by creating the actual model which we will use to save to RavenDB:
The rest was just a matter of reading from MySQL and writing to RavenDB. I chose to use Peta Poco for the SQL access, because it is the easiest. The following code sucks. It is written with the assumption that I know what the data sizes are, that the cost of making so many queries (roughly a 1,500,000 queries) is acceptable, etc.
But, it does the work, and it was simple to write. Using this code, I was able to insert 299,949 projects in just under 13 minutes. Most of the time went to making those 1.5 million queries to the db, by the way.using (var docStore = new DocumentStore { ConnectionStringName = "RavenDB" }.Initialize()) using (var db = new PetaPoco.Database("MySQL")) using (var bulk = docStore.BulkInsert()) { foreach (var prj in db.Query<dynamic>("select * from gc_projects").ToList()) { string name = prj.proj_name; bulk.Store(new Project { Name = name, CodeLicense = prj.code_license, CodeUrl = prj.code_url, ContentLicense = prj.content_license, ContentUrl = prj.content_url, Description = prj.project_description, Summary = prj.project_summary, Labels = db.Query<string>("select label from gc_project_labels where proj_name = @0", name) .ToList(), Blogs = db.Query<dynamic>("select * from gc_project_blogs where proj_name = @0", name) .Select(x => new Blog { Link = x.blog_link, Title = x.blog_title }) .ToList(), Groups = db.Query<dynamic>("select * from gc_project_groups where proj_name = @0", name) .Select(x => new Group { Name = x.group_name, Url = x.group_url }) .ToList(), Links = db.Query<dynamic>("select * from gc_project_links where proj_name = @0", name) .Select(x => new Link { Url = x.link, Title = x.link_title }) .ToList(), People = db.Query<dynamic>("select * from gc_project_people where proj_name = @0", name) .Select(x => new Person { Name = x.person_name, Role = x.role, UserId = x.user_id }) .ToList(), }); } }
Everything is cool, and it is quite nice. On the next post, I’ll talk about why I wanted a new dataset. Don’t worry, it is going to be cool.
DIGITAL JUICE
No comments:
Post a Comment
Thank's!