Monthly Archives: September 2013

Importing data from StackExchange or How to deserialize REALLY large XML files

I’ve recently been reviewing MongoDB for work purposes and have been really impressed with MongoDB’s story around replication and sharding. After carefully reading all the documentation, I fired up a couple of VM’s in Windows Azure and soon got a working replicaset installed into an Azure Availability set. This meant that I would fall in line with Azure’s 99.95% availability SLA and this was easily tested by running two console apps (one to insert data and one to read) and randomly killing MongoDB services. The console apps ran in a while loop and so were continually trying to connect to Mongo. Failover for reads was very quick but writes took a little longer to recognise that the primary node was down – presumably waiting for the other nodes to negotiate on “who’s the boss”.

Once satisfied that this could be overcome with a bit of clever coding (perhaps using a back off strategy) I wanted to explore performance. So what kind of dataset might be large enough to stress Mongo, allow me to explore optimal data models and is free? StackExchange Data Dump! It just so happens that the September 2013 dump was recently made available. One BitTorrent download over night later and I have 14GB’s of compressed XML on my computer.

I figured I’d start with a small set of data – I think I settled on initially. The data export for each site is fairly straightforward and the schema presumably denormalised from what StackExchange actually use on their systems.

StackExchange XML Export

So you could go ahead and create a model for Mongo, parse the XML and then map the data accordingly, but I wanted to get the import over and done with and get on with the good stuff of performance testing. Easiest way out then would be to use whatever’s been defined in the XML files and use standard .NET XML deserializer – effectively deserializing in one go (I’m sure you can tell where this is headed…). You can use xsd.exe, point it at the XML files and generate the schema and the corresponding .NET classes. Easy stuff. Once done you can use the generated classes and the .NET deserializer to pump data into Mongo. The problem though is that the data export does not specify any types, so what we have is that everything is a string. Not a great situation to be honest. To get around that I used AutoMapper to map the generated classes onto a data model that is a bit more appropriate for my purpose.

There’s always a butt somewhere… I didn’t download 14GB’s of data to mess around with a small data set like I came for the daddy – I wanted to query the data for! Easy enough I thought, point the directory to where I’d unzipped the data and let her rip. Clearly I didn’t think this through… For reference, for the September 2013 data dump, the Posts.xml file for StackOverflow is 20GB in size. So one Out of Memory exception later I was left scratching my head on how to import this bad boy into… well anywhere really.

StackOverflow to the rescue… well the site really. I found a StackOverflow post by Jon Skeet that explained what I needed to do very well. Using XmlReader and “yield” you can effectively “stream” data from the XML and transform it on the fly. I modified his answer slightly to produce the following:

private static IEnumerable StreamTypeFromXml(string stackDirectory, string elementName, Func<XElement, T> converter) where T : class
            using (XmlReader reader = XmlReader.Create(stackDirectory))
                while (reader.Read())
                    if (reader.NodeType == XmlNodeType.Element)
                        if (reader.Name == elementName)
                            var element = XElement.ReadFrom(reader) as XElement;

                            if (element != null)
                                yield return converter.Invoke(element);

So what we basically have here is a function that takes the name of the XML file that we want to parse, the element name that we want to distinguish on (in the data dump it’s just “row”) and a Func<XElement, T> which does some stuff on each row as it’s being yield returned. Unfortunately I couldn’t (was too impatient) to figure out an elegant way of deserializing the yielded result automatically into a class. So all my Func<XElement, T> ended up doing is to map each row into the classes that I’d generated previously. A little bit of extra processing needs to be done because the data is stored in attributes (to save space presumably), so I ended up with code like this:

private static IEnumerable GetAllXmlComments(string stackDirectory)
            var commentFile = Path.Combine(stackDirectory, "comments.xml");
            var allXmlComments = StreamTypeFromXml(commentFile, "row", element => new commentsRow
                    CreationDate = element.GetAttributeStringValue("CreationDate"),
                    Id = element.GetAttributeStringValue("Id"),
                    PostId = element.GetAttributeStringValue("PostId"),
                    Score = element.GetAttributeStringValue("Score"),
                    Text = element.GetAttributeStringValue("Text"),
                    UserDisplayName = element.GetAttributeStringValue("UserDisplayName"),
                    UserId = element.GetAttributeStringValue("UserId")

            return allXmlComments;

For a proof of concept, I can live with it.

So now we can get a collection of all the data in a “stream” and do the mapping for Mongo. Rather than do the mapping manually (again!) I drafted in AutoMapper. So long as the property names were similar (or in my case – the same) AutoMapper will try to match up the properties. Where it got a little stuck is trying to convert strings to int. Rather than try to guess that’s what you want, you’ll need to create a type converter and set the mapping convention. Same goes for string to DateTime conversions, but really it’s just a few lines of code and AutoMapper will go off and do its thing.

That’s pretty much it. This post was mainly about using an XmlReader to stream large amounts of data using yield return. The full source is up on GitHub if you want to see the whole thing.