Category Archives: Dev Stuff

Importing data from StackExchange or How to deserialize REALLY large XML files

I’ve recently been reviewing MongoDB for work purposes and have been really impressed with MongoDB’s story around replication and sharding. After carefully reading all the documentation, I fired up a couple of VM’s in Windows Azure and soon got a working replicaset installed into an Azure Availability set. This meant that I would fall in line with Azure’s 99.95% availability SLA and this was easily tested by running two console apps (one to insert data and one to read) and randomly killing MongoDB services. The console apps ran in a while loop and so were continually trying to connect to Mongo. Failover for reads was very quick but writes took a little longer to recognise that the primary node was down – presumably waiting for the other nodes to negotiate on “who’s the boss”.

Once satisfied that this could be overcome with a bit of clever coding (perhaps using a back off strategy) I wanted to explore performance. So what kind of dataset might be large enough to stress Mongo, allow me to explore optimal data models and is free? StackExchange Data Dump! It just so happens that the September 2013 dump was recently made available. One BitTorrent download over night later and I have 14GB’s of compressed XML on my computer.

I figured I’d start with a small set of data – I think I settled on bicycles.stackexchange.com initially. The data export for each site is fairly straightforward and the schema presumably denormalised from what StackExchange actually use on their systems.

StackExchange XML Export

So you could go ahead and create a model for Mongo, parse the XML and then map the data accordingly, but I wanted to get the import over and done with and get on with the good stuff of performance testing. Easiest way out then would be to use whatever’s been defined in the XML files and use standard .NET XML deserializer – effectively deserializing in one go (I’m sure you can tell where this is headed…). You can use xsd.exe, point it at the XML files and generate the schema and the corresponding .NET classes. Easy stuff. Once done you can use the generated classes and the .NET deserializer to pump data into Mongo. The problem though is that the data export does not specify any types, so what we have is that everything is a string. Not a great situation to be honest. To get around that I used AutoMapper to map the generated classes onto a data model that is a bit more appropriate for my purpose.

There’s always a butt somewhere… I didn’t download 14GB’s of data to mess around with a small data set like bicycles.stackexchange.com. I came for the daddy – I wanted to query the data for StackOverflow.com! Easy enough I thought, point the directory to where I’d unzipped the data and let her rip. Clearly I didn’t think this through… For reference, for the September 2013 data dump, the Posts.xml file for StackOverflow is 20GB in size. So one Out of Memory exception later I was left scratching my head on how to import this bad boy into… well anywhere really.

StackOverflow to the rescue… well the site really. I found a StackOverflow post by Jon Skeet that explained what I needed to do very well. Using XmlReader and “yield” you can effectively “stream” data from the XML and transform it on the fly. I modified his answer slightly to produce the following:

private static IEnumerable StreamTypeFromXml(string stackDirectory, string elementName, Func<XElement, T> converter) where T : class
        {
            using (XmlReader reader = XmlReader.Create(stackDirectory))
            {
                reader.MoveToContent();
                while (reader.Read())
                {
                    if (reader.NodeType == XmlNodeType.Element)
                    {
                        if (reader.Name == elementName)
                        {
                            var element = XElement.ReadFrom(reader) as XElement;

                            if (element != null)
                            {
                                yield return converter.Invoke(element);
                            }
                        }
                    }
                }
                reader.Close();
            }
        }

So what we basically have here is a function that takes the name of the XML file that we want to parse, the element name that we want to distinguish on (in the data dump it’s just “row”) and a Func<XElement, T> which does some stuff on each row as it’s being yield returned. Unfortunately I couldn’t (was too impatient) to figure out an elegant way of deserializing the yielded result automatically into a class. So all my Func<XElement, T> ended up doing is to map each row into the classes that I’d generated previously. A little bit of extra processing needs to be done because the data is stored in attributes (to save space presumably), so I ended up with code like this:

private static IEnumerable GetAllXmlComments(string stackDirectory)
        {
            var commentFile = Path.Combine(stackDirectory, "comments.xml");
            var allXmlComments = StreamTypeFromXml(commentFile, "row", element => new commentsRow
                {
                    CreationDate = element.GetAttributeStringValue("CreationDate"),
                    Id = element.GetAttributeStringValue("Id"),
                    PostId = element.GetAttributeStringValue("PostId"),
                    Score = element.GetAttributeStringValue("Score"),
                    Text = element.GetAttributeStringValue("Text"),
                    UserDisplayName = element.GetAttributeStringValue("UserDisplayName"),
                    UserId = element.GetAttributeStringValue("UserId")
                });

            return allXmlComments;
        }

For a proof of concept, I can live with it.

So now we can get a collection of all the data in a “stream” and do the mapping for Mongo. Rather than do the mapping manually (again!) I drafted in AutoMapper. So long as the property names were similar (or in my case – the same) AutoMapper will try to match up the properties. Where it got a little stuck is trying to convert strings to int. Rather than try to guess that’s what you want, you’ll need to create a type converter and set the mapping convention. Same goes for string to DateTime conversions, but really it’s just a few lines of code and AutoMapper will go off and do its thing.

That’s pretty much it. This post was mainly about using an XmlReader to stream large amounts of data using yield return. The full source is up on GitHub if you want to see the whole thing.

RaspberryPi: A web API for gphoto

I was recently inspired by my good friend Josh Gallagher to flex my geek credentials when he mentioned he’d acquired a Raspberry Pi.  I’d seen mention of it on Engadget but had been too busy to pay much attention to it.  I’d seen an article on SLR Lounge about someone fitting one of these cheap little devices into a camera grip to control the camera so I thought I’d have a go at that.  I’m more of a software guy though, so I thought I’d try to see whether I could control my camera from my Nexus 7.

As a former web developer, my immediate instinct was to create a thin web API over gphoto2.  I would then be able to create a UI with HTML which I could use from any device with a browser.  As well as learning to use the Raspberry Pi I would also use this opportunity to learn a bit about Python.

The technology stack I ended up using was:

  • Python
  • Bottle v0.10 – a Sinatra like web framework
  • gphoto2
  • jQuery
  • Raspberry Pi
  • Nikon D3s

Although it would have been easy to just slap Mono on the box and whip up a .NET based solution, I thought it would be more challenging to use something entirely unfamiliar.  I spent about two days looking through Learn Python the Hard Way to get myself up to speed.  Getting to grips with the basic language syntax was fairly straightforward and it was actually quite fun learning something new.

Before writing any kind of web based API, I wanted to prove the concept of hooking up my camera and getting the RPi talking to it.  So this is where I hit the first problem that was mentioned in David Hunt’s post, which regards the limitations of the RPi’s USB controller.  The limitation manifests itself with having to disconnect and then reconnect the camera because of what appear to be random PTP I/O errors.  David mentions the use of a C script to reset the USB port.  A lot of googling later I eventually found the C script in question.  More googling and we have a bash script to string together the requisite commands that will control the camera:

#!/bin/bash
#
dev=`gphoto2 --auto-detect | grep usb | cut -b 36-42 | sed 's/,///'`
if [ -z ${dev} ]
then
 echo "Error: Camera not found"
 exit
fi
resetusb /dev/bus/usb/${dev}
gphoto2 $@
resetusb /dev/bus/usb/${dev}

Deciding on a suitable web framework took a bit of time. After several failed attempts to get Django to work with Apache, I went with Bottle as a lightweight web framework and ditched both Django and Apache.

Getting python to call out to a bash script felt a bit clunky though and turning that part into a python script was fairly straight forward.  Python has the subprocess module which allows python to call out to other “executables” on the system and return data back to the python script.  This results in a couple of simple python methods that can be combined to call out to gphoto and return the appropriate result:


def resetusb():
    if global_usb_port != None:
        subprocess.Popen(['sudo', '/home/pi/usbreset', '/dev/bus/usb/' + global_usb_port])
        return True
    else:
        return False

def detectcamera():
    gphoto_detect = subprocess.check_output(['sudo', 'gphoto2', '--auto-detect'])

    if gphoto_detect == None:
        return False

    usb_device = gphoto_detect.split(":")

    if len(usb_device) < 2:
        return False
    else:
        usb_device = usb_device[1].strip().replace(",","/")

    global global_usb_port
    global_usb_port = usb_device
    return True

def execute(command):
    if detectcamera() == False:
        return "Camera not found"

    resetusb()
    gphotocommand = ['sudo', 'gphoto2'] + command
    gphoto_response = subprocess.check_output(gphotocommand)
    resetusb()

    return gphoto_response

Using Bottle to create the web API also proved to be straightforward.  The one problem I encountered was getting gphoto to capture the image and display it on the web page.  Turns out that the image downloaded is saved as read-only and overwriting it with python caused a prompt.  As it’s a web process, I didn’t get the prompt until I tried to replicate the issue from the command line.  The solution was to copy the file to the location where Bottle was set up to serve static files and then delete the original file.

With what I’ve done so far I can (for any camera supported by gphoto2):

  • List all the configuration options that the camera presents
  • View individual configuration options
  • List the camera’s abilities
  • Capture an image and display it on the webpage

Things to do:

  • Ability to change configuration values
  • Ability to string together a series of commands – this should allow things like bracketing and time lapse photography (although the D3s already has a bracketing function)
  • Sort out the hardware side of things so I can attach the RPi to the camera and operate it in the field from my Nexus 7
  • Use a data store to store camera specific settings and preferences
  • Put the source code up on github. Source code can be found on github

 

WCF: Unauthorized client authentication with server header Ntlm, Negotiate

This (like just about all posts on this blog) is more of a reminder to myself in case I ever see this problem again.

Whilst running some integration tests that exercise end to end WCF functionality I encountered an odd problem that only seemed to manifest itself when I was running the test application against the service running within IIS (or so I thought).  When the service was running in Visual Studio 2008’s Cassini server, all my tests pass, but when I reconfigure to point to IIS (the dev environment sits on Windows 2008 Server R2) I get the following error:

The HTTP request is unauthorized with client authentication scheme ‘Ntlm’. The authentication header received from the server was ‘Negotiate,NTLM’.

At this point in time I’m a relative WCF noob and WCF security is (I’m led to believe) a huge topic that I just don’t have much time to learn about right now.  My debugging process is fairly simple:

  • I know the tests work when running against the dev environment so it’s not code.
  • The solution uses Web Deploy projects.  So I delete the virtual application in IIS and rebuild.  That deploys correctly, so it’s not the deployment.
  • The exact same settings running against localhost work for another Service I recently checked in, so I know that this should work.
  • I checked to ensure that the configurations for that other Service were identical (where it mattered) to the one that had problems with authentication.
  • I was using Fiddler to ensure that the client is talking to the service and that the service in turn is talking to other stubbed services.

That last point is where I kind of tripped up and where my WCF  and general ignorance raised its head.

I had configured the client to use Fiddler’s localhost equivalent (ipv4.fiddler) in order to view the messages that are sent across the wire.  The server was naturally configured to just use localhost.  Although fiddler can and does intercept ipv4.fiddler and routes it to localhost, WCF – or more likely Windows Authentication – sees it as a domain and will point out that there’s a mismatch in authentication – hence the 401 error above.

The solution in the end was very simple – ensure that all client endpoints match all server endpoints and that security is configured the same way for both client and service – the problem then goes away. It took a bit of Google magic but I eventually found this post on Stackoverflow which pointed me in the right direction.

ChopShop: A MVC.NET E-Commerce project

I decided to start a little open source project recently.  The idea was that it would allow me to flex some programming muscle, play around with some technologies I wouldn’t be allowed to use at work now and generally make my karma better by giving back (to whoever decided to pick it up) – who knows, it may be the next Magento (yeah right!).  Over the course of this coming year I’m hoping to carry on working on this project and documenting some of the decisions I’ll be making around the codebase.

High Level Architecture Decisions

Although I have a features list in mind that’s about a mile long, my intention, with regards to code architecture, is to keep things as simple as possible.  The application will be split into two websites – one for the front end (the Shop) and an administrative back end.  Keeping scalability and performance in mind, the idea would be to allow the front end to be customised/expanded independently of the admin site – the only thing linking the two would be the database.

My web application of choice will of course be ASP.NET MVC 3 using C# 4.  Database persistence will be handled by NHibernate 3 and the entire thing will be glued together with Castle Windsor 2.5 and jQuery.  I’ve opted for a fairly typical (I think) n-tier logical architecture liberally using Interfaces to keep the layers separate.

The general pattern to get data from the database to the browser would be for the Controller (in the Web project) to request some data for a View Model from a Service. I’ve opted for a very simple Repository pattern allowing Windsor to inject dependencies for me based on the “WithFirstInterface” convention. The service would then request data from the repository, pass it to the Controller, which then gives it to the ViewModel to mash together into whatever the View requires.  There is some implementation leakage from the Repository layer into the Service layer, but that is acceptable due to the dependency on NHibernate in the first place.  I felt that keeping things DRY and SOLID far outweighed any perceived need to change out the ORM at any future stage (thus also sticking to YAGNI).

So far, so very simple, but then again I’ve been mainly working on putting the framework together for the Admin site.  The architecture for the Shop could look different (but only very slightly).  One of the most fundamental architectural decisions in this project is that each component must be swappable (with the exception of the data access layer).  One of my aims is to have multiple payment providers, so being able to have a plug and play architecture will be vital.  Designing the application this way should also allow for each component to be tested without affecting any other components.

The TL;DR architectural summary then would be:

  • Take a dependency on NHibernate
  • Take a dependency on ASP.NET MVC 3
  • Separate layers/components intelligently with Interfaces
  • Glue the layers/components together with Castle Windsor
  • Create a series of unit tests around each component

 

Project Management

The biggest problem I’ve traditionally had with doing little projects like these in the past has been the lack of focus around what it is exactly I want to achieve.  To combat this, I’ve taken to using agilezen.com to record all the different user stories that I want around the creation of this application.  To date there’s around 52 stories and so far I’ve only managed to complete 1(!).  The downside to User Stories is that they don’t take frameworks and infrastructure into account.  The User may only care about adding a Product to their Catalog, but without a supporting framework in place very little can be done.  I’m still not sure how I could approach that kind of “start up” problem better – I’m sure it’ll come to me the more projects I start up.

 

Code Kata: Langton’s Ant in C#

I’ve been itching to give code katas a try after listening to people like Uncle Bob talk about them in podcasts and reading about them on blogs. However, it wasn’t until I (half) watched a screencast by Micah Martin on solving a very small and simple problem that I actually went and did my own properly.

The problem space in this instance is Langton’s Ant. The problem is very simple:

Squares on a plane are colored variously either black or white. We arbitrarily identify one square as the “ant”. The ant can travel in any of the four cardinal directions at each step it takes. The ant moves according to the rules below:

  • At a white square, turn 90° right, flip the color of the square, move forward one unit
  • At a black square, turn 90° left, flip the color of the square, move forward one unit

This should be a fairly straightforward thing to map out in C# I thought. If I forego the graphical aspect of it (as can be seen in the Wikipedia page on this) then the actual logic should be fairly easy to test. In the end this little test took me a couple of hours to do which simply proves that a) I’m not as good a programmer as I thought and b) I have plenty to learn about using my IDE more efficiently (VS2008 with Resharper).

I’ve attached the project to this post.

MSBuild: CSS compression and Javascript minification for ASP.NET MVC

At work we use a continuous integration system to monitor our source repository and create builds.  It’s very handy in that us developers can simply code all day and then commit at the end of the day and then magically what we’ve written gets compiled, configured and deployed to our test and integration environment.  The deployment process is a simple file copy, but what’s smart is that only your standard aspx pages and dll’s get copied over – all those source files don’t get copied and web.config can also be changed on the fly to allow for different environments (e.g. different database connection strings).

So being the lazy coder that I am I thought it’d be nice if I tried to replicate this to some extent at home for my personal projects.  It’s also very convenient since I set up and wrote the build scripts we use at work – at least I won’t be fumbling too much in the dark this time round 😉

I have no intention of setting up a continuous integration server at home (we use CruiseControl.NET at the office), so a simple batch file should do the trick.  What I wanted with a simple double click was:

  1. In place compilation of all my projects within a solution
  2. Copy only what’s required  to my “deployment” folder (i.e. no source files, only aspx, ascx, asax, js, css etc)
  3. Minify and compress my javascript and css
  4. Change web.config so it knows which database to connect to (production or dev) – future requirement
  5. Automate running unit tests – future requirement

Reasons for wanting the above:

  1. Since I’m doing all this on my home/development PC there’s really no need for step 1.  When I run the website locally the files should be compiled automatically, however, during the automated build process will be set to build in “Release” mode.
  2. I don’t like my source code hanging around on a production server.  It has no business being there even if IIS won’t serve it.
  3. Minification and compression make for happy end users
  4. Remember the last time you deployed a config file to production with the connection string to your local database?  Yeah I haven’t done that in a while 😉
  5. I use ReSharper to run my unit tests, but rarely (if ever) do I run the whole suite of tests at once.  At work I’ve set the build script to do this, but this will be just a simple script for me at home.  In future I will probably include this though.

Pre-Requisites

To achieve the above you will need to download and install the following:

[Note that apart from YUI the other two are optional but allow me to do stuff easier.  The Community Tasks probably isn’t even needed in this simple example, but I’m sure I used it for the build script at work which I’m liberally using as my base example)

Put those DLL’s in a place you’ll remember – we’ll need them later.

To achieve this automated build process we’ll need 2 files.  One will be a batch file calling MSBuild with various options and the other will be the proj file which contains all the build instructions.

BuildForProduction.cmd

@echo off
echo Creating a Production Build
path=%path%;C:WindowsMicrosoft.NETFramework64v3.5
MSBuild "C:UsersRezaDocumentsVisual Studio 2008ProjectsBillManagerBuildScript.proj" /v:n /m /tv:3.5 /p:TargetFrameworkVersion=v3.5;Configuration=Release;Platform=AnyCPU /t:BuildCode /fl

A very simple batch file.  First thing is to set the path so that we can find the MSBuild executable.  Note that I’m using 64 bit Windows 7 Professional, so if you’re running a different OS just remove 64 from the path.

Next is the command that does all the good stuff.  We call MSBuild with the options that we want to execute.  I won’t go into details, you can MSBuild /? to look at all the available options.  All I’m telling MSBuild with that line is that I want it to log with normal verbosity to the screen, use as many processors as it can find (all 8 baby!), use version 3.5 of the .NET framework and build it in “Release” mode.  Right at the end I also tell it which Target I want it to run (BuildCode).

The build script itself is rather large so I’ll try and split it up to try and make better sense of it. The thing to remember is that this script is essentially just a project file – the same as when you create a new project in Visual Studio (csproj or vbproj), so unfortunately we’re dealing with XML here (yuck).  I’ll just go over highlights here – I’ll include both files at the end of this post.

BuildScript.proj

At the very start of the file we have the following line:

<Project DefaultTargets="BuildCode" InitialTargets="GetPaths;GetProjects" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">

What this basically says is that if no Target is specified, then use the BuildCode target.  However, the InitialTargets command states that those two Targets need to be run before anything else.  As can be deduced from the target names what I do there is get my pathing in order and create a collection of projects that I want to compile.

<Target Name="BuildCode">    
    <!-- Build the assemblies -->
    <MSBuild Projects="@(CodeProjects)"
             Targets="$(BuildTargets)"
             BuildInParallel="true"
             Properties="Environment=Configuration=$(Configuration);Platform=$(Platform)">
      <Output TaskParameter="TargetOutputs"
              ItemName="CodeAssemblies"/>     
    </MSBuild>

    <!-- Add the compiled code assemblies to master list of all compiled assemblies for the build -->
    <ItemGroup>
      <CompiledAssemblies Include="@(CodeAssemblies)" />
    </ItemGroup>

    <Message Text="Completed BuildCode task"
            Importance="high"/>

    <CallTarget Targets="CopyToBuildFolder" />
    
  </Target>

Above is the Target that we invoked from the batch file. What it’s effectively stating is that I want to build all the projects as defined in the collection @(CodeProjects).  There’s more to it but for this example we’re not using any of the other parameters.  The last command is where we tell the BuildCode target to call another target called CopyToBuildFolder

<Target Name="CopyToBuildFolder">
    <!-- Copies all compiled code to the correct folder - ready for deployment -->
    
    <!-- We need to delete all files in this folder first to ensure a clean build-->
    <Folder.CleanFolder Path="$(BuildsFolder)" Force="True" />

    <!-- Copy main website files - This is ASP.NET MVC specific -->
    <CreateItem Include="$(ProductName).Web**Views***.aspx;
                         $(ProductName).Web**Views***.ascx;
                         $(ProductName).Web**Views***.config;                        
                         $(ProductName).Web*.config;             
                         $(ProductName).Web*.asax;
                         $(ProductName).Webdefault.aspx;
                         $(ProductName).Web**bin***.dll;
                         $(ProductName).Web**content***.css;
                         $(ProductName).Web**content***.jpg;
                         $(ProductName).Web**content***.gif;
                         $(ProductName).Web**content***.png;
                         $(ProductName).Web**scripts960.gridder.js;
                         $(ProductName).Web**scriptsbillmanager.js;
                         $(ProductName).Web**scriptsjquery.tablesorter.min.js;
                         $(ProductName).Web**scriptsjquery-1.3.2.min.js;
                         $(ProductName).Web**scriptsjquery-ui-1.7.2.custom.min.js;
                         $(ProductName).Web**scriptsjson2.js;">
      <Output TaskParameter="Include" ItemName="FilesForWeb" />
    </CreateItem>

    <!-- copy the files to the production build area-->
    <Copy SourceFiles="@(FilesForWeb)"
          DestinationFolder="$(BuildsFolder)%(RecursiveDir)" />

    <!-- Change the Configs -->
    <CallTarget Targets="UpdateProductionConfig" />
    
    <!-- Compress any JS and CSS -->
    <CallTarget Targets="CompressAndMinifyJavascriptAndCSS" />
    
  </Target>

In this Target I’m specifying all the files that I want to include for deployment.  The first thing to do is ensure that the destination folder is clean, so I run a task that deletes all files in a folder.  This is where the SDC tasks come in handy.  The command to recursively delete files in a folder with standard MSBuild tasks is a pain, so by including that one DLL I have saved a lot of time.  The next step is to specify which files I want to copy across.  My solution contains one ASP.NET MVC project and 3 class library projects, so all I want to deploy is what’s in the Web folder.  Note the slightly odd syntax with the “**” in the path.  What this tells MSBuild is that I want to select all files recursively.  So the first line where in the CreateItem task is to get all aspx files in the Views folder recursively.  If you’re familiar at all with ASP.NET MVC and its folder structure then you’ll know why I need all these files.

So once I’ve created the collection of files I want (called @(FilesForWeb)) I call the Copy task and specify the %(RecursiveDir) command to ensure that the folder hierarchy is maintained.  Once that’s done I then call the next two Targets.

<Target Name="CompressAndMinifyJavascriptAndCSS">
    <!-- Compresses javascript and CSS. combines into one file -->
    <CreateItem Include="$(BuildsFolder)ContentSite.css">
      <Output TaskParameter="Include" ItemName="CSSFiles"/>
    </CreateItem>

    <CreateItem Include="$(BuildsFolder)scriptsbillmanager.js;">
      <Output TaskParameter="Include" ItemName="JSFiles"/>
    </CreateItem>
    
    <Attrib Files="%(JSFiles.FullPath)" ReadOnly="false" />

    <CompressorTask
        CssFiles="@(CSSFiles)"
        DeleteCssFiles ="false"
        CssOutputFile="$(BuildsFolder)Contentsite.css"
        CssCompressionType="YuiStockCompression"
        JavaScriptFiles="@(JSFiles)"
        ObfuscateJavaScript="yes"
        PreserveAllSemicolons="false"
        DisableOptimizations="false"
        EncodingType="UTF8"
        DeleteJavaScriptFiles="false"
        LineBreakPosition="-1"
        JavaScriptOutputFile="$(BuildsFolder)scriptsbillmanager.js"
        LoggingType="ALittleBit"
        ThreadCulture="en-GB"></CompressorTask>

    <CallTarget Targets="CompressJson2" />
  </Target>

We’re almost done here… The purpose of this last Target is to compress and minify my application specific javascript and css.  What we can do here is also combine all the javascript into a single file to be more efficient, but for now I’m only compressing single files (jquery and gridder are already minified). I think the Target is fairly self explanatory.  I’m compressing my application specific CSS and javascript and then saving it to itself.  What minification does is try to compact the javascript file as much as possible.  It’ll run through your script and replace long winded variable names with much shorter ones and try not to break your logic.  It also removes any comments, so you can be as verbose as you like whilst developing but when it gets pushed to a live website those funny remarksdeep insights into your code are hidden from your public audience.  Note that I’m also setting the encoding to UTF8.  This is useful if you need to do localisation within your javascript.

Phew… long post.  Scripts for this can be found here.  Note that this is a sample only – use it at your own risk!