Importing data from StackExchange or How to deserialize REALLY large XML files

I’ve recently been reviewing MongoDB for work purposes and have been really impressed with MongoDB’s story around replication and sharding. After carefully reading all the documentation, I fired up a couple of VM’s in Windows Azure and soon got a working replicaset installed into an Azure Availability set. This meant that I would fall in line with Azure’s 99.95% availability SLA and this was easily tested by running two console apps (one to insert data and one to read) and randomly killing MongoDB services. The console apps ran in a while loop and so were continually trying to connect to Mongo. Failover for reads was very quick but writes took a little longer to recognise that the primary node was down – presumably waiting for the other nodes to negotiate on “who’s the boss”.

Once satisfied that this could be overcome with a bit of clever coding (perhaps using a back off strategy) I wanted to explore performance. So what kind of dataset might be large enough to stress Mongo, allow me to explore optimal data models and is free? StackExchange Data Dump! It just so happens that the September 2013 dump was recently made available. One BitTorrent download over night later and I have 14GB’s of compressed XML on my computer.

I figured I’d start with a small set of data – I think I settled on bicycles.stackexchange.com initially. The data export for each site is fairly straightforward and the schema presumably denormalised from what StackExchange actually use on their systems.

StackExchange XML Export

So you could go ahead and create a model for Mongo, parse the XML and then map the data accordingly, but I wanted to get the import over and done with and get on with the good stuff of performance testing. Easiest way out then would be to use whatever’s been defined in the XML files and use standard .NET XML deserializer – effectively deserializing in one go (I’m sure you can tell where this is headed…). You can use xsd.exe, point it at the XML files and generate the schema and the corresponding .NET classes. Easy stuff. Once done you can use the generated classes and the .NET deserializer to pump data into Mongo. The problem though is that the data export does not specify any types, so what we have is that everything is a string. Not a great situation to be honest. To get around that I used AutoMapper to map the generated classes onto a data model that is a bit more appropriate for my purpose.

There’s always a butt somewhere… I didn’t download 14GB’s of data to mess around with a small data set like bicycles.stackexchange.com. I came for the daddy – I wanted to query the data for StackOverflow.com! Easy enough I thought, point the directory to where I’d unzipped the data and let her rip. Clearly I didn’t think this through… For reference, for the September 2013 data dump, the Posts.xml file for StackOverflow is 20GB in size. So one Out of Memory exception later I was left scratching my head on how to import this bad boy into… well anywhere really.

StackOverflow to the rescue… well the site really. I found a StackOverflow post by Jon Skeet that explained what I needed to do very well. Using XmlReader and “yield” you can effectively “stream” data from the XML and transform it on the fly. I modified his answer slightly to produce the following:

private static IEnumerable StreamTypeFromXml(string stackDirectory, string elementName, Func<XElement, T> converter) where T : class
        {
            using (XmlReader reader = XmlReader.Create(stackDirectory))
            {
                reader.MoveToContent();
                while (reader.Read())
                {
                    if (reader.NodeType == XmlNodeType.Element)
                    {
                        if (reader.Name == elementName)
                        {
                            var element = XElement.ReadFrom(reader) as XElement;

                            if (element != null)
                            {
                                yield return converter.Invoke(element);
                            }
                        }
                    }
                }
                reader.Close();
            }
        }

So what we basically have here is a function that takes the name of the XML file that we want to parse, the element name that we want to distinguish on (in the data dump it’s just “row”) and a Func<XElement, T> which does some stuff on each row as it’s being yield returned. Unfortunately I couldn’t (was too impatient) to figure out an elegant way of deserializing the yielded result automatically into a class. So all my Func<XElement, T> ended up doing is to map each row into the classes that I’d generated previously. A little bit of extra processing needs to be done because the data is stored in attributes (to save space presumably), so I ended up with code like this:

private static IEnumerable GetAllXmlComments(string stackDirectory)
        {
            var commentFile = Path.Combine(stackDirectory, "comments.xml");
            var allXmlComments = StreamTypeFromXml(commentFile, "row", element => new commentsRow
                {
                    CreationDate = element.GetAttributeStringValue("CreationDate"),
                    Id = element.GetAttributeStringValue("Id"),
                    PostId = element.GetAttributeStringValue("PostId"),
                    Score = element.GetAttributeStringValue("Score"),
                    Text = element.GetAttributeStringValue("Text"),
                    UserDisplayName = element.GetAttributeStringValue("UserDisplayName"),
                    UserId = element.GetAttributeStringValue("UserId")
                });

            return allXmlComments;
        }

For a proof of concept, I can live with it.

So now we can get a collection of all the data in a “stream” and do the mapping for Mongo. Rather than do the mapping manually (again!) I drafted in AutoMapper. So long as the property names were similar (or in my case – the same) AutoMapper will try to match up the properties. Where it got a little stuck is trying to convert strings to int. Rather than try to guess that’s what you want, you’ll need to create a type converter and set the mapping convention. Same goes for string to DateTime conversions, but really it’s just a few lines of code and AutoMapper will go off and do its thing.

That’s pretty much it. This post was mainly about using an XmlReader to stream large amounts of data using yield return. The full source is up on GitHub if you want to see the whole thing.

RaspberryPi: A web API for gphoto

I was recently inspired by my good friend Josh Gallagher to flex my geek credentials when he mentioned he’d acquired a Raspberry Pi.  I’d seen mention of it on Engadget but had been too busy to pay much attention to it.  I’d seen an article on SLR Lounge about someone fitting one of these cheap little devices into a camera grip to control the camera so I thought I’d have a go at that.  I’m more of a software guy though, so I thought I’d try to see whether I could control my camera from my Nexus 7.

As a former web developer, my immediate instinct was to create a thin web API over gphoto2.  I would then be able to create a UI with HTML which I could use from any device with a browser.  As well as learning to use the Raspberry Pi I would also use this opportunity to learn a bit about Python.

The technology stack I ended up using was:

  • Python
  • Bottle v0.10 – a Sinatra like web framework
  • gphoto2
  • jQuery
  • Raspberry Pi
  • Nikon D3s

Although it would have been easy to just slap Mono on the box and whip up a .NET based solution, I thought it would be more challenging to use something entirely unfamiliar.  I spent about two days looking through Learn Python the Hard Way to get myself up to speed.  Getting to grips with the basic language syntax was fairly straightforward and it was actually quite fun learning something new.

Before writing any kind of web based API, I wanted to prove the concept of hooking up my camera and getting the RPi talking to it.  So this is where I hit the first problem that was mentioned in David Hunt’s post, which regards the limitations of the RPi’s USB controller.  The limitation manifests itself with having to disconnect and then reconnect the camera because of what appear to be random PTP I/O errors.  David mentions the use of a C script to reset the USB port.  A lot of googling later I eventually found the C script in question.  More googling and we have a bash script to string together the requisite commands that will control the camera:

#!/bin/bash
#
dev=`gphoto2 --auto-detect | grep usb | cut -b 36-42 | sed 's/,///'`
if [ -z ${dev} ]
then
 echo "Error: Camera not found"
 exit
fi
resetusb /dev/bus/usb/${dev}
gphoto2 $@
resetusb /dev/bus/usb/${dev}

Deciding on a suitable web framework took a bit of time. After several failed attempts to get Django to work with Apache, I went with Bottle as a lightweight web framework and ditched both Django and Apache.

Getting python to call out to a bash script felt a bit clunky though and turning that part into a python script was fairly straight forward.  Python has the subprocess module which allows python to call out to other “executables” on the system and return data back to the python script.  This results in a couple of simple python methods that can be combined to call out to gphoto and return the appropriate result:


def resetusb():
    if global_usb_port != None:
        subprocess.Popen(['sudo', '/home/pi/usbreset', '/dev/bus/usb/' + global_usb_port])
        return True
    else:
        return False

def detectcamera():
    gphoto_detect = subprocess.check_output(['sudo', 'gphoto2', '--auto-detect'])

    if gphoto_detect == None:
        return False

    usb_device = gphoto_detect.split(":")

    if len(usb_device) < 2:
        return False
    else:
        usb_device = usb_device[1].strip().replace(",","/")

    global global_usb_port
    global_usb_port = usb_device
    return True

def execute(command):
    if detectcamera() == False:
        return "Camera not found"

    resetusb()
    gphotocommand = ['sudo', 'gphoto2'] + command
    gphoto_response = subprocess.check_output(gphotocommand)
    resetusb()

    return gphoto_response

Using Bottle to create the web API also proved to be straightforward.  The one problem I encountered was getting gphoto to capture the image and display it on the web page.  Turns out that the image downloaded is saved as read-only and overwriting it with python caused a prompt.  As it’s a web process, I didn’t get the prompt until I tried to replicate the issue from the command line.  The solution was to copy the file to the location where Bottle was set up to serve static files and then delete the original file.

With what I’ve done so far I can (for any camera supported by gphoto2):

  • List all the configuration options that the camera presents
  • View individual configuration options
  • List the camera’s abilities
  • Capture an image and display it on the webpage

Things to do:

  • Ability to change configuration values
  • Ability to string together a series of commands – this should allow things like bracketing and time lapse photography (although the D3s already has a bracketing function)
  • Sort out the hardware side of things so I can attach the RPi to the camera and operate it in the field from my Nexus 7
  • Use a data store to store camera specific settings and preferences
  • Put the source code up on github. Source code can be found on github

 

WCF: Unauthorized client authentication with server header Ntlm, Negotiate

This (like just about all posts on this blog) is more of a reminder to myself in case I ever see this problem again.

Whilst running some integration tests that exercise end to end WCF functionality I encountered an odd problem that only seemed to manifest itself when I was running the test application against the service running within IIS (or so I thought).  When the service was running in Visual Studio 2008’s Cassini server, all my tests pass, but when I reconfigure to point to IIS (the dev environment sits on Windows 2008 Server R2) I get the following error:

The HTTP request is unauthorized with client authentication scheme ‘Ntlm’. The authentication header received from the server was ‘Negotiate,NTLM’.

At this point in time I’m a relative WCF noob and WCF security is (I’m led to believe) a huge topic that I just don’t have much time to learn about right now.  My debugging process is fairly simple:

  • I know the tests work when running against the dev environment so it’s not code.
  • The solution uses Web Deploy projects.  So I delete the virtual application in IIS and rebuild.  That deploys correctly, so it’s not the deployment.
  • The exact same settings running against localhost work for another Service I recently checked in, so I know that this should work.
  • I checked to ensure that the configurations for that other Service were identical (where it mattered) to the one that had problems with authentication.
  • I was using Fiddler to ensure that the client is talking to the service and that the service in turn is talking to other stubbed services.

That last point is where I kind of tripped up and where my WCF  and general ignorance raised its head.

I had configured the client to use Fiddler’s localhost equivalent (ipv4.fiddler) in order to view the messages that are sent across the wire.  The server was naturally configured to just use localhost.  Although fiddler can and does intercept ipv4.fiddler and routes it to localhost, WCF – or more likely Windows Authentication – sees it as a domain and will point out that there’s a mismatch in authentication – hence the 401 error above.

The solution in the end was very simple – ensure that all client endpoints match all server endpoints and that security is configured the same way for both client and service – the problem then goes away. It took a bit of Google magic but I eventually found this post on Stackoverflow which pointed me in the right direction.

Cooking: Banana Walnut and Chocolate cake/bread

I found this Banana Walnut and Chocolate cake recipe on epicurious.com and wanted to save it here with proper measurements for future reference.

  • 187.5g flour
  • 1 tsp baking soda
  • 1 tsp baking powder
  • 1/4 tsp salt
  • 113.5g butter
  • crushed or chopped dark chocolate
  • 1 packet walnuts (about 100g)
  • 200g sugar
  • 2 large eggs
  • 2-3 bananas
  • 2 tbsp lemon juice
  • 1.5 tsp vanilla extract

As per the recipe on epicurious.com:

Preheat oven to 180°C . Butter and flour a small metal bread loaf pan. Whisk first 4 ingredients in medium bowl to blend. Combine chocolate chips and walnuts in small bowl; add 1 tablespoon flour mixture and toss to coat.

Beat butter in large bowl until fluffy. Gradually add sugar, beating until well blended. Beat in eggs 1 at a time. Beat in mashed bananas, lemon juice and vanilla extract. Beat in flour mixture. Spoon 1/3 of batter into prepared pan. Sprinkle with half of nut mixture. Spoon 1/3 of batter over. Sprinkle with remaining nut mixture. Cover with remaining batter. Run knife through batter in zigzag pattern.

Bake bread until tester inserted into center comes out clean, about 1 hour in an oven assisted fan. Turn out onto rack and cool.

In case I don’t read the baking soda instructions again, the lemon juice helps to activate the bicarbonate and helps with the cake rising.  No lemon juice  = flat cake.

ChopShop: A MVC.NET E-Commerce project

I decided to start a little open source project recently.  The idea was that it would allow me to flex some programming muscle, play around with some technologies I wouldn’t be allowed to use at work now and generally make my karma better by giving back (to whoever decided to pick it up) – who knows, it may be the next Magento (yeah right!).  Over the course of this coming year I’m hoping to carry on working on this project and documenting some of the decisions I’ll be making around the codebase.

High Level Architecture Decisions

Although I have a features list in mind that’s about a mile long, my intention, with regards to code architecture, is to keep things as simple as possible.  The application will be split into two websites – one for the front end (the Shop) and an administrative back end.  Keeping scalability and performance in mind, the idea would be to allow the front end to be customised/expanded independently of the admin site – the only thing linking the two would be the database.

My web application of choice will of course be ASP.NET MVC 3 using C# 4.  Database persistence will be handled by NHibernate 3 and the entire thing will be glued together with Castle Windsor 2.5 and jQuery.  I’ve opted for a fairly typical (I think) n-tier logical architecture liberally using Interfaces to keep the layers separate.

The general pattern to get data from the database to the browser would be for the Controller (in the Web project) to request some data for a View Model from a Service. I’ve opted for a very simple Repository pattern allowing Windsor to inject dependencies for me based on the “WithFirstInterface” convention. The service would then request data from the repository, pass it to the Controller, which then gives it to the ViewModel to mash together into whatever the View requires.  There is some implementation leakage from the Repository layer into the Service layer, but that is acceptable due to the dependency on NHibernate in the first place.  I felt that keeping things DRY and SOLID far outweighed any perceived need to change out the ORM at any future stage (thus also sticking to YAGNI).

So far, so very simple, but then again I’ve been mainly working on putting the framework together for the Admin site.  The architecture for the Shop could look different (but only very slightly).  One of the most fundamental architectural decisions in this project is that each component must be swappable (with the exception of the data access layer).  One of my aims is to have multiple payment providers, so being able to have a plug and play architecture will be vital.  Designing the application this way should also allow for each component to be tested without affecting any other components.

The TL;DR architectural summary then would be:

  • Take a dependency on NHibernate
  • Take a dependency on ASP.NET MVC 3
  • Separate layers/components intelligently with Interfaces
  • Glue the layers/components together with Castle Windsor
  • Create a series of unit tests around each component

 

Project Management

The biggest problem I’ve traditionally had with doing little projects like these in the past has been the lack of focus around what it is exactly I want to achieve.  To combat this, I’ve taken to using agilezen.com to record all the different user stories that I want around the creation of this application.  To date there’s around 52 stories and so far I’ve only managed to complete 1(!).  The downside to User Stories is that they don’t take frameworks and infrastructure into account.  The User may only care about adding a Product to their Catalog, but without a supporting framework in place very little can be done.  I’m still not sure how I could approach that kind of “start up” problem better – I’m sure it’ll come to me the more projects I start up.

 

Mac: MacBook Pro hard disk upgrade

It didn’t take long before my requirements for my laptop exceeded the capabilities of the thing.  The bottom of the range specification of a 250GB hard disk with 4GB of RAM is very quickly overcome when trying to run Windows 7 and Visual Studio 2010 through Parallels.  The latest refresh of the MacBook Pro didn’t take my fancy either – it just doesn’t seem worth the money and I may as well wait out another year.  So what’s a geek to do when the laptop’s hard disk is straining due to my having carelessly thrown two operating systems at it as well as a couple of development environments (that’s XCode rather than TextMate) and my whole music collection?

Well I could have upgraded the entire laptop but where’s the fun in that.  A quick look on scan.co.uk brings up a a very reasonably priced Western Digital Scorpio Black 500GB.  Under £50 for a hdd that’s twice the capacity and much faster than my current one – yes please.  Might as well upgrade the RAM at the same time (because modern OS’s need more than 2GB each to run smoothly), but sadly Scan was out of stock.

So the problem becomes very simple: What steps are required to migrate hard disks without losing any data or settings when you run Windows 7 on a BootCamp partition with Parallels (although I imagine the same process applies to VMWare Fusion)?

I needed:

  1. An external hard disk of some description (I bought another 3.5″ SATA, which I was going to re-use after in my desktop, together with an external HDD enclosure that I am already using for my media center). This will preferably be bigger than your existing laptop’s HDD.
  2. A MacBook Pro (mid-2010 vintage although I don’t see why it wouldn’t work with something older).
  3. A tiny tiny phillips screwdriver and a size 0 Torx screwdriver.
  4. TimeMachine (comes with Mac).
  5. WinClone (free download).
  6. The original OSX install CD that comes with the MacBook
  7. Lots and lots of Time…

The first thing I did after connecting the new external hdd to the laptop (via USB) was to partition it into 2.  One partition would be for the OSX backup and the other for the BootCamp image.

Next, run TimeMachine to take a full backup of your OSX installation.  Set the destination to the external disk and wait for a few hours.  TimeMachine will back absolutely everything up… EXCEPT the BootCamp partition.  Which is where WinClone comes in.  After TimeMachine has finished, run WinClone and take an image of the BootCamp partition.  I placed this image directly on the partition of the external HDD reserved for it.

Once both of these processes have finished, power down and swap the hard disks round.  The actual assembly/disassembly time was in the region of 15 minutes – less if you have a Torx screwdriver, but a small set of pliers will do the trick too.  Rather than describe the process there’s a very succinct video on YouTube showing the process.

With that out the way, power up the laptop and pop the OSX CD in.  When the bootstrapping process has completed, OSX will ask whether you want to restore from a backup.  It wasn’t entirely intuitive and in your face like most of Apple’s prompts, but you’ll need to look for a menu item at the top for the utility to restore.  Before you do that though you will need to create a partition on the new hard disk (otherwise there’s nowhere for the OS to install to) and this can be found in the Disk Utility.  You can then set TimeMachine to restore to the new partition you’ve created on your brand new hard disk.  After waiting for what seemed like a long time (overnight) your laptop will be ready to go… almost.

What’s still missing is the Windows installation that’s still sitting as an image on the external hard disk.  Fire up BootCamp again and recreate a BootCamp partition.  As I now have a larger hard disk I can afford a larger Windows partition, so I created it twice as big as my last one.  Once that’s been created (don’t go through the whole process, just create the partition), I used WinClone to restore the image onto the new BootCamp partition.  However, because I increased the partition space and want to take advantage of that exra space, I needed to do one more thing.  The Tools menu on WinClone has a menu option to Expand NTFS Filesystem, use this on the new BootCamp partition and you’re pretty much all set.

At this point, Parallels still complains that it can’t find the Virtual Machine you’ve just lovingly restored into the new BootCamp partition.  That’s because it’s still looking for it on the old hard disk.  Power down the Virtual Machine and look in the Config menu.  Switch the hard disk in the dropdown to point to your new disk and then restart the VM.  Windows will go through a process of checking the disks but after that everything should be back to normal.

And so ends (part of) my laptop upgrade journey.  It was a lot easier than I thought it would be (this being an Apple product and all).  Next up is a RAM upgrade but that should be a lot easier now I’ve opened it up once already.

5 months later…

Time flies. Since getting my MacBook Pro I have been procastinating. That is not to say I have done nothing with my time – I’ve brought myself up to date with ASP.NET MVC 2, learned how to implement and query with NHibernate 3.0 (using the new Linq provider, Criteria and even some HQL), written a web application using .NET 4 (and got it working correctly in a continuous integration build server), learned how to use the new templating features in jQuery and finally grokked javascript’s functions as first class citizen concept. I also put my MacBook through its paces and did some tutorials on Ruby, Rails and node.js.

When I’m not at work (coding), I tend to sit at the computer at home (and code) or read (about code and how to make my code better). After doing this for close to 2 years now I think I’m about ready for a break and my dear wife has provided me with a distraction – Finance.

She’s kindly bought me a copy of Finance – The Basics by Erik Banks and even though I’m only on the 2nd chapter I’m already finding it an interesting and easy to grasp book (so far!). I opted for the Kindle version which is marginally cheaper than the paperback, although I would recommend anyone else to go for the paperback version. The Kindle is a very nice format and for reading it is as good (if not better) than reading a proper book, however the digital version of this book is disappointing with regards to the diagrams that come with it.

One would have thought that being a proper e-book that is bought from Amazon for an Amazon device, that things like diagrams would be crisp and clear and that diagrams would flow nicely together with the text. Sadly this is not the case. The diagrams are all washed out and fuzzy. Enlarging them makes no difference, if anything the diagram gets worse. I’m not even talking about complex graphs (I’m only on chapter 2!). A simple box containing text (same size as the content) looks washed out. I can appreciate that something like vector graphics would be beyond the capabilities of a humble e-reader, but at least include a decent size graphic with the book. This isn’t a gripe at Amazon, well maybe it is a little if only due to quality control, but publishers really should make a better effort than just dumping some ASCII in a file and flogging it at the same price as a physical book.

So there we go, a short update of my life, a slight diversion of interests and a mini-review/whinge of my new Kindle 3 (WIFI only version).

Mac: Macbook Pro – My first week

I went and bought a 13″ Macbook Pro last Saturday – the base model with no bells or whistles. Impulse brought me to the Apple Store on Regent Street and impulse compelled me to buy one on the spot. Common sense would have told me to exhibit some patience and buy the laptop online with a faster hard disk, but impulse won.

I’ve been wanting a laptop for a while now. I satisfied my craving for a state of the art desktop PC last Christmas and I felt it was time to finally get a laptop – the missing piece in my digital lifestyle. I’ve always been curious about the “Mac World”. A mystical place I thought, staring enviously from my “Microsoft World”. All “Mac People” ever went on about was how much better it was than the “Windows way”, so I thought I’d give it a go. With the advent of programs like VMWare, Parallels and Boot Camp I thought I could always flee to the safety of Windows – either on the laptop or back to my desktop – a risky strategy in these economically challenging times.

As it turns out my fears were unjustified. My initial use case (and justification) was that I simply needed a small device on which to browse the web, check on emails and do some lightweight programming on (perhaps learning a bit about Ruby and this thing called ‘Rails’). And in this regard the Macbook Pro simply excels. It’s fast, it’s simple and it does what I want it to. Let’s ignore for the moment that for this simple use case I could have gone and bought a £250 netbook – it’s not an economic itch to scratch by any means.

As a professional web developer I could not resist installing a few other ‘extras’ though. Firefox and Chrome quickly joined Safari, as did TextMate, QuickSilver and Ruby. Solving problems on projecteuler.net proved to be fun using Ruby and when I get stuck doing something there’s always Microsoft’s Remote Desktop Connection to my PC where I can fire up Visual Studio for some brute force calculations (8 logical cores… wooo!).

So far it’s been good, but then it’s only been a week and the novelty of actually having a laptop hasn’t worn off yet. I have noticed that I tend to use the laptop in the living room or in bed now instead of being glued to my desktop. That is not so much a statement of how much I enjoy using Mac OS X (I still have some ‘issues’ with that) but of how nice it is to be able to just fire up a TextMate instance or browser and ‘do stuff’ whilst still being ‘sociable’ with my other half.

I have resolved to create a website for my upcoming wedding and I think I will probably do some of it on this laptop. It will be interesting to see how well it will cope when used in anger.

Code Kata: Langton’s Ant in C#

I’ve been itching to give code katas a try after listening to people like Uncle Bob talk about them in podcasts and reading about them on blogs. However, it wasn’t until I (half) watched a screencast by Micah Martin on solving a very small and simple problem that I actually went and did my own properly.

The problem space in this instance is Langton’s Ant. The problem is very simple:

Squares on a plane are colored variously either black or white. We arbitrarily identify one square as the “ant”. The ant can travel in any of the four cardinal directions at each step it takes. The ant moves according to the rules below:

  • At a white square, turn 90° right, flip the color of the square, move forward one unit
  • At a black square, turn 90° left, flip the color of the square, move forward one unit

This should be a fairly straightforward thing to map out in C# I thought. If I forego the graphical aspect of it (as can be seen in the Wikipedia page on this) then the actual logic should be fairly easy to test. In the end this little test took me a couple of hours to do which simply proves that a) I’m not as good a programmer as I thought and b) I have plenty to learn about using my IDE more efficiently (VS2008 with Resharper).

I’ve attached the project to this post.

MSBuild: CSS compression and Javascript minification for ASP.NET MVC

At work we use a continuous integration system to monitor our source repository and create builds.  It’s very handy in that us developers can simply code all day and then commit at the end of the day and then magically what we’ve written gets compiled, configured and deployed to our test and integration environment.  The deployment process is a simple file copy, but what’s smart is that only your standard aspx pages and dll’s get copied over – all those source files don’t get copied and web.config can also be changed on the fly to allow for different environments (e.g. different database connection strings).

So being the lazy coder that I am I thought it’d be nice if I tried to replicate this to some extent at home for my personal projects.  It’s also very convenient since I set up and wrote the build scripts we use at work – at least I won’t be fumbling too much in the dark this time round 😉

I have no intention of setting up a continuous integration server at home (we use CruiseControl.NET at the office), so a simple batch file should do the trick.  What I wanted with a simple double click was:

  1. In place compilation of all my projects within a solution
  2. Copy only what’s required  to my “deployment” folder (i.e. no source files, only aspx, ascx, asax, js, css etc)
  3. Minify and compress my javascript and css
  4. Change web.config so it knows which database to connect to (production or dev) – future requirement
  5. Automate running unit tests – future requirement

Reasons for wanting the above:

  1. Since I’m doing all this on my home/development PC there’s really no need for step 1.  When I run the website locally the files should be compiled automatically, however, during the automated build process will be set to build in “Release” mode.
  2. I don’t like my source code hanging around on a production server.  It has no business being there even if IIS won’t serve it.
  3. Minification and compression make for happy end users
  4. Remember the last time you deployed a config file to production with the connection string to your local database?  Yeah I haven’t done that in a while 😉
  5. I use ReSharper to run my unit tests, but rarely (if ever) do I run the whole suite of tests at once.  At work I’ve set the build script to do this, but this will be just a simple script for me at home.  In future I will probably include this though.

Pre-Requisites

To achieve the above you will need to download and install the following:

[Note that apart from YUI the other two are optional but allow me to do stuff easier.  The Community Tasks probably isn’t even needed in this simple example, but I’m sure I used it for the build script at work which I’m liberally using as my base example)

Put those DLL’s in a place you’ll remember – we’ll need them later.

To achieve this automated build process we’ll need 2 files.  One will be a batch file calling MSBuild with various options and the other will be the proj file which contains all the build instructions.

BuildForProduction.cmd

@echo off
echo Creating a Production Build
path=%path%;C:WindowsMicrosoft.NETFramework64v3.5
MSBuild "C:UsersRezaDocumentsVisual Studio 2008ProjectsBillManagerBuildScript.proj" /v:n /m /tv:3.5 /p:TargetFrameworkVersion=v3.5;Configuration=Release;Platform=AnyCPU /t:BuildCode /fl

A very simple batch file.  First thing is to set the path so that we can find the MSBuild executable.  Note that I’m using 64 bit Windows 7 Professional, so if you’re running a different OS just remove 64 from the path.

Next is the command that does all the good stuff.  We call MSBuild with the options that we want to execute.  I won’t go into details, you can MSBuild /? to look at all the available options.  All I’m telling MSBuild with that line is that I want it to log with normal verbosity to the screen, use as many processors as it can find (all 8 baby!), use version 3.5 of the .NET framework and build it in “Release” mode.  Right at the end I also tell it which Target I want it to run (BuildCode).

The build script itself is rather large so I’ll try and split it up to try and make better sense of it. The thing to remember is that this script is essentially just a project file – the same as when you create a new project in Visual Studio (csproj or vbproj), so unfortunately we’re dealing with XML here (yuck).  I’ll just go over highlights here – I’ll include both files at the end of this post.

BuildScript.proj

At the very start of the file we have the following line:

<Project DefaultTargets="BuildCode" InitialTargets="GetPaths;GetProjects" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">

What this basically says is that if no Target is specified, then use the BuildCode target.  However, the InitialTargets command states that those two Targets need to be run before anything else.  As can be deduced from the target names what I do there is get my pathing in order and create a collection of projects that I want to compile.

<Target Name="BuildCode">    
    <!-- Build the assemblies -->
    <MSBuild Projects="@(CodeProjects)"
             Targets="$(BuildTargets)"
             BuildInParallel="true"
             Properties="Environment=Configuration=$(Configuration);Platform=$(Platform)">
      <Output TaskParameter="TargetOutputs"
              ItemName="CodeAssemblies"/>     
    </MSBuild>

    <!-- Add the compiled code assemblies to master list of all compiled assemblies for the build -->
    <ItemGroup>
      <CompiledAssemblies Include="@(CodeAssemblies)" />
    </ItemGroup>

    <Message Text="Completed BuildCode task"
            Importance="high"/>

    <CallTarget Targets="CopyToBuildFolder" />
    
  </Target>

Above is the Target that we invoked from the batch file. What it’s effectively stating is that I want to build all the projects as defined in the collection @(CodeProjects).  There’s more to it but for this example we’re not using any of the other parameters.  The last command is where we tell the BuildCode target to call another target called CopyToBuildFolder

<Target Name="CopyToBuildFolder">
    <!-- Copies all compiled code to the correct folder - ready for deployment -->
    
    <!-- We need to delete all files in this folder first to ensure a clean build-->
    <Folder.CleanFolder Path="$(BuildsFolder)" Force="True" />

    <!-- Copy main website files - This is ASP.NET MVC specific -->
    <CreateItem Include="$(ProductName).Web**Views***.aspx;
                         $(ProductName).Web**Views***.ascx;
                         $(ProductName).Web**Views***.config;                        
                         $(ProductName).Web*.config;             
                         $(ProductName).Web*.asax;
                         $(ProductName).Webdefault.aspx;
                         $(ProductName).Web**bin***.dll;
                         $(ProductName).Web**content***.css;
                         $(ProductName).Web**content***.jpg;
                         $(ProductName).Web**content***.gif;
                         $(ProductName).Web**content***.png;
                         $(ProductName).Web**scripts960.gridder.js;
                         $(ProductName).Web**scriptsbillmanager.js;
                         $(ProductName).Web**scriptsjquery.tablesorter.min.js;
                         $(ProductName).Web**scriptsjquery-1.3.2.min.js;
                         $(ProductName).Web**scriptsjquery-ui-1.7.2.custom.min.js;
                         $(ProductName).Web**scriptsjson2.js;">
      <Output TaskParameter="Include" ItemName="FilesForWeb" />
    </CreateItem>

    <!-- copy the files to the production build area-->
    <Copy SourceFiles="@(FilesForWeb)"
          DestinationFolder="$(BuildsFolder)%(RecursiveDir)" />

    <!-- Change the Configs -->
    <CallTarget Targets="UpdateProductionConfig" />
    
    <!-- Compress any JS and CSS -->
    <CallTarget Targets="CompressAndMinifyJavascriptAndCSS" />
    
  </Target>

In this Target I’m specifying all the files that I want to include for deployment.  The first thing to do is ensure that the destination folder is clean, so I run a task that deletes all files in a folder.  This is where the SDC tasks come in handy.  The command to recursively delete files in a folder with standard MSBuild tasks is a pain, so by including that one DLL I have saved a lot of time.  The next step is to specify which files I want to copy across.  My solution contains one ASP.NET MVC project and 3 class library projects, so all I want to deploy is what’s in the Web folder.  Note the slightly odd syntax with the “**” in the path.  What this tells MSBuild is that I want to select all files recursively.  So the first line where in the CreateItem task is to get all aspx files in the Views folder recursively.  If you’re familiar at all with ASP.NET MVC and its folder structure then you’ll know why I need all these files.

So once I’ve created the collection of files I want (called @(FilesForWeb)) I call the Copy task and specify the %(RecursiveDir) command to ensure that the folder hierarchy is maintained.  Once that’s done I then call the next two Targets.

<Target Name="CompressAndMinifyJavascriptAndCSS">
    <!-- Compresses javascript and CSS. combines into one file -->
    <CreateItem Include="$(BuildsFolder)ContentSite.css">
      <Output TaskParameter="Include" ItemName="CSSFiles"/>
    </CreateItem>

    <CreateItem Include="$(BuildsFolder)scriptsbillmanager.js;">
      <Output TaskParameter="Include" ItemName="JSFiles"/>
    </CreateItem>
    
    <Attrib Files="%(JSFiles.FullPath)" ReadOnly="false" />

    <CompressorTask
        CssFiles="@(CSSFiles)"
        DeleteCssFiles ="false"
        CssOutputFile="$(BuildsFolder)Contentsite.css"
        CssCompressionType="YuiStockCompression"
        JavaScriptFiles="@(JSFiles)"
        ObfuscateJavaScript="yes"
        PreserveAllSemicolons="false"
        DisableOptimizations="false"
        EncodingType="UTF8"
        DeleteJavaScriptFiles="false"
        LineBreakPosition="-1"
        JavaScriptOutputFile="$(BuildsFolder)scriptsbillmanager.js"
        LoggingType="ALittleBit"
        ThreadCulture="en-GB"></CompressorTask>

    <CallTarget Targets="CompressJson2" />
  </Target>

We’re almost done here… The purpose of this last Target is to compress and minify my application specific javascript and css.  What we can do here is also combine all the javascript into a single file to be more efficient, but for now I’m only compressing single files (jquery and gridder are already minified). I think the Target is fairly self explanatory.  I’m compressing my application specific CSS and javascript and then saving it to itself.  What minification does is try to compact the javascript file as much as possible.  It’ll run through your script and replace long winded variable names with much shorter ones and try not to break your logic.  It also removes any comments, so you can be as verbose as you like whilst developing but when it gets pushed to a live website those funny remarksdeep insights into your code are hidden from your public audience.  Note that I’m also setting the encoding to UTF8.  This is useful if you need to do localisation within your javascript.

Phew… long post.  Scripts for this can be found here.  Note that this is a sample only – use it at your own risk!