When the humble CSV file became King

[edit: Pressed the “Update” button when I meant to press the “Preview” one. Original post was undercooked.]

My current contract started as an exercise in porting an Excel/VBScript macro across to VB.NET. When discussing the contract with my not-yet boss, I was certain there’d be a speedup in simply getting the algorithm out of Excel and into .NET. To my delight, there was a factor of 10 speedup when I finally turned to the now-boss with news of the completed cutover.

We’ve been enhancing the core algorithm since then, and recently re-looked at some profiler results together to ensure the latest round of enhancements hadn’t injected anything too wildly slow into the mix. The core algorithm was still punching it like Chewey, but the writing of the data back to the Excel Spreadsheet via OLEDB continued to dominate the profiler results. As the elapsed time was all in Microsoft OLEDB library calls, there was very little I could directly do about it.

We got to talking about how our algorithm’s output data gets consumed, and put back on the table the idea of having output files in CSV format, as the tool used to render the results (Marxan) takes CSV files natively as input.

Now, I’m pretty mixed about it. Excel has been a pain to use as a data store for large datasets. There a number of interesting OLEDB hacks that I stumbled across to eventually get it humming. Now that it is working, there’s a certain appeal to a single file holding all results for a given model run. The CSV approach breaks the results into a number of files. Still, by dropping the OLEDB write overhead, we can punch it even harder.

I initially designed the model support framework with a single file for both input and output. As the boss was talking about CSV right at the beginning, I added the potential for a little flexibility there, but didn’t push hard. The initial design in terms of class diagrams looked like this:

EFlows Support Framework - Initial

EFlows Support Framework – Initial

I realised working on it that I really didn’t want to read a bunch of disparate CSV files in as input for the model, which seemed pointless given the input data was all together in that original Excel file, and being read in very quickly. I decided I needed to revisit the design and pull apart my model save/load support, allowing me to save to a different type of file to the the type I loaded from.

This triggered memories from my previous project at the university where my boss at the time was adamant that the results and input data couldn’t mix. Seemed reasonable at the time. They evolve at different rates, and for that reason alone, scream out for separate treatment. I eventually became sold on keeping the input and output datasets disctinct, so this design change is a tip of the hat to that previous boss.

Here’s the UML class-diagram of the final change:

EFlows Support Framwork - Revised

EFlows Support Framwork – Revised

Some notes on the design :

  • The basic idea is to have the models completely unaware of how their data is saved or loaded.  That’s all delegated to data adapters.  The newer design sees data adapters dedicated to saves, and separate ones dedicated to loads.
  • The original design went a little too far in terms of flexibility. I folded a bunch of noise into a single factory class in a spot where I just couldn’t see myself ever wanting the extra flexibility.
  • The proliferation of interfaces is a result of dependency inversion, making it easier to unit test classes higher up the stack by mocking lower-level classes.
  • I haven’t gone as far as dependency injection because it just doesn’t seem that popular in .NET.  As the code all goes back to the boss at the end of the contract, I thought I’d stop pushing my luck at this point.
  • In splitting the Excel save/load support, a bunch of shared methods came out in the wash, and ended up in a stateless support class they both reference as an aggregate.

And so now, I can either write to Excel if I want a very slow run, or to CSV files if I want to go at ludicrous speed. The boss gets his wish for something fast enough to generate huge reams of data in small windows of time.


2 responses to “When the humble CSV file became King

  1. Hi Linds,
    If you want to write to Excel but do so at a pretty fast speed, check out the POI (java) and NPOI (.Net port, but not as mature) projects.

    .Net is here: http://code.google.com/p/npoi/downloads/list

    I’ve used NPOI 1.2.5 version or thereabouts for reading Excel workbooks at blisteringly fast speed (no IPC). I’ve only written to Exel using POI, not NPOI. I believe the 2.0 beta is trying to add support for .xlsx formats, which the java POI already supports.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s