Tag Archives: VB.NET

MWCIT Source Code Released Under BSD-3 License

It’s been a ‘swimming through molasses’ project all things considered, but today marks a milestone where I can finally drop down a gear. After clearing things with my employer, I’ve just placed the source-code to Release 1.0 of the ‘Murrumbidgee Wetlands Condition Indicator Tool’ under a BSD-3 license, and hosted it with GitHub.

The BSD-3 license was suggested by the University when I was sniffing around for options on how the project client could retain access to the source if they need it once my gig here is up. Turns out there’s excellent reasons for BSD-3 (or more accurately, NOT a Creative Commons license which was my first choice), so I was more than happy to settle on the suggestion.

Now, before you get all excited, Release 1.0 of the MWCIT isn’t much more than a button launcher driven by config-files:

Relase 1.0 of the MWCIT,  doing what it mostly does.

Release 1.0 of the MWCIT, doing what it mostly does.

Still, if you’re a developer with a passing interest in a simple (but not trivial) example of a home-grown Model-View-Presenter (MVP) implementation, or you’re interested in how to coax NSubstitute into firing events out of a mock object, there might be something in it for you.

Continue reading

Advertisements

SHA-256 for Code Verification in VB.NET

I’ve got a problem. I’m staring at possible security issues with the the current software I’m writing. The nature of the software means that it has a higher than normal chance of being exposed to things that aren’t healthy for it. It’s a .NET utility, and there doesn’t seem to be a whole lot said out there on a list of considerations for hardening my .NET assemblies.

Basically, I’ve seen the following three high-level variants talked about with respect to .NET applications:

  1. The generation and signing of strongly-named assemblies.
  2. Obfuscate the code into a tangle that even die-hard dis-assembly won’t easily pick apart ( via tools like  Dotfuscator, CryptoObfuscator etc.)
  3. Sign the shipped assemblies with a certificate verified by a  trusted Certificate Authority.

Now, strongly-named assemblies should only be considered if you’re going to register your assembly with the GAC, which you should only do if you’re considering having a number of other assemblies that rely on it calling into the same (shipped only once) library. I’ve got no interest in going down the GAC registration path. In fact, I’m shipping all the assemblies that the bootstrap executable needs as resources within that executable, which conveniently side-steps the entire issue of whether I’m calling the assemblies I think I’m calling. Option one is thus a non-starter in this instance.

I’ll grant that obfuscation of the code would surely interfere with black-hat activities around dis-assembly and re-assembly with malicious additions. Then again, they’ve got to navigate a dis-assembly that implements Inversion of Control, where all the “meat” is in those assemblies packed up as resources, along with runtime behaviour that relies on reflection to assemble the final product.  Frankly, the script-kiddie end of the Black-hat community  are already way, way outside their league in grokking what I’ve done, even with original class names available.

So we’re clear, option 2 is still appealing, but I’m under the gun, and none of the three obfuscators I’ve tried seem to enjoy my trick with reflection to completely side-step falling into needing a dependency injection framework. Old memories of obfuscation, Java and reflection suggest that I may be trying to  push a large boulder uphill whist wearing a wet-suite primed with butter. For now, I’ll leave a deep groking of option 2 until I have some time to explore.

Now, Option 3 involves time and money, both of which this project seems to be short on.  Upon some deep naval-gazing over this, it turns out that I’m not that concerned about whether it’s me doing the verification of the programming deployed or trusted some third party, at least not for the first delivery.

Because I’m not concerned (and my end-users are admittedly oblivious to software security issues), I can save the client a bunch of money, and me a bunch of time by taking a leaf from the Linux world, and ship the final product with a checksum that allows the client to guarantee that what I delivered is what they have installed. 

MDA-5 and SHA1 are no longer hot given recent security discoveries. SHA-256 hasn’t been proven to be vulnerable, so I’ve settled on generating an SHA-256 key that verifies the content of the one and only assembly I ship.

As my user-base will number in the hand-full, and admit openly that they are not technophiles, I’ve kept it simple for them.  On their about screen, I have my code calculate the SHA256 key on itself, and report that to the user. At least in terms of ensuring that they have an un-modified executable, they can check that key against the one supplied in the install instructions.

The SHA-256 key, calculated by the entry assembly on itself upon dialog display.

SHA-256 key, calculated by the entry assembly on itself upon dialog display.

I had to do a bit of digging to put together how to do that, but it turns out to be pretty straightforward:

Imports System.IO
Imports System.Reflection
Imports System.Security.Cryptography
Imports System.Text

Public Module CodeHardeningCollection

  Public Function RetrieveEntryAssemblySHA256AsString() As String
    Return ByteArrayToHumanReadableString(
      RetrieveSHA256([Assembly].GetEntryAssembly)
    )
  End Function

  Public Function RetrieveSHA256(ByRef thisAssembly As [Assembly]) As Byte()
    Dim mySHA256 As SHA256 = SHA256Managed.Create()
    Dim hashValue() As Byte

    Using stream = File.OpenRead(thisAssembly.Location)

      hashValue = mySHA256.ComputeHash(stream)

      stream.Close()
    End Using

    Return hashValue
  End Function

  Public Function ByteArrayToHumanReadableString(ByVal array() As Byte) As String
    Dim bytesAsString As New StringBuilder()
    For byteIndex As Integer = 0 To array.Length - 1
      bytesAsString.Append(
        String.Format(
          "{0:X2}", array(byteIndex)
        )
      )
      If byteIndex Mod 4 = 3 Then ' splits output into 4-char blocks to aid in chunking.
        bytesAsString.Append(" ")
      End If
    Next ' byteIndex
    Return bytesAsString.ToString()
  End Function 'PrintByteArray

End Module

And wallah, the executing assembly will report all alterations made to it through a change in its calculated SHA-256 key.

However, it can still be spoofed via malicious code replacing the call to that “key calculation code” with a straight write-out of the original “shipping” key. That scenario though has us in the territory of an attack deliberately targeting this particular software with a clear intent to circumvent the software’s extant security features. If an attacker is that determined, I’m working on the principle that they’d probably pick first from the plethora of houses out there with their front doors wide open.

An unspoofed SHA-256 key can still be reported via a separated third-party tool such as openssl, so I’ll be recommending that as a more secure approach to checking they have the right software in the install instructions.

Deriving the assembly SHa25 key via OpenSSL

Deriving the assembly SHa25 key via OpenSSL

Why not just suggest the 3rd-party tool checking right off the bat and call it a day? Because from what I know of their operation, it’s unlikely that they’d have the technical skill to go there. At least this gives them somewhere to start if we need to verify if software has been compromised.

There we go. Basic .NET assembly security on a tight time-budget.

Lurgy Lambda Leaves Lingering Lethargy

I’ve mentioned before that I have a distaste for lambda functions haven’t I? Allow me to now indulge in a little confirmation bias.

The scene:

I need every single ounce of speed in the heart of a really complex algorithm, and the complexity is now interfering with my grokking of a nasty but subtle bug, I’m embracing lambda functions to inject some extra semantics at little extra runtime cost.

The story:

For the entire morning, I’ve watched in frustration as the VB.NET code compiles and runs, but never return a ‘True’ value for a boolean lambda function. Yet, if I lift the boolean expression back out of the function and have the expression evaluate directly, it works fine.

Want to know what I did? I got the defining ‘Func’ signature wrong. I wrote this:

  Dim IsLastSeasonInRun As Func(Of Boolean, Integer, Integer) =
  Function(year As Integer, season As Integer)
    Return (year = params.YearsPerRun - 1) AndAlso (season = params.NoOfSeasons - 1)
  End Function

Instead of this (note the location of the ‘Boolean’ on the top line between both):

  Dim IsLastSeasonInRun As Func(Of Integer, Integer, Boolean) =
  Function(year As Integer, season As Integer)
    Return (year = params.YearsPerRun - 1) AndAlso (season = params.NoOfSeasons - 1)
  End Function

Am I miffed? Yes. Some at myself. Some at MSDN’s coverage of lambdas.

For myself, I learned that I could have avoided this problem by setting ‘Option Strict On’ in the source. I’d toggled it off, because I’m dealing with an OLE spreadsheet where every single number turns out to be a Double when it hits my code. I turned it off to get around the immense amount of explicit “noise” typecasting between doubles and integers the compiler was forcing me into (let me worry about crossing the Integer/Double boundary please).

Then again, examples from MSDN themselves don’t tackle what I want to do with Lambda functions explicitly.  You gotta go deep to find the fiddly bits not mentioned in the MSDN lanbda discussion that light the blowtorch. Specifically, it all hinges on that ‘Func‘ call:

The Func type is also new with Visual Basic 2008; it is essentially a delegate that has the return type specified as the last generic parameter and allows up to four arguments to be supplied as the leading generic parameters (there are actually several Func delegates, each of which accepts a specific number of parameters).

Last generic parameter… LAST… It burns… it burns bad… because this is a strongly typed language, and a compile-time check on the return type isn’t done on Lambda functions unless I flag ‘Option Strict On’ to catch my goof (invisible, because I was working on the assumption that it compiles, which means the signature is good).

So… remember .NET folks, if you’ve flagged ‘Option Strict’ off, lambda functions can:

  • compile when they shouldn’t
  • fail silently for you when they run

Nice. Like being affectionately licked to death by a minion of Cthulhu.

Advice On Being a Research Software Developer

My current software development contract draws to a close. It’s the second in a row of appointments that involve writing software for different researchers, and the third such contract in my career. Allow me a moment, dear reader, to reflect on the experience, and attempt to describe what it’s like being a research software developer.

Of course, we only ever truly know an experience by slipping on the boots and walking the path, but by viewing it through the filter of my previous corporate life, perhaps the contrast with more typical environments might be enough to give you, interested reader, some idea of what to expect.

Before getting into the meat of it though, I’ve blogged about this contract a few times before. Let’s take a brief recap.

This contract saw the boss asking for a simple deployment, prompting my search for a single .NET assembly to house all others in a release. It had me making serious attempts to test with NUnit and Excel spreadsheets in a way that wouldn’t drive me mad. It had me rejig my underlying framework to better support mashups in terms of reading and writing simulation data with Excel spreadsheets and CSV files.

The project saw me cry tears of pain thanks to a 64-bit Windows upgrade badly misbehaving with my reliance on 32-bit libraries, both in terms of the Visual Studio debugger, and then NUnit. There was lots more fun that you’ll probably never learn about because I never thought blog to about it as it happened.

From the above blog posts, I confess that these posts do little to draw out the experience of research programming. These issues are just as likely to pop up in a corporate or government software development experience as they are in a role devoted to writing software for researchers.

All that is, for one very small point. That the blog posts exist at all. Which turns out to be a very big point. Let punctuate the point with its very own heading.

Expect Greater Degrees of Freedom and Fear

You’ll be working with researchers. The wonderful (and scary) thing about research is that you don’t really know where you’re going until you get there. Sure, you’ve got some ideas on what might be out there, but often, when you start, it’s not even remotely clear how to get from here to there.

Note that the structure you’re used to in corporate and government jobs probably doesn’t exist if you’re contributing to a science project. Unless it’s engineering flavoured, a researcher won’t have a clue what you’re on about if you attempt a deep discussion on requirements analysis, development methodology, testing frameworks, or other programming jargon.

The typical constraints that attempt to ensure quality are gone. Congratulations, you’re free. Free enough to hang yourself with all that extra rope. If you don’t start adding some of the more appropriate hard-earned skills back into the mix, it’s your funeral, cowboy/girl! You’re still being hired to produce quality, only now, you get to choose what that means.

Because researchers invest their time trying to learn something new, they actively grok and support the idea of trying new things, even if it may go nowhere:

In real terms then, yes, they’ll be relying on your expertise in programming (and dare I say it… computer science). However, more importantly, they’re counting on your capacity to jump in and just try stuff without necessarily knowing where it’s going, or even what you’re doing. Let’s ground this concept with a few examples:

  • The programming language used to construct a simulation prototype was inherently inefficient. A more efficient language with similar syntax to the original was chosen, allowing the researcher to retain an understanding of what the re-implementation does. The change in implementation was taken in the hopes that we’d get a goodly speed-up (Excel VB Script -> VB.NET) .
    • A factor of 10 speed up (50 seconds down to 5 for a typical run) was far, far bigger than I’d hoped for, and satisfied the boss in terms of retaining an understanding of the code-base. Did I know initially? Not with much certainty. Just an educated guess there’d be “some” speedup based on what I understood of the technology space.
  • Dead ends, and changes of mind that “sound” small, but would kick the wind out of all the assumptions that were holding my code-base together happened pretty regularly. The worst for me was a need to let a simulation “run” stretch over more than one year. The entire code-base worked on the assumption that a run and a year were synonymous. Frequent backtracks to get a good answer to this were the order of the day.
    • Know that if you fall in love with the code above the end-goal, you’re going to have a very bad time of it.
    • Pick a source-code control technology that actively supports (or at the very least, minimises interference with ) your capacity to try something new, and blow it all away if it turns out to be a flop. For me, GIT is a gift from the coding-gods in terms of how easy it is to branch experiments off, merge the winners and torch the losers.
  • Simulating how species benefit from releases of water into a river connected to wetland rich in wildlife gets complex fast. The day it occurred to me that grokking simulated annealing is the very shallow end of the pond was a very sobering day indeed. The day my boss made it clear that I’d be expected to invent new approaches to the scarier-end of the code-base was very, very sobering.
    • For my computer-science friends, simulated annealing is a novel way of finding a “good enough” answer to the 0/1 knapsack problem without the exponential effort of a brute-force search for the best answer.
    • Don’t be afraid to engage in any and every activity you can think of to understand what’s there and what you can do with it. Re-implement it. Draw funky diagrams, be they mind-maps or UML. Doesn’t matter, so long as you’re continually leaning into the complexity (see later) until the penny finally drops.
  • Things occasionally get very sticky. I write down what I did to fix it in enough detail so if it ever happens again, I can redo it without re-investing the effort I spent initially.So long as I stay away the very hard boundary of discussing results before we have a publication, I’m free to do what I want with those notes. Hence the work blog entries.

Fear and Freedom, sitting in a tree. K-I-S-S-I-N-G! Yes, they’re into threesomes, but know that YOU are the optional party in any ménage à trois that eventuates. All I can offer you is some recycled wisdom:

Failing isn’t in the falling down, it’s in the staying down.

Lean Into the Complexity; Fear Doesn’t Banish It

If it’s really research, it’s cutting edge. If it’s in a domain we’ve been looking at for a while, it’s also guaranteed to be involved. There’s complexity here that you are unlikely to find in other software roles.

For me, that complexity is where my inner-critic starts up his magic “That’s it! THIS time you are going to choke!” chorus. It helps to enjoy this kind of fear, and recognise that this is you standing at the edge of what you’ve tested yourself against. If shouting “BOOGEY-WOOGEY” at the complexity-monster doesn’t see it bat even a single eye-stalk, how do we come to understand it?

Work with what you’ve got right now, and start yesterday. Research papers, prototypes, whatever you have on-hand. A key aim here is to build a vocabulary quickly on the research domain so you can start having meaningful exchanges with the expert(s) as soon as possible.

What I like doing right off the bat is pulling out the nearest mind-mapping tool (Freeplane is my current favourite) and going to town on whatever reading material I’ve been handed, pulling out what seems most important into a web of terms I can begin hanging my new knowledge on.

A Mind Map of the 50,000ft view of the project.

A Mind Map of the 50,000ft view of the project.

I also like starting a degree of simple refactoring work on the code. It’s a good excuse to get used to the code, and if it was cooked up by someone who doesn’t consider programming their primary passion, I can guarantee you’ll have a rich field of refactoring potential.

In my case, I had nearly all of the functionality sitting in a single method call, 657 lines long. In terms of “Bad Code Smells“, we’ve got ample examples of “Long Method”, “Duplicate Code”, “Dead Code”. Lots of simple refactoring wins just sitting there, waiting to be knocked off as you work on groking what’s been written.

Exactly what you do to lean into the complexity comes down to a matter of personal preference. Abstracting away from any particular activity, I’m really engaged in activities of remembering by “doing” something with the material. Just reading new, highly complex material doesn’t help my retaining of the knowledge.

It’s Agile, And Test-Driven, And Telling Doesn’t Help

You won’t be handed a pretty design document full of informative UML diagrams that allow you to chunk your understanding, or a cross-referenced requirements specification clearly identifying atomic, testable, unambiguous requirements. Nobody’s going to list acceptance criteria, that once implemented, guarantee that you’ve done the right thing.

I’m adamant now that when it comes to picking an appropriate development methodology for these kinds of projects, you’re facing something that needs to be very agile. But please, don’t take my word for it, read a fantastic discourse on the subject of choosing an appropriate development methodology for the nature of the work.

Also… good luck selling your researcher(s) on daily SCRUM meetings, the necessity of TDD, continuous integration, pair programming, or whatever else you’re standing on your religious pulpit about.

Instead, consider leaning into regular coffee catch-ups, where you can air issues before they become obstacles. Mention that you built some tests to make sure you didn’t break anything important with the latest experiment. Tell them you’ll be ready to hand them something that runs (not necessarily behaving though) whenever they ask for it. Ask to sit with them and walk through the code together when it looks appropriate.

EFlows Class Diagram

EFlows Class Diagram

Get the point? You’re still doing the entire agile “thing”, but you’ve just gone meta! Pull out your Supers-cape emblazoned with a Mega-M, because you’re now doing it in a way that doesn’t bamboozle them with your impenetrable development-jargon.

I’ve mentioned source-control that supports experimentation. Another thing that I lean on heavily is my test-suite.

Let me be blatantly honest with you here. I doubt that extreme test-driven development approaches that attempt to close in on 100% code coverage are a good match for this kind of work. A large suite of test cases does indeed calcify a code-base, making it a real effort to radically change an approach if you also have to revisit all the related tests.

As a consequence, I’m not hung up on extensive unit-testing in this kind of role. Stuff that needs to always work, which is typically foundational, and unlikely to change anyway, gets unit-test love. I haven’t, however, bothered climbing the very high up the stack with unit-tests that rely on mocks.

I pay far more attention to integration testing “key methods” through my reuse of NUnit. I allow the method of interest to call down into other real methods, limiting myself to only mock data driving “live” code. The tests then interrogate how well that key method is playing ball with it’s neighbours.

These integration tests are not the Unit Tests you are looking for!

These integration tests are not the Unit Tests you are looking for!

As these algorithms are attempting to simulate nature, there’s a trend to inject a degree of randomness into the simulations, which makes things more difficult to test. Practice looking for loop and method invariants, and testing on them. The number of times I’ve saved myself from a very-bad-ending through an integration test suddenly complaining that an invariant has just been violated is now too large to count.

Don’t be afraid to ask for budget to be allocated to tool support. For the most part, I favour freeware over commercial competitors where I can, but sometimes you only get what you need with cash (this current project sees me spending time with commercial products EnterpriseArchitect, and the RedGate Profiler suite).

Channel your Inner Scribe

Meticulous notes through the course of the project are an absolute necessity for me in complex domains. Sometimes, sharing them matters, because I acknowledge that I am not even remotely a domain expert for the systems being simulated. What I’m advocating here is a project log that you can easily share with your collaborators. It’s got to be relatively free-form, allowing you to attach pictures, photos, videos, etc.

I humbly submit to you that circa 2013, a private WordPress blog, used as a daily journal, and shared only with your collaborators hits the sweet-spot in documenting and sharing your learning with collaborators who aren’t necessarily tech-savvy.

Don’t be afraid to take a photo of the scribble on the whiteboard and write your own notes on what it all means. Don’t be afraid to have your domain expert read those notes and throw peanuts at them (or you). Actually, being afraid is perfectly fine. Allowing it to then stop you from producing reliable code to base research outcomes on is what you’re aiming to push past.

Finally, if you don’t like the idea that you’ll be spending a goodly amount of time writing/coding, then staring at the writing/coding, then your naval, then the writing/coding again, then back at your naval until the magic “ahah” moment drops, this may not be the software career for you.

Sell what you’ve Done, because nobody else will

You probably won’t be sitting beside other software developers, and suddenly launching into a full-on nerd-fest on that clever sub-linear loop you just concocted after your 4-hours-passing-in-a-moment with the digital-fairies. Neither will you have a sales-team handy to decode your technobabble.

Rattling on about O-notation, normalisation, refactoring and the fundamental limits of concurrency thanks to Amdahl’s law is not a way to win non-software friends and influence neuro-typical mindsets.

But… but… if you don’t point out your wins in a language that your audience gets, they’ll never know you had those wins. Drop that O-notation wall of jargon and say instead “Yes, that thing that took a minute to run now takes 5 seconds.” Non-programmers understand concrete time measurements, and will sing praises to the code-set when it’s set in that framework.

Draw your UML diagrams, but don’t get all teary when they don’t understand that little crows-foot means a 1-many relationship, and matching keys to navigate the relationship. They’ll appreciate that certain blobs in your bubble-mania are named things they recognise, and sometimes, might even comment that they got something out of seeing this visualisation of the code-base.

Sell only the bits in the diagram they recognise and only to the degree that doesn’t make their eyes glaze over. The rest of your bubble-mania is for your own naval-gazing. Do, however, expect them to re-use your bubble-mania with audiences who also have no idea what the crows-foot means. Don’t let this disconnect bother you.

Do find ways to help those around you visualise what you do all day. It’s a absolute crying shame when the CEO of a software company is so misguided on software development that they publicly go on record, calling their software developers “glorified administration staff” . Don’t laugh. It’s happened.

Help the researchers you’re working with to understand that this is more than “just typing” by handing them artifacts they might understand… You might try a pretty animation of how you changed the software source-code over time. At the very least, they might ask for a copy for the next dance-rave they’re hosting.

So there you have it. My advice on being a research software engineer condensed into five bullet-points:

  • Expect greater degrees of freedom and fear
  • Lean into the complexity; fear doesn’t banish it
  • It’s agile, and test-driven, and telling doesn’t help
  • Channel your inner scribe
  • Sell what you’ve done because nobody else will

Good luck! May your ground-shaking “ahah” moments be frequent and mind-blowing!

The End of the TRaCK?

In September 2010, I started a software development contract with the Australian Rivers Institute. That first contract, being an integration sub-project of the TRaCK program, is well and truly finished from my end, except for an outstanding journal manuscript. Part of the reviewer feedback quite rightly argued that the submission would have been much stronger with supporting publications that currently haven’t been submitted.

I’ve been naval gazing over whether to keep pushing for a publication of the manuscript or to let it slide. I realised at the end of the PhD that I’m just not that driven to step onto the ‘publish or perish‘ treadmill. That our key users ended up getting value out of the software we built scratches my professional itch sufficiently. Still, the manuscript is written, and It would be nice to have something in a journal to do the research and modelling we conducted on the Daly River in the Northern Territory some justice.

The MSE Data Viewer showing the Daly River

The MSE Data Viewer showing the Daly River

This morning I got feedback from my old boss on the state of the other unpublished papers. It’ll be a while yet, as he’s currently still recovering from the project by sailing around the world. I’m pleased for him and and glad he’s taking all the time he needs to get over the last few months of the project, where he was devoting every waking moment to it.

So, now I’m left in a zone where I’d like to at least wrap things up, and a leave a Web-based paper-trail for anyone interested in my own contribution to TRaCK and the Management Strategy Evaluation software (MSE for short) in particular.

Rich Graphing via the MSE

Rich Graphing via the MSE

Firstly, I had an interview on my role in TRaCK about mid-way through the project to describe the MSE to other researchers within the institute. Jon the interverwier and I spent quite some time ironing out technical jargon from two very different fields of expertise. For any software engineers reading the interview, be warned that I eventually called good enough knowing that it still sounded a little odd, but that the primary audience wouldn’t appreciate any further corrections to appease my pedantry.

The Layered Architecture of the MSE

The Layered Architecture of the MSE

Secondly, here is a thorough final report detailing the MSE software and how it was used in the Daly River catchment in the Northern Territory. It was produced for a target audience of researchers and natural resource managers. As proof that I’m capable of more than just developing software, chapters 2, 3 and Appendix A are my contributions to the report.

Finally, the boss wanted me to cook up an ‘under the hood’ document, describing the software in sufficient detail that a software developer could read the report, pick up the software source, and begin again as quickly as possible. For those wondering what kind of document I’d write on the architecture and design of software when given free reign, this is it.

So there you have it. A few artefacts I can point to after the completion of the project to establish some street-cred with respect to my time on the TRaCK project and the MSE application.

CamelCase to Readable Text in VB.NET

I recently had a need to take Enumeration instances from within .NET and pretty them up for human consumption. The heart of the problem involved how to take CamelCase text and add spaces between each word-break denoted by a new upper-case letter.

I’m mostly following the Microsoft internal coding guidelines for my naming conventions.  Enumerations should thus be mostly PascalCase/Upper Camel Case, but I’m not above just grabbing external libraries and gluing them into the utility, risking oddities that don’t match the guideline.

Given that I can’t predict ahead of time how well the enumeration sticks to a strict PascalCase naming scheme, I wanted a regular expression that would cater for a wider range of strings than ‘strict’ PascalCase.  I learnt that a programmer can drive themselves crazy catering for a rich range of possible encodings, so I decided to draw the line at strict CamelCase along with PascalCase, ignoring non-word characters for the time being.

Now, all languages have their little quirks with how they implement regular expressions, and .NET is no exception.  Thankfully, after a little digging around, I discovered a good launch-point out at StackExchange based on somebody wanting to do a very similar thing in PHP. Very little messing around was required with my favourite expression’s syntax, which is always a pleasant thing. The final expression settled on was:

"(?<=[a-z])(?=[A-Z])"

Interpret the expression thusly:

Look for a pattern that forms a boundary between two characters for valid CamelCase.  On the left-side, seek a lower-case character (a-z). On the right-hand side, seek an upper-case character (A-Z).  On the left-side, do what’s called a zero-width positive look-behind assertion to identify the lowercase character without moving the pattern matcher along the string.  On the right, do a zero-width positive lookahead assertion in order to identify the spot here the new upper-case character in the string without consuming it in a pattern match. The split is to be made so the upper-case character starts a new string.

This blog post is essentially me saying to myself “Ok.. I can see that it works… but WHY does it work?” and deciding to scare whoever else out there likes the occasional good Regular Expression brain-twist.

A chunk of VB.NET code that makes PascalCase/CamelCase text into something more easily consumable by a human is below. The regular expression is created ahead of time outside the function for runtime efficiency.


Imports System.Text.RegularExpressions

Private CamelCaseRegex As New Regex("(?<=[a-z])(?=[A-Z])")

Public Function CamelCaseToHumanReadableString(
                  ByRef inputString As String) As String

  Return String.Join(
    " ",
    CamelCaseRegex.Split(inputString)
  )

End Function

CamelCase for your human consumers long and prosper!

When the humble CSV file became King

[edit: Pressed the “Update” button when I meant to press the “Preview” one. Original post was undercooked.]

My current contract started as an exercise in porting an Excel/VBScript macro across to VB.NET. When discussing the contract with my not-yet boss, I was certain there’d be a speedup in simply getting the algorithm out of Excel and into .NET. To my delight, there was a factor of 10 speedup when I finally turned to the now-boss with news of the completed cutover.

We’ve been enhancing the core algorithm since then, and recently re-looked at some profiler results together to ensure the latest round of enhancements hadn’t injected anything too wildly slow into the mix. The core algorithm was still punching it like Chewey, but the writing of the data back to the Excel Spreadsheet via OLEDB continued to dominate the profiler results. As the elapsed time was all in Microsoft OLEDB library calls, there was very little I could directly do about it.

We got to talking about how our algorithm’s output data gets consumed, and put back on the table the idea of having output files in CSV format, as the tool used to render the results (Marxan) takes CSV files natively as input.

Now, I’m pretty mixed about it. Excel has been a pain to use as a data store for large datasets. There a number of interesting OLEDB hacks that I stumbled across to eventually get it humming. Now that it is working, there’s a certain appeal to a single file holding all results for a given model run. The CSV approach breaks the results into a number of files. Still, by dropping the OLEDB write overhead, we can punch it even harder.

I initially designed the model support framework with a single file for both input and output. As the boss was talking about CSV right at the beginning, I added the potential for a little flexibility there, but didn’t push hard. The initial design in terms of class diagrams looked like this:

EFlows Support Framework - Initial

EFlows Support Framework – Initial

I realised working on it that I really didn’t want to read a bunch of disparate CSV files in as input for the model, which seemed pointless given the input data was all together in that original Excel file, and being read in very quickly. I decided I needed to revisit the design and pull apart my model save/load support, allowing me to save to a different type of file to the the type I loaded from.

This triggered memories from my previous project at the university where my boss at the time was adamant that the results and input data couldn’t mix. Seemed reasonable at the time. They evolve at different rates, and for that reason alone, scream out for separate treatment. I eventually became sold on keeping the input and output datasets disctinct, so this design change is a tip of the hat to that previous boss.

Here’s the UML class-diagram of the final change:

EFlows Support Framwork - Revised

EFlows Support Framwork – Revised

Some notes on the design :

  • The basic idea is to have the models completely unaware of how their data is saved or loaded.  That’s all delegated to data adapters.  The newer design sees data adapters dedicated to saves, and separate ones dedicated to loads.
  • The original design went a little too far in terms of flexibility. I folded a bunch of noise into a single factory class in a spot where I just couldn’t see myself ever wanting the extra flexibility.
  • The proliferation of interfaces is a result of dependency inversion, making it easier to unit test classes higher up the stack by mocking lower-level classes.
  • I haven’t gone as far as dependency injection because it just doesn’t seem that popular in .NET.  As the code all goes back to the boss at the end of the contract, I thought I’d stop pushing my luck at this point.
  • In splitting the Excel save/load support, a bunch of shared methods came out in the wash, and ended up in a stateless support class they both reference as an aggregate.

And so now, I can either write to Excel if I want a very slow run, or to CSV files if I want to go at ludicrous speed. The boss gets his wish for something fast enough to generate huge reams of data in small windows of time.