Curious Breaches of the Rule of Least Surprise in R

R is a language for statistical computing. For the past few months, I’ve been tasked with cutting a prototype modelling tool from R to a .NET framework that does something similar in a previous project here.  Though I’ve skirted around the edge of R on occasion with my succession of contracts, this project is the first time where I had to really go deep.

There was pain, and its source was primarily assuming that R was ‘just another language’, similar enough to the set of languages I’m familiar with that I’d be fine.

It’s not, and I wasn’t.

If you’re experienced with other programming languages and are coming to R for the first time, here be dragons.  If you’re considering going to another language from R, for the love of efficient computation, leave your dragons there with R.

What follows is a number of insights I’ve gained in contrasting R against the more ‘traditional’ languages out there. They’re essentially breaches of the rule of least surprise, where ‘no surprise’ is a baseline of ‘I can expect this assumption to hold regardless of programming language’.

Profilers Reporting Line Numbers Deserve Your Love

The project was all about efficiency.  Early in the process, I put the original R scripts through the language-supplied profiler to get a feel for the bottlenecks present.  Here’s a  nice, snappy summary of R profiling via examples.

You’ll note by default that R profiling is at the function level, not the line level.

Here, then reader, is my first breach of the rule of least surprise.  Without identifying the line of source that IS the bottleneck, I could end up wasting time fine-tuning something that’s not the bottleneck. All it takes is the profiler identifying a function/command that’s called more than once in a run, and bam, ambiguity on where the bottleneck resides, followed by the potential for wasted investment in re-tuning.   It’s been described as crapshooting, which seems appropriate.

I looked a little deeper and was greatly relieved to discover that R does line-number profiling. Couldn’t make it work. Felt stupid. Re-read that command summary more closely, and discovered the following relevant note,  triggering my second breach of the rule of least surprise:

The profiler interrupts R asynchronously, and it cannot allocate memory to store results as it runs. This affects line profiling, which needs to store an unknown number of file pathnames.

So.. no line profiling for me because the script is too memory-intensive/long-running. Umm… making the script a prime candidate for wanting detailed profile information in the first place…. systemic loop of collapse… now in play.

My ‘away’ R profiler results look like this, and I’ve never missed line-numbers on profiler results so much as I do right now:

total.time total.pct self.time self.pct
"unix.time" 400.20 100.00 0.00 0.00
"Optimize" 399.78 99.90 0.02 0.00
"compiledOptimise" 361.84 90.41 72.64 18.15
"which" 227.58 56.87 227.58 56.87
"apply" 92.84 23.20 34.20 8.55
"species.count" 49.12 12.27 1.78 0.44

After re-implementation, my ‘home’ .NET profiler results look like this:

Performance results of the MarxACT model from the ANTS .NET profiler

Performance results of the MarxACT model from the ANTS .NET profiler

So… profiling and sufficiently scary R scripts… I find myself furiously clicking my heals and muttering “There’s no place like home! There’s no place like home!”. Love the line numbers in your profiler results peeps.  You’ll miss them when they’re gone.

Forget Transcription, Embrace Re-Implementation

My biggest ‘wrong-call’ on this project was to try just cutting the R code across to its equivalent in .NET.  Now the author of the original R scripts had done his homework, and was careful to avoid for-loops, which in R, have been described as the third circle of Dante’s R hell, I kid you not.  However, the author  still went big on ‘apply‘ calls which, it seems,  is just as… cough… efficient.

Now, if you trawl the Web trying to get a feel for why for loops are so expensive in R, you might also come away disappointed.  Even that R inferno document linked above pulls out a typical line in R discussion forums that boils down to “It’s not C… write loops like C and you’re gonna have a bad time”.

A common story I’ve seen too many times is that because R is interpreted, it’s for loops are necessarily slower. Ok,  there’s a given truth there simply because interpreted languages are by their very nature slower than compiled ones.  But if you’ve ever profiled Perl or Python, I can guarantee it’s not an iteration construct that jumps out as your bottleneck.

But look, don’t take my word for it… I’ll wait while you try to find other interpreted languages outside R that have speed issues tied directly to their iteration loop language constructs. If you find one, I’d appreciate a heads-up. I don’t want to be caught off-guard like this again.

Slow iteration loops are my third breach of the rule of least surprise with R.  Programming is in essence, automation, and automation involves repeatability.  If  the core construct for repetition that tends to be mimicked pretty-much verbatim between languages turns out to be inefficient for one of them, that’s quite a gob-dropper for me.

So, slow for-loops means the original R script rightly went with a vectorisation approach, which you can interpret as  “apply this function to each element of this vector without the effort of stepping through each cell”. As there is typically no base language support for this kind of thing outside of R (and I guess, whatever language heritage R spawns from), I’d have to translate  them back into loop equivalents.

Transcription turned out to be really tough work.  As R language constructs tend to work at the level of a blob of data cells (vector, array, data frame, etc), one command can do a LOT of work compared to a more traditional language.

A surprise draw-back for me then became the gap between what the code was doing, versus how well it described why it was being done.  With each command taking large leaps of data manipulation, the intent of the R code was often opaque for me as the unfamiliar reader.

Also, very small visual differences, like the placement of a comma can mean the difference between splitting a 2D array into column slices of vectors or row slices.  I often found myself missing vast differences in semantics given that syntax between those differences was nearly identical.

MarxACT R Function-Call Stack

This code-base is spaghetti! Better draw a diagram! This diagram is spaghetti!

By the end of cutting the code across, I had something that I had absolutely zero confidence in being a faithful transcription of the R scripts. Debugging was taking forever, because a faithful transcription means that I was doing an immense amount of loop work  thanks to translating all that vectorisation directly.

And there’s my fourth breach of the rule of least surprise. Language pressure brought to bear by how R wants things done via vectorisation results in a wildly different approach than if it’d had been done in a more C-like language. And by C-like, I mean most modern 3GL languages. By tempting me to work with entire vectors as a data-point, as opposed to say individual cells in  a vector as a data-point, it wasn’t immediately obvious how to avoid excessive work when the original R code offered to do all that work for “free” syntactically.

Eventually, after spinning my wheels for way too long, the author of the R script and I sat down and we went back to basics.  I got him to describe the algorithm to me at a high level, deliberately ignoring implementation specific detail.  Off that specification, I got a working .NET version pretty quickly, and could begin making it more efficient.

Unlike previous jobs where I could cut an algorithm across from one language to another without needing to deeply ‘grok’ its fine points, R is just too different for that to be a sane approach if I didn’t originally author the script.  I got further, faster with that  ‘language independent specification’, regulating the R scripts to only verification that results match between implementations.

As mentioned earlier, the point of the cutover exercise was to get some serious speedup. Have I said that a single run was taking four hours on my machine?  Sensitivity analysis on the R scripts meant that jobs were being submitted over the weekend to the University cloud environment, and consuming 3000+ cumulative hours in processing those scripts.

An informal look at the asymptotic bounds of the R script showed me that the core evaluation routine was a fractional polynomial (a good deal closer to the quadratic end than the linear end) with respect to the number of sites modeled.  The final result was (very) sub-linear with respect to the number of species (far fewer species than sites), all achieved basically by not working at the level of a vector, but considering how the overall value changes between loop iterations, and applying a delta update to cached summary data instead.

Those alterations worked in the .NET code to generate the same result in  three minutes that the R script needed four hours to do. Sensitivity analysis is now a tractable problem again.

In Summary

In transporting non-trivial algorithms between R and .NET, I hit a few gotchas that I’ve not seen in moving algorithms between languages before:

  • Profiling in R is at the function level of granularity by default, not line-level as per other languages.
  • Though R offers line-level profiling, you’re unlikely to get it to work for those really big R jobs where line-level profiling would deliver most benefit.
  • Iterative loops in R are really slow, and considered poor form by the R community if looking for performance.
  • The vectorisation alternative to iterative loops means an approach to algorithm design in R isn’t easily transposed to other languages, and may just bring about poor asymptotic performance if taking the path of least resistance offered by R.

Now don’t get me wrong here. I have a whole new appreciation for the power of R.  Very few lines of code can do an impressive amount of data manipulation.  When coupled with dedicated, powerful math libraries, I now get the attraction to R for its prototyping potential. I also appreciate now why it’s a good idea to step away from R once you’re sure you’ve got a working prototype, and now want the something to really pump out those results.

I note that for-loops were essential for ensuring that the .NET code would do in minutes what R was doing in hours. I grant that the original author was unlikely to see how this could be done if told to stay away from for-loops in favour of vectorisation.  I suspect that even through for-loops are expensive in R, that drop in asymptotic complexity would still trump the heavy-weight vectorisation approach originally taken.  I guess that’s pretty moot though, given that the .NET equivalent would still out-blaze the same ‘loop’ approach in R.

They tell me of the age of cloud computing makes CPU resource limitations a thing of the past.  I find their mental masturbation amusingly naive.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s