Tag Archives: R

Curious Breaches of the Rule of Least Surprise in R

R is a language for statistical computing. For the past few months, I’ve been tasked with cutting a prototype modelling tool from R to a .NET framework that does something similar in a previous project here.  Though I’ve skirted around the edge of R on occasion with my succession of contracts, this project is the first time where I had to really go deep.

There was pain, and its source was primarily assuming that R was ‘just another language’, similar enough to the set of languages I’m familiar with that I’d be fine.

It’s not, and I wasn’t.

If you’re experienced with other programming languages and are coming to R for the first time, here be dragons.  If you’re considering going to another language from R, for the love of efficient computation, leave your dragons there with R.

What follows is a number of insights I’ve gained in contrasting R against the more ‘traditional’ languages out there. They’re essentially breaches of the rule of least surprise, where ‘no surprise’ is a baseline of ‘I can expect this assumption to hold regardless of programming language’.

Continue reading

Advertisements

One N-dimensional array, Two Vectors

Backstory: My supervisor tends to capture complex relationships that drive environmental models in .NET DataTable instances. With the right GUI toolkit, it’s (now) trivial to stick the DataTable into a GUI somewhere and allow a callibrator to tweak parameters until the model starts acting like the real-world.

However, DataTables limit us to describing only two-dimensional relationships. My supervisor has semi-seriously threatened me that we might need to store relationships of arbitrary dimensionality. We’d need some kind of “generic” structure that could store and retrieve the data, but that structure should not be limited to any particular number of dimensions. What’s he’s asked of me isn’t as unwholesome as the tile of this blog post insinuates, but the man is forcing me to use grey-matter damn-it!!!

In chewing over just how exotic this can all get, I was pointed at the R project. Our discussion very much reminded my Supervisor of things he’d seen using R. So, I first poured over some user documentation, and reverse-engineered a class diagram of R’s data structures (don’t hold me to perfect accuracy, disambiguation of natural language can be tricky at the best of times , and I wasn’t looking to supply a definitive model):

A UML class diagram of data structures in the R language.

Data Structures in the R language

This exercise really delivered for me. Arrays of any dimension within R seem to be defined as two vectors, one containing the data, another containing dimension information (a sequence of positive intiegers). With only these two “data” and “dimension” vectors, they have enough information to make that “data” vector act like an array of however many dimensions as needed. What was missing in my understanding is how they mapped from array coordinates and dimensions (an n-tuple, where n is the number of dimensions) to a single coordinate, identifying the equivalent cell in the “data” vector.

A little more formally, if I have an n-dimensional array An, I can also have a value-for-value equivalent vector Vn, so when I reference a “value” in An, via say a(c1, …, cn), I can also reference the same value in Vn, via say v(cv). I essentially need a mapping function , taking the array coordinates (c1, …, cn) and dimension details (d1, …, dn) as input, and returning the vector coordinate (cv) as output.

Getting the easy stuff out of the way first, I needed a vector the same size as an array of n dimensions. The size of each dimension matters, I’ll express thusly: d1, …, dn. Each dimension size is drawn from the set of positive integers (each dimension must be an integer > 0). A “data” vector to hold the same amount of information as the array is obtained by simply multiplying the dimensions together (apologies to the maths nerds in advance, I took a whirlwind refresher-course on math constructs this morning after a hiatus of at least 10 years, it may not be entirely “legit”):

A(d_1, \ldots , d_n) = V(\prod\limits_{i=1}^n d_i)

Next, I played with 2-dimensional arrays until I had a formula that would work in converting coordinates a(c1, c2) into cv. Turns out it looks like this (so long as I index any given coordinate from 0, which is fine given how programming languages typically allow you to index their abstract data types):

c_V = c_1 + c_2 \times d_1

Makes sense. I can visualise the array flattening out into a vector as a string of sausages. Initially the sausages are laid out such that each row of the array is one sausage. Each sausage must be as long as c1, and there must be c2 sausages. Extra dimensions got much harder to visualise, so I just spent time punching numbers around until I ended up with a formula that works in the general case:

c_V = c_1 + \sum\limits_{i = 2}^n  (c_i \times \prod\limits_{j = 1}^{i - 1}  d_j)

We can re-express that formula in some psuedo-code:

vectorCoordinate = 0
dimensionOffset = 1
for index = 0 to (dimensions.Count - 1)
  if index = 0 then
    vectorCoordinate = arrayCoordinate(index)
  else
    dimensionOffset = dimensionOffset * dimensions(index - 1)
    vectorCoordinate = vectorCoordinate + arrayCoordinate(index) * dimensionOffset
  end if
 end for

And done! I now get how I can make two vectors act as if they were really an n-dimensional array.  You know, I bet this is already all well documented somewhere.  I wonder if Knuth has already covered this ground?

One final thing. The formulas were made pretty via WordPress.com’s native support for LaTeX.