August 30, 2010

The one tool I couldn’t live without

Posted in Uncategorized tagged , , at 8:37 pm by mariawolters

As part of the Scientiae carnival, Karina asks what tools we couldn’t live without. Although I’m an experimentalist, I don’t work with glass or microbes, like most of the science blogosphere I read, but with silicon and people (-> fanciful term for human-computer interaction).

My favourite tool is not an experimental platform – as a respondent to one of our project’s questionnaires noted sagely, “Technology breaks!”. Personally, I am a great fan of good old-fashioned pen-and-paper, which is extremely unlikely to crash and lose data of a whole experimental session. Rather, my favourite is a tool for analysing experimental results and modelling the resulting patterns of behaviour and preferences.

One letter: R

R is a programming language for everything statistical. It’s free, it’s open source, and it’s being maintained by statisticians for statisticians. Its origin means that it is a pain to learn. It takes a while until one has cleared a path through the data structures, including the various conventions for extracting information from objects that store the results of painstaking statistical analyses, and I am still often baffled myself.

But the payoff is magnificent. Clear (modulo coding ability), open, replicable analyses. R is the ultimate in replicable research. If you give people your data set and your source code, they can repeat every single step of your reasoning. There are no paywalls, no limits of affordability, no packages that are indispensable for the analysis, but that your department hasn’t paid for.

R’s free, open source origins are especially important when we’re talking about the analysis of data that is publicly available, data that can be – and should be! – analysed and reanalysed by citizen bloggers and journalists.

An excellent example are the comments on a recent Posterous post by Ben Goldacre on the reanalysis of a publicly available data set Several people are discussing their statistical analysis in terms of the code they used to generate it, which makes the whole discussion far more transparent. While both the Stata and the R code are useful, only the R analyses can be quickly replicated by anyone with a computer and an internet connection (download R, read in data, execute code, done).

R is even better when used in synergy with LaTeX – essentially, the resulting research papers are completely self-documenting. I don’t write any other way, except when forced to collaborate with first authors that insist on Microsoft Word. Even then, I try to document the main results of my analyses using R and LaTeX, although laziness sometimes gets the better of me and I copy percentages straight into Word.

Relying on free, open-source solutions like R and LaTeX is particularly important when you change institutions a lot. I have what could legitimately be called a patchwork career – after three years as a lecturer, I spent three years in industry, where there were no funds for SPSS, so I was glad I took the time to learn basic R at university. Afterwards, I worked for four different universities in two years, a jumble of small, part-time jobs. Using free software as much as possible made me independent of the particular budgetary constraints and preferences at each institution. Now that I have been in the same place exclusively for five years, I make a very conscious effort to stay independent. As we all know, science budgets wax and wane, and periods of relative dearth are common. So it’s even more important to ensure that you have the tools to keep publishing when the funds run dry and you need to beef up your CV for the next round of applciations.

Especially when the tools are as excellent as R.


  1. […] This post was mentioned on Twitter by Bernd Weiss, Maria Wolters. Maria Wolters said: The one tool I couldn't live without: […]

  2. I like the concept of replicable analyses and of repeating every single step of a procedure.

    Replicable figures are very important in Finance [1]. I have been using R to test the weights of the optimal portfolios in ConpA [2]. Two implementations, same results.


  3. Ken Williams said,

    Hear hear. The console-nature of R lends itself to reproduceable and shareable analysis, which just doesn’t happen when people use Excel or describe their procedures in prose.

  4. Phil said,

    Stata of course also does this with .do files (scripts). Not as cheap as R, but not very expensive, extremely well-supported, and a large user community. A good alternative to R.

    • mariawolters said,

      I agree that the .do scripts are great – eminently postable as Ben’s thread showed, and I’ve heard many good things about Stata. I also assume that Stata tends to have one central set of well-maintained routines, whereas R packages are numerous and can be hit-and-miss sometimes.

      But I’m puzzled by your definition of not very expensive. The academic price of Stata/IC, which IIRC is the basic version, is just shy of £400, the non-academic UK price is £800. (Source: Timberlake, the UK distributor.)

  5. alan said,

    Using sweave really augments the strengths of LaTeX and R. Sweave unifies the writing and the stats/data processing into a single set of files — more reproducible than either one alone.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

<span>%d</span> bloggers like this: