August 30, 2010
The one tool I couldn’t live without
As part of the Scientiae carnival, Karina asks what tools we couldn’t live without. Although I’m an experimentalist, I don’t work with glass or microbes, like most of the science blogosphere I read, but with silicon and people (-> fanciful term for human-computer interaction).
My favourite tool is not an experimental platform – as a respondent to one of our project’s questionnaires noted sagely, “Technology breaks!”. Personally, I am a great fan of good old-fashioned pen-and-paper, which is extremely unlikely to crash and lose data of a whole experimental session. Rather, my favourite is a tool for analysing experimental results and modelling the resulting patterns of behaviour and preferences.
One letter: R
R is a programming language for everything statistical. It’s free, it’s open source, and it’s being maintained by statisticians for statisticians. Its origin means that it is a pain to learn. It takes a while until one has cleared a path through the data structures, including the various conventions for extracting information from objects that store the results of painstaking statistical analyses, and I am still often baffled myself.
But the payoff is magnificent. Clear (modulo coding ability), open, replicable analyses. R is the ultimate in replicable research. If you give people your data set and your source code, they can repeat every single step of your reasoning. There are no paywalls, no limits of affordability, no packages that are indispensable for the analysis, but that your department hasn’t paid for.
R’s free, open source origins are especially important when we’re talking about the analysis of data that is publicly available, data that can be – and should be! – analysed and reanalysed by citizen bloggers and journalists.
An excellent example are the comments on a recent Posterous post by Ben Goldacre on the reanalysis of a publicly available data set Several people are discussing their statistical analysis in terms of the code they used to generate it, which makes the whole discussion far more transparent. While both the Stata and the R code are useful, only the R analyses can be quickly replicated by anyone with a computer and an internet connection (download R, read in data, execute code, done).
R is even better when used in synergy with LaTeX – essentially, the resulting research papers are completely self-documenting. I don’t write any other way, except when forced to collaborate with first authors that insist on Microsoft Word. Even then, I try to document the main results of my analyses using R and LaTeX, although laziness sometimes gets the better of me and I copy percentages straight into Word.
Relying on free, open-source solutions like R and LaTeX is particularly important when you change institutions a lot. I have what could legitimately be called a patchwork career – after three years as a lecturer, I spent three years in industry, where there were no funds for SPSS, so I was glad I took the time to learn basic R at university. Afterwards, I worked for four different universities in two years, a jumble of small, part-time jobs. Using free software as much as possible made me independent of the particular budgetary constraints and preferences at each institution. Now that I have been in the same place exclusively for five years, I make a very conscious effort to stay independent. As we all know, science budgets wax and wane, and periods of relative dearth are common. So it’s even more important to ensure that you have the tools to keep publishing when the funds run dry and you need to beef up your CV for the next round of applciations.
Especially when the tools are as excellent as R.