October 15, 2010

Open Science – Free Software as Lingua Franca

Posted in Uncategorized at 9:40 pm by mariawolters

A column by
Nick Barnes in Nature News I saw a couple of days ago struck a chord with me. In his column, Nick argues that sharing code makes science more transparent and scientists more easily accountable. Nick says:

That the code is a little raw is one of the main reasons scientists give for not sharing it with others. Yet, software in all trades is written to be good enough for the job intended. So if your code is good enough to do the job, then it is good enough to release — and releasing it will help your research and your field.

It also helps if the software or scripts that are used are written in a sort of “Lingua Franca” that the research community can refer back to. Once that has been established, people will happily trade code and scripts back and forth on mailing lists and web sites. (Incidentally, free software is also a lot easier to access for people whose departments don’t have the money for the latest and greatest extension packages – as I discovered on my brief forays into MATLAB.) Unsurprisingly, a lot of the Linguae Francae that I have encountered over the years tend to be based on free software. I will mention two data analysis tools that are particularly relevant to my line of work, speech science.


PRAAT (Dutch for “to speak”) is an all-singing, all-dancing, egg-laying, milk-and-wool producing (OK, maybe not quite) tool for phoneticians. Basically, if you need it, PRAAT can do it in some way or other. The algorithms it uses for common analysis tasks such as determining the fundamental frequency of a speech signal are pretty decent.

Best of all, PRAAT can be scripted. Unfortunately, this scripting language is one of the more cumbersome languages I have programmed in. For example, the function names correspond to the names of the relevant menu items, and the arguments for the functions follow the sequence in which they are entered in the relevant forms. Unsurprisingly, updates of PRAAT that changed several forms have been known to cause much crying and gnashing of teeth. Despite these shortcomings, PRAAT is a fantastic tool, extremely versatile, and with a lively Yahoo Group.


R is a language for statistical analysis that was coded and is being maintained by a group of excellent statisticians. Although there are some issues with efficiency and the algorithms used for calculating some statistical analyses, R is an extremely reliable tool with many active specialised user communities. The sheer momentum behind R is described very well in
this post from the Revolution Blog

Of particular relevance to me is the R-lang linguistics mailing list, where people like T. Florian Jäger, Hugo Quené, and Roger Levy (to name but a few!) share their expertise in concise code snippets. Postings often include relevant data fragments.

Other Tools

Other tools (again, among many other cool free stuff!) that are relevant to computational linguists are the NITE XML toolkit for defining, creating, annotating, and maintaining sophisticated databases of language data and the NLTK toolkit for Python. I don’t have much hands-on experience with either of those tools because it’s been a long time since I properly worked on language data – I have been focusing on statistics and experimental design in a range of fields during the past seven years.

Maintenance and Momentum
Free software takes a lot of time and effort to produce and support. The field is littered with toolboxes and programmes that were widely used once, but are now no longer maintained. A good example is the SNACK sound tool kit for Python. When I worked in industry as a development engineer for Rhetorical Ltd (now sadly no more), we used Wavesurfer, a powerful customisable tool based on SNACK, for most of our annotation and also some signal analysis work. Wavesurfer is still used at my research centre, but it’s unsupported and won’t be developed further.

It remains to be seen whether PRAAT can keep its momentum – it is certainly still actively developed. R, on the other hand, seems to be self-sustaining very nicely, thank you, and NLTK has been included in a recent Ubuntu distribution.