October 19, 2010
Sophia from I’m a Scientist, Get Me Out of Here asked me to elaborate on a tweet where I mentioned that data scraping could be useful for studying Human-Computer Interaction. Two applications spring to mind, but both will require more sophisticated analysis than counting.
1. Searching for Usability Problems
People who find software hard to use will complain about it on the Internet. But it can be difficult to find all or even most of such mentions. What’s needed here is a combination of named entity recognition and sentiment analysis. Named entity recognition is a set of techniques that look for all mentions of a certain entity, such as a software package, in a text. Sentiment analysis is a set of tools for gauging attitudes from text. There are simple methods such as keyword searches, and more sophisticated methods that involve complex computational linguistic analyses.
To give you an idea of the complexity of the task, let’s assume we want to find public complaints about usability problems with MS Word. Googling (“Microsoft Word” sucks) yields 438000 pages, (“Microsoft Word sucks”) yield 1790. When we expand into synonyms, we find that (“M$ Word sucks”) gives 637 pages, and Microsoft / MS Word is useless (complete phrase) gives another 14. (Of course, Word users mainly refer to their instrument of torture as Word, which leads to a whole other layer of complexity to be disambiguated.)
The reason you want to use phrases as much as possible is that not every co-occurrence of Microsoft Word and “sucks” in the same document is a diatribe on the failures of MS Word. For example, when googling (Microsoft Word Latex sucks), one of the entries on the first page is this question on Slashdot , where a hapless physics major asks whether it’s worth learning LaTeX. (If you are reading this – YES IT IS. If you’re serious about word processing, use proper software. And there is a special place in hell for journal editors who insist on MS Word submissions. Diatribe over.) And it actually contains the phrase “LaTeX sucks”.
But even if you were to find every instance on the internet where somebody complains about the general suckitude, uselessness, or space, time, and sanity wasting abilities of Microsoft Word, that would not mean that you automatically have identified the usability problems themselves – you only know which documents might contain them. For example, this page is full of phrases such as “Word sucks”, but it does not discuss its usability problems in detail. (There is a fully intended subliminal message though.)
2. Identifying Internet Users with Disabilities and Long Term Care Conditions
A lot is being written about eHealth and Telemedicine, and a prime user group are people who have long term conditions that make it difficult for them to leave the house. There are many ways of finding out how widespread internet use is among those groups, but a particularly powerful way is simple observation. How many bloggers, forum users, facebookers and Twitter users identify themselves as having a long-term care condition? How many more don’t talk about their conditions on their About page, but casually mention depression, arthritis, chronic pain, amongst other things? What do they have to say about services? Probably a key outcome would be, as it is so often with eHealth and telecare, that people would like to see their meatspace services working well before any fancy monitoring, gadgetry, and telewizardry is added.
These are just some thoughts, somewhat disorganised, and possibly more challenging than hoped for, very probably working from a different definition of data scraping – hope they help and can stimulate discussion.
October 17, 2010
A while ago, I posted a diatribe on the hierarchical closed structures of German academia. On a blog written by young German academics Bloggen in der Wissenschaft, I have now found some numbers that illustrate the situation. In this post, I will focus on the last part of the pyramid, the availability of permanent positions.
According to the post, roughly 1/3 of new PhDs who would like a permanent post in the German system will be able to get one. They base this on a comparison between the number of PhD students graduating each year who have academic ambitions and the number of open positions. Sounds good, doesn’t it? But let’s look at what these positions are like.
300 openings are for permanent positions at full universities with a high teaching load, e.g. Dozent, Akademischer Rat.
200 openings are for Junior Professorships. These are the equivalent of the American entry level assistant professor, with the possibility of tenure after six years. However, as discussed before, quite a few Juniorprofessor positions are not meant to lead to tenure, but are designed as temporary posts.
The biggest batch of positions, 2.500, are for full professorships at both full universities and universities of applied science. The German universities of applied science are similar to the former UK polytechnics in that there is a strong emphasis on teaching and a high teaching load. Many of these institutions are not allowed to grant PhD degrees, which makes it extremely difficult to build a research group of your own. If that sounds enticing to you, remember that universities of applied science typically require their professors to have at least three years’ industry experience and offer a salary that is substantially lower than what could be earned in industry.
In order to get a Professur at a full university, a Habilitation is usually required, which is a piece of work that is equivalent in size to a PhD thesis. (In the Humanities, you essentially write a second book, with the PhD as your first book.) A period as a Juniorprofessor may be viewed as equivalent to a Habilitation, but despite the introduction of this career path some ten years ago, around 2000 people still finish their Habilitation each year. One might interpret that as young academics not believing the hype.
Taken together, this means a long period in the wilderness of postdoc and temporary positions until job security is achieved or, in the case of Juniorprofessuren with a prospect of tenure, achievable.
October 15, 2010
A column by
Nick Barnes in Nature News I saw a couple of days ago struck a chord with me. In his column, Nick argues that sharing code makes science more transparent and scientists more easily accountable. Nick says:
That the code is a little raw is one of the main reasons scientists give for not sharing it with others. Yet, software in all trades is written to be good enough for the job intended. So if your code is good enough to do the job, then it is good enough to release — and releasing it will help your research and your field.
It also helps if the software or scripts that are used are written in a sort of “Lingua Franca” that the research community can refer back to. Once that has been established, people will happily trade code and scripts back and forth on mailing lists and web sites. (Incidentally, free software is also a lot easier to access for people whose departments don’t have the money for the latest and greatest extension packages – as I discovered on my brief forays into MATLAB.) Unsurprisingly, a lot of the Linguae Francae that I have encountered over the years tend to be based on free software. I will mention two data analysis tools that are particularly relevant to my line of work, speech science.
PRAAT (Dutch for “to speak”) is an all-singing, all-dancing, egg-laying, milk-and-wool producing (OK, maybe not quite) tool for phoneticians. Basically, if you need it, PRAAT can do it in some way or other. The algorithms it uses for common analysis tasks such as determining the fundamental frequency of a speech signal are pretty decent.
Best of all, PRAAT can be scripted. Unfortunately, this scripting language is one of the more cumbersome languages I have programmed in. For example, the function names correspond to the names of the relevant menu items, and the arguments for the functions follow the sequence in which they are entered in the relevant forms. Unsurprisingly, updates of PRAAT that changed several forms have been known to cause much crying and gnashing of teeth. Despite these shortcomings, PRAAT is a fantastic tool, extremely versatile, and with a lively Yahoo Group.
R is a language for statistical analysis that was coded and is being maintained by a group of excellent statisticians. Although there are some issues with efficiency and the algorithms used for calculating some statistical analyses, R is an extremely reliable tool with many active specialised user communities. The sheer momentum behind R is described very well in
this post from the Revolution Blog
Of particular relevance to me is the R-lang linguistics mailing list, where people like T. Florian Jäger, Hugo Quené, and Roger Levy (to name but a few!) share their expertise in concise code snippets. Postings often include relevant data fragments.
Other tools (again, among many other cool free stuff!) that are relevant to computational linguists are the NITE XML toolkit for defining, creating, annotating, and maintaining sophisticated databases of language data and the NLTK toolkit for Python. I don’t have much hands-on experience with either of those tools because it’s been a long time since I properly worked on language data – I have been focusing on statistics and experimental design in a range of fields during the past seven years.
Maintenance and Momentum
Free software takes a lot of time and effort to produce and support. The field is littered with toolboxes and programmes that were widely used once, but are now no longer maintained. A good example is the SNACK sound tool kit for Python. When I worked in industry as a development engineer for Rhetorical Ltd (now sadly no more), we used Wavesurfer, a powerful customisable tool based on SNACK, for most of our annotation and also some signal analysis work. Wavesurfer is still used at my research centre, but it’s unsupported and won’t be developed further.
It remains to be seen whether PRAAT can keep its momentum – it is certainly still actively developed. R, on the other hand, seems to be self-sustaining very nicely, thank you, and NLTK has been included in a recent Ubuntu distribution.
October 8, 2010
Dear Mr Crockart,
my name is Dr Maria Wolters, I live in XXXX and am one of your constituents. As a Senior Research Fellow at the University of Edinburgh, I am deeply worried about the impact that the proposed budget cuts will have on UK science, and Scottish science in particular.
Scotland has always taken great pride in its intellectual heritage, and there are many initiatives to capitalise on the unique density of excellent universities and translate the world-class research they produce into products, jobs, and economic growth.
However, the innovation pipeline needs to be fed by basic research. If funding for this is cut off at the source, at a time when other nations such as Germany and the US invest heavily in research, the stream of transferable results will dry up, and the long-term economic consequences for Scotland and the UK will be dire.
Edinburgh itself will be particularly badly hit. Although its four universities attract many academics and research students and have a great track record of successful spin-off companies, institutions such as Edinburgh Napier University and Queen Margaret University are already tightening their belts and laying off vital research and teaching staff.
Even without the shadow of potentially severe cuts, the current situation is critical. It is extremely difficult to get funding from the UK Research Councils. Research only has a chance if it is uniformly rated as outstanding or excellent, and even then, funding is by no means assured.
Partnering with industry is not a cure-all. Basic research has a time to market of 5-30 years. This is far too long a time frame for even the largest companies, most of whom are under great pressure to deliver consistent returns to their shareholders. Yet, without the groundwork of basic, blue-skies research, there will be no technology to transfer, no innovations to monetise. It is government who needs to provide funds for seeding the next vital breakthroughs.
The University of Edinburgh is the highest-ranked University in Scotland according to the recent Times Higher Education Ranking. I am proud to be part of the internationally acclaimed School of Informatics, the highest ranked UK department of Computer Science in the 2008 RAE. We have a thriving commercialisation arm, Informatics Ventures. You probably only know too well yourself how crucial computer science and IT are to the whole economy. All this activity and research excellence is sustained on a tight budget, with careful administration of the available resources. There is no room for further manoevring.
Research groups and academic departments depend on a steady stream of research income to attract the best group leaders and researchers. Groups that have taken decades to build can be destroyed and dispersed by a dry spell of a couple of years. Researchers will follow the money, which means that the UK and in particular Scotland will lose out. Permanently. So, funding cuts not only jeopardise the health of the economy for years to come, they will also waste the substantial investment in people and knowledge that has already been made. To make matters worse, world-class universities like the University of Glasgow urgently need research funding to survive. You may remember the recent massive front page headline in the Herald – the urgency of the situation was a real wake-up call. Fortunately, the University of Edinburgh is not in a similar situation – yet. With further cuts, it may well be.
The Science is Vital [http://scienceisvital.org.uk/] coalition, along with the Campaign for Science and Engineering [http://www.sciencecampaign.org.uk], are calling upon the Government to set out a supportive strategy, including public investment goals above or at least in step with economic growth. Without such investment and commitment the UK risks its international reputation, its market share of high-tech manufacturing and services, the ability to respond to urgent and long-term national scientific challenges, and the economic recovery. Science Is Vital is supported by many Fellows of the Royal Society as well as charities such as Cancer Research UK.
Please support your constituents, your city, and the Scottish and UK economy by
– signing EDM 767 – Science is Vital (http://bit.ly/edm767)
– signing the Science is Vital petition – (http://scienceisvital.org.uk/sign-the-petition – I have already signed myself)
– attending a lobby in Parliament on 12 October (15.30, Committee Room 10). I won’t be at the lobby myself, but many others will.
I look forward to hearing from you.