October 19, 2010
Data Scraping for HCI and eHealth – Some Thoughts
Sophia from I’m a Scientist, Get Me Out of Here asked me to elaborate on a tweet where I mentioned that data scraping could be useful for studying Human-Computer Interaction. Two applications spring to mind, but both will require more sophisticated analysis than counting.
1. Searching for Usability Problems
People who find software hard to use will complain about it on the Internet. But it can be difficult to find all or even most of such mentions. What’s needed here is a combination of named entity recognition and sentiment analysis. Named entity recognition is a set of techniques that look for all mentions of a certain entity, such as a software package, in a text. Sentiment analysis is a set of tools for gauging attitudes from text. There are simple methods such as keyword searches, and more sophisticated methods that involve complex computational linguistic analyses.
To give you an idea of the complexity of the task, let’s assume we want to find public complaints about usability problems with MS Word. Googling (“Microsoft Word” sucks) yields 438000 pages, (“Microsoft Word sucks”) yield 1790. When we expand into synonyms, we find that (“M$ Word sucks”) gives 637 pages, and Microsoft / MS Word is useless (complete phrase) gives another 14. (Of course, Word users mainly refer to their instrument of torture as Word, which leads to a whole other layer of complexity to be disambiguated.)
The reason you want to use phrases as much as possible is that not every co-occurrence of Microsoft Word and “sucks” in the same document is a diatribe on the failures of MS Word. For example, when googling (Microsoft Word Latex sucks), one of the entries on the first page is this question on Slashdot , where a hapless physics major asks whether it’s worth learning LaTeX. (If you are reading this – YES IT IS. If you’re serious about word processing, use proper software. And there is a special place in hell for journal editors who insist on MS Word submissions. Diatribe over.) And it actually contains the phrase “LaTeX sucks”.
But even if you were to find every instance on the internet where somebody complains about the general suckitude, uselessness, or space, time, and sanity wasting abilities of Microsoft Word, that would not mean that you automatically have identified the usability problems themselves – you only know which documents might contain them. For example, this page is full of phrases such as “Word sucks”, but it does not discuss its usability problems in detail. (There is a fully intended subliminal message though.)
2. Identifying Internet Users with Disabilities and Long Term Care Conditions
A lot is being written about eHealth and Telemedicine, and a prime user group are people who have long term conditions that make it difficult for them to leave the house. There are many ways of finding out how widespread internet use is among those groups, but a particularly powerful way is simple observation. How many bloggers, forum users, facebookers and Twitter users identify themselves as having a long-term care condition? How many more don’t talk about their conditions on their About page, but casually mention depression, arthritis, chronic pain, amongst other things? What do they have to say about services? Probably a key outcome would be, as it is so often with eHealth and telecare, that people would like to see their meatspace services working well before any fancy monitoring, gadgetry, and telewizardry is added.
These are just some thoughts, somewhat disorganised, and possibly more challenging than hoped for, very probably working from a different definition of data scraping – hope they help and can stimulate discussion.