February 18, 2011

What are the Limits of Using Tweets for Research?

Posted in Uncategorized at 10:19 pm by mariawolters

I had an interesting discussion on Twitter last night with Gunther Eysenbach, Aaron Quigley, and Chris Dickie about – of all things – Twitter and privacy – namely, whether it is ethical to harvest tweets and use them for research without prior informed consent of the Twitterers who wrote them. The discussion was sparked off by Eysenbach’s strong reaction to this blogpost by Michael Zimmer.

Zimmer argues that while Twitter is a public medium, “there is a reasonable expectation that one’s tweet stream will be “practically obscure” within the thousands (if not millions) of tweets similarly publicly viewable.” Therefore, he continues, the intended audience for tweets are people who invest the time and make the effort to seek the Twitterer out – but not researchers and other data analysts. Therefore, anybody seeking to use Tweets for research should first ask the Twitter user whose Tweets they are harvesting for “informed consent”.

Eysenbach argues that this is nonsense – Tweets are public, they are apparently not subject to copyright, and therefore, seeking informed consent to perform research on them is tantamount to Ethics Gone Crazy. (Not Eysenbach’s words, but he does appear to feel quite strongly about this issue.) Eysenbach, one of the key figures of medical internet research, has previously examined the ethics of qualitative internet research in depth in the British Medical Journal, so it was interesting to hear what he had to say.

For me, there are several issues here. I’m going to describe them in a somewhat vague layperson’s terminology, and I hope that people with more legal / ethical knowledge than me will be able to correct and comment on this.

The first issue would be “fair use”. As soon as Tweets are published, they become searchable, and they are periodically harvested and stored by Google. It may be your intention to communicate only with the people you are mentioning, but tweetbots, marketers, and others will be alerted to your Tweet through dedicated keyword searches as soon as they are published. Such automatic alerts or even regular hand-searches are common, and can lead to all sorts of serious legal, financial, and personal trouble for innocent tweeps who thought they were just talking to their followers. Twitter itself maintains up-to-date frequency statistics about words and two- and three-word combinations that are displayed as “Trending Topics”. In conclusion, it seems to me that detailed analyses of tweets are normal and, dare I say it, part of the fabric of Twitter – much to the regret of everyone who has ever been inundated by Bieber- and iPadBots.

So, 1-0 to Eysenbach.

Copyright was another issue that was mentioned in the debate (e.g., does retweeting violate copyright?). I am not even remotely qualified to address this point, so I’ll leave it for now.

But there’s a third issue here, and that is privacy and confidentiality. Yes, Twitter is public. But does that mean that everything that is published there can be republished without regard for the twitterer’s privacy? The Press Complaints Commission may think so , but I would like to see researchers hold themselves to stronger ethical standards than this. As we’ve recently seen in the Baskerville case, trampling roughshod over a person’s privacy can do real damage.

It may be argued that users should be aware at all times that tweets are public, and use Twitter accordingly. But that does not reflect the reality of Twitter usage. For me, and for others, it’s a kind of watercooler while we work. While it is true that Twitter is not a chatroom, it can come close, especially for people who work from home, like Ian Rankin. Some might brand this kind of behaviour foolish, but the point is that there are millions of fools just like Ian and me on Twitter. (Other things I have in common with Ian Rankin: Both of us live in Edinburgh. And, er, that’s it.)

Having said that, I try not to say anything on Twitter that I wouldn’t be happy to say to the addressee’s face. You never know, they might have searches set up for misspellings of their own name and tweet at you out of the blue, telling you that they are actually really lovely.

And the watercooler fun is not all. There are plenty of people who rage about their work (pseudonymously, but still, in my opinion, dangerously), and plenty of people who tweet while drunk or when returning home after a long and eventful night out. They may delete those tweets in the cold light of the morning, while suffering from the mother of all headaches, but your bot may still have harvested and stored them. Others take to Twitter when life gets them down, or they have had a bad day, to receive instant messages of support that can make a real difference to their mood.
Last, but not least, there are spoof accounts, set up to ridicule another twitterer. I know at least four people this has happened to, all prolific twitterer users.

When you perform your analysis, as a conscientious researcher, what do you do with all that sensitive data? Would you quote it with name and location? Would you discuss examples of spoof accounts, drunk tweets, dark moods, or watercooler jokes at length? Would you quote the tweets in your research – which would enable anybody with a search engine to find the original author? Would you reference twitterers by name, would you use their location data in your publications?

For me, this is where I draw the line. Harvesting is all well and good, but

  1. for analysis purposes, identifying information should be removed as far as possible;
  2. for publication purposes, it might be a good idea to contact the twitterer whose tweets are to be cited as examples, especially if the content of the tweets is manifestly private or potentially embarassing and damaging.

Anonymisation is actually standard practice in research – in all the consent forms that people who participate in my experiments, I assure them that their data will be fully anonymised, that nobody will be able to identify them from the data that has been stored about them, and that no identifying data will ever be used when the research is presented. It seems to me that Twitter research should follow similar practices.

I appreciate that there are some analyses where one would want to keep, say, location information, for example for regional topic analysis or news detection. But in such analyses, information from millions of tweets (and, by extension, Twitterers) is typically condensed into a few trends and graphs. Likewise, in research, demographical data stored about participants is often extremely broad – all I would typically store is gender, age (group), maybe education (again very broad, such as highest qualification achieved) or country of birth. This by itself is not enough to identify a person.

Another question is what would count as publication. That is an interesting question in itself. I haven’t asked Eysenbach for permission to link to the tweet I cited earlier, and I do not ask bloggers for permission to link to their posts. The line is extremely fluid though. My one overriding rule is to respect people’s privacy – if they would not want a tweet to be discussed in the Daily Mail, I do not mention it. (For non-UK-based readers, substitute a sensationalist tabloid whose political orientation is diametrically opposed to yours to get the general drift.) I have very specific examples in mind for most of the more touchy privacy issues I discussed above, but I would not want to name the people involved without their explicit consent.

One could argue that since these people were foolish enough to make mistakes in public, they are fair game. But this line of argument suffers from a problem that I see time and time again online – it disregards the fact that there are actual, living human beings behind the screen names, and they are affected, often deeply and sometimes permanently, by the things people write and tweet about them. As researchers, it is our duty to safeguard our research participants, to make sure they come to no harm as the consequence of our work, Therefore, it is incumbent on researchers to protect the best interests of the people whose online output they study – not vice versa.

Looking at Eysenbach’s paper with James Till in the British Medical Journal again, I think we would agree on this matter. For example, they say about newsgroup postings:

The internet holds various pitfalls for researchers, who can easily and unintentionally violate the privacy of individuals. For example, by quoting the exact words of a newsgroup participant, a researcher may breach the participant’s confidentiality even if the researcher removes any personal information. This is because powerful search engines such as Google can index newsgroups (http://groups.google.com/), so that the original message, including the email address of the sender, could be retrieved by anybody using the direct quote as a query. Participants should therefore always be approached to give their explicit consent to be quoted verbatim and should be made aware that their email address might be identifiable. Another reason why researchers should contact individuals before quoting them is that the author of the posting may not be seeking privacy but publicity, so that extensive quotes without attribution may be considered a misuse of another person’s intellectual property.

However, it is entirely possible that we differ on the implied norms of Twitter use. Eysenbach and Till write:

Thirdly, and perhaps most importantly, the perception of privacy depends on an individual group’s norms and codes, target audience, and aim, often laid down in the “frequently asked questions” or information files of an internet community.

[ Twitter does not have a privacy policy, and there are no hard and fast rules or FAQs because there are so many different ways of using the service. ]

Update: As Gunther Eysenbach has kindly and correctly pointed out below, Twitter does have an explicit policy , to which he ascribes the same status as a FAQ.

However, from my own observations of the 800+ accounts I follow, “Twitter as watercooler”or even “Twitter as safety net for people who are going through a hard time in their life” are two frequent, entirely valid, and important uses of the service that need to be handled sensitively in research. Researchers who may not have encountered those uses of Twitter in their own work or streams, or who base their assessment of privacy mainly on external characteristics such as searchability, disregard them at their peril.

In conclusion, I see no problem with a systematic analysis of tweets for research, provided that they are properly anonymised. Twitterers should not be identifiable from publications that result from a piece of research unless they have explicitly consented to be named and/or to have their tweets reproduced in the context of this particular study. Given the private nature of much of Twitter’s content, that is indeed basic research ethics.