February 18, 2011

What are the Limits of Using Tweets for Research?

Posted in Uncategorized at 10:19 pm by mariawolters

I had an interesting discussion on Twitter last night with Gunther Eysenbach, Aaron Quigley, and Chris Dickie about – of all things – Twitter and privacy – namely, whether it is ethical to harvest tweets and use them for research without prior informed consent of the Twitterers who wrote them. The discussion was sparked off by Eysenbach’s strong reaction to this blogpost by Michael Zimmer.

Zimmer argues that while Twitter is a public medium, “there is a reasonable expectation that one’s tweet stream will be “practically obscure” within the thousands (if not millions) of tweets similarly publicly viewable.” Therefore, he continues, the intended audience for tweets are people who invest the time and make the effort to seek the Twitterer out – but not researchers and other data analysts. Therefore, anybody seeking to use Tweets for research should first ask the Twitter user whose Tweets they are harvesting for “informed consent”.

Eysenbach argues that this is nonsense – Tweets are public, they are apparently not subject to copyright, and therefore, seeking informed consent to perform research on them is tantamount to Ethics Gone Crazy. (Not Eysenbach’s words, but he does appear to feel quite strongly about this issue.) Eysenbach, one of the key figures of medical internet research, has previously examined the ethics of qualitative internet research in depth in the British Medical Journal, so it was interesting to hear what he had to say.

For me, there are several issues here. I’m going to describe them in a somewhat vague layperson’s terminology, and I hope that people with more legal / ethical knowledge than me will be able to correct and comment on this.

The first issue would be “fair use”. As soon as Tweets are published, they become searchable, and they are periodically harvested and stored by Google. It may be your intention to communicate only with the people you are mentioning, but tweetbots, marketers, and others will be alerted to your Tweet through dedicated keyword searches as soon as they are published. Such automatic alerts or even regular hand-searches are common, and can lead to all sorts of serious legal, financial, and personal trouble for innocent tweeps who thought they were just talking to their followers. Twitter itself maintains up-to-date frequency statistics about words and two- and three-word combinations that are displayed as “Trending Topics”. In conclusion, it seems to me that detailed analyses of tweets are normal and, dare I say it, part of the fabric of Twitter – much to the regret of everyone who has ever been inundated by Bieber- and iPadBots.

So, 1-0 to Eysenbach.

Copyright was another issue that was mentioned in the debate (e.g., does retweeting violate copyright?). I am not even remotely qualified to address this point, so I’ll leave it for now.

But there’s a third issue here, and that is privacy and confidentiality. Yes, Twitter is public. But does that mean that everything that is published there can be republished without regard for the twitterer’s privacy? The Press Complaints Commission may think so , but I would like to see researchers hold themselves to stronger ethical standards than this. As we’ve recently seen in the Baskerville case, trampling roughshod over a person’s privacy can do real damage.

It may be argued that users should be aware at all times that tweets are public, and use Twitter accordingly. But that does not reflect the reality of Twitter usage. For me, and for others, it’s a kind of watercooler while we work. While it is true that Twitter is not a chatroom, it can come close, especially for people who work from home, like Ian Rankin. Some might brand this kind of behaviour foolish, but the point is that there are millions of fools just like Ian and me on Twitter. (Other things I have in common with Ian Rankin: Both of us live in Edinburgh. And, er, that’s it.)

Having said that, I try not to say anything on Twitter that I wouldn’t be happy to say to the addressee’s face. You never know, they might have searches set up for misspellings of their own name and tweet at you out of the blue, telling you that they are actually really lovely.

And the watercooler fun is not all. There are plenty of people who rage about their work (pseudonymously, but still, in my opinion, dangerously), and plenty of people who tweet while drunk or when returning home after a long and eventful night out. They may delete those tweets in the cold light of the morning, while suffering from the mother of all headaches, but your bot may still have harvested and stored them. Others take to Twitter when life gets them down, or they have had a bad day, to receive instant messages of support that can make a real difference to their mood.
Last, but not least, there are spoof accounts, set up to ridicule another twitterer. I know at least four people this has happened to, all prolific twitterer users.

When you perform your analysis, as a conscientious researcher, what do you do with all that sensitive data? Would you quote it with name and location? Would you discuss examples of spoof accounts, drunk tweets, dark moods, or watercooler jokes at length? Would you quote the tweets in your research – which would enable anybody with a search engine to find the original author? Would you reference twitterers by name, would you use their location data in your publications?

For me, this is where I draw the line. Harvesting is all well and good, but

  1. for analysis purposes, identifying information should be removed as far as possible;
  2. for publication purposes, it might be a good idea to contact the twitterer whose tweets are to be cited as examples, especially if the content of the tweets is manifestly private or potentially embarassing and damaging.

Anonymisation is actually standard practice in research – in all the consent forms that people who participate in my experiments, I assure them that their data will be fully anonymised, that nobody will be able to identify them from the data that has been stored about them, and that no identifying data will ever be used when the research is presented. It seems to me that Twitter research should follow similar practices.

I appreciate that there are some analyses where one would want to keep, say, location information, for example for regional topic analysis or news detection. But in such analyses, information from millions of tweets (and, by extension, Twitterers) is typically condensed into a few trends and graphs. Likewise, in research, demographical data stored about participants is often extremely broad – all I would typically store is gender, age (group), maybe education (again very broad, such as highest qualification achieved) or country of birth. This by itself is not enough to identify a person.

Another question is what would count as publication. That is an interesting question in itself. I haven’t asked Eysenbach for permission to link to the tweet I cited earlier, and I do not ask bloggers for permission to link to their posts. The line is extremely fluid though. My one overriding rule is to respect people’s privacy – if they would not want a tweet to be discussed in the Daily Mail, I do not mention it. (For non-UK-based readers, substitute a sensationalist tabloid whose political orientation is diametrically opposed to yours to get the general drift.) I have very specific examples in mind for most of the more touchy privacy issues I discussed above, but I would not want to name the people involved without their explicit consent.

One could argue that since these people were foolish enough to make mistakes in public, they are fair game. But this line of argument suffers from a problem that I see time and time again online – it disregards the fact that there are actual, living human beings behind the screen names, and they are affected, often deeply and sometimes permanently, by the things people write and tweet about them. As researchers, it is our duty to safeguard our research participants, to make sure they come to no harm as the consequence of our work, Therefore, it is incumbent on researchers to protect the best interests of the people whose online output they study – not vice versa.

Looking at Eysenbach’s paper with James Till in the British Medical Journal again, I think we would agree on this matter. For example, they say about newsgroup postings:

The internet holds various pitfalls for researchers, who can easily and unintentionally violate the privacy of individuals. For example, by quoting the exact words of a newsgroup participant, a researcher may breach the participant’s confidentiality even if the researcher removes any personal information. This is because powerful search engines such as Google can index newsgroups (http://groups.google.com/), so that the original message, including the email address of the sender, could be retrieved by anybody using the direct quote as a query. Participants should therefore always be approached to give their explicit consent to be quoted verbatim and should be made aware that their email address might be identifiable. Another reason why researchers should contact individuals before quoting them is that the author of the posting may not be seeking privacy but publicity, so that extensive quotes without attribution may be considered a misuse of another person’s intellectual property.

However, it is entirely possible that we differ on the implied norms of Twitter use. Eysenbach and Till write:

Thirdly, and perhaps most importantly, the perception of privacy depends on an individual group’s norms and codes, target audience, and aim, often laid down in the “frequently asked questions” or information files of an internet community.

[ Twitter does not have a privacy policy, and there are no hard and fast rules or FAQs because there are so many different ways of using the service. ]

Update: As Gunther Eysenbach has kindly and correctly pointed out below, Twitter does have an explicit policy , to which he ascribes the same status as a FAQ.

However, from my own observations of the 800+ accounts I follow, “Twitter as watercooler”or even “Twitter as safety net for people who are going through a hard time in their life” are two frequent, entirely valid, and important uses of the service that need to be handled sensitively in research. Researchers who may not have encountered those uses of Twitter in their own work or streams, or who base their assessment of privacy mainly on external characteristics such as searchability, disregard them at their peril.

In conclusion, I see no problem with a systematic analysis of tweets for research, provided that they are properly anonymised. Twitterers should not be identifiable from publications that result from a piece of research unless they have explicitly consented to be named and/or to have their tweets reproduced in the context of this particular study. Given the private nature of much of Twitter’s content, that is indeed basic research ethics.


  1. […] This post was mentioned on Twitter by Maria Wolters, Maria Wolters. Maria Wolters said: What are the Limits of Using Tweets for Research? http://wp.me/pzc5g-20 […]

  2. I would probably agree with most of what has been said above. But even “anonymizing” is a complicated issue. In a recent JAMA study, researchers analyzed tweets where doctors apparently breached confidentiality or behaved “unprofessionally” (http://jama.ama-assn.org/content/305/6/566.2.long). They rightfully did not to publish their names, but this kind of analysis would not have been possible had they removed all the identifying information from the tweets in their database. Cynics may also argue that these physicians breached patient confidentiality, so why should their names be protected… (but I am not arguing along these lines).

    Purging all usernames or real names from the database of archived tweets is also a problem because there are instances where you want to (and should) quote tweets (you mentioned copyright). We are doing presently an analysis of how public health agencies and hospitals used twitter and again it would not be possible without looking at usernames and tracking down who they belong to. And we may quote “positive” examples of exemplary tweets. I am not sure why we should seek the permission of the author for quoting them in this context.

    I should also correct the notion that “Twitter has no privacy policy”.

    The Twitter privacy policy (https://twitter.com/privacy) – which are part of the terms of service which every user agrees to when he signs up for an account – are VERY clear:

    “Our Services are primarily designed to help you share information with the world. Most of the information you provide to us is information you are asking us to make public. This includes not only the messages you Tweet and the metadata provided with Tweets, such as when you Tweeted, but also the lists you create, the people you follow, the Tweets you mark as favorites or Retweet and many other bits of information. Our default is almost always to make the information you provide public but we generally give you settings to make the information more private if you want. Your public information is broadly and instantly disseminated. For example, your public Tweets are searchable by many search engines and are immediately delivered via SMS and our APIs to a wide range of users and services. You should be careful about all information that will be made public by Twitter, not just your Tweets.
    Tip What you say on Twitter may be viewed all around the world instantly.”

    I don’t think one can be any more clear. I would interpret this as a clear “informed consent”, along the lines of the “frequently asked questions or information files of an internet community” you cite above.

    PUBLIC tweets are PUBLICations, and may be tracked, analyzed or quoted. Caveat scriptor!

  3. Emily Goodhand said,

    An interesting piece on tweets and copyright was published last November by @BrightSparkBlog:


    They highlight when a tweet may be subject to copyright. Copyright will only apply to a tweet which is original enough in and of itself to attract protection (e.g. a particularly poetic tweet is more likely to be copyright than say ‘I’m having a cup of tea’). Tweets are published as soon as they are online, so once you write it and click ‘tweet’ your post is published to the world.

    My take on this is that harvesting Twitter for the purposes of analysis and research is fine, but you need to then be careful as to how you use that data. There will be personal data in the name and possibly even the Twitter username, so when publishing results it would be best to anonymise, in my view. If you’re likely to be re-publishing tweets which could themselves be copyright works (see link above), you will need to ensure that you re-publish them for criticism and review, and credit the author (similar to a reference), as you can then use them under the fair dealing defence without having to seek permission first (speaking as one who is based in the UK and under UK laws).

    Great post though, and one which certainly provides food for thought!

    Emily (@copyrightgirl)

  4. Faisal said,

    “Zimmer argues that while Twitter is a public medium, ‘there is a reasonable expectation that one’s tweet stream will be “practically obscure'”

    Aggregation and the use of #hashtags can help in stream consumption. One such tool is @storify which many are starting to use to aggregate like minded conversations, at for example a convention.

    Here is one such example of gathering patient insights into researching Diabetes management:

  5. very interesting article.

    Twitter and all social media are essentially public spaces. Twitter’s privacy is policy is very clear, it is a public dissemination service.

    That however doesn’t stop people from feeling it is private. But the water cooler analogy is appropriate. When one is at work saying things around the water cooler one may be rudely awakened one day when the boss gets to hear what you said. That’s part of growing up at work.

    Additionally the analogy of Tweeting whilst under the influence is appropriate. It’s the same as going for a drink with colleagues to a bar. If one misbehaves there it is likely to be shared in the office more generally.

    My office is online.

    People often forget context, at their peril, but usually the consequences are no more severe than embarrassment. For professionals involved in research and int the helping people focussed services such as health and social services? The consequences could be de-registration with statutory/federal/state bodies, loss of status, loss of job etc.

    As a psychotherapist I have been thinking about this a little and some more of my thoughts are here in my blog, ‘why are the wolves white mr freud’ http://bit.ly/l8Gi1D

    Thanks for this blog, good to see serious conversations about the issue.

    Kind regards


  6. LoveStats said,

    Glad to see more discussions of this highly controversial topic. Agree or disagree, keep pushing for what you believe in.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: