Social media posts can be an appealing source of data for researchers to analyse, such as in studies using location-based posts to map urban activity or using Twitter data to detect epidemic outbreaks. However, just because data is in the public domain doesn’t make it fair game to capture and republish. This was demonstrated last year in a debate after OK Cupid data was published unanonymised and subsequently retracted. Researchers mustn’t only research legally, but also ethically, asking themselves questions around the impact of the research they are doing on the participants involved.

Recently the STEP (Sensitivity, Transparency, Expectation of privacy, Platform) Framework was developed for curating and sharing social media data. It is intended to facilitate open publication of social media data and offers structured guidance around how to work ethically with such data, in order to improve practice and manage risk. Let’s look at an overview.

  • Sensitivity:
    • Is the information being studied of a sensitive nature? (For example, research may be threatening to subjects if it concerns deeply personal experiences, social control, the interests of powerful persons, or subjects sacred to participants.)
    • Are the research subjects from vulnerable populations? If so, this data should also be considered sensitive, whether vulnerable due to developmental problems, social status, age, or neighbourhoods/environments.
  • Transparency:
    • Is there sufficient documentation to make the data reusable and collection methods transparent? It should cover the data collection methodology, anonymization processes, and ethical considerations, and provide readme file(s) to ensure the data is understandable. It is important for researchers to be transparent about their processes, not only to facilitate data reuse and openness, but also to help foster ‘privacy literacy’ so users can make informed decisions about participating.
  • Expectation of privacy:
    • Did subjects have an expectation of privacy? While social media posts are in public domains, users (especially private citizens as opposed to politicians, celebrities, or organisations) may not expect their posts to be seen beyond their perceived online community, especially if they are, for example, @-mentions on Twitter. You might also consider the names being used: some sites allow use of pseudonyms where others require real names.
    • Was consent obtained for research and/or data sharing? There is no universally agreed rule on the level of consent required for social media research, so it is worth staying current of such developments to inform decisions around publishing data. If consent isn’t obtained, the data may need to be considered more sensitive and access perhaps controlled.
    • Are the data (or can the data be) properly anonymised? Most data repositories require submitted data to be de-identified, however this may be difficult and you may be able to argue it is unnecessary if the subjects would not have had an expectation of privacy.
  • Platform:
    • Are the data in keeping with the policies of the social media platform? Some social media sites’ terms of service limit what can be published. For example, Twitter’s Developer Agreement and Policy states that API users “will only distribute or allow download of Tweet IDs and/or User IDs”; this will also ensure that data would become inaccessible should the Twitter user change their privacy settings or delete a tweet. However, if tweet content is the evidence to your findings and required for reproducibility, you may want to open a dialogue with the platform provider to enable certain data to be shared.

I hope this has given you some food for thought – you can read the full practice paper “Sharing selves” on the MSU repository for more detail including two useful case studies putting the STEP considerations into action. Do you work with social media data and have any other guidance or tips to share?


Public domain image from