Bernie Hogan at the WebSci summer school

Bernie Hogan spoke this morning on ‘Facebook as a Research Environment’. He opened by describing Facebook as an excellent source of data, albeit (let’s face it, inevitably) one with legal, ethical and technological constraints. He described academic uses of it, including:

  • capturing user/network data and pushing that to a survey
  • comparing claims about products with people’s recommendations
  • studying relationship strength against trace data

Facebook, data and controversy

Hogan outlined a few studies that resulted in embarrassment in one way or another:

  • Lewis et al, Taste, Ties and Time: it was possible to figure out who was whom in a dataset made available (as reported online yesterday)
  • Warden, American Cultures: Pete Warden captured friend links from all publicly available Facebook profiles (about 40% of the Facebook population). He planned to release the data to the public, but got squashed by Facebook’s lawyers…
  • The Facebook 100: Porter et al released personal data from the first 100 schools to join Facebook in 2005… accidentally including IDs, making it possible to identify people in that dataset.

Yes, we’re back to the Problem for Web Science.

The FB100 dataset is still available in a pseudonymised format: I piped up to ask whether it’s still possible to figure out who is whom, and apparently you can figure out schools but not people. Really? I’d like to find out more about that.

Social capital

Hogan also remarked that Facebook provides exceptional access to social capital, which he defined as ‘the ability of individuals / groups to access resources from their social network’. He pointed to a few papers (including, of course, Granovetter’s The Strength of Weak Ties (pdf)), and asked questions such as:

  • Do community clusters affect perceptions of social capital? (I.e. does a diverse network make you feel you have access to broad support?)
  • Is the effect mediated by participation in the network or is it independent of network activity?

He touched on identity markers (e.g. in the context of reddit and wikipedia), and how they lead to differences in behaviour/topology. This reminded me of work by Michael Bernstein et al on Anonymity and Ephemerality on 4chan and /b/.

The complexity of relationships, and handling that online

The above topic came up, too. I’d say it’s a whole other blog post…


10 responses to this post.

  1. […] Copyright ‘n’ legal « Bernie Hogan at the WebSci summer school […]


  2. nice write up, this ‘problem for webscience’ certainly seems not to be going away anytime soon, and I am glad Bernstein chalked it up as a problem/challange. I get feeling each time it is discussed that we are comfortable to namecheck failed sharing attempts like the AOL study and others, but no-one is prepared to say heres how we should proceed or how it should have been managed in first place to avoid shares getting pulled. I think there should be best practice docs, guidelines, and crucially training for both new and old researchers of how to handle and share the social web data. Some ethical approval initially, and crucially bit of legal advice at data gathering and sharing stage would also help. Perhaps also stepped releases of releasing a sample of your sanitised dataset first and getting community to download and check it for remaining personally identifiable information, then if ok share rest, if not fix it or dont release it in full.


    • Thanks Connor, I think you’re right that we need to talk about the ways forward in this area. (I need to further blog it, but the summer school has left me with no free time right now!)

      Guidance, ethics approval and stepped release all sound fabulous. Researchers should be going through ethics review processes anyway, of course, but the other things sound almost like something we’d look to the Web Science Trust to provide… but the Trust is immersed in so many things that I think we can’t just say “Please provide these things!”

      But perhaps we /should/ build a community area on this. What would the best form be — a wiki, static webpages, something you need to login to?



      • Ha, I know they should get ethics approval, but it sometimes doesn’t happen as the default is to think its on the Web, therefore its public. I also think what Bernie hogan says is right – by asking right questions and getting data that wont embarass the companies like twitter / facebook will go a long way to gaining their trust, and ensure shares aren’t subjected to takedown requests.

        Since web and internet science is taking place at many different institutions, including big companies like HP, Yahoo and Google, I think building a community that wants to share the data and shows how to do it would be an cool idea. Ideally create a nice environment for mods and experts to pass on their know how and safe environment for all to learn how capture and treat users data. Perhaps such a platform will make use of all what you mentioned, wiki, forums, some stickies / static webpages, but crucially I think there should be some link up / link out to some actual places that host the data. In the future there might be a web science observatory, but for now there are other hosted platforms like datverse network Share and stanfords SNAP sharing platform. Yes the WebSci Trust is very busy, but this could be linked into some of their existing plans, like the Web science observatory, or else kickstarted by anyone really with an interest. The Web science trust is running some events on thursday and friday later this week, where they have made some time to discuss research issues, i will try and pitch this one to them if I am there. Feel free, if you or others want to send me an email if there are any points you want to bring up to web science team / trust at soton. email cm7e09 at

        I think this is going to take a while, but if we start building it now, and crucially while discussing studies that got it wrong when sharing try and point to examples of existing best practice of sharing. I’d like to think that Jamie Tevan and Bernie Hogan would be able to change their answers to questions and slides next year to point the attendees of websci conference / summer school some nice web science research data, rather than saying its been tried and mostly had bad consequences. Its going to take some work but open gov data shows its possible.

    • For some reason there isn’t a reply button to your subsequent comment, so I’m replying here instead 🙂

      So, sorry for slow response — I blame the summer school!

      Your comment about ethics approval is true, but it’s a highly unfortunate situation 😦

      I very much like your vision of the data sharing community (one that includes a “what won’t work” area that lets us build on past errors, perhaps), and thanks for reminding me about the WebSci observatory concept! I would be very happy if you could raise this issue at the events tomorrow and Friday, please feel very free to pitch them loud and clear 🙂 (And thank you for your optimism. We can do this!)


  3. Thanks for a great writeup on Bernie’s talk!

    Deanonymization is *often* possible. It’s hard to tell without looking at the dataset. 33 Bits, is my favorite blog on that.


  4. thanks for link to 33bits – the blog has a lot of good opinion and analysis


  5. […] great range of speakers and topics… for instance, see my write-ups of talks from Wendy Hall, Bernie Hogan and Marc Smith, not to mention my rants about data and social […]


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: