Bias and beyond in digital trace data

2018-10-10T15:36:58Z (GMT) by Momin Malik
Large-scale digital trace data from sources such as social media platforms, emails, purchase records, browsing<br>behavior, and sensors in mobile phones are increasingly used for business decision-making, scientific<br>research, and even public policy. However, these data do not give an unbiased picture of underlying phenomena.<br>In this thesis, I demonstrate some of the ways in which large-scale digital trace data, despite its<br>richness, has biases in who is represented, what sorts of actions are represented, and what sorts of behaviors<br>are captured. I present three critiques, demonstrating respectively that geotagged tweets exhibit heavy geographic<br>and demographic biases, that social media platforms’ attempts to guide user behavior are successful<br>and have implications for the behavior we think we observe, and that sensors built into mobile phones like<br>Bluetooth and WiFi measure proximity and co-location but not necessarily interaction as has been claimed.<br>In response to these biases, I suggest shifting the scope of research done with digital trace data away from attempts<br>at large-sample statistical generalizability and towards studies that situate knowledge in the contexts<br>in which the data were collected. Specifically, I present two studies demonstrating alternatives to complement<br>each of the critiques. In the first, I work with public health researchers to use Twitter as a means of<br>public outreach and intervention. In the second, I design a study using mobile phone sensors in which I<br>use sensor data and survey data to respectively measure proximity and sociometric choice, and model the<br>relationship between the two.