Datamining the ‘open’ Internet: Studying digital writing with Web scraping

Information and technology in the arts and humanities, Symposium on; Black, Michael L.

doi:10.6084/m9.figshare.5071822.v1

black - dataming the open internet.pptx (2.94 MB)

Datamining the ‘open’ Internet: Studying digital writing with Web scraping

presentation

posted on 2017-06-03, 20:43 authored by Symposium on Information and technology in the arts and humanitiesSymposium on Information and technology in the arts and humanities, Black, Michael L.

In this presentation, I will discuss some of the methodological challenges involved in using web scraping to study Internet writing practices. Compared to smaller-scale case studies of computers and writing or the API-driven research of social media, web scraping relies on data that is largely unstructured and in many cases also incomplete. While the process of pulling a data object from the web is relatively simple, gathering entire sites or ecosystems and organizing the derived data is a complicated process. The highly-structured nature of the web offers a good starting point for organizing scraped data, but metadata like dates and authorship that are often crucial to digital humanities research questions are not a part of web structures. In some cases, it is possible to derive metadata from the scraped pages themselves but in other instances web archives can be leveraged to approximate necessary metadata. Preparing web data for analysis also requires acute awareness of the stylistic norms specific to the era of design scraped pages and, in more modern websites, the particular quirks of the content management systems that hosted the scraped pages. To complicate matters further, there are in addition to technical challenges also several legal and ethical considerations involved in web scraping that researchers must incorporate into their workflows. While the web is in a sense more "open" than many of the large archives that digital humanists use to study historical texts, researchers engaging in web scraping projects must take care to distinguish their own data gathering algorithms from corporate white and black hat bots.