black - dataming the open internet.pptx (2.94 MB)
Datamining the ‘open’ Internet: Studying digital writing with Web scraping
presentation
posted on 2017-06-03, 20:43 authored by Symposium on Information and technology in the arts and humanitiesSymposium on Information and technology in the arts and humanities, Black, Michael L.In this presentation, I will discuss
some of the methodological challenges involved in using web scraping to study
Internet writing practices. Compared to smaller-scale case studies of computers
and writing or the API-driven research of social media, web scraping relies on
data that is largely unstructured and in many cases also incomplete. While the
process of pulling a data object from the web is relatively simple, gathering entire
sites or ecosystems and organizing the derived data is a complicated process.
The highly-structured nature of the web offers a good starting point for
organizing scraped data, but metadata like dates and authorship that are often
crucial to digital humanities research questions are not a part of web
structures. In some cases, it is possible to derive metadata from the scraped
pages themselves but in other instances web archives can be leveraged to
approximate necessary metadata. Preparing web data for analysis also requires
acute awareness of the stylistic norms specific to the era of design scraped
pages and, in more modern websites, the particular quirks of the content
management systems that hosted the scraped pages. To complicate matters further,
there are in addition to technical challenges also several legal and ethical
considerations involved in web scraping that researchers must incorporate into
their workflows. While the web is in a sense more "open" than many of
the large archives that digital humanists use to study historical texts,
researchers engaging in web scraping projects must take care to distinguish
their own data gathering algorithms from corporate white and black hat bots.