Regions, years and months were extracted from the URLs using R.
Full text was scraped using python and a custom function
The museum check was done, looking for content that would match the following regex: "[M|m]use*|[G|g]aller*". The libraries check was derived from the search of this other pattern: "[L|l]ibrar.*|[B|b]ibliot.*"
Pageviews were retrieved using the Massviews tool and pairing the results with the existing database