ISMB 2022 Poster: Web scraping pilot study for SARS-CoV-2 variants of concern dashboards
Poster presented at: 30th Conference on Intelligent Systems for Molecular Biology (ISMB)on July 13, 2022 in Madison, Wisconsin.
Co-Authors: Wiriya Rutvisuttinunt, Liliana Brown, Steve Tsang, Jane Lockmuller
Tracking the SARS-CoV-2 variants and mutations is essential to inform the development of medical countermeasures. In response, many dashboards emerged to publish aggregated variant data through independent analyses using their own metrics and visualizations. To leverage knowledge across dashboards and prioritize SARS-CoV-2 variants with high public health impact, we developed a pipeline to automate the collection of data on variants of concern (VOC), variants of interest (VOI) and variants under monitoring (VUM) from relevant dashboards and generate consensus by web scraping with Python Selenium and Beautiful Soup followed by visualization in R. Additionally, we used the FAIR Data Principles criteria to track the data openness for each dashboard. From June 1 through September 9, 2021, we monitored twelve variant-reporting websites and scraped three dashboards (25%). The list of top variants of concerns is in agreement across these dashboards, which highlights the high impact threat levels. The nine other websites (75%) had structures inaccessible to the web scraping pipeline. Some challenges faced included limited programmatically accessible data, difficulty finding documentation, and frequent website structure changes. Overall, all dashboards provided visual variant summaries; however, expanding websites’ machine-readability and documentation would strengthen the impact by improving interoperability and reusability.