Exploring Research Data Repositories with geoextent.
Research data repositories allow scientists to share data and software in a sustainable, usable, citable, and discoverable way (cf. FAIR data). The records can comprise all kinds of files (text, binary, large, small) and the data quality and metadata vary greatly depending on the curation policies. The bulk of metadata is created by hand and text-centric (keywords, classifications, related identifiers), and only a minority of discipline repositories explicitly support search and filtering by more advanced properties. Properties with the potential for creating connections between datasets are geospatial and temporal parameters. However, these parameters are not easy to record correctly by hand, especially since their existence and usefulness go beyond disciplines working with spatiotemporal data (geography, geosciences). Even though most research data repositories do not explicitly capture the Spatio-temporal metadata, the information is available in actual data files. To leverage this information, we present geoextent (https://o2r.info/geoextent/), a Python library for reliably extracting the geospatial and temporal extents of files, directories, and repository records. In this notebook, we use geoextent to explore the actual geospatial properties of records stored in the generic data repository Zenodo and assess the potential of geospatial metadata for research data discovery. Preliminary results indicate that a relevant share of records includes the information to fill geospatial-metadata gaps so that the discovery of relevant data, related data, and indirectly related publications can be improved. In the future, geoextent could be integrated seamlessly into data ingestion workflows of repositories to dependably extract spatiotemporal metadata during the creation of records.