SDM-Genomic-Datasets.zip (5 GB)
SDM-Genomic-Datasets
These datasets are generated from cosmic mutation dataset in COSMIC database (GRCh37, version90) with the purpose of evaluating available ontology-based Data Integration engines.They include datasets with different number of records (10k, 100k, 1 million, and 10 million records), attributes (2-15), and duplicated values (25-75 percent of duplicated records and each duplicated value being repeated 10/20 times).
The details of generation of these datasets can be found in the papers where they have been used in empirical evaluation: https://doi.org/10.1145/3340531.3412881 and 10.5281/zenodo.3993657
Also, the examples of mapping rules to integrate these datasets are available in https://github.com/SDM-TIB/SDM-RDFizer-Experiments/tree/master/cikm2020/experiments/mappings