Dataset versions and their provenance

2014-01-05T03:25:39Z (GMT) by Stian Soiland-Reyes
<p>Example of using PAV to version datasets, showing the provenance of each individual version.</p> <p>From the blog post <em>Tracking versions with PAV</em>.</p> <p>In this example, <em>dataset-1.0.0.csv</em> has been <em>pav:importedFrom survey.xls</em>, i.e. probably saved from Excel (the software can be specified using <em>pav:createdWith</em>). The Excel file was imported from an SPSS survey data file, but in addition had a <em>pav:sourceAccessedAt</em> the survey form (e.g. the creator looked up more descriptive column headers).</p> <p>For <em>dataset-1.1.0.csv</em> we (as humans) can see the minor version has been incremented, and that it has a different provenance, this version was imported from <em>dataset.xlsx</em>, which has been <em>pav:derivedFrom</em> the earlier <em>survey.xls</em> (indicating that the spreadsheet have evolved significantly). The data was imported frm a different <em>survey2.spv</em> (which might or might not be related to <em>survey.spv</em>), but still accessed the same <em>surveyform.docx</em>.</p> <p>For <em>dataset-2.0.0.csv</em> the provenance is quite different, this time the scientist has simply used Survey Monkey rather than SPSS to manage their survey, and have published its exported CSV. Presumably this dataset is quite different in its CSV structure and/or question asked, as it has gained a new major version to become 2.0.0.</p> <p> </p> <p><em>Figure created using Lucidchart.</em></p> <p><em><br></em></p>