figshare
Browse

sorry, we can't preview this file

1/17
S1Data-simulated.tar.gz (6.98 GB)

GRDC - CCDM/CIC genomic prediction report

dataset
posted on 2022-06-15, 04:53 authored by Darcy JonesDarcy Jones

 S1Data-simulated.tar.gz - These are simulated data evaluating prediction accuracy for different traits, number of markers, numbers of samples on different testing scenarios.



Other files are example results of the SelectML methods.

These are the combined results of the optimise and predict scripts. The compressed folders are named by the simulated dataset that they correspond to and the model.


The trait is first (e.g. A1 means an additive trait with 1 causal marker), N1000 means 1000 samples (in the training population), M1000 means 1000 markers sampled, "_CAUSAL" means that the causal loci were included in the sampled markers (note that this means for M1000 all markers sampled did have a genuine effect), and the final section before ".tar.gz" indicates the model used (e.g. sgd, xgb, BGLR).


Inside each of these compressed folders are the following files. From the `selectml optimise` command:


regression_*_best.json - the best performing combination of hyper parameters for these data and model type.

regression_*_results.tsv - The optuna running logs showing sampled parameters and average mean squared error (of cross validated samples) of models from that parameter set. 

regression_*_full_results.tsv - Like _results.tsv but includes other statistics relevant to the task, such as pearsons correlation.


And from `selectml predict`:

regression_*_model.pkl - a stored version of the trained model given the best parameters from optimise, trained from the complete train dataset. 

regression_*_predictions.tsv - predicted results for all training datasets.

regression_*_stats.tsv - summary statistics (e.g. MSE, pearsons correlation) for the model in different test populations.

History