Gaussian Process kernels comparison - Datasets and python code

dataset

posted on 2024-06-24, 09:23 authored by Jiabo LuJiabo Lu, Niels FraehrNiels Fraehr, QJ Wang, Xiaohua Xiang, Xiaoling Wu

Overview

Data used for publication in "Comparing Gaussian Process Kernels Used in LSG Models for Flood Inundation Predictions".

We investigate the impact of 13 Gaussian Process (GP) kernels, consisting of five single kernels and eight composite kernels, on the prediction accuracy and computational efficiency of the Low-fidelity, Spatial analysis, and Gaussian process learning (LSG) modelling approach.

The GP kernels are compared for three distinct case studies namely Carlisle (United Kingdom), Chowilla floodplain (Australia), and Burnett River (Australia).

The high- and low-fidelity model simulation results are obtained from the data repository Fraehr, N. (2024, January 19). Surrogate flood model comparison - Datasets and python code (Version 1). The University of Melbourne. https://doi.org/10.26188/24312658.v1.

Dataset structure

The dataset is structured in 5 file folders:

Carlisle
Chowilla
BurnettRV
Comparison_results
Python_data

The first three folders contain simulation data and analysis codes.

The "Comparison_results" folder contains plotting codes, figures and tables for comparison results.

The "Python_data" folder contains LSG model functions and Python environment requirement.

Carlisle, Chowilla, and BurnettRV

These files contain high- and low-fidelity hydrodynamic modelling data for training and validation for each individual case study, as well as specific Python scripts for training and running the LSG model with different GP kernels in each case study. There are only small differences between each folder, depending on the hydrodynamic model simulation results and EOF analysis results.

Each case study file has the following folders:

Geometry_data

DEM files
.npz files containing of the high-fidelity models grid (XYZ-coordinates) and areas (Same data is available for the low-fidelity model used in the LSG model)
.shp files indicating location of boundaries and main flow paths

XXX_modeldata

Folder to storage trained model data for each XXX kernel LSG model.

For example, EXP_modeldata contains files used to store the trainined LSG model using exponential Gaussian Process kernel.

ME3LIN means ME3 + LIN. ME3mLIN means ME3 x LIN.

EXPLow mean inducing points percentage for Sparse GP is 5%.

EXPMid mean inducing points percentage for Sparse GP is 15%.

EXPHigh mean inducing points percentage for Sparse GP is 35%.

EXPFULL mean inducing points percentage for Sparse GP is 100%.

HD_model_data

High-fidelity simulation results for all flood events of that case study
Low-fidelity simulation results for all flood events of that case study
All boundary input conditions

HF_EOF_analysis

Storing of data used in the EOF analysis for the LSG model.

Results_data

Storing results of running the evaluation of the LSG models with different GP kernel candidates.

Train_test_split_data

The train-test-validation data split is the same for all LSG models with different GP kernel candidates. The specific split for each cross-validation fold is stored in this folder.

YYY_event_summary.csv, YYY_Extrap_event_summary.csv

Files containing overview of all events, and which events are connected between the low- and high-fidelity models for each YYY case study.

EOF_analysis_HFdata_preprocessing.py, EOF_analysis_HFdata.py

Preprocessing before EOF analysis and the EOF analysis of the high-fidelity data.

Evaluation.py, Evaluation_extrap.py

Scripts for evaluating the LSG model for that case study and saving the results for each cross-validation fold.

train_test_split.py

Script for splitting the flood datasets for each cross-validation fold, so all LSG models with different GP kernel candidates train on the same data.

XXX_training.py

Script for training each LSG model using the XXX GP kernel.

ME3LIN means ME3 + LIN. ME3mLIN means ME3 x LIN.

EXPLow mean inducing points percentage for Sparse GP is 5%.

EXPMid mean inducing points percentage for Sparse GP is 15%.

EXPHigh mean inducing points percentage for Sparse GP is 35%.

EXPFULL mean inducing points percentage for Sparse GP is 100%.

XXX_training.bat

Batch scripts for training all LSG models using different GP kernel candidates.

Comparison_results

Files used for comparing LSG models using different GP kernel candidates and generate the figures in the paper "Comparing Gaussian Process Kernels Used in LSG Models for Flood Inundation Predictions". Figures are also included.

Python_data

Folder containing Python script with utility functions for setting up, training, and running the LSG models, as well as for evaluating the LSG models.

Python environment

This folder also contains two python environment file with all Python package versions and dependencies. You can install CPU version or GPU version of environment. GPU version environment can use GPU to speed up the GPflow training process. It will install cuda and CUDnn package.

You can choose to install environment online or offline. Offline installation reduces dependency issues, but it requires that you also use the same Windows 10 operating system as I do.

Online installation

LSG_CPU_environment.yml: python environment for running LSG models using CPU of the computer
LSG_GPU_environment.yml: python environment for running LSG models using GPU of the computer, mainly using GPU to speed up the GPflow training process. It need to install cuda and CUDnn package.

In the directory where the .yml file is located, use the console to enter the following command

conda env create -f LSG_CPU_environment.yml -n myenv_name

conda env create -f LSG_GPU_environment.yml -n myenv_name

Offline installation

If you also use Windows 10 system as I do, you can directly unzip environment packed by conda-pack.

LSG_CPU.tar.gz: Zip file containing all packages in the virtual environment for CPU only
LSG_GPU.tar.gz: Zip file containing all packages in the virtual environment for GPU acceleration

In Windows system, create a new LSG_CPU or LSG_GPU folder in the Anaconda environment folder and extract the packaged LSG_CPU.tar.gz or LSG_GPU.tar.gz file into that folder.

tar -xzvf LSG_CPU.tar.gz -C ./LSG_CPU

tar -xzvf LSG_GPU.tar.gz -C ./LSG_GPU

Access to the environment path

cd ./LSG_GPU

activation environment

.\Scripts\activate.bat

Remove prefixes from the activation environment

.\Scripts\conda-unpack.exe

Exit environment

.\Scripts\deactivate.bat

LSG_mods_and_func

Python scripts for using the LSG model.

Evaluation_metrics.py

Metrics used to evaluate the prediction accuracy and computational efficiency of the LSG models.

Gaussian Process kernels comparison - Datasets and python code

Overview

Dataset structure

Carlisle, Chowilla, and BurnettRV

Geometry_data

XXX_modeldata

HD_model_data

HF_EOF_analysis

Results_data

Train_test_split_data

YYY_event_summary.csv, YYY_Extrap_event_summary.csv

EOF_analysis_HFdata_preprocessing.py, EOF_analysis_HFdata.py

Evaluation.py, Evaluation_extrap.py

train_test_split.py

XXX_training.py

XXX_training.bat

Comparison_results

Python_data

Python environment

Online installation

Offline installation

LSG_mods_and_func

Evaluation_metrics.py

Funding

China Scholarship Council (CSC) (No. 202306710125)

The University of Melbourne via the Melbourne Research Scholarship

National Key R&D Programme of China (2023YFC3006501)

History

Add to Elements

Usage metrics

Categories

Keywords

Licence

Exports