On analysis of complex administrative data: neural networks, modelling and prediction

Abstract Eco-geographical heterogenicity of countries such as Chile and in cities exhibiting a territorial and demographic important diversity is relevant for the epidemiological studies of the apparition and spread of SARS-CoV-2. That situation is the opposite to countries such as the Czech Republic, with small or less diverse territories where the apparition and spread of SARS-CoV-2 can be correlated mainly to demographic and seasonal variables more than climate, pollution, or other physical and biological variables. It is well visible that there is no simple model for measured active cases and given parameters. This motivates a more general question to develop models that predict the future number of reported COVID-19 cases (by laboratory testing). We tune several neural networks to show the complexity issues of such a problem. These predictions are made for the countries, Austria, Czech Republic, and Slovakia, with the model class used for making these predictions being the artificial neural network, for the data from February 2020 until February 2021. Two different architectures of the neural network are compared the feed-forward network and the recurrent neural network. Ultimately, it is found that there are notable differences between the three countries studied, with the data for the Czech Republic being easier to predict with good accuracy than the data from the other two countries. Likewise, it turns out that the feed-forward approach delivers better results for Austria and the Czech Republic, whereas, for Slovakia, the recurrent approach performs better. Likewise, it is found that combining the data from all three countries does not lead to improved accuracy compared to models using data from only one single country. Both of the findings mentioned above might be related to the relatively small amount of data available.


Introduction
Measurement of the presence of SARS-CoV-2 in the wastewater has been made on four different places (Loma Marga, 2 Norte, Almendral, Cordillera: labels A; B; C; D) by systematic sampling using a quick and convenient ready-to-use reagent useful for efficient total RNA extraction or for the simultaneous isolation of RNA, DNA, and protein.The main variables were Concentration (copies / mL), ActualCases, and ActiveCases.Secondary variables are also present, for example ActiveIncidenceRate.By data preprocessing, we realized severe collinearity of CurrentIncidenceRate and ActualCases.For all four places, we see a clear linear relationship between both the ActualIncidentRate and ActualCases, and ActiveIncidentRate and ActiveCases.This is deducted from the plots in the respective section and supported by the calculated correlation, which is close to 1 in all cases.Overlaying the two shows a similar relationship between Rate and Cases for both categories Actual and Active for all four places.When comparing the four places with each other, places A and B show a steeper linear relation than C and D. Basic description of the variables and their linear relationship and cross-correlation analysis is given in Stehlík et al. [1].
One can be interested in the complexity of the relationship between Concentration and some simple function f 1 (ActiveCases).For example making a linear regression.Is the linear regression plausible?If not, suggest a non-linear regression.Would the quality of linear regression increase, if we replace the dependent variable "ActiveCases" by variable "ActiveIn-cidenceRate"? Would the quality of linear regression increase if we replace the independent variable "ActiveCases" by variable "ActualCases"?Or study the dependence Concentration = f 1 (ActiveCases, ActualCases).Is the linear regression plausible?Would adding more secondary variables lead to a significantly improved regression?By straightforward fitting, we can conclude that by adding more secondary variables there is an improvement according to MSE, even if for some places the improvement is slight.Another important question is to search for the computable normalizations of data, which are improving standard prediction methods.That, technically speaking, leads us to search for functions/multifunctions of the form ψ(v 1 , . . ., v n ) where v i are principal variables.Such variables can be biological, chemical, and physical underlining parameters in our setup.These are coliforms, suspended solids, and oxygen to mention examples of biological, physical, and chemical underlining variables.It is natural, that normalization should be applied, one example is the fact that in modeling of dependence of amplification qRT PCR on time, the rain volumes directly influenced concentration.Such information can be useful for the development of a standardized methodology for the quantification of SARS-CoV-2 in wastewater for community epidemiological monitoring and generate a predictive model of COVID-19 epidemic outbreaks.
The above-mentioned issues bring attention to the proper model of active cases.Would the neural network model be a better surrogate?In any case, we do not wish to put general preference to other models.Typically, if the complexity of data is higher than can be captured by well-designed regression, a neural network as general approximator can have a closer fit.This however does not mean that we have automatically higher interpretability.Since we have seen that the complexity of the administrative data has been reasonably high, we opted to use neural network and put also adaptive transfer function SPOCU on some of the layers.Neural nets are flexible in how they can be used -they are an architecture choice rather than simply a classification algorithm.Neural nets can more easily learn rich representations.This is why, for example, in some domains, they can outperform decision trees.One obvious advantage of artificial neural networks over support vector machines is that artificial neural networks may have any number of outputs, while support vector machines have only one.Another specific benefit of neural nets over SVMs is that their size is fixed: they are parametric models, while SVMs are non-parametric.An SVM (at least a kernelized one) consists of a set of support vectors, selected from the training set, with a weight for each.In the worst case, the number of support vectors is exactly the number of training samples.
To tackle such complexity, we compare several implementations of neural networks with applications for administrative data.As a data set, we have chosen the number of reported daily new cases of SARS-CoV-2 in Austria, the Czech Republic, and Slovakia.Reported data from Chile has been studied in Stehlík et al. [2], and is not compared here due to its significant intrinsic regional heterogeneity.The data that will be analyzed in this work are the number of reported daily new SARS-CoV-2 cases in the countries of Austria, the Czech Republic, and Slovakia.Such reporting is also important for proper estimates, and estimation and prediction (see e.g., Hafeez et al. [3]).Time series of reported cases for Austria, Czech Republic, and Slovakia are given in Figure A3, and logarithmized time series in Figure A4.
There are numerous sources providing this data, but the source chosen for this article is the World Health Organization's (WHO) COVID-19 explorer, an open-access website dedicated to the coronavirus pandemic (https://covid19.who.int/).The data itself was collected by the WHO through official communications and monitoring of the official ministries of health websites and social media accounts for data until 21 March 2020.All subsequent data was collected from daily reports to the WHO regional or global headquarters by the respective countries.Counts reflect laboratory-confirmed cases based on WHO case definition.A reported case according to this definition is: 1.-A person with a positive Nucleic Acid Amplification Test (NAAT); 2.-A person with a positive SARS-CoV-2 Antigen-RDT AND meeting either the probable case definition or suspect criteria A or B [of WHO's suspect case definition]; or 3.-An asymptomatic person with a positive SARS-CoV-2 Antigen-RDT who is a contact of a probable or confirmed case.
The actual dataset used for this work is available at https://covid19.who.int/WHO-COVID-19-global-data.csv, and contains the following information: Reported date (starting in early January 2020 for most countries), country name, and a two-letter country code, the respective WHO region, daily new cases, cumulative cases, daily new deaths, and cumulative deaths.For all subsequent analyses, as already stated, only the daily new cases from countries Austria, Czech Republic, and Slovakia will be relevant.In all following sections, while not correct from a medical point of view, the names COVID-19 and SARS-CoV-2 will be used synonymously.
The data analyzed in this article can be considered administrative data, meeting the definition of administrative data by the Organisation for Economic Co-operation (cf., Hand et al. [4]).Administrative data typically have features such as: 1.-the agent that supplies the data to the statistical agency and the unit to which the data relate are usually different, in contrast with most statistical surveys; 2.-the data were originally collected for a definite nonstatistical purpose that might affect the treatment of the source unit; 3.-complete coverage of the target population is the aim; 4.-control of the methods by which administrative data are collected and processed rests with the administrative agency.
This also means that one should be aware of multiple challenges with this kind of data (as pointed out by Hand et al. [4]).As a consequence, one should, for example, be careful to attribute increasing numbers of COVID-19 reported cases to a higher prevalence in the overall population, since it may also be caused by changes in a country's testing regime (i.e., at what point a person is tested for SARS-CoV-2).
Data collection was stopped on 10 February 2021 due to expected minor changes in estimated parameters for augmented datasets and the diminishing returns of additional data.The observation period was chosen to be from 26 February 2020, when Austria recorded the first COVID-19 cases among the three countries studied, to the last day of data collection.The dataset comprises 350 observations for each country, equivalent to 50 weeks of data.R software was used for statistical computing in this study.
This article is organized as follows: Section 2 introduces both datasets and ARIMA models that serve as benchmarks.The special architecture of the recurrent neural network is developed and fitted to the data in Section 3. In Section 4 configuration for multiple countries is studied and Austria, Czech Republic, and Slovakia are analyzed from this perspective.Finally, the conclusions drawn from this work are laid out in Section 5.

Description of the data and benchmark model
Austria, Czech Republic, and Slovakia experienced similar COVID-19 infection patterns with waves peaking in autumn and spring.Austria had a slightly delayed peak, and after the second wave, Slovakia and the Czech Republic had a third wave around the end of the year.All three countries showed a cyclic pattern of decreased new reported cases every week on a particular day due to fewer tests conducted over the weekend.
ARIMA is a classical time series analysis method used as a benchmark model.It's used when data show non-stationarity in the mean but not in variance/autocovariance. ARIMA is defined using the ARMA process and is causal if it meets specific conditions involving past and present values, but not future values.

Estimation
The auto.arima() function from the forecast package is used to find optimal ARIMA model values for each country by computing the Akaike Information Criterion (AIC) for all possible models within a given range of p, d, and q values, using training data.The model with the smallest AIC is selected as the optimal model.
Test data from the last two weeks of the observation period is kept aside for evaluating predictive performance.The selected models' parameters and AIC values are shown in Supplementary file (see Table B3).Forecasts and mean squared prediction errors are in Supplementary file see Figure A5 A and Table B4, respectively.
The ARIMA models for Austria and Czech Republic were close to actual values, but struggled for Slovakia.

Choice of activation function
The vanishing and exploding gradient problems can significantly affect the performance of neural networks, especially in deeper architectures.Avoiding these issues is crucial for effective training.We use techniques such as selecting appropriate activation functions and using regularization methods like dropout to help mitigate these problems.
Choice of activation function in neural networks is important.Sigmoid and tanh have vanishing gradient issue.ReLU and hard tanh gained popularity for mitigating this issue.SPOCU is a new activation function shown to have advantages over ReLU, sigmoid, tanh, and leaky ReLU.Studies like Kisel'ák et al. [5] have shown that SPOCU has uniformly better performance compared to SELU and ReLU in terms of loss and accuracy, allowing for faster network convergence.In the article, Eguasa et al. [6] proposed a novel bandwidth selector with activation functions like ReLU, Leaky-ReLU, SELU, and SPOCU, resulting in remarkable improvement in Mean Squared Error for regression models in three data.

Fitted models
As already noted in Section 2.1, in the model training process, the data from January 27 to February 9, 2021, will be excluded as it will be used later for performance comparison of different models.
To prevent overfitting, the best model from a group of similar models is chosen based on performance on a validation dataset, not just training data.A 14-day period before the test data is used for the validation dataset (Neuneier and Zimmermann [7]).To obtain more stable forecasts, the selected model is fitted multiple times and predictions are averaged, reducing instability and randomness from starting weights.For the given task, the squared error loss is commonly used as the loss function for regression tasks.Model selection will be done using grid search, limited to essential hyperparameters such as the number of nodes per layer and the number of lags for input, due to exponential growth in combinations.
For this article, we will consider models with one, two, and three hidden layers, respectively, for each class of networks and country.

Single hidden-layer networks
For the first attempt at fitting a neural network for COVID-19 case prediction, the nnetar function from the R-package forecast is used, which is designed for time series data.Optimal hyperparameters such as input lags and hidden layer units are determined through a two-step grid search.
Final predictions from averaging multiple networks have been made.Sigmoid activation function is used with random weights [−0.7, 0.7], jointly with BFGS optimization.Hyperparameters are selected with grid search and training/validation set.See Supplement file Table B9.Optimal repeats are around 50. Austria and Czech Republic models have 9 lags, Slovakia has 20 lags.Czech Republic has 8 hidden units, while Austria and Slovakia have 1 and 2, respectively.See Figure A6 in Supplement file for the predictions made by the three models for the test data, while line 1 in Table 1 shows the mean squared prediction error for the models, and how large that error is as a percentage of the mean squared prediction error from the benchmark ARIMA models.
Austria and Czech Republic had lower prediction errors (35% and 55% reduction) compared to ARIMA.Austria's predictions were better in the second half of the test period, but off in the first half.Slovakian data had worse predictions than ARIMA, with 10% increase in mean squared prediction error.Now consider the network has 14 output units for forecasting up to 14 days.It was built using the Keras package in R, which interfaces with the Keras API in Python.The Keras API documentation can be found at https://keras.io, and for the R package at https://keras.rstudio.com/index.html.However, the increased flexibility of this package compared to the previous model also means higher computational complexity and longer runtime.
Glorot Uniform Initializer, Adam optimizer, and mini-batch size of 32 used for training.See Supplement B in Line 19 in Table 12 for the hyperparameters.These hyperparameters do not vary significantly between the three models used.Prediction errors and visualizations (see Table 1 and Figure A7) reveal that only Austria's predictions are better than ARIMA.Austria and Czech Republic models generally underestimate future, but capture trend.Slovakia model struggles to capture trend.

Multi hidden-layer models
Since in the previous model, the approach with one single output iteratively applied to predict 14 days ahead worked better than the approach with 14 output nodes, the former approach will again be used for all three countries.The nnetar model in the previous subsection iteratively predicts one step ahead.
In this subsection, models were built using the keras package with an additional second hidden layer, following a similar fashion as in Section 3.1.1.The BFGS optimizer, not available in Keras, was replaced with the Adam optimizer.Identity activation was used for the first hidden layer, and ReLU for the second hidden layer and output layer.See Supplement file Table B10 in line B10 the hyperparameters.
Predictions for Czech Republic (see Supplement file Figure A8) are accurate but not as good as previous models.Slovakia's model outperforms ARIMA for the first time, while Austria's predictions are even better with half the MSE of ARIMA.See line 1 in Table 1.Despite a slight drop in accuracy for Czech Republic, predictions are still good.Slovakia's model struggles with the first week of test data possibly due to late reporting of cases on Friday.
Furthermore, for the models with three hidden layers, the MSEs and predictions are plotted, see Supplement file Table B12, line 1 in Table 1 and Supplement file Figure A9.One can see a major deterioration in the quality of the predictions for all three countries, suggesting that the models are now too complex.Moreover, the predicted curves are somewhat smoother, and all three models tend to underestimate.

Networks with a single RNN layer
Dropout is a technique used in single RNN layer networks to reduce overfitting (Srivastava et al. [8]) and prevent vanishing gradients.It also helps prevent co-adaption between faster and slower-learning neurons in the network.
A basic RNN architecture with a single layer and two approaches are compared: The first approach uses a single output to forecast the next day's value, while the second approach uses 14 outputs for each of the next fourteen days.See Supplement file Table B11 for hyperparameters.
Observations include the use of longer lags in the model with scalar output, more units in the RNN layer for the models with 14 outputs, and more parameters to be trained in the models with fourteen-dimensional output.The prediction errors of the models in this section can be found in lines 1 and 1 in Table 1.
The results indicate that models with a single output unit outperformed those with fourteen output units for all three countries.However, the model with one output neuron performed better than the ARIMA model only for Slovakia, while the ARIMA model made better predictions for the other two countries.See Supplement file A10 and A11 for forecasts.The approach of predicting 14 days ahead did not yield significant improvements over the iterative application of a model that predicts only one day ahead.Therefore, this approach will not be pursued in the context of RNNs.

Models with multiple RNN layers
In this subsection, networks with more than one RNN layer are fitted.We want to find out whether there is enough of an improvement in the predictions to justify the increased runtime.
First consider equal-size hidden layers.See Supplement file line B11 in Table B11 and line 1 in Table 1 the information about the network with two hidden layers, and the performance on the test dataset, respectively.The study found that networks with fewer parameters performed differently on test data across three countries.Networks in Austria had lower MSE than ARIMA, while Slovakia had similar results, and the Czech Republic had higher MSE.
Networks with three hidden layers with hyperparameters of the resulting models are used.See Supplement file line B11 in Table B11, and the performance on the test data compared to the ARIMA model in line 1 in Table 1.The models had increased neurons but also increased prediction errors, especially in the Czech Republic and Slovakia.Limiting neurons equally in each layer may not be suitable for deeper networks.Further analysis involved fitting RNNs with separate neuron determinations for each layer.
Then, consider hidden layers with unequal numbers of units.The hyperparameters of the model with two hidden RNN layers are made it.See Supplement file line B10 in Table B10.Also, see in line 1 in Table 1 the MSEs of the predictions on the test dataset are summarized.Increasing network parameters didn't improve predictions.Varying neuron numbers in hidden layers had little impact on input time series, which is reassuring (See Table B13).
Prediction errors (See line 1 in Table 1) improved for some countries, but none outperformed ARIMA.The two different approaches are also compared graphically.See Supplement file Figures A12 and A13.Graphical comparisons showed underestimation of future cases.
It seems that less complex networks performed better.

LSTM networks
This time around, the approach with 14 outputs will no longer be used since it failed to provide any improvement in previous models.
The networks with more than a single LSTM layer use both elu and tanh activation in different layers, whereas the model with a single LSTM layer uses elu as the activation for the LSTM layer.Additionally, a 50% dropout is used between the last LSTM layer and the output layer.

Networks with a single LSTM layer
The hyperparameters of the models (See Supplement file line B11 in Table B11) of the models are similar for all three countries.The prediction errors for the three models are in: See line 1 in Table 1.
ARIMA performed better for Austria and the Czech Republic, while the model for Slovakia outperformed ARIMA.Forecasts for Slovakia were effective, while Czech Republic forecasts were close but fell short of ARIMA.Austria forecasts were generally inaccurate, overestimating trends.See Supplement file Figure A14 for the forecasts.

Networks with multiple LSTM layers
Like in the section on RNNs, networks, where the number of LSTM units is the same in every hidden layer, are compared to networks fitted without this caveat.
First, consider neural networks with equal-size hidden layers.The hyperparameters of networks with two LSTM layers, are summarized in: See line B11 in Table B11.Austria had a lag of 18 and Slovakia had the highest number of LSTM units at 28. Test errors from all three models were larger than ARIMA.See Supplement file in line 1 in Table 1, the prediction errors from all three models are larger than in the ARIMA model.
For the architecture with three hidden LSTM layers (See line B11 in Table B11 for the hyperparameters), lags remained unchanged, but units increased for Austria and the Czech Republic (decreased for Slovakia).Additional layer increased parameters.See Supplement file line 1 in Table 1 for the test errors.Only Slovakia had lower MSE than ARIMA, but not significantly.
See Supplement file Figure A15 the predictions from models with two LSTM layers.Austria and Slovakia show little difference, with overestimations for Austria.Czech Republic models differ in approaches and performance.Now consider the number of neurons in each hidden layer is allowed to differ.The hyperparameters of the networks with two LSTM layers are summarized in: See line B10 in Table B10.Czech Republic and Slovakia have unchanged lag, while Austria's lag decreased.Czechia has a large difference in hidden layer units, while Austria and Slovakia show little difference.See Supplement file line 1 in Table 1 for the test MSE for the networks.MSE improved for two countries but worsened slightly for Slovakia.All models' MSE is worse than ARIMA, with only Slovakia close.
The LSTM network with three hidden layers (See Table B14 in Supplement file) had fewer neurons in the third layer, resulting in fewer parameters.However, MSE increased compared to other models and ARIMA for all countries.Three hidden layer models perform (See Supplement file Figure A18) worse for Austria and Slovakia, but one approach performs better in predicting extreme observations for the Czech Republic.

On improving poorly fitting models
We will try to improve the LSTM networks in this section.We will try to do so by altering the activation function, or by increasing the number of layers and examining the impact of those changes on the quality of the predictions.
First, consider two networks were fitted with different activation functions (SPOCU) in LSTM layers.One used SPOCU activation in both layers, while the other used SPOCU in the first layer and tanh in the second.The hyperparameters of these models, as well as their error on the test dataset, are summarized in: See Supplement file Table B1.SPOCU-only model showed overestimation but better trend prediction and had higher errors than ARIMA model, while SPOCU-tanh model had close prediction to observed data for the full 14-day period and outperformed ARIMA model.See Supplement file Figure A16 for predictions.
Then, consider increasing LSTM networks to 15 layers fitted for Austria to improve predictions (See Table B2).One uses SPOCU activation with specific parameters, outperforming the model with different activations in terms of lower MSE and better trend prediction.See Supplement file Figure A17 the corresponding predictions plotted.SPOCU model does not overestimate.

Modelling multiple countries at once
We explore combining data from three neighboring countries to make predictions using a single model.This is based on the assumption that there may be interactions between the countries.

Recurrent neural networks
RNNs in this section again assume equal neurons in each layer for faster tuning, but require all input sequences to have the same lag.This may be restrictive due to Slovakia's longer lag observed in previous networks.See Supplement file line B5 in Table B15 and Table 1 hyperparameters of the three networks fitted summarized and the test errors reported respectively.
The input lag increases slightly with network depth, but the number of units increases drastically from 1 to 2 layers.As a result, the total number of parameters increases with each additional layer.Prediction errors are large for all depths and countries.The error relative to ARIMA is smallest for Slovakia and largest for Austria.This pattern is consistent across all depths.

LSTM networks
LSTM networks with multiple hidden layers have the same number of neurons.See Supplement file Tables B17 and B5 Hyperparameters and errors, respectively.
All networks use the same lag for input, with similar units in each layer, performing better than RNNs.Same architecture is best for Slovakia, and worst for the Czech Republic.Errors are worse for Czech Republic, better for Austria compared to RNNs.Slovakia's networks perform similarly to ARIMA.Combining data from all countries does not outperform ARIMA.

Feed-forward networks
We compared two approaches: fixed input lag networks (RNNs, LSTMs) and separate input lag networks (only for feed-forward networks).Varying neuron numbers in hidden layers made feed-forward networks faster to fit.
First, consider Feed-forward network hyperparameters with identical input lags that are summarized in See Supplement file Table B17.Input lag is similar across depths, with fewer training epochs.See Supplement file line B7 in Table 1 for the test errors.One/two hidden layer networks perform well for Slovakia but poorly for Austria, indicating a limited benefit of using combined data for countries other than Slovakia.Now, consider different input lag for every country where hyperparameters are in: See Supplement file Table B18.Also, prediction errors are in: See Supplement file Line B8 in Table 1.Slovakia has low error, Austria has higher error compared to Austria-only models, and Czech Republic has an intermediate error.Combining data improved results for Slovakia and Czech Republic, but worsened them for Austria with additional layers increasing error for Austria.In conclusion, all-three-countries models are only useful for Slovakia.One potential improvement is to fit networks that use inputs from all countries but predict for only one country, aiming for better forecasts by focusing on a single country.

Conclusion
Vanishing gradients are a common issue in fitting models for administrative data.It can be eased with activation functions and dropout.Other techniques exist but were not discussed in the article.
Also with regard to the problem of vanishing gradients, it became apparent that recurrent neural networks are more susceptible to vanishing gradients than feed-forward networks.The LSTM network adaptation was less susceptible to this problem, although it still occurred at times.
In general, vanishing gradients were more problematic for deeper networks, which should come as no surprise.
When studying COVID-19 data from Austria, Slovakia, and the Czech Republic, it was found that using a single output node for iterative predictions of 14 days ahead resulted in smaller errors compared to using 14 output units for predicting all 14 days at once.
For recurrent and LSTM networks, it was found that restricting model selection to networks with the same number of neurons in each layer did not consistently result in worse test error, but did shorten the time required for hyperparameter tuning through grid search, especially for deeper networks.
In comparison to ARIMA benchmark models, neural networks demonstrated superior accuracy in COVID-19 forecasting for the three countries, despite the additional training time.However, some neural networks did not have smaller prediction errors, except for cases where performance improved by using a different activation function for LSTMs in Austria.
A noteworthy observation was that smaller networks performed better than larger ones because there was not a lot of data available, causing more complex networks to overfit.This is why networks with more output units tended to perform worse because they had more parameters to train.
The intuition that combining data from all three countries only improved predictions for Slovakia, not for the other countries.It seems that asking the network to predict for all three countries at once was too much.
The Czech Republic had the most accurate predictions, potentially because it has the least complex underlying process.Also, Austria and the Czech Republic typically had an input lag of 7-14 days, while Slovakia's was 14-21 days.
Finally, the recurrent-style networks, which would seem to be especially well-suited for the given task, only performed better than the feed-forward networks for Slovakia.This may be related to the above-mentioned fact that the input lag was clearly the largest for Slovakia, and that this allowed these recurrent networks to better use their core strength, capturing dependence between successive inputs.
On a closing note, it should be noted that this entire article dealt with univariate time series only and that analyzing this data again including additional relevant covariates could certainly yield further interesting findings.Likewise, investigating the effect of averaging the data over several days or weeks in order to eliminate the effect of outliers constitutes an interesting topic for further research.

Discussion
One of the main questions of similar analysis, which was also pointed out by Referees, can be the particular choice of the countries.Selection of these countries was systematically related to our language and position, since we have been used to some reporting.In addition, the countries are neighbors and are thus naturally connected, as many people originally from one country live in a neighboring country or have family there.Also the size of the countries is not too different.
It is also difficult to evaluate the disadvantages of neural networks compared to the ARIMA model.The split into three subsets for training NNs was only done for hyperparameter selection for a specific network architecture.Once a set of hyperparameters was found, a given model was re-trained on all but the last two weeks of data, just like the ARIMA models.Only the performance of those models trained on the same training set as the ARIMA models was then compared to that of the ARIMA models.Hence, one can consider that the NNs were not put at a disadvantage by the training set-up.

Table 1 .
Table of MSE and % of ARIMA MSE of given models.