Detection of Freezing of Gait using Convolutional Neural Networks and Data from Lower Limb Motion Sensors

Freezing of Gait is the most disabling gait disturbance in Parkinson’s disease. For the past decade, there has been a growing interest in applying machine learning and deep learning models to wearable sensor data to detect Freezing of Gait episodes. In our study, we recruited sixty-seven Parkinson’s disease patients who have been suﬀering from Freezing of Gait


I. INTRODUCTION
P ARKINSON'S disease (PD) is a progressive and non- reversible neurodegenerative disorder with predominantly motor impairments, such as tremor at rest, rigidity, bradykinesia, impairment of posture, and freezing of gait (FOG) [1].
Globally, PD affected about 6.1 million individuals in 2016 [2].The estimates of the prevalence of PD range from 35.8 to 12,500 per 100,000 persons [3,4].The prevalence significantly increases with age, and studies have shown that the prevalence of PD for people above 65 years old is between 1.3% to 3% of that age group [5,6].Moreover, the number of patients diagnosed with PD increased by 31.6% from 2005 to 2015 [7], and the age-standardised prevalence of PD also increased by 21.7% from 1990 to 2016 [2].PD is the fastest-growing neurological disease, and has become the most challenging health issue for ageing populations.
PD's pathological characteristics include the loss of dopaminergic neurons in the substantia nigra pars compacta and the accumulation of intracellular protein (α-synuclein containing Lewy bodies) inside nerve cells that lead to cell death [8,9].The aetiology of PD is not well understood, but past studies have revealed a moderate correlation between PD causality and the role of environmental and genetic factors [10,11].The abnormal degeneration of dopaminergic neurons and the death of brain cells obstruct the smooth control and coordination of voluntary movements throughout the body.When 80% of dopamine-producing cells are damaged, the cardinal motor symptoms in PD start to emerge and significantly impair the performance of simple daily tasks such as walking and static standing.These motor impairments significantly reduce a patient's quality of life, but one of the most disabling symptoms in PD is FOG, with half of all PD patients suffering the symptom [12].The clinical definition of FOG is "brief, episodic absence or marked reduction of forward progression of the feet despite the intention to walk" [13].
FOG is more prevalent among PD patients in the advanced stages, but a study has found that it could be found in the early stage of PD [14].It severely deteriorates PD subjects' mobility and restricts an individual's independence, often leading to falls, which are frequently associated with serious injuries.FOG is a paroxysmal and unpredictable motor anomaly, but some internal and external factors were found to induce a freezing episode, such as walking in a confined space, turning, dual-tasking, and stressful situations (e.g.inability to reach the destination) [15,16].FOG episodes usually continue for a few seconds, but can occasionally last for several minutes [17].
The primary symptomatic treatment for FOG is medication, and the most widely used medication is levodopa (L-dopa) therapy, which has demonstrated a positive effect on improving the dopamine-responsive type of FOG [18,19].Other types of treatments that tackle motor symptoms have not shown significant evidence to improve FOG, such as botulinum toxin injections and amantadine [20,21].
Surgical treatment, such as deep brain stimulation (DBS) in the subthalamic nucleus or the globus pallidus internus area, is another approach to ease the burden of FOG [22][23][24].However, DBS is not suitable for all PD patients as it is a highly invasive treatment that carries all the risks of major brain surgery [25].
Aside from treatments aimed at reducing the onset of freezing, there are also different approaches to mediate the consequences of freezing.Cueing is a movement strategy technique that supplements medication in improving the overall functional mobility of patients by assisting patients with PD to overcome FOG episodes and prevent falls [20,26,27].The cueing techniques can be achieved in the form of rhythmic auditory cueing, visual assistance cues, and sensory cues.The cueing techniques's neural mechanisms are not well understood, but studies have shown that the disruptions in sensory-motor interactions might cause deficits in internal cueing for movements and movement initiation [15,28].The role that external cueing plays is to bypass the dysfunctional basal ganglia network and compensate for the loss of internal rhythms that results in impaired automaticity [29].
In order to deliver on-demand cueing assistance to PD patients at the most opportune moment to overcome the gait disturbance, wearable devices/systems have been proposed to monitor gait performance continuously and detect FOG events [30][31][32].Technological advancements in wearable devices with small form factor single-board computers have made such a system feasible in recent years.Currently, the majority of studies have chosen to use inertial measurement unit (IMU) sensors (accelerometer and/or gyroscope) as they provide relatively accurate measurements and can be worn by patients for an extended period time without interrupting the walking pattern and normal activities of daily living.Other physiological wearable sensors, such as blood pressure and heart rate sensors and those that measure electrocardiograms and electromyograms, were also used in some studies to identify the physiological changes before the onset of FOG [33,34].Another way of detecting FOG is to use vision-based techniques to determine gait abnormalities [35,36].However, the results of vision-based methods have so far been worse than those of IMU-based approaches (discussed below), with 82.1% being the highest detection accuracy reported so far [36].The privacy and the security of the videos will also raise barriers to the adoption of these approaches.
In order to detect FOG events, conventional machine learning approaches have required a substantial amount of domainrelated expertise and tremendous efforts in pre-processing and feature engineering on the data.However, no single feature or a combination of features have been shown to detect freezing episode perfectly due to the symptom's complexity and heterogeneity.Hence, researchers have recently started to adopt deep learning (DL) models to detect FOG without generating handcrafted features.The DL models were shown to be able to learn novel and robust features of the sensor data without relying on domain experts to specify disease phenotypes [37,38].Furthermore, the DL models worked well with large-scale real-world data, and has been proposed as a means to improve clinical decision making by providing datadriven evidence [38,39].
Convolutional neural networks (CNN) are a type of neural network in DL, and they are the most popular model architecture for image classification.In recent years, its practical usage has extended from identifying objects from daily life, such as dogs and cats, to discovering symptoms, identifying diseases, and predicting biological structure [37,39,40].In contrast to conventional machine learning approaches, CNN require minimum pre-processing, and they capture complex and heterogeneous features from data without extensive domain knowledge.It has naturally become a favoured tool to study clinical data.
In 2018, Camps et al. [41] introduced an 8-layer 1D CNN to perform FOG detection.The model was trained using data from a group of patients in the REMPARK database where 21 PD patients wore a nine-channel tri-axial IMU (accelerometer, gyroscope and magnetometer) on the left side of the waist while performing several walking tests at home.The data collected was segmented into 2.56 seconds windows, and every window of data was transformed into the frequency domain using the short-time Fast Fourier Transform (FFT).The magnitude of each FFT window was combined with the previous window of FFT data to form a single sample.The authors further processed the data with data augmentation in order to address the data imbalance issue.The model outperformed the previous shallow ML models and achieved 91.9% sensitivity, 89.5% specificity and 90.6% geometric mean.
Later that year, Xia et al. [42] presented a simpler 5layer CNN.The data was collected from ten subjects with accelerometers placed on three different parts of their bodies.Outlier removal and data segmentation were performed, and the raw accelerometer data with a window size of four seconds was used as the model's input.The proposed CNN model was tested with two schemes.The patient-dependent model was able to detect FOG with an accuracy of 99%.The patientindependent model was trained using the leave-one-patient-out validation and achieved an accuracy of 80.7%.
In 2020, Sigcha et al. [43] proposed to use a combination of CNN and a Long Short-Term Memory (CNN-LSTM) deep neural network model on data collected from one single accelerometer that was placed on the participant's waist.The study involved 21 participants, and the data collection was conducted at the patients' home to increase the occurrence of the FOG episodes.The authors found that stacking three previous spectral windows on the current window as the input of the CNN-LSTM model provided the best result, and they achieved mean sensitivity, specificity, and geometric mean all equal to 87.1% using the leave-one-subject-out validation.
More recently, Bikias et al. [44] proposed another CNN model that attempted to investigate the feasibility of using a wrist-based IMU sensor to detect FOG episodes.The study used the data from the CuPiD IMU dataset [45], which contained data from 18 patients.The IMU consisted of an accelerometer and a gyroscope, with a sampling rate of 128 Hz.A simple network with two CNN layers was used, and evaluation with 10-fold cross-validation achieved a mean specificity of 90% and sensitivity of 86%.
Although different DL models were used, the past studies were able to achieve excellent detection accuracy.However, these models were trained and tested with a limited number of subjects, and the gait patterns for PD patients with FOG differ significantly and can even differ significantly within a patient as the disease progresses.The performance of models might deteriorate if employed on subjects with gait characteristics different from those from subjects the models were trained on.In addition, while past studies have used either time-domain or frequency-domain data as the input, we explored in this study the possibility of leveraging time-frequency representations as the input.The use of the two-dimensional time-frequency representation as the input also demonstrated the feasibility of using computer vision techniques and architectures to detect FOG.This approach lays the groundwork for future FOG detection research to adopt and extend innovative solutions from the computer vision literature.
Inspired by the latest research in CNN models, we investigated a novel FOG detection method using CNN.The model was optimised using the sequential model-based Bayesian optimisation method.In order to evaluate the proposed model's performance, we first compared the proposed model with seven popular machine learning algorithms: 1) k-nearest neighbours (KNN), 2) Linear Regression (LR), 3) Decision Tree (DT), 4) Random Forest (RF), 5) Support Vector Machine (SVM) with linear kernels, 6) SVM with radial basis function (RBF) kernels, and 7) Extreme Gradient Boosting (XGBoost).Subsequently, the proposed model was compared against the state-of-the-art DL models: Xia's model, Camps's model, and Bikias's model.The preliminary design and results that were reported in a previous conference paper [46] were also reconstructed and examined.

A. Data
Sixty-seven PD subjects who suffered from different degrees of FOG in the past agreed to participate in our study.All subjects were selected during their regular check-up and recommended by their respective neurologists from the local hospitals.The study was approved by the SingHealth Centralised Institutional Review Board of Singapore on 28th September 2016 (CIRB Ref: 2016/2743).
Each subject was instructed to perform two types of walking tests in the hospital under the observation of a physiotherapist and several researchers.The first one was the standard 7-metre Timed-Up-and-Go (7mTUG) test, and each subject conducted the test three times.As mentioned in our previous research [46], FOG subjects (freezers) often experience "white-coat syndrome" where they do not experience FOG when performing walking tests with their neurologists or physiotherapists in a hospital or a laboratory setting.On the other hand, another widely adopted FOG data collection method, home-based data collection, has a higher chance of simulating the patient's daily routine and inducing more FOG episodes.However, it also brings up significant concerns about the patient's safety and privacy.Hence, in order to reduce the "white-coat syndrome", we asked the subjects who did not experience any FOG episodes during the 7mTUG to walk freely in the clinic as the second FOG-inducing test in order to capture more occurrences of freezing.The subjects wore an IMU around the lateral malleolus area of each ankle, and a third IMU near the 7th cervical (C7) vertebra during both tests.However, the data from the third sensor was not used in the FOG detection model as it was used only to analyse the patients' sitting and standing posture and stability.Each IMU used in this study was composed of an accelerometer, gyroscope, and magnetometer, and was developed in-house at the National University of Singapore [47,48].The IMU data was then transmitted wirelessly over a Bluetooth connection and saved into an iPad app at a sampling rate of 50 Hz.
Videos were recorded during the tests, and three experienced physiotherapists independently reviewed the videos after the tests in order to mark the FOG events.The final FOG labels were decided based on the decision of the majority.We ended up with a total of 486 FOG events in our database.
Within the sixty-seven subjects we recruited, data from four subjects who were unable to complete the tests, or encountered data loss during the tests, was excluded.Three subjects who suffered from FOG did not manifest any signs of freezing during the tests.Seven subjects demonstrated minimal or insignificant periods of FOG.However, recordings from these ten subjects were kept as examples of non-FOG gait.Four subjects faced significant challenges completing the test without walking aids, such as a walking frame or cane, so their data (which included periods of FOG) was collected with walking aids and used in our analyses.The rationale was that the use of walking aids in some environments was common in PD patients, so including this type of data would allow us to build a more robust system that could be used by PD patients to detect FOG in different environments.
The demographics of the sixty-three subjects who completed all the tests are shown in Table I.

B. Signal Pre-Processing
1) Data Filtering: Signal pre-processing, such as filtering, is usually required for classification problems using time series data.However, DL models often require minimal filtering, and introducing noise into the input data is often used in DL models to reduce generalisation error and improve model robustness [50].Hence, in our experiments, the data was not filtered at all when training the CNN model.
When testing with the other machine learning models, the accelerometer signals were filtered with a 4th-order Butterworth band-pass filter.The cut-off frequencies were 0.2 Hz and 15 Hz.A 4th-order Butterworth low-pass filter was applied to the gyroscope signal with a cut-off frequency of 10 Hz.
2) Continuous Wavelet Transform: Previous DL FOG detection algorithms used either the time-domain raw data or frequency components obtained from the Fast Fourier Transform (FFT).Spatial and temporal domain features, such as  [51][52][53][54].Frequency domain features, such as power in the freezing band (FOG episodes often occur between 3 to 8 Hz) and locomotor band (volitional activities have dominant frequencies that range from 0.5 to 3 Hz), have also been shown to be sensitive predictors in FOG detection, and can only be discovered in the frequency domain [51,55,56].For example, Figure 1a shows non-FOG gait data from one of our subjects reflected in the orderly and periodic changes in the vertical axis of both the accelerometer and gyroscope signals.Figure 1b shows that most of the frequency components were distributed below 3 Hz.Figure 2a shows gait data from another one of our PD subjects suffering from a FOG episode.The data from the vertical axis of the accelerometer and gyroscope were much more random and distorted.Figure 2b shows that most of the frequency components were distributed between 3 Hz to 8 Hz.
However, based on observations of our own data and the results of past studies [32,57], these patterns in the time or frequency domains were not always distinguishable for all patients.For the same patient, his/her FOG patterns also varied over time.This heterogeneity complicates the autonomous detection of FOG.Therefore, applying either raw time-domain data or transformed FFT data as the inputs for a CNN model can potentially lead to some critical features missing from the analysis and classification.This motivated us to make use of the wavelet transform, which would capture patterns in the time-frequency domain, to provide richer inputs to the CNN model.In Figure 1c and Figure 2c, data in the same window was transformed into the time-frequency plane using a continuous wavelet transform (CWT).The scalograms contained all the key information from the time and frequency domain analyses.Furthermore, they provided considerable additional insights into the non-stationarity of the IMU signals and the time specificity of power increases in different frequency bands.
As the CWT can provide a finer discretised scale for analysis than the discrete wavelet transform, we used the absolute value of the coefficients obtained from the Morlet mother wavelet as the input to our CNN model.The mathematical representation of the CWT for a time-series data x(t) with respect to a mother wavelet ψ, is defined as : (1) where s and τ are the scale and translation factors, respectively, used to transform the mother wavelet ψ, and * denotes the complex conjugate.

C. Machine Learning Models
In order to compare the performance of DL models to machine learning models, 67 features (F1 to F67, described below) that had been used in past FOG detection studies were trained using seven popular machine learning models.The features were extracted from the data in 1-second windows.
1) Frequency Domain Features: Moore et al. [58] and Delval et al. [51] pointed out that freezing of gait often occurred in the range of 3 to 8 Hz in the frequency spectra for vertical leg acceleration, while normal gait happened in the 0.5 to 3 Hz range.Therefore, we selected five widely used groups of frequency domain features (F1 to F5) described in Table II.2) Entropy Features: Sample Entropy (SampEn) is an improved version of approximate entropy and is often used to evaluate a time-series data's complexity or regularity [59].Human gait is a form of a dynamical system, and FOG is a sudden and episodic abnormality in the gait system [60,61].SampEn can be an effective method to analyse the regularity or stability of human gait where a higher SampEn value indicates a higher level of irregularity or randomness.
Sample Entropy calculation was performed for both accelerometer and gyroscope signals in the X, Y, and Z axes (F6 to F11), as well as the signals' magnitude (F12 to F13).This feature extraction process was performed for each window of data to form the feature vector.3) Wavelet Features: The wavelet transform is another popular method to analyse time-series signals.For example, El-Attar et al. [62] demonstrated that features extracted from applying discrete wavelet transform (DWT) on accelerometer signals yielded a robust FOG classification model.

D. Data Segmentation for Deep Learning Model
Data normalisation is essential for DL models to reduce computational time and improve performance [63].The best practice to perform data normalisation and estimate the data distribution is always to use only the training data and keep the test dataset untouched to prevent potential overfitting.In our study, the training data (consisting of the nine IMU signals) was used to fit the robust scaler, and the entire dataset was transformed using the best-fit scaler.The normalised data were then segmented into smaller 4-second windows (200 samples), with a 50% overlap.Each 4-second window was composed of two parts.The non-overlapping part of the data was 2 seconds and was defined as the current window because the window label was determined using only part of the data.The overlapping 2 seconds of data was from the previous window, and it was combined with the current window in order to capture more features and transitory patterns.In some earlier studies, a window was labelled as a FOG window when all data in that window were FOG data [41], or more than 50% of the data in that window were FOG data [42].However, in our data, a window that contained more than 0.2-second of FOG data (10% of the new information in each window, e.g.ten samples) was labelled as a FOG window.The shortest FOG episode in our dataset was 0.04 seconds, which was the duration of a two frames in the videos used by the therapists to identify FOG.There were only 6 FOG episodes (1.2%) shorter than 0.2 seconds, and these episodes might not be detectable in real life.The rationale for labelling 0.2 seconds of FOG data as a FOG window was to minimise the detection latency and improve the detection robustness by training the model to recognise partial FOG windows and short FOG episodes.However, the obvious drawback was that a model's performance deteriorated with this approach, which might be part of the reasons why the reconstructed models in this paper exhibited reduced performance as compared to those reported in the original studies.

E. Deep Learning Model Architecture
Extensive research on CNN has shown that models can learn more complex features as it goes deeper.However, training a deeper model is challenging as it requires a large amount of computational resources and data.Therefore, we limited our network to a maximum of 8 CNN layers.An overview of the CNN architecture used is shown in Figure 3.
The first two CNN layers in the first 2D convolutional block contained 77 and 685 filters, respectively, and the kernel size was 7x7.A max-pooling layer with pool size of 5x5 and stride size of 2x2 was added after the CNN layer.The two CNN layers in the second and third blocks had identical configurations, except 128 filters were used in the second block and 464 filters were used in the third block.The kernel size in these two blocks was reduced to 5x5.The last block used a smaller 3x3 kernel and 101 filters.A 2D global average pooling layer, followed by a fully connected layer with 512 neurons, was added at the end of the convolutional block.After the fully connected layer, a sigmoid activation function was used to determine the output.All convolutional blocks ended with a batch normalisation layer, a softsign activation layer, and a dropout layer.The dropout rate was set to 0.4.
We also used the following DL techniques in the model to improve model robustness and prevent overfitting: 1) Regularization: Overfitting is a common modelling issue, and it often occurs with CNN models.This error happens when the model fits the training data too well but fails to generalise to the test data.A few regularisation techniques to overcome overfitting were implemented in our model.
A large weight in CNNs will typically amplify noise in the input data, causing the error to increase further while propagating through the network.Hence, it is often an indicator of overfitting.Maximum normalised weight constraints were applied to all our convolution layers to ensure that the magnitude of weights did not exceed a given threshold during training.
A dropout layer is another regularisation technique to prevent overfitting.It randomly sets the output of some hidden neurons to zero during training at the given retaining probability.Dropout layers typically work well with a maximum normalised weight constraint.We tested different combinations of values for the maximum normalised weight and the dropout probability of retention, and the optimal result is discussed in the next section.
Early stopping was another regularisation method we used to prevent overfitting, where training was stopped when there were no significant decreases in validation loss over 20 epochs.
2) Global Average Pooling: Another technique used to reduce overfitting was introducted by Lin et al. [64] who introduced a global average pooling layer to replace the fully connected layer and enhance model discriminability within the receptive fields.Our model used a 2D global average pooling layer, followed by a fully connected layer at the end of the layer.The implementation has been used in recent advanced CNN architectures, such as EfficientNet [65] and MobileNet [66], as it reduces the computational cost of using two fully connected layers and maintains the performance of the model at the same time.
3) Batch Normalisation: Recent deep learning models have tended to increase their depth with multiple layers that are combined sequentially, with the inputs to each neural network layer coming from the activity of the previous layer.During the training process, the parameters in each neural layer will be updated with each mini-batch of data, and this change of parameters will create a constant shift in the distribution of inputs [67,68].When these inputs propagate through the network, this small distribution change is amplified and causes a slowdown in the network convergence.This phenomenon has been described as an internal covariate shift [67].We used batch normalisation in our network to normalise each layer's inputs to reduce the training time and improve the model's robustness [69].

F. Model Optimisation
With the thousands of combinations of all the hyperparameters, it would have been very time-consuming to identify the optimal hyper-parameters if we used the entire dataset.As such, we used only 30% of the training dataset to search for the potential candidates and obtained an estimate of the optimal model performance.Furthermore, as the network structure was relatively complex, conventional hyper-parameter tuning approaches, such as the exhaustive grid search or random search, could not be performed using the full scale of the network as the computational cost would have been enormous.Hence, we adopted the "Taking the Human Out of the Loop" concept [70] and chose the Bayesian Optimisation approach to select the optimal combination of hyper-parameters, such as activation functions, dropout rate, kernel initialisers, weight constraints, optimisers, loss functions, and the number of filters in each layer.A hyper-parameter tuning library, scikit-optimize [71], was adopted in the fine-tuning process, and the Gradient Boosted Regression Trees technique was used to minimise the negative G-mean.
The Bayesian Optimisation (BO) function tried to find a new hyperparameter sample X n bounded by the given options χ (shown in Table III).For each iteration, the BO selected the best X n by optimising the acquisition function, α, with the surrogate function obtained from the previous iteration D n−1 : where the acquisition function α, was defined as : The f (X + ) and f (X n ) symbolised, respectively, the highest validation G-mean derived from the objective function so far, and the current validation G-mean from the current set of hyperparameters.D n−1 represented the surrogate function that was estimated from the last set of hyperparameters, and the objective function f (X n ).The surrogate function was an approximation of the objective function used to infer the posterior of the optimisation process.The BO was a sequential optimisation process, where the initial set of X n was determined randomly from the options χ.The surrogate function D n−1 was updated by the initial set of hyperparameters or the last set of values that had been evaluated.The updated D n−1 was then fed to Equation 2 to find an improved set of hyperparameters.We set the maximum number of iterations to 256, but the validation G-mean exhibited no significant improvements after 154 evaluations in our experiment, which suggested that the sequential optimisation process had converged to the global optimum.The final set of hyper-parameters are summarised in Table III.
To speed up the fine-tuning process further, the model was trained using synchronous distributed training where each GPU ran a replica model with a local batch size of 10.We trained each combination of parameters for 50 epochs with a global batch size of 80 (the local batch size * the number of GPUs).Theoretically, this training strategy would have increased the training speed by eight times.

G. Model Evaluation
Six evaluation metrics were used to evaluate the proposed model's performance and compare it to other state-of-the-art algorithms Table IV.The Shapiro-Wilk normality test was performed on all evaluation metrics from all classification models.The one-tailed Student's t-test (α = 0.05) was used to test for statistically significant improvements if the distributions from both evaluation metrics passed the normality test, and the Hedge's g was reported for the effect size estimation.Otherwise, statistical significance was assessed using the nonparametric Mann-Whitney U test (α = 0.05) and the Rank-Biserial Correlation was applied to determine the effect size.
For most PD patients, FOG events constitute only a small part of their regular walking experience.The gait data collected for FOG studies will always be imbalanced with FOG incidents being the minority class.We used the geometric mean (G-mean) and F1 score (harmonic mean of the precision and recall) as they are generally the better metrics to evaluate model performance by taking into account data imbalances [72,73].

III. RESULTS AND DISCUSSION
All experiments were conducted using Python, Tensorflow, and other relevant python libraries.The model was trained on an Nvidia Tesla V100 GPU using an Amazon Web Services Elastic Compute Cloud (EC2) cluster.
The data was split into 80% training data and 20% test data, i.e., 50 subjects in the training dataset and 13 subjects in the test dataset.During the training and validation process, the 10-fold cross-validation was performed using only the training data, and the test set was held out for final evaluation.

A. ML Classification Results
Seven popular machine learning models (KNN, LR, DT, RF, SVM, SVM-RBF and XGBoost) were selected to evaluate the classification performance using conventional handcrafted features.All models were fine-tuned using a grid search to determine the best set of hyper-parameters.Each model was trained with Stratified 10-fold validation.The mean performance of models over the 10-fold validation is shown in Table V and Figure 4.The XGBoost showed the best performance in accuracy (80.65%) , G-mean (81.03%), and F1 score (77.41%).Other models showed a mean classification accuracy below 80%.The XGBoost model also exhibited statistical improvement over KNN and SVM (RBF) models for four evaluation metrics, with effect sizes (Hedge's g) above 0.8.

B. Deep Learning Classification Results
1) Baseline CNN Model and Reconstructed Models: We retrained our previous model [46] with 10-fold cross-validation to evaluate the model performance.Furthermore, we reconstructed some of the state-of-the-art models mentioned in the first section as a comparison.As the dataset was different, and because of the increased heterogeneity in our data because of the much larger number of subjects, some of the reconstructed models (like Camps's model and Bikias's model) did not achieve the performance reported in the original articles.In contrast, our baseline model and the reconstructed Xia's models exhibited very comparable results to their reported subjectindependent models.The performance differences in the reconstructed models could also have been due to differences in data collection conditions (e.g.data collection in hospital versus home), sensor data (e.g.inclusion of magnetometer data), window sizes, and labeling.The results are shown in Table VI and Figure 5.
2) Proposed CNN Model: We trained the models with a 10fold cross-validation scheme and evaluated them with the holdout test set to demonstrate that the proposed CNN model outperformed previous models.Table VI shows that the proposed CNN model exhibited a statistically significant improvement from the ML models for the average accuracy, precision, Gmean, and F1 score.All metrics except the sensitivity were significantly improved compared to Xia's model.However,  Furthermore, the proposed model showed significant improvement over G-mean against all the models with substantial effect size.
The best performance obtained from the proposed model achieved a 90.7% G-mean and 91.5% F1 score.Furthermore, the proposed model and the reconstructed Bikias's and Camps's model displayed minor performance fluctuations throughout the 10-fold cross-validation, indicating that these models provided similar performance levels with data from new subjects.No significant improvement for sensitivity was found in the statistical analysis, mainly because the proposed model was optimised to find the best G-mean performance.

IV. CONCLUSION
In the last two decades, FOG detection algorithms have slowly changed from classical feature extraction and statistical  In this study, we proposed a novel method to analyse IMU data using time-frequency analysis and proposed a robust model structure that was trained and validated on a relatively large cohort of PD patients who suffered from FOG.Using the "Taking the Human Out of the Loop" concept, we employed Bayesian Optimisation to determine the optimal hyperparameters and the final model design.Our proposed design is also a subject-independent model, and it is immune to the fluctuation in gait patterns when PD patient mobility deteriorates over time.The empirical results also indicated that the model can provide consistent performance and excellent FOG detection accuracy.Moreover, the statistically significant improvement of G-mean in the proposed model compared against all other models demonstrated that the Bayesian optimisation approach  The harmonic mean of precision and recall is an evaluation metric that assesses the classification performance by evenly weighting recall and precision.
could effectively determine the hyperparameter over desired evaluation metrics or objective functions.The proposed model used the two-dimensional timefrequency representations as inputs, demonstrating the feasibility of using computer vision techniques and architecture to detect FOG.This approach lay the groundwork for future FOG detection research to adapt and develop more innovative solutions from the computer vision domain.However, the main limitations of this model are that the model does not reduce the computational cost and relies heavily on GPUs to process the data during the training phase of the model.Nevertheless, we did not find a significant increase in computational cost for the inference process compared with other models.Another limitation of this study is that the data collection was performed in the hospital, where patients pay extra attention to their movements.This approach might not represent sufficiently all the real-world environments that trigger FOG and the patients' daily difficulties at home.

TABLE V:
The ML prediction results with selected FOG features.* denotes results that were significantly poorer than the best ML model, XGBoost (p < 0.05).Statistical significance required either that the effect size measurement Hedge's g was greater than 0.8, or the Rank-Biserial Correlation was greater than 0.5.
This article has been accepted for publication in IEEE Transactions on Biomedical Engineering.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TBME.2022.3140258(a) FoG in Time Domain (b) FoG in Frequency Domain (c) FoG in Time-Frequency Domain

Fig. 5 :
Fig. 5: Classification performance for the baseline and re-constructed DL models.
This article has been accepted for publication in IEEE Transactions on Biomedical Engineering.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TBME.2022.3140258

2 √
The true positive rate, which corresponded to the ratio of FOG windows that were correctly classified as FOG windows.Specificity T N T N +F P The true negative rate, which indicated the proportion of non-FOG windows that were correctly considered as non-FOG.Precision T P T P +F P The ratio of correctly classified FOG windows to the total proportion of classified FOG windows.Geometric Mean Sensitivity * Specif icity The G-Mean, the square root of the product of the sensitivity and specificity is a performance measurement that helps to balance the result among different classes.F1-Score 2 * P recision * Recall P recision+Recall

TABLE I :
[49]graphics of the subjects.PD stands for Parkinson's Disease.Duration of Disease refers to the time interval between the date of PD diagnosis and the first assessment in this study.FOG-Q is the Freezing of Gait Questionnaire, which is currently the only validated measure to evaluate FOG subjectively[49].7mTUG is the 7-meter Timed-Up-and-Go test, which is a widely used functional mobility test.
cadence, step duration, velocity, stride length, FOG Criterion, and gait cycle duration (stride time, stance time and swing time), etc. have been shown to be effective in detecting FOG

TABLE II :
Short descriptions of the selected frequency domain features for FOG detection.

TABLE III :
CNN hyperparameters explored using Bayesian Optimisation and the optimal parameters obtained.

TABLE IV :
Classification Evaluation Metrics.The true positive rate (TP) indicated the proportion of FOG episodes correctly classified.The false positive rate (FP) showed the proportion of non-FOG data windows misclassified as FOG episodes.The true negative rate (TN) computed the proportion of non-FOG episodes that were classified accurately.The false-negative rate (FN) was the proportion of FOG episodes misclassified as non-FOG episodes.

TABLE VI :
Classification performance for the baseline and re-constructed DL models.* denotes significant improvement from XGBoost.† indicates that the proposed model exhibited a significant improvement from the corresponding model for the specific metric.Statistical significance meant that the effect size measurement Hedge's g was greater than 0.8, or the Rank-Biserial Correlation was greater than 0.5.