Supervised Classification of Bradykinesia for Parkinson's Disease Diagnosis from Smartphone Videos

Slowness of movement, known as bradykinesia, is an important early symptom of Parkinson's disease. This symptom is currently assessed subjectively by clinical experts. However, expert assessment has been shown to be subject to inter-rater variability. We propose a low-cost, contactless system using smarthphone videos to automatically determine the presence of bradykinesia. Using 70 videos recorded in a pilot study, we predicted the presence of bradykinesia with an estimated test accuracy of 0.79 and the presence of Parkinson's disease with estimated test accuracy 0.63. Even on a small set of pilot data this accuracy is comparable to that recorded by blinded human experts.


I. INTRODUCTION
Parkinson's disease is a neurodegenerative disorder that affects approximately 1 in 500 adults in the UK [1]. The diagnosis is a clinical one, based on the clinician detecting the presence of a slowness of movement termed bradykinesia, together with at least one of rigidity, rest tremor or postural instability (United Kingdom Parkinson's Disease Society Brain Bank Criteria) [2]- [4].
The most common method to detect bradykinesia involves a specially trained clinician making a visual assessment of the patient tapping finger and thumb together. In this test, a patient is asked to tap their forefinger against their thumb for 10 seconds (as wide and quick as possible). The clinician observes for impairment of speed, amplitude or rhythm, and there is often also a progressive 'decrement' seen over the duration of the test [4], [5].
However, this visual clinical judgment is inherently subjective, and there is no objective measure in routine clinical use. Given both the imprecise definition of the term, and the difficulty for human observers to quantify small differences in movement, it is little surprise that inter-rater assessment of bradykinesia is moderate at best [4], [5]. Current evidence suggests that human observers prioritize changes in movement amplitude over changes in the frequency or rhythm [4].
Given the importance of bradykinesia to diagnose and monitor Parkinson's, and the relatively small group of neurologists trained to assess it, an automatic and objective method of determining the level of bradykinesia has the potential to improve early diagnosis and to standardize followup assessment. Robust implementation of such a system might also allow home-monitoring of disease progression and richer longitudinal information to inform patient care.
Other approaches have previously been suggested for objective bradykinesia assessment [6]- [8]. However, all prior methods either require sensors that may not be readily available, or they require the patient to interact with a specific computer program or smartphone app. To our knowledge, there is only one previous report that involves standard video to objectively measure bradykinesia, but the participants all had advanced Parkinson's disease, and the method required the face to be included in the video [9]. Here we propose a solution that uses the ubiquitous smartphone video camera to capture the relevant data during standard clinical assessment.
Our primary aim is to provide proof-of-concept that one can automate the assessment of bradykinesia, negating the impact of inter-rater variability and providing easily accessible clinical decision support. We also investigate the potential for diagnosis of Parkinson's disease itself. We describe how the video signal is processed and how pertinent features may be extracted to predict both bradykinesia and the presence of a Parkinson's disease diagnosis. Finally, we present initial results from a case-control pilot study.

II. EXISTING WORK
In general, three main approaches have been used to objectively record bradykinesia on finger tapping: 1) touch plate or computer key, 2) gyroscope and/or accelerometer, and 3) optical tracking of infrared markers.
These methods are compared to expert classification on one of two clinically validated scoring systems. The Universal Parkinson's Disease Rating Scale (UPDRS) gives an overall score from 0 (no bradykinesia) to 4 (severe bradykinesia) based on the first 10 finger taps [10]. The Modified Bradykinesia Rating Scale (MBRS) gives a separate score from 0 (no impairment) to 4 (severe impairment) for each of the three aspects of bradykinesia -speed, amplitude, and rhythm [4] -based upon ten seconds of tapping.
For 33 Parkinson's disease (PD) patients, Giovanonni et al. found the number of keyboard finger taps in 60 seconds was significantly lower than controls (107 vs. 182), and correlated with clinician UPDRS rating, r = 0.69 [11]. Papapetropoulos et al. reported a modest but significant improvement in the maximum frequency of tapping a touch recording plate among 7 patients after deep brain stimulation to treat PD (4Hz before, 4.6Hz after) [12]. More recently, finger tapping within two rectangles on a smartphone screen showed a correlation of r = 0.75 with clinical UPDRS rating, with some features predictive of PD diagnosis, e.g. area under the curve (AUC) of 0.92 for distance of taps [13].
Gyroscope and/or accelerometer devices attached to forefinger and thumb provide a richer signal and can records the standard clinical finger-thumb tap examination. An accelerometer study showed that patients had higher beat decay of the auto-mutual information value (a signal predictability measure) vs. controls, giving a diagnostic accuracy of around 80% [14]. In 50 PD patients, Heldman et al. showed correlations between gyroscope recording and clinical rating for each of the three distinct components of bradykinesia according to the MBRS [4]. Angular velocity correlated with speed (-0.79), the excursion angle correlated with amplitude (-0.81), and the coefficient of variation correlated with rhythm (0.65) [4]. Furthermore, among 18 patients at each of 10 different deep brain stimulation amplitudes, gyroscope measurements showed higher test-retest reliability calculated as intraclass correlation and greater sensitivity calculated as minimal detectable change [15].
Using 3D recording of infrared markers on finger and thumb, in 22 patients and 22 controls, amplitude decrement and maximum opening velocity best differentiated between patients and controls (AUC = 0.87 and 0.81). [16]. Another study reported significant correlation between infrared measures of speed and rhythm and clinician rating according to UPDRS, although these correlation were not very strong (-0.37 and 0.31) [17].
To our knowledge, only one previous study has used computer analysis of simple video to detect bradykinesia on finger tapping [9]. This method tracked index finger motion and estimated distance by using face height to quantify hand length (by universal face:hand size ratios, the 'proportions of man'). A feature of tapping rhythm, 'cross-correlation between the normalized peaks', showed a strong Guttman correlation with UPDRS (-0.8), and support vector machine classification distinguished between PD patients and controls with an accuracy of 95%. However, only 13 patients participated, and all had advanced Parkinson's (a disease stage at which diagnosis is rarely an issue). Furthermore, a requirement to video the patient's face could be considered intrusive, limiting utility in practice.

A. Data Collection (Video Recording and Clinician Rating)
The study was approved by the UK Health Research Authority (IRAS no. 224848). Patients with idiopathic PD, diagnosed by a consultant neurologist at Leeds Teaching Hospitals NHS Trust, were invited to attend a research clinic appointment. Control participants were invited from the partners and companions of patients, or from hospital staff. Video recordings were made of each hand in turn tapping forefinger and thumb 'as quickly and as big as possible' for 15 seconds. This convenience sample comprised 40 patient hands and 30 controls hands (20 patient participants and 15 control participants), collected in 2017 and 2018.
The recordings were made using an integrated smartphone camera (iPhone SE), set to 60 frames per second, 1920x1080 pixels, and placed on a tripod, with only ambient lighting. The participant was asked to rest their elbow on a chair arm during the finger tapping and only the hand/forearm was filmed (no identifiable patient details were filmed). The distance from camera to hand was not tightly defined; in practice, the camera was positioned at approximately 1m from the participant. The lateral (thumb) surface of the hand faced the camera.
The degree of bradykinesia in each video was independently rated by two consultant neurologists with a special interest in movement on the UPDRS scale [17]. The raters were blinded to patient/control group. Where there was disagreement in rater scores, the higher score was used.
B. Data Analysis 1) Data Processing: A schematic of the data processing framework is presented in Figure 1. Complete details of the process will be described in future work. An abridged description now follows.
Initially, the video frames were segmented to pixels corresponding to a participant's hand. The hand regions of interest were first detected using a convolutional neural network, originally proposed by Bambach et al. [18]. Our implementation was trained using manual annotation of 500 randomly selected frames from our dataset. A secondary pixel-level segmentation, the grabcut method [19], was then used to refine the regions by removing erroneous background pixels.
The segmented frames were then converted into an optical flow field [20]. In such a field, each position corresponds to the vector pixel movement of a point object between two sequential frames. The magnitude of the vector thus represents the instantaneous speed of a point (in pixels/frame). We sum the magnitude at each point in the region of interest to obtain a metric of overall hand movement.
To convert optical flow magnitude into true hand velocity, caused by camera distance or hand size (rather than actual movement), we scale the magnitude by the number of pixels Figure 1. Illustration of the data processing in which raw video is converted to an anonymous 1D time series. Raw video is first segmented using a convolutional neural network. The segmentation is refined using the grabcut method. Frame-by-frame movement of the hand is extracted using optical flow. The optical flow field is then reduced so that the magnitude of movement between two frames is summarized by a single value. in the hand region of interest, so that our metric M t is: where H and W are the height and width of the optical flow field, u and v are the horizontal and vertical components of the flow, and b is the pixel mask obtained from the image segmentation. By evaluating M t over a sequence of video frames we produce a 1D signal over time. Examples of the signal are shown in Figure 2.
2) Feature Selection: Candidate features were derived from the 1D signal via clinical knowledge and visual inspection. In particular, we derived a set of features that described the frequency, amplitude, and tap-to-tap variability, to reflect the UPDRS assessment criteria. The features selected were as follows.
Frequency: Tapping frequency was estimated as the frequency corresponding to the maximal amplitude peak in the fast Fourier transform (FFT) spectrum. This assumes that the finger tapping motion corresponds to the greatest movement (and thus energy) between frames and that other movements, such as finger tremor, have smaller magnitude. Amplitude: Energy spectral density was calculated as the squared integral of the FFT spectrum. In addition, we assumed that bradykinesia movement is distinctive in some frequency bands. Therefore the energy spectral density is separated into six non-overlapping frequency bands ranging from 0Hz to 18.36Hz with bandwidth interval 3.06Hz. Variability: Two variability features were derived using the peaks of the optical flow waveform. Peaks were calculated via the MATLAB function findpeaks with zero minimum peak prominence. Peaks were then classified as maxima or minima by fitting a 1D Gaussian mixture model with two clusters to the peak amplitude values. We then defined: Jitter: We hypothesize that there are differences between the hand closing and hand opening motions. We further note that there is an observable difference in higher frequency movement between maxima and minima. For instance, the troughs in the signal of the patient in Figure 2 appear subject to jitter that is not as visible at the peaks. To quantify the jitter we include the ratio of number of maxima to number of minima over the entire time series as a predictor. Peak-to-peak variability: was calculated as the standard deviation of the time between maxima peaks. This feature models variation in tapping frequency across the time series and may be considered analogous to the standard deviation of RR intervals (SDRR) for ECG signals [21].
Given the relatively small number of samples in the dataset we reduced the feature space into two dimensions using principle component analysis. The NB model was chosen as a simple baseline classifier providing a sensible lower bound for performance based upon small datasets.
LR provides a linear separation of the data points and this simplicity may lead to lower generalization error. We incorporated ridge (L 2 ) regularization with strength determined via a grid search to minimize 10-fold cross-validation loss.
The SVM-L model optimizes a different cost function that the LR model and therefore gives a different linear separation of the classes. Meanwhile, the SVM-R model has the ability to model nonlinear decision boundaries. The slack and (for SVM-R) kernel scaling hyper-parameters were again estimated using a grid search to minimize 10-fold cross-validation loss.
We report the accuracy, sensitivity, specificity, and AUC score for each model. Due to the relatively small size of our pilot data we estimate the out-of-sample test accuracy of each model by reporting the mean accuracy of leave-oneout cross-validation (LOO-CV), with the hyper-parameters determined via the procedure described above. Analyses were performed using MATLAB 2017b and the scikit-learn and Tensorflow packages for Python 3 [23], [24].

IV. RESULTS
A total of 70 videos were collected from 35 participants (left and right hands), with 40 videos corresponding to the hands of participants with diagnosed Parkinson's disease. UPDRS scores from 0-4 were assigned by two expert clinicians and then categorized into our binary outcome: UPDRS ≤ 1 and UPDRS > 1. Their assessment matched in 73% of cases (κ = 0.46). The largest of these two scores   was selected for training of the models. In Figure 2 we show an example of UPDRS = 0 and UPDRS = 4 for comparison.
The performance of each model for the prediction of UPDRS category is shown in Table I. We see that LR and the two SVMs obtain the best accuracy and AUC scores of 0.79 on the training data, though the SVMs have better sensitivity with LR obtaining better specificity. NB is not competitive for this prediction task. The test accuracy (estimated using LOO-CV) drops to 0.76 for both SVM models, whilst LR retains its accuracy of 0.79.
In Figure 3 we show each time series plotted in featurespace after the dimensionality reduction, colored according to category. We also show the decision boundaries of each method: an unbroken line for NB, dashed for SVM-R, dashdotted for SVM-L, and dotted for LR.
Our second task is the prediction of Parkinson's disease itself based upon these features. The performance of each model for this second task is shown in Table II. SVM-L obtained the best accuracy of 0.69 and the best specificity of 0.75. However, the simple NB model obtained the best sensitivity of 0.81 and AUC score of 0.69. Both LR and SVM-R were not competitive for this task.
When estimating the test error, the simplicity of the NB  model allowed it to retain an accuracy of 0.63 whilst SVM-L dropped to 0.57. It appears that, in this relatively small dataset, SVM-L is highly reliant on a few key data points.
A plot of the time series in feature-space, colored by category, and the decision boundary of each method is displayed in Figure 4.

V. DISCUSSION
In a pilot sample of 70 finger-tapping test videos, we showed reasonable predictive performance for predicting moderate to severe bradykinesia. The estimated test accuracy of 0.79 (using LR) is promising in light of the level of agreement between expert clinical raters (0.73). We also note that disagreement between the automated method and clinical experts may be caused when either i) the clinician is correct and the automated test is wrong, or ii) the clinician is incorrect and the automated test is right. Given that prior literature casts doubt on the ability of human experts to accurately evaluate subtle traits [4], [25], ii) is highly feasible; such that the reported accuracy may underestimate how well we truly classify bradykinesia.
The method was less successful at predicting the presence of Parkinson's disease diagnosis: NB obtained an estimated test accuracy of 0.63. This poorer performance is to be expected, given that bradykinesia is only one symptom of a more comprehensive clinical diagnosis criteria.
The approach used here has potential to provide widely available and low-cost bradykinesia detection; without the requirement for new hardware or for patients to directly interact with smartphone apps or computer programs. This is a fundamental difference from previous published methods to detect or assess bradykinesia [4], [8]. An automated method broadens access to the measurement of bradykinesia (currently the preserve of a small group of clinicians, principally neurologists). For example, allowing family doctors and medical nurse practitioners to screen for and monitor the phenomenon has potential resource benefits. Furthermore, the use of ubiquitous technology means that the approach may be suitable in a home setting to monitor progression of Parkinson's disease. In addition, it might also be useful for monitoring other conditions in which there are changes in movement over time such as rheumatoid arthritis, in which common symptoms include decreased range of motion and joint stiffness [26], [27].
Whilst initial results appear promising, our small sample size means that classification using LR, SVMs, and NB produced conservative decision boundaries. A large sample size would allow us to determine whether there was any true local structure in the feature space.
A larger sample size would also allow us to improve the usefulness of the system by estimating the UPDRS score directly, rather than the binary categorization undertaken here. A larger validation study is therefore necessary and has been initiated by the study team.
Furthermore, the approach taken here is likely sub-optimal in two respects. First, spatial and angular information is discarded at each frame. This has the advantage of reducing the dimensionality of the signal so that real-time processing, even on modest hardware, is practicable. Second, the handselection of candidate features was entirely subjective and may have missed important characteristics in the time series. Additional data would allow more sophisticated approaches to automatically learn pertinent features (c.f. [28]).

VI. CONCLUSION AND FUTURE WORK
We have described and demonstrated an automated method to classify the presence of bradykinesia via smartphone video signals. In our small pilot study we have shown good agreement with expert clinicians. Further improvements may be possible via more sophisticated analyses, but this required further training data. A larger validation study of this technology is currently under development.