Validation of an automated sleep detection algorithm using data from multiple accelerometer brands

To evaluate the criterion validity of an automated sleep detection algorithm applied to data from three research‐grade accelerometers worn on each wrist with concurrent laboratory‐based polysomnography (PSG). A total of 30 healthy volunteers (mean [SD] age 31.5 [7.2] years, body mass index 25.5 [3.7] kg/m2) wore an Axivity, GENEActiv and ActiGraph accelerometer on each wrist during a 1‐night PSG assessment. Sleep estimates (sleep period time window [SPT‐window], sleep duration, sleep onset and waking time, sleep efficiency, and wake after sleep onset [WASO]) were generated using the automated sleep detection algorithm within the open‐source GGIR package. Agreement of sleep estimates from accelerometer data with PSG was determined using pairwise 95% equivalence tests (±10% equivalence zone), intraclass correlation coefficients (ICCs) with 95% confidence intervals and limits of agreement (LoA). Accelerometer‐derived sleep estimates except for WASO were within the 10% equivalence zone of the PSG. Reliability between data from the accelerometers worn on either wrist and PSG was moderate for SPT‐window duration (ICCs ≥ 0.65), sleep duration (ICCs ≥ 0.54), and sleep onset (ICCs ≥ 0.61), mostly good for waking time (ICCs ≥ 0.80), but poor for sleep efficiency (ICCs ≥ 0.08) and WASO (ICCs ≥ 0.08). The mean bias between all accelerometer‐derived sleep estimates worn on either wrist and PSG were low; however, wide 95% LoA were observed for all sleep estimates, apart from waking time. The automated sleep detection algorithm applied to data from Axivity, GENEActiv and ActiGraph accelerometers, worn on either wrist, provides comparable measures to PSG for SPT‐window and sleep duration, sleep onset and waking time, but a poor measure of wake during the sleep period.


| INTRODUCTION
Numerous studies have implicated disturbed sleep with adverse health outcomes including increased risk of obesity and diabetes (Reutrakul & Van Cauter, 2018), cardiovascular disease (Cappuccio et al., 2011), and mental health conditions (João et al., 2018); however, the quality of evidence depends on the validity of the measurement of sleep parameters. Most epidemiological studies linking poor sleep with negative health outcomes rely on a single, retrospective self-report of sleep (Girschik et al., 2012), which is prone to recall bias (Lauderdale et al., 2008). Polysomnography (PSG), the 'gold standard' for sleep assessment, is expensive, labour-intensive (Sadeh, 2015), and hence not feasible for large scale studies. The use of wrist actigraphy devices such as Actiwatch is cost-effective, non-intrusive, and allows continuous recording of activity over days or weeks under free-living conditions (Martin & Hakim, 2011).
Wrist-worn accelerometers that collect raw data have become increasingly used for physical activity assessment in large population-based studies (e.g., National Health and Nutrition Examination Survey [Loprinzi & Cardinal, 2011], UK Biobank [Doherty et al., 2017]). The three most widely used research-grade raw data accelerometer brands used in epidemiological studies are the Axivity (Axivity Ltd., Newcastle, UK), ActiGraph (ActiGraph LLC, Pensacola, FL, USA), and GENEActiv (Actvinsights Ltd., Kimbolton, Cambridgeshire, UK). Implementation of a 24-h protocol in these studies means that they could be used to assess sleep and physical activity with a single device in a free-living setting.
Availability of raw data also potentially improves comparability among different accelerometer brands and enables the development and application of novel algorithms. In the early devices, due to limited memory and battery capacity, data were pre-processed onboard the device into the form of 'counts' through proprietary algorithms that complicates comparability of data from different devices. Some of the most widely used count-based sleep detection algorithms include those developed by Sadeh et al. (1994) and Cole et al. (1992). Recently, van Hees et al. (2015) developed a sleep detection algorithm based on angular wrist rotation measured with raw acceleration data. Sleep is defined as sustained inactivity, determined as the absence of change in wrist rotation greater than 5 for 5 min, or user-defined duration within a defined sleep window (starting at sleep onset and ending when waking up) (van Hees et al., 2015). The sleep window can be obtained from sleep onset and offset timings participants record in their sleep log (sleep log method) or from an automated sleep window detection (the Heuristic Algorithm looking at Change of Z-Angle, HDCZA [van Hees et al., 2018]). This algorithm is available as part of an open-source software which combines sleep and activity data over the 24-h day, can be applied to raw accelerometer data irrespective of accelerometer brand, and is now widely used in the research community (Jones et al., 2019;Wendt et al., 2020). However, the sleep algorithm has undergone limited validation to date and parameters such as wake after sleep onset (WASO) have not been validated. Additionally, the impact of handedness on the accelerometer-derived sleep estimates compared to PSG has not been assessed.
The primary aim of the present study was to validate an automated sleep detection algorithm (HDCZA) when applied to data from the ActiGraph, Axivity, and GENEActiv accelerometers worn simultaneously on both wrists against PSG in a healthy adult population. The secondary aim was to compare the automated sleep window detection and the sleep log method.

| Procedure
Participants underwent informed consent on a separate day prior to the PSG assessment. Demographic details such as age, sex, ethnicity, medical history, and habitual sleep duration were recorded. Participants' height and weight were measured and recorded to the nearest 0.5 cm and 0.5 kg, respectively. Each participant self-reported their handedness.
The PSG assessment took place on a weeknight. Participants arrived early in the evening and were fitted with three accelerometers on each wrist: the GENEActiv Original (ActivInsights Ltd, Cambridgeshire, UK), Axivity AX3 (Axivity Ltd, Newcastle, UK), and ActiGraph GT9X Link (ActiGraph LLC, Pensacola, FL, USA). For comfort reasons, two of the devices (the Axivity and ActiGraph) were taped together to reduce the number of straps on each wrist. The relative position of the three devices on a given wrist was randomised between the participants but consistent between wrists for each participant. After fitting the accelerometers, participants were prepared for the PSG assessment in accordance with the American Academy of Sleep Medicine (AASM) guidelines (Berry et al., 2012) by a trained technician.
The recording began when participants expressed willingness to go to bed and ended the following morning (usually between 6:00 and 7:00 a.m.). In the morning, participants were asked about time they tried to sleep, fell asleep and woke up. To assess the extent to which sleep duration in the laboratory setting was representative of habitual sleep duration all participants were fitted with a GENEActiv accelerometer on their non-dominant wrist for 8 days. Participants completed a sleep log for the days they wore the device.
The sleep PSG data were processed using Philips Respironics software. The beginning of sleep scoring was determined by 'lights out' time and ended at 'lights on'. Sleep parameters were scored using the AASM criteria (Berry et al., 2012) by a single specialist sleep technologist. This included use of 30-s epochs for sleep staging, assigning epochs a state of sleep or wake, documenting and generating indices of the frequency limb movement events.

| Accelerometers
The GENEActiv Original, Axivity AX3, and ActiGraph GT9X Link are triaxial accelerometers with a dynamic range of ±8 g, where g is equal to the Earth's gravity. All accelerometers were configured to record at a frequency of 100 Hz and initialised using the same personal computer (PC). However, the accelerometers could not be precisely time synchronised with the PSG (accelerometers initialised on a different PC). GENEActiv devices were initialised, and data were downloaded and saved in raw format as .bin files using GENEActiv PC software ActiGraph Link GT9X devices were initialised and downloaded using ActiLife version 6.13.3, saved in raw format as .gt3x, then converted to .csv format for data processing.
All accelerometer files were processed using R-package GGIR version 2.5 (https://cran.r-project.org/web/packages/GGIR/) (Migueles et al., 2019). Signal processing in GGIR includes autocalibration using local gravity as a reference, detection of sustained abnormally high values, detection of non-wear, calculation of the average magnitude of dynamic acceleration (i.e., the vector magnitude of acceleration cor- When a sleep log is used to guide the algorithm in GGIR, it is also possible to calculate sleep onset latency (SOL) if onset reported by participants corresponds to intention to fall asleep as opposed to the timing of sleep onset. In this study, the results reported for the sleep log condition used timings of sleep onset to guide the algorithm.
In addition, accelerometer data were also processed with onset T A B L E 1 Descriptive statistics of sleep outcomes (mean [SD]  indicating intention to fall asleep. These data were used to demonstrate agreement between accelerometer SOL and PSG SOL, thus only results for SOL are presented from these outputs. To demonstrate the agreement between PSG and count-based sleep detection, the zero-crossing algorithm developed by Sadeh et al. (1994) also available in GGIR was applied to data.  (Bland & Altman, 1986). The level of reliability was classified as 'poor' (ICC > 0.5), 'moderate' (ICC 0.5-0.75), 'good'
Paired t tests were conducted to assess whether habitual sleep duration measured using 7-day accelerometry with and without use of a sleep log was different from sleep duration defined by 1-night PSG.
Additionally, sensitivity, specificity, and accuracy of the binary classification of sleep (any sleep stage) and wakefulness, were derived from epoch-by-epoch comparison to PSG. However, it should be noted that it was not possible to precisely synchronise the data and raw PSG data was not available for one participant. Accelerometer 5-s T A B L E 2 Intraclass correlations, agreement and equivalence zones between accelerometers worn on the non-dominant wrist and polysomnography (n  Table S1. The median wear duration of the accelerometer during the habitual sleep assessment was 6.8 days and 6.0 nights.  Figure 1 and in Figure S1a-

| Comparison of sleep estimates by PSG and the sleep log method
The descriptive statistics of sleep estimates (mean [SD] or median [IQR] per night) measured by PSG and accelerometers worn on the nondominant and dominant wrists using the sleep log method are shown in Table S2. The pattern of the results using the sleep log method was largely unaltered compared to the automated sleep window detection. Table S4 shows ICCs, mean bias, 95% LoA, and equivalence zones between each accelerometer worn on both wrists and PSG for SOL.
Reliability was poor (ICCs of À0.24 to À0.11) between each accelerometer brand worn on either wrist and PSG. SOL was underestimated by accelerometers by up to 5 min with wide 95% LoA (up to ±49 min) observed. Regardless of accelerometer brand or wrist placement the equivalence zones for SOL ranged from 40% to 73%.

| Comparison of sleep estimates by PSG and the zero-crossing algorithm
The descriptive statistics of sleep estimates (mean [SD] or median [IQR] per night) measured by PSG and accelerometers worn on the nondominant and dominant wrists using the zero-crossing method are shown in Table S5.

| Epoch-by-epoch comparison to PSG
Relative to PSG, the overall accuracy of the algorithm was 84%. Sensitivity (identifying sleep as sleep) was generally high at 92%, while specificity (identifying wake as wake) was relatively low at 20%.
Results were similar for all brands of accelerometers on either wrist (Table S7).

| Comparison of 1-night PSG and 7-day accelerometry
Habitual sleep duration measured using 7-day accelerometry did not significantly differ from 1-night PSG assessment. The mean ( An important finding of this study was that overall agreement between PSG and accelerometers for all sleep estimates, irrespective of the accelerometer brand or placement, was similar. Given that handedness had little impact on accelerometer-derived sleep estimates, researchers could decide on wrist placement based on other outcomes of interest. For instance, in physical activity research the non-dominant wrist is commonly used for device placement (Dieu et al., 2017). Of note, although until recently hip placement has been most widely used for physical activity assessment (Ainsworth et al., 2015), hip-mounted accelerometers (GTX3+) compared to wrist-worn demonstrate lower agreement for sleep duration, sleep efficiency, WASO, and SOL when compared to PSG (Full et al., 2018;Slater et al., 2015;Zinkhan et al., 2014).  et al. (1994) or Cole et al. (1992) (Quante et al., 2018;Slater et al., 2015). In this study specificity for WASO was lower (20%) than in previous studies. One explanation for low specificity could be that a few participants had very little wake during the sleep period. In accelerometry-based assessment, wrist movement indicates wakefulness and immobility indicates sleep. However, immobility is possible during periods of wakefulness and as such can be mistakenly identified as sleep periods by accelerometers. Bland-Altman plots also revealed a pattern in the bias between accelerometer-measured WASO and PSG such that accelerometer measures showed smaller differences compared to PSG when WASO decreased. Therefore, more wakefulness in the sleep period will likely result in misclassification of WASO.
Implementing the algorithm developed by Sadeh et al. (1994) based on zero-crossing in GGIR to generate sleep estimates demonstrated poor detection of sleep episodes with the SPT-window resulting in substantial underestimation of WASO.
Further, this study compared sleep estimates from accelerometers with PSG using sleep onset and offset timings to guide the algorithm.
Overall, the use of a sleep log did not improve the level of agreement of sleep estimates between accelerometers and PSG. Although the participants were asked about their sleep onset and waking times not long after awakening, it appears that estimating these timings by self-report is challenging, particularly estimating timing of sleep onset.
A potential value of a sleep log is that when the timing of intention to fall asleep is recorded, it is possible to calculate SOL.  (Plekhanova et al., 2020).

| STRENGTHS AND LIMITATIONS
The strengths of this study include the simultaneous comparison of three research-grade accelerometers worn on each wrist with PSG.
Importantly, sleep data were generated using an automated sleep detection algorithm that can be applied to raw accelerometer data irrespective of accelerometer brand and is available as part of the open-source software. This method allows identical data handling and facilitates comparability of the results. The limitations of this study include the small sample size, which makes it hard to generalise findings beyond the specific population recruited for this study.
The sample consisted of healthy volunteers; thus, the findings cannot be generalised to individuals with sleep disorders. While the comparison of wrist accelerometers to the 'gold standard' PSG is a strength, 1 night of PSG assessment may not represent participants' habitual sleep due to the first-night effect. However, the habitual sleep duration of the participants was similar to that of 1 night of PSG as indicated by 7-day accelerometry. The influence of the first-night effect should also be a minor issue when evaluating the agreement between the measurement methods. Also, the GENEActiv was worn adjacent to the Axivity, which was taped on top of the ActiGraph, this may have impacted on the agreement between the three devices. Future studies should randomise the positioning of all devices and/or consider different set-ups of accelerometers to establish the impact on agreement between the devices and PSG.

| CONCLUSION
This study suggests that the automated sleep detection algorithm HDCZA applied to the Axivity, GENEActiv and ActiGraph acceler-

AUTHOR CONTRIBUTIONS
All authors developed the study concept and contributed to the study design. Tatiana Plekhanova collected and analysed the data and drafted the manuscript. Tatiana Plekhanova, Alex V. Rowlands, Charlotte L. Edwardson contributed to data interpretation. All authors contributed to critical review of the manuscript and approved the final version for submission.

ACKNOWLEDGMENTS
The authors would like to acknowledge Somaya Turk for assisting with data collection and thank all participants who volunteered to take part in this study. The research was supported by the National

Institute for Health Research (NIHR) Leicester Biomedical Research
Centre (BRC). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR, or the Department of Health.
No external sources of funding were accessed.

CONFLICT OF INTEREST
The authors have declared no conflict of interest.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon reasonable request.