Comparison of data analysis procedures for real-time nanoparticle sampling data using classical regression and ARIMA models

ABSTRACT Real-time monitoring is necessary for nanoparticle exposure assessment to characterize the exposure profile, but the data produced are autocorrelated. This study was conducted to compare three statistical methods used to analyze data, which constitute autocorrelated time series, and to investigate the effect of averaging time on the reduction of the autocorrelation using field data. First-order autoregressive (AR(1)) and autoregressive-integrated moving average (ARIMA) models are alternative methods that remove autocorrelation. The classical regression method was compared with AR(1) and ARIMA. Three data sets were used. Scanning mobility particle sizer data were used. We compared the results of regression, AR(1), and ARIMA with averaging times of 1, 5, and 10 min. AR(1) and ARIMA models had similar capacities to adjust autocorrelation of real-time data. Because of the non-stationary of real-time monitoring data, the ARIMA was more appropriate. When using the AR(1), transformation into stationary data was necessary. There was no difference with a longer averaging time. This study suggests that the ARIMA model could be used to process real-time monitoring data especially for non-stationary data, and averaging time setting is flexible depending on the data interval required to capture the effects of processes for occupational and environmental nano measurements.


Introduction
Nanoparticle exposure assessment in the workplace has been performed by many research groups because of the rapid growth of nanotechnology and the probable risks of nanoparticle exposure. Measurement strategies, sampling devices, and data analysis methods vary among studies. Therefore, standardization of measurement strategies and data analysis are necessary for risk assessment, epidemiology, and exposure control measures as well as for determining compliance with occupational exposure limits in the near future [4,19,34,36]. Sampling strategies have been published by many organizations and have been specified in detail [2,5,[10][11][12]14,26,27]. However, insufficient guidelines regarding data processing have been formulated, and researchers are using their own methods for analysis [20,37].
Several methods are available for airborne nanoparticle sampling in the workplace. Among them, real-time monitoring is mainly used for nanoparticle exposure assessments. Real-time data constitute a time series, which is a sequence of observations taken sequentially in time. There are advantages using a concentration profile over time to determine the source of particles, their mobility, and simplicity have advantages when employing a real-time monitor. However, the interpretation of time-series data requires caution because real-time data are highly correlated. This phenomenon, termed autocorrelation, means that the set of measurement results are not independent, that is, there is significant autocorrelation among samples in a series.
Due to the basic assumption of independence among data when employing statistics [23], it is necessary to adjust for autocorrelation when interpreting time-series data in exposure assessments [16]. Two methods are commonly used to control for autocorrelation. One entails modeling the autoregressive component of time-series data, an approach adopted by the first-order autoregressive (AR(1)) and autoregressive-integrated moving average (ARIMA) models, and the other is the use of a longer averaging time. AR(1) and ARIMA models are not very different. AR(1) could be one special case of the ARIMA model. Several studies have accounted for autocorrelation using an AR(1) model [9,22,38] in atmospheric environmental monitoring and modeling. When using an AR(1) model, the data should be stationary. Therefore, an artificial transformation of non-stationary data into stationary data is needed before data processing. In ARIMA(p, d, q) models [3], a differencing term (d) transforms stationary data, and its applicability to handling nanoparticle exposure assessment data has been suggested in recent studies [4,12,20]. The use of a longer averaging time may lead to a reduction in the autocorrelation of real-time monitoring data, but it could lose specific events, especially sudden emissions, and thereby impair the results of statistical analysis [16,22].
In the real-time monitoring of airborne nanoparticles, the time interval varies according to the principle of operation and type of instrument. For example, the P-trak (Model 8582, TSI Inc., USA) interval is 1 s to 30 min, whereas the TSI Optical Particle Sizer (OPS: Model 3330, TSI Inc., USA) interval is over 1 s. In the scanning mobility particle sizer (SMPS), the time interval is different from that of other real-time monitors because of its size bin. The sampling time varies according to the number of size bins, iteration, and range of measurements. One measurement cycle of a simple SMPS (Nanoscan Model 3910, TSI Inc., USA) is 1 min (fixed). The measurement interval for the full-size SMPS (Model 3936L75, TSI, Inc., USA) that we used in a previous study was 135 s (120 s upscan and 15 s retrace time) for one cycle, for particles from small to large in size (14.6-661 nm) [12] and one cycle of other SMPS was 406 s with 10 s of retrace time (SMPS+C, model 5.403, Grimm GmbH, Germany) for particles from large to small (11.1-1083.3 nm) [1].
Although autocorrelation should be considered first when handling real-time monitoring data, most previously reported studies of nanoparticle exposure assessment in the workplace have ignored it and used a central tendency (arithmetic mean, geometric mean, and median), measure of dispersion (range, interquartile range, standard deviation, and geometric standard deviation), or graphical presentation [6,13,24,25,31,32,35].
The ARIMA model was introduced in a study that considered autocorrelation using simulations and provided three examples using real-time data [20]. Therefore, the verification of statistical methods using real-time monitoring field data is necessary.
The aims of this study were (1) to compare the results of statistical methods (classical regression and the AR(1) and ARIMA models) and (2) to investigate the effect of different averaging times to account for autocorrelation using field data and suggest possible data interpretation methods for real-time nanoparticle monitoring data.

Statistical analysis
Classical regression, AR(1), and ARIMA model were used to apply for analyzing real-time monitored data which is time series in this study. AR(1) was used to perform for atmospheric research to account for autocorrelation [9,16,22]. ARIMA was recently suggested for nanoparticle exposure assessment data analysis [4,12,20].
The first step in any time-series data analysis is to plot the observed data with respect to time. This enables the determination of any trends, seasonality (in long-term data), discontinuities, and outliers in the data. Most of the theory regarding the analysis of time series considers stationary time series, and for this reason, the analysis requires the conversion of a non-stationary series into a stationary series. It is important to check whether the data are stationary because, in contrast to the ARIMA model, the data should be stationary for the AR(1) model. The ARIMA model should also use stationary data, but the data do not need to be transformed by the researcher because they are transformed to stationary data using the differencing term I(d) during the ARIMA procedure. Several methods can be used to transform the data into stationary data, such as the use of logarithms, square roots, and differencing. To check that the data are stationary, a unit root test should be performed. A Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test is used to check for a stationary series in autoregressive models [21]. Once the data are proven to be stationary, the following process can be followed. We separated AR(1) and ARIMA models because the ARIMA model is appropriate for real-time monitored data with relatively short sampling duration because it is difficult to get the stationary data and this model could be applied to both stationary and non-stationary.

AR(1) model
The AR(1) model is a special case of the ARIMA model. AR(1) is suitable for stationary data (little varied with the time) and is used as a simple model. An autocorrelation function (ACF) is independent of the scale of the measurement of the time series [3]. To determine the ACF coefficient of an AR(1) model, acf() was used to calculate the first lag of the ACF for imputing the coefficient. The ACF and partial autocorrelation function (PACF) graphs are shown in Figure S-1-S-3 for each workplaces. The gls() function was used for generalized least squares (GLS) regression. The estimation of GLS is commonly used in time-series regression [8]. The corAR1() function should be used to represent an autocorrelation structure of order 1 [33]. As mentioned above, it is necessary to check that the data are stationary. We performed differencing when the data were not stationary, but it was impossible to obtain estimated values for each task. Therefore, in this study, we ignored the need to check for stationary data when comparing with the ARIMA model, in which the data are automatically checked to confirm stationary, and an estimated value is provided.

ARIMA model
ACF and the PACF can be used to determine the AR and moving average (MA) terms. To determine the best ARIMA model according to the Akaike information criterion (AIC), the auto.arima() function conducts a search for all possible models. AIC is used to measure both the fit and unreliability of the model [30]. Here, it was used to obtain the values of p, d, and q, which were the coefficients for ARIMA(p, d, q) model. The parameters for ARIMA model are classified where p is the order of the AutoRegressive model, d is for order of differencing, and q is the order of the Moving Average model [7,17,18]. In this study, we used the auto.arima() function to determine the order of those parameters for the ARIMA model because it is difficult and ambiguous to determine the model using the ACF and PACF graphs for non-experts in ARIMA.
Data analysis was conducted using R (version 3.0.2; R Development Core Team, Vienna, Austria). An alpha level of 0.05 was used to determine statistical significance throughout the study. Classical regression analysis and the AR(1) and ARIMA models were used to examine differences among the work tasks or working statuses. For classical regression analysis, the lmtest package (version 0.9-32) was used. To run the AR(1) and ARIMA models, nlme (version 3.1-113) and forecast (version 4.8) packages were used [15,17,33].
Sampling site, strategy, and real-time measurement parts are described in Supplemental materials.

Results
Table S-1 summarizes the characteristics of the workplaces investigated with respect to the type of particles and ventilation and the particle number concentrations.
Time-series plots of workplaces A, B, and C are presented in Figures 1-3, respectively. As shown in Figure 1, the number concentration of engineered nanoparticle (ENP) measured by SMPS in workplace A varied among processes and over time. Three processes were identified in workplace A, indicated by the gray boxes with numbers. The uncolored part of Figure 1 shows the concentration profile for non-experimental periods. The profile fluctuated depending on the process being operated, with the highest concentration being during reaction (peak 3 in Figure 1), followed by sonication (peak 2), and weighing (peak 1) of Fe 2 O 3 and Ti particles. The reaction process was performed in an open space with no control measures. The boiling process proceeded at 100°C using a magnetic stirrer and without wrapping. The sonication process was also performed without wrapping. The weighing process was performed with a vertical partition installed outside of microbalance. Part 4 of Figure 1 shows the number concentration measured during off-duty time when no experimental work was underway. There was an unexpected concentration drift from 20:30 to about 00:00, which sometimes occurs when a real-time monitor is used and cannot be easily explained. The uncolored areas show time periods when no experimental work was ongoing, but researchers were in the laboratory.   Figure 2 shows a time-series plot of workplace B during one workshift. Two processes were in operation. For one, a worker collected the ENP from a cyclone inside the manufacturing facility (Figure 2(b)), and the other was a process that sieved the collected nanoparticles (Figure 2(d)). The sieving process was performed automatically by vibrating equipment with an local exhaust ventilation (LEV) system. Other periods in Figure 2 where the facilities were operated are labeled as (1). Part (3) of Figure 2 is the lunch time, during which there was a lower concentration than during the operation periods.

Comparison of statistical approaches
To find the best model for data, we compared the AIC values (Table S-2). The AIC values from classical regression were higher than AR(1) and ARIMA models. The AIC values for AR(1) and ARIMA were very similar. AIC values of 1 min data for classical regression are 48,333 and for both AR(1) and ARIMA are 45,890 in workplace A. For workplace B, AIC values of 1 min data of classical regression, AR(1) and ARIMA are 7211, 5855, and 5856, respectively. AIC values for workplaces are 23,265 (classical regression), 21,116(AR(1)), and 21,094 (ARIMA) by 1 min data. In 5 and 10 min averaging time data show that also the AIC values for classical regression are higher than AR(1) and ARIMA.
To compare the results of a classical regression with the AR(1) and ARIMA models for workplace A, the estimated parameters are summarized in Table 1. We set the offduty time during day 2 (from 1:00 to 8:00) as the background, because there was a stable concentration in workplace A at this time. The data show very high autocorrelation (autocorrelation coefficients for the first lag were 0.958 for 1 min, 0.898 for 5 min, and 0.900 for 10 min data). In the classical regression analysis, all tasks (weighing, the reaction and sonication processes, and non-experimental intervals during the day) produced significantly different (p < .01) results compared with the background. The AR(1) models showed no significant difference between the results obtained during the weighing process (p = .06) , Others: Non-experimental time. a ACC: Autocorrelation coefficient for the first lag. b A non-zero mean is allowed after the series has been differentiated. Therefore, the estimate from the ARIMA model might be an absolute term. and the background values. The ARIMA model showed no significant difference between the results obtained during the weighing process (p = .27) and non-experimental time (p = .07) compared with the off-duty background period. There was no background estimate value in the ARIMA model due to the first-order differentiation that was used to make the data stationary. Therefore, a background estimate was not calculated, and the other estimated value was an absolute value. Table 2 presents the results of three different analysis methods. For the classical regression, working, sieving, and collecting were significantly different compared with the lunch time period, which was used as the background reference. The data show very high autocorrelation (autocorrelation coefficients for the first lag were 0.987 for 1 min, 0.925 for 5 min, and 0.844 for 10 min data). The AR(1) model indicated no significant difference between the background and the results obtained during the sieving process (p = .19). In the ARIMA model, the results obtained during the sieving process (p = .50) and the working time (p = .07) were not significantly different from the lunchtime background values.  Sieving. a ACC: Autocorrelation coefficient for the first lag. b A non-zero mean is allowed after the series has been differentiated. Therefore, the estimate from the ARIMA model might be an absolute term.
To compare the results of the classical regression, and AR(1), and ARIMA in workplace C, the estimated parameters are shown in Table 3. The data show very high autocorrelation (autocorrelation coefficients for the first lag were 0.887 for 1 min, 0.797 for 5 min, and 0.456 for 10 min data).
Four categories were determined for workplace C: daytime shifts, meals (lunch, dinner), night shift, and the background (the non-working period during the night shift). From 00:32 to 2:12, there was no work activity in workplace C. Therefore, this non-working period was used to set the background concentration in the analysis. Significant differences were found between the background and the results obtained during the working time (shifts 1, 3) in the classical regression and in the AR(1) and ARIMA (1, 1, 2) models ( Figure 3). However, no significant differences were observed between the background and the results obtained during night working in the AR(1) and ARIMA models (p = .20, p = .22, respectively). There was a significant difference between the background and the results obtained during the night shift (p < .01) in the classical regression model. All three analyses found no difference between the background and the results obtained during meal times. ACC: Autocorrelation coefficient for the first lag. b A non-zero mean is allowed after the series has been differentiated. Therefore, the estimate from the ARIMA model might be an absolute term. Table 1 shows the results for the three different statistical models with different averaging times in workplace A. The classical regression model found significant differences between the results for 1, 5, and 10 min. The results of the AR(1) and ARIMA models showed significantly higher values for the reaction and sonication processes than for the off-duty period at all averaging times. In contrast, there was no significant difference between the results obtained for the off-duty period and the weighing process and non-experimental time according to the AR(1) and ARIMA models, with the exception of the result obtained from AR(1) for the 1-min average time period between the off-duty period and the nonexperimental time (p = .02). Table 2 shows the results for the three different statistical models with different averaging times in workplace B. The results of the classical regression analysis for working time, the sieving process, and collecting ENP from facilities were significantly different from those for the lunch time period (p < .01). However, in the AR(1) and ARIMA model results, only the collection of ENP was significantly different from the lunchtime results. The AR(1) and ARIMA model results for the lunch period were not significantly different from those for the sieving process. The results for the day shift from the AR(1) model were significantly different from those for lunch time (p = .02), unlike in the ARIMA model. Table 3 shows the results for the three different statistical models with different averaging times in workplace C. Meal times were used as a background because no work was performed during that time. The results of the classical regression model were significantly different from the background for both night and daytime work periods. The results of the AR(1) and ARIMA models showed significantly higher values for working time than for the background. However, the values derived by the AR(1) and ARIMA models were not significantly higher than the background for the night shift period at all-time intervals.

Discussion
This study compared three different statistical methods using three averaging times (1, 5, and 10 min) to examine nanoparticle exposure assessment in workplaces based on realtime data while taking autocorrelation into account. We used two methods to achieve this. The independence of observed samples is a basic assumption of regression analysis [28], but most of the data gathered from real-time monitoring are not independent. Hence, a classic regression analysis breaks this rule. Therefore we used the AR(1) and ARIMA models, which can process autocorrelated data. A longer averaging time is a further method used to reduce autocorrelation [16].

Comparison of statistical methods
The aim of the study was to compare three different statistical methods, namely classical regression, AR(1), and ARIMA models, for processing time-series data. All of the results obtained from classical regression analysis showed significant differences from the background, except for the comparison of meal times in workplace B with the background, where the difference in nanoparticle concentration was relatively small between background and meal time. Therefore, the use of a classical regression may result in erroneous conclusions as reported in a previous study [4]. As seen in Tables 1-3, the classical regression model showed that all tasks were significantly different from background because it ignored considering autocorrelation.
However, when we used the AR(1) and ARIMA models, the results for the reaction and sonication processes in workplace A, the collection of ENP from facilities in workplace B, and daytime working in workplace C all revealed values that differed significantly from the background values which was reference in analysis. The results for the weighing process in workplace A, the sieving process in workplace B, and the night shift in workplace C were not significantly different from each background because it considered autocorrelation. In workplace B, the sieving process was performed under an LEV system to capture particles, and in workplace C, the work load during the night shift was lower than that during the day. For the non-experimental time in workplace A, the AR(1) model indicated a significant difference compared with the background values, but the ARIMA model found none. This might be a result of the different principles underlying the two approaches.
The ARIMA model has more I(d) terms. All of the sampled data were not stationary. The ARIMA model could determine whether the data were stationary and make the necessary adjustments by estimating a value for each task using the initial data before comparisons were performed while using auto.arima() function in R. When the AR(1) model is used, it is necessary to differencing to make data stationary, but no estimated value which is the intercept value is obtained after differencing. In the previous study, the AR(1) model was used for real-time monitoring data but without considering data stationary [9,16,22,38]. The use of the ARIMA model has been suggested in recent nanoparticle monitoring studies [4,12,24], but has only been applied once [20]. The ARIMA model could be used to realtime monitoring data which is both highly autocorrelated and non-stationary data, but industrial hygienists are not familiar with its use. After checking for stationary data, an estimation of the autocorrelation procedure should be performed, especially for AR (1). The autocorrelation coefficients of first order for 1 min were 0.958, 0.987, and 0.887 for workplaces A, B, and C, respectively.
To decide the parameters for the AR(1) and ARIMA models, the AIC can be used to search for the best-fit model [20]. The AIC is a relative measure of model fit that balances improvement in model fit based on log-likelihood against the number of added model parameters. Thus, it can be used to compare two (or more) models, and the model with the lower AIC value is considered to provide a better fit to the data. To achieve the parsimony coefficient which means the lowest values for each parameter (p, d, q), the order of AR(p) and ARIMA (p, d, q) increases, and the complexity of the correlation structure increases and suggests the production of a parsimonious model to achieve better analysis results [3]. Therefore, a lower parameter (p, d, q) is recommended for ARIMA(p, d, q) modeling.
Background concentration could be measured at different times such as meal time or off-duty time at same place (near-field) or other location which is not affected by certain nanoparticles (far-field) [4]. Determining the background (reference) is important. In this study, we used the near-field background concentration as references because of the low values and stability of nanoparticle concentrations. This could be determined by an industrial hygienist. The measured nanoparticle concentrations during the off-duty period for workplace A, lunch time for workplace B, and during the non-working part of the night shift were used as background values for workplace C.
When the results of the classical regression, which ignores the autocorrelation in the time-series data, were used to summarize the three workplaces, the concentration profiles among the processes were significantly different for the three workplaces (Tables 1-3). A problem arises in that autocorrelation may lead to violations of the assumption underlying classical regression analysis, which dictates that data be independent. Furthermore, if the autocorrelation is ignored, a large degree of uncertainty ensues when considering predicted concentrations [38]. When the mean differences are small or the autocorrelation between samples is high, incorrect results may be obtained [20]. Therefore, the correct statistical methods should be used that correct for the autocorrelation effect. For nanoparticle exposure assessment, the ARIMA model is a more suitable method to compare the effects of a task using time-series data, because most real-time monitoring data are not stationary.

The effect of a longer averaging time
The second aim of the study was to investigate the changes in the results due to differences in the averaging time because a longer averaging time might reduce the autocorrelation for real-time monitoring data [16,29]. We suggested that the ARIMA model is used for real-time data in the previous discussion. As a result, we found that the averaging time did not directly affect the result when we applied the ARIMA model. A shorter averaging time could distinguish short-term variations in the nanoparticle emissions. It should be noted that a longer averaging time could unite different tasks that are undertaken in sequence into one similar exposure pattern, and therefore differences in the exposure pattern might become obscured. An expert judgment is required to categorize the importance of a series of processes when transforming the averaging time.
Rather than reducing autocorrelation, the main goal was to determine an appropriate analysis method for processing real-time monitoring data. There are advantages and disadvantages of different averaging times in time-series data. Houseman et al. [16] and Levy et al. [22] reported that longer averaging times may be inappropriate in microenvironmental studies because they may impair variance estimation. However, elsewhere, longer averaging times were recommended to reduce autocorrelation between measurements [29]. When the averaging time was set to 10 min, the influence of some processes, such as sonication for 7 min (peak 3 in Figure 1) in workplace A and the collection of ENP from facilities in workplace B, could be ignored or diluted in the results.
However, when the interval of a monitoring device is more than 1 minute, that is, 135 s for a full-scale SMPS [12] or 230 s for a short DMA and 406 s for a long DMA with a GRIMM SMPS [1], a longer averaging time should be used to analyze the data with flexibility of time interval. Caution should be exercised in the categorization process because the averaging time can cause the impact of different processes to overlap.

Conclusion
We evaluated three different statistical methods for processing highly autocorrelated realtime monitoring data. Classical regression was not an appropriate analysis method for time-series data due to the effects of autocorrelation. The AR(1) model could be used, but the data should be stationary. Therefore, we suggest that the ARIMA model is most appropriate for the analysis of real-time monitoring data because it accounts for autocorrelation and stationary of data. When using the ARIMA model, averaging times of 1, 5, and 10 min produced the almost same results. Flexible time averaging is suggested because it can be used with a full-size SMPS, which has a sampling interval longer than 1 minute. An averaging time with flexible time interval would allow a series of tasks to be determined, and the exact time used should be determined by the investigator or industrial hygienist.

Disclosure statement
No potential conflict of interest was reported by the authors.