Real-time 7-day forecast of pollen counts using a deep convolutional neural network

Several studies have used regression analyses to forecast pollen concentrations, yet few have applied a deep neural network in their research. This study implements a deep convolutional neural network with the great potential to recognize patterns of pollen phenomena that enable the prediction of pollen concentrations. We train the model using data from 2009 to 2015 from multiple meteorological datasets, satellite data and processed data reflecting pollen flux as input for the model. The model forecasts pollen counts 1–7 days ahead for the entire year of 2016. Comparison of daily forecasts to observations, the algorithm obtains a relatively high index of agreement and Pearson correlation coefficient of up to 0.90 and 0.88, respectively. An evaluation of categorical statistics based on defined threshold levels shows satisfactory results. Critical success index of the model forecasts is as high as 0.887 for weed pollen, 0.646 for tree pollen, and 0.294 for grass pollen. Forecasts of grass pollen exhibit the largest decrease in accuracy because of the strong variance in annual pollen concentrations. Forecasts of weed pollen exhibit the greatest consistency, with a 7-day forecast correlation and index of agreement of 0.82 and 0.77, respectively, during the peak season. This correlates with the consistency of annual and seasonal trends of weed pollen within the study area. Compared to the conventional modeling approaches, convolutional neural network shows a promising ability to predict pollen for multiple days to allow individuals with allergies to take proper precautions during high pollen days.


Introduction
Pollen has deleterious effects on human health. Allergic rhinitis is estimated to affect up to 30% of adults and 40% of children [1], and hay fever and allergic asthma as much as 25% of the population [2]. Effective allergen avoidance has shown improvement in the allergy symptoms [1]. For people with allergic symptoms, awareness of where and when elevated concentrations of pollen will occur is critical for their health [3].
Annual pollen concentrations vary considerably due to species-specific and weather-related factors [4]. Several studies examined the relationships between meteorology, pollen concentrations, and seasonal pollen trends [5][6][7][8]. Temperature has the strongest influence and the strongest relationship to pollen emissions and concentrations [9]. Several studies applied regression analyses and models [3,[10][11][12][13][14] to predict pollen concentrations. Studies have spatially [15,16] and temporally [17,18] modeled and predicted pollen counts with satisfactory results. Jeon et al. [12] developed the Community Multiscale Air Quality Modeling System pollen model (CPM) to predict oak pollen concentrations. Evaluations on pollen forecasting applications have shown an accuracy of 50% on average [19]. Although few studies have shown a slight improvement of neural networks over regression analyses in airquality forecasting [20], few have applied them to forecasting pollen counts [21,22].
An artificial neural network (ANN) is a layered structure of algorithms. One form of neural network is the multi-layer perceptron (MLP) with the most basic one consisting of an input layer (data that feed the neural network); hidden layers (transform the inputs into information that the output layer can use); and an output layer (transforms the hidden layer activations into a scale we can define) such as classification and regression. Deep neural networks consist of multiple hidden layers, each of which contains multiple neurons (mathematically mimic a biological neuron using activation functions). These networks were inspired by an understanding of the brain, which is an interconnection of billions of neurons. Deep neural networks dealing with nonlinear variables yield higher accuracy than conventional neural networks and regression models [23,24]. One of the most complicated neural networks is a convolutional neural network (CNN) [25] which uses filters to convolve input data into multiple convolutions of input data. The motivation for our study is developing a CNN system that forecasts real-time pollen counts with greater accuracy and less processing time than current models.

Data
We acquired pollen data from the Houston Department of Health and Human Services (HDHHS) archives. The study focused on daily pollen data from 2009 to 2016. The most common pollen producing species were comprised of tree and weed categories. Tree and weed pollen were composed of 25 and 14 species, respectively. Pollen concentrations were measured south of the Houston downtown area. Meteorological data were gathered at the Moody Tower station, 4.5 km east-northeast of the pollen station. Figure 1 exhibits a map of the study area with the station locations. We obtained meteorological data from the Texas Commission on Environmental Quality (TCEQ), which operates the Continuous Ambient Monitoring Sites (CAMS) in various metropolitan areas within the state of Texas. We selected data from CAMS station 695 (Moody Tower, near Downtown Houston) for its close proximity to the HDHHS pollen measurement station. We extracted the data of surrounding stations CAMS001, CAMS053, CAMS409, and CAMS416 as input for missing data from CAMS695 station. The hourly meteorological data were processed to daily intervals to correspond with the pollen data. The data were comprised of mean temperature (Celsius), total precipitation (mm), mean U and V wind components, mean wind speed (m/s), friction, and radiation (Langley). Relative humidity (%) and pressure (mb) were comprised of daily minimum, mean, and maximum measurements.
We acquired the leaf area index (LAI) data from the NASA Moderate Resolution Imaging Spectroradiometer (MODIS) instrument aboard the Terra (EOS AM-1) and Aqua (EOS PM-1) satellites. The spectral resolution of the MODIS data was 500 m, and LAI data were comprised of 4-day measurements. LAI measurements were based on a quadrilateral square area that encompassed the entire city of Houston (see Fig. 1). The center point of the area, the location of the HDHHS pollen station, had a radius of an estimated 45 km to each perpendicular side of the measurement area. The total area covered is estimated 2025 km 2 . Since the variations in mean LAI were minimal, we performed linear interpolation between the 4-day measurements to coincide with the daily measurements of the pollen and meteorological data.

Processed data
We applied processed data to represent multiple variables representing initial conditions of pollen concentrations. The processed data consist of a meteorological adjustment factor (K e ), normal pollen distribution (C e ), characteristic concentration (C*), averaged frictional velocity (u Ã ), and pollen flux (F p ). The meteorological adjustment factor represents the resistance of pollen release based on meteorological conditions [25] comprising of three meteorological factors (temperature, relative humidity, and wind speed) that affect the pollen release from plants. The equation for the adjustment factor is: T te , RH te , and WS te represent the threshold values for temperature, relative humidity, and wind speed, respectively. We calculated the adjustment factor from the threshold values of the most common species for the respective groups. Oak pollen comprised over 54% of total tree pollen, and ragweed accounted for over 93% of total weed pollen. Thus, oak [12] and ragweed [26] threshold values of temperature, relative humidity, and wind speed are used to represent the respective pollen vegetation. Grass pollen threshold parameters were not available; thus, oak pollen parameters were selected because of the seasonal similarity between the two pollen categories. C 1 , C 2 , and C 3 are weighting factors that weigh the influence of meteorological resistance. Table S1 lists the threshold values and weighting factors for each pollen category. Adjustment factors for each pollen category were computed.
Normal pollen distribution (C e ) represents the mean normal pollen distribution for each pollen vegetation category by imitating the seasonal pollen cycle for each category. C e is defined as: where d is the number of consecutive days which pollen measurements meet or exceed the pollen count, l is the mean distribution, and r is the standard deviation for a normal distribution graph. Respective statistical variables were not suitable due to strong variance in the pollen in the time series. Therefore, we manually selected the l and r variables where C e best represents the pollen trend for the years 2009-2012. The l and r parameters of tree pollen were 50 and 15, respectively, and those for both grass and weed were 30 and 10. The pollen flux (F p ) is the daily emission flux of pollen particles for each pollen vegetation type. The computation of pollen flux is: where C e represents the normal pollen distribution, K e the meteorological adjustment factor, u Ã the averaged frictional velocity, and C* the characteristic concentration. C* is defined as follows: where canopy height (h c ) is the mean canopy height of the vegetation species. The canopy heights for each category were set at 6.38 m for the tree [12], 0.1 m for grass [18], and 1.0 m for weed [13]. The LAI is the computed mean LAI from MODIS satellite image data for the respective time period of the area surrounding the pollen station (see Fig. 1). p q is set as 'Pollen Count ? 1.' From multiple experiments with the data, we found that the model can be trained more efficiently with values greater than zero.
During training, zero values may cause the model to ignore the data, reducing the number of training sets. We added a value of 1 to the pollen count to reduce zero values within the data, preventing the model from becoming a naïve predictor. Naïve predictor ignores the importance of the other input variables. Thus, the model placed greater weight on and using last day's pollen count as its forecast and artificially produced favorable statistical results. Previous studies [5,20,21] had also observed this phenomenon for regression or simple neural network models when they included only pollen grain counts within the dataset.
We normalized all input data to reduce the magnitude between the various input data. This prevents one feature from having more influence than another or causing dramatic changes in the weight matrix when the CNN model was optimized.

Neural network system
Our CNN model used 24 normalized input variables for the prediction of the next 1-7 days counts of tree, grass, and weed pollen. Input variables are comprised of meteorology (13 variables), LAI, fraction of photosynthetically active radiation (FPAR), meteorological adjustment factor, normal pollen distribution, and pollen flux for each of the three pollen categories. See Fig. 2 for a graphical representation of the pollen forecasting CNN model.
The model was comprised of five main layers: an input layer, two one-dimensional convolutional layers, a fully connected layer, and an output layer. We applied a dropout [27] layer between the convolutional layers and optimized the system parameters for each vegetation category (see Table S2 for parameter details). Optimal system parameters were identified using a trial and error method over multiple different parameter ranges consisting of kernel size (line segment shape at least 1 9 2 size), number of filters, learning rate, batch size, and training epochs for each pollen category and then evaluated their performance based on 1-day forecast accuracy. Cases with multiple favorable results of differing settings underwent a second set of testing runs to evaluate the stability of the models to further identify the optimal parameters. We trained the model with 2009-2015 data with 15-20% of the training set used for cross-validation purposes for the pollen forecast. The procedure is conducted to avoid overfitting of the model. Once the training and validation run was complete, the model received the 2016 normalized data and predicted the next 1 to 7-day pollen for each vegetation category and total pollen prediction.
The CNN model has been compared to the recurrent neural network (RNN) [28,29] and deep neural network (DNN) for the purposes of forecasting pollen concentration. We implemented a gated recurrent unit (GRU) neural network as representative and advanced form of an RNN [30][31][32] which is a form of neural network generally suited for temporal sequences. The DNN model uses multiple layers of artificial neurons with dropout layers in between. All models received the same input data with some adjustments to their initial parameters, see Table S3 in the Supplementary section for parameter details.
The neural network models have been compared based on their statistical prediction capability in forecasting tree, grass, and weed pollen 1 and 7-days ahead. The CNN model has performed consistently better than both the DNN and GRU models in nearly all cases. Furthermore, the mean training time of the CNN model was the fastest of the models tested, with the DNN model close behind and the GRU model taking about five times longer than the CNN model. The GRU model was consistently the least accurate in predicting tree and grass pollen for both 1-and 7-day predictions than the other models. The exception was 7-day prediction of weed pollen where GRU achieved 1% better IOA than the CNN model. The DNN model Fig. 2 Representation of the pollen forecasting convolutional neural network model consisting of and input layer, two convolutional layers, a dropout layer in between the convolutional layers, a fully connected layer, and an output layer generally performed 4% less accurate than the CNN model in both 1-and 7-day forecasting of all three pollen categories. See Table S4 in the Supplementary section for details on the performances of each model. Thus, this study implemented the CNN model for the purposes of forecasting tree, grass, and weed pollen 1-7 days ahead due to the model's more accurate and stable results, and the faster performance in training time.

Results and discussion
For our evaluation, we used the Pearson correlation coefficient (r), the index of agreement (IOA) [33], and a categorical statistics evaluation as presented by [34]. The categorical statistics evaluation consists of hit rate (HIT), critical success index (CSI), false alarm rate (FAR), equitable threat score (ETS), and proportion of correct (POC). We evaluated each pollen category and the days predicting ahead and compared observed to predicted pollen concentrations using r and IOA statistical evaluation methods. The evaluation of categorical statistics evaluation determined how well the model, compared to the observations, captured threshold levels based on the prevalence of allergy symptoms.

Evaluation of the categorical statistics
The evaluation of the categorical statistics was based on four quadrants: From these quadrants, we evaluated the categorical statistics as follows: HIT is the fraction of observed pollen concentrations above the threshold that are predicted correctly by the model (1 is the best). FAR is the fraction of predicted pollen concentrations above the threshold that are false (0 is the best). CSI is the fraction of correctly predicted pollen concentrations above the threshold after the removal of correctly predicted pollen concentrations below the threshold value (1 is the best). ETS measures the performance skill of the model (1 is the best). POC is the fraction of the model forecast that matched the observations above and below the threshold (1 is the best). For the purpose of the categorical statistics evaluation, we defined the threshold levels according to the severity of symptoms and most prevalent pollen species for the respective pollen categories. Oak (Quercus) pollen will represent the tree pollen evaluation of all tree pollen counts (grains m -3 ). Pollen levels defined by the National Allergy Bureau (NAB) are based on the percentile ranges of the pollen counts measured by all stations certified by the NAB. The NAB defines pollen counts between 15 and 89 grains m -3 as moderate. Soldevilla et al. [35] categorized biological air quality (BAQ) into four levels (good, acceptable, poor, and bad) based on the frequency of pollen types and their allergic potential. Poor BAQ is set as a threshold baseline, which refers to pollen types with moderate pollen counts but high allergic potential. Pollen counts of moderate (51-200 grains m -3 ) for a specific group of tree species (i.e., Cupressus, Pinus, Platanus, Populus, and Quercus) accounted for 81% of the total tree pollen count. Another group of tree species with pollen counts defined as moderate, with 31-50 pollen grains m -3 , accounted for roughly 9% of the total tree pollen count. As a compromise for the purpose of the evaluation, we defined the pollen threshold for the three pollen counts as 50.
Grass pollen concentrations from 30 to 80 grains m -3 day -1 substantially increased allergic nose and eye symptoms in children. When grass pollen concentrations exceeded 70 grains m -3 day -1 , the severity of lung dysfunction symptoms increased [36]. For the evaluation, we defined the grass pollen threshold as 30 because of the prevalence of allergy symptoms occurring with grass pollen counts of 30 grains m -3 .
In our evaluation, ragweed (Ambrosia) pollen represents the majority of weed pollen, which accounted for about 93% of all weed pollen counts. Ragweed pollen counts as low as 5 grains m -3 can cause allergic symptoms [37,38]. Other studies have indicated that symptomatic experiences caused by ragweed pollen occur at higher concentrations ranging from 20-50 grains m -3 [39][40][41]. The NAB defines pollen counts between 10 and 49 pollen grains m -3 as moderate. Thus, as a large number of studies have indicated that patients experience allergy symptoms in conditions within a diverse range, we set the threshold values for ragweed pollen at 20 grains m -3 to evaluate the model.
To evaluate the categorical statistics of the total pollen count (sum of the tree, grass, and weed pollen) for the entire year of 2016, we used both the threshold values of each of the pollen categories and the mean threshold of all three categories. Thus, the total pollen model performance had four categorical evaluation thresholds (see Table S1).
The results of the statistical evaluation showed our CNN model (see Table 1) yielded favorable to mixed results for the respective pollen categories. The 1 to 7-day predictions of grass pollen had the least optimal scores for HIT (0.125-0.313), CSI (0.111-0.294), FAR (0.167-0.500), and ETS (0.093-0.271). The grass model, however, had a POC score higher than 0.9 in all predictions. An explanation for this score was the abundance of measured and predicted grass pollen counts falling below the threshold value. The model did not accurately forecast the few cases of threshold exceedances. The ratio of the model overpredicting and correctly forecasting pollen count when exceeding the threshold was nearly equal in most cases. This results in significantly higher FAR than exhibited by the other pollen categories. The low CSI score indicates that the model mostly under predicted during the threshold exceedances of the grass pollen season.
The  Table S5 in the Supplementary Section of the paper.
The results indicate that categorical statistics were not able to sufficiently evaluate the overall accuracy of the model. Therefore, we determined that the IOA and r would be appropriate for an alternative evaluation using statistical analyses.

Index of agreement and the Pearson correlation coefficient
IOA measures the degree of model prediction error and whether our model accurately predicted the peaks of pollen concentrations. The r measures the linear correlation between the observed and predicted concentrations of our model. We used both methods to evaluate the accuracy of our model at predicting 1-7 days ahead for the 2016 time series. We ran the model through 25 iterations to evaluate the consistency and accuracy of our model. See Fig. 3 for the model performance in IOA and r for the 7-day predictions. For 1 day ahead prediction of tree pollen, the model achieved mean IOA and r of 0.88 and 0.85, respectively. The model accuracy ranged from 0.84-0.91 IOA, and 0.79-0.88 r. The model accuracy decreased slightly to an IOA and r of 0.76 and 0.71, respectively, by forecasting the seventh day. Although the model underpredicted the concentrations during the peak tree pollen season (see Fig. 4). One explanation for this finding is that most of the training data had peak pollen concentrations below 2016 concentrations. To address this issue, the model would need to be trained with more consistent pollen and meteorological data. Thus, the model would be trained with more samples, further optimizing its predictive capabilities. These instances of the model not being able to  The model predictions of next day grass pollen were the least favorable with a mean IOA score of 0.81 and r of 0.75. The range of accuracy for first-day forecasts was also the largest for grass, with an IOA range and r range of 0.73-0.84 and 0.62-0.79, respectively. Grass pollen forecast accuracy decreased considerably as forecast time increases. By the seventh day, the accuracy was 0.56 IOA and 0.51 r. Weed pollen forecast was most favorable, with a mean IOA and r of 0.90 and 0.88, respectively. The model performed more consistently at predicting weed pollen counts. For the 1-day forecasts, the IOA score ranged from 0.87-0.93 and r score from 0.87-0.90. The accuracy of the model decreased only slightly as forecast time increases. The model overpredicted the weed pollen concentration for 2016 during the peak weed season.
Deep CNN model accuracy substantially declined for the 3-4 day forecasts for the pollen categories. Tree and weed pollen forecast showed minor improvements from second to third-day forecast, suggesting pollen and weather phenomenon of the current day had a delay to the reaction of pollen emissions by plants. A similar phenomenon is observed on an annual temporal scale, where stronger precipitation in 1 year led to stronger pollen production in the following or second year [11].
Results reflected the magnitude of variation in pollen concentrations among the categories from year to year. The average pollen time series of 2009-2015 correlated with that of 2016 by 0.49 for grass, 0.76 for the tree, and 0.84 for weed pollen. Grass pollen strongly varied from year to year, but those of weed pollen were more consistent, reinforcing the accuracy of the model with regard to the variation among the pollen categories. Hence, the accuracy of the model depends on the stability of the pollen seasons. An evaluation of the model on total pollen 1 to 7-day forecasts for 2016 produced more stable results. This finding can be explained by the inclusion of non-and lowpollen time periods during the year.
The model prediction accuracy of weed pollen counts was slightly less consistent as the model forecasts further ahead in time. The predictions of counts during the odd days (e.g., third and fifth days) were consistently better than those of the previous days (e.g., second and fourth days). The model results indicate that for 2016, current day weather had a stronger relationship to the third-day forecast of pollen concentrations than to the second-day forecast. Comparing yearly to seasonal time series for the respective pollen categories showed on average a 2% decrease in the prediction accuracy for IOA and r (see Fig S1 in the supplementary section of the paper).
The shortcomings of the predictive capability of the current model could be mitigated by implementing certain approaches: (1) Increasing the amount of pollen data concentrations for training; (2) reducing amount of missing pollen data to improve interpolation accuracy of missing pollen data; (3) increasing temporal resolution of pollen, allowing the evaluation of short-term meteorological effects (e.g., precipitation) on pollen concentrations. Data for individual pollen species were not sufficient for individual pollen species forecast of the model. The implementation of more layers (deeper CNN) is dependent on the availability of more data. The CNN model's ability to better identify patterns of pollen phenomena was limited to the availability of data. This phenomenon reflects the limitations of CNN as a forecasting system model discussed by Eslami et al. [20].

Conclusion
Studies have shown that pollen concentrations are related to wind [5], temperature [9], precipitation [11], and relative humidity [41]. Despite the meteorological relationships, pollen concentrations vary considerably. This phenomenon reflects the nonlinear annual variability of phenological pollen data not always amenable to linear regression modeling [5]. Despite the pollen concentration variance, our deep CNN was capable of forecasting pollen up to 7 days ahead with sufficient accuracy. On average, our model was able to predict weed pollen concentrations of nearly 0.9 IOA for a 1-day forecast and 0.81 mean IOA for the seventh-day forecast. The first-day forecast of tree pollen attained 0.86 IOA and the 7-day forecast 0.78 IOA. Our model produced a satisfactory forecast of grass pollen beyond the next day with an accuracy of at least 0.8 IOA, which dropped considerably to 0.56 IOA in the 7-day forecast. Variation in the forecasting accuracy among the pollen categories related to the variation in their annual pollen seasons. The greater the pollen emissions differed from the previous years, the more difficult it was for the model to accurately forecast pollen concentrations. The model was able to address some of the variability among the pollen concentrations and obtained stronger correlations in the 7-day forecasts than direct correlations between 2016 and previous years.
Our CNN model predicted real-time concentrations of pollen with favorable statistics and generated results within minutes of initiating the model. Therefore, the computational efficiency of the deep CNN algorithm could supplement deterministic and regression models to more accurately and rapidly forecast pollen concentrations- offering a more reliable warning system for populations at high risk of pollen-related allergies.