A novel ensemble machine learning and time series approach for oil palm yield prediction using Landsat time series imagery based on NDVI

Abstract Accurate oil palm yield prediction is necessary to sustain oil palm production for food security and economic return. However, there are limited studies on comprehensive mapping and accurate oil palm yield prediction using advanced machine learning algorithms. Using multi-temporal remote sensing data, this paper proposed a new approach to predict oil palm yield based on the normalized difference vegetation index (NDVI) and ensemble machine learning algorithm. ReliefF algorithm with linear projection was employed to select the best combination of spectral indices in oil palm discrimination. Oil palm land cover was classified using random forest (RF) and modified AdaBoost algorithms. A time-series approach known as walk-forward validation was firstly introduced to train the model using the 2016-2019 data and the one-step prediction was performed for 2020 using RF and AdaBoost. Result of the study revealed that the RF model (RMSE = 0.384; MSE = 0.148; MAE = 0.147) outperformed the AdaBoost model (RMSE = 0.410; MSE = 0.168; MAE = 0.176). Our research has demonstrated the value of detailed mapping and subsequent yield prediction by developing a novel approach utilising time-series satellite imagery, ensemble machine learning, and NDVI, which will assist decision-makers in managing their practices related to oil palm.


Introduction
Oil palm (Elaeis guineensis) is a major cash crop in Malaysia that generates a large amount of revenue for the country. Palm oil derived from oil palm is cheaper than other vegetable oils due to the much higher productivity of oil palm trees per unit of land and is usually demanded globally, especially from people of third-world countries. One hectare of oil palm generally yields more than three tons of vegetable oil, whereas other cash crops like sunflower or rapeseed yield less than one ton of vegetable oil (FAOSTAT 2019).
Over the last decade, global demand for oil palm has increased, with Southeast Asia supplying nearly 90% of palm oil production. The expansion of oil palm cultivation has created new job opportunities to support the livelihoods of rural households and local communities (Edwards 2019). Higher farm profits benefit rural households and communities by generating promising income and thus reducing poverty at the local, regional and national levels (Kubitza et al. 2018;Santika et al. 2019). However, the expansion of oil palm plantations has sparked some debate about tropical forest deforestation. Recent oil palm expansion in forested localities, where more than 90% of global palm oil is produced, has led to concerns about the role of oil palm in deforestation (Meijaard et al. 2020). On the island of Borneo, at least 50% of all deforestation between 2005 and 2015 was linked to oil palm expansion (Meijaard et al. 2018). To alleviate the problem, a simple solution is to increase palm oil production without exploring new lands, owing to the utilisation of currently cultivated lands. Furthermore, global demand for vegetable oil is projected to increase by 46% by 2050.
Predicting the yield in oil palm is crucial, which can assist in the monitoring of oil palm production precisely in terms of yield variations across the location. Furthermore, precise treatments can be focused only on the predicted low yielding area, ensuring that material resources are not squandered and, as a result, no excessive chemical-based agriculture inputs are applied and the environment is maintained. The decision-makers can quickly and precisely respond to the current crop status. Previous conventional scouting of oil palm plantations is laborious, inaccurate, and prone to human error (Shamshiri et al. 2019). Nevertheless, traditional yield estimation approaches are no longer useful to decision-makers because they are time-consuming. Remote sensing technology is one of the most effective tools for obtaining critical information for oil palm monitoring and growing conditions over a wide region; it can be used for oil palm yield estimation. Most studies on oil palm yield prediction have used factors such as weather, agronomic information, and others (Oberth€ ur et al. 2017;Sarkar et al. 2020;Yousif et al. 2020). Oettli et al. (2018) developed a linear model to examine the influence of weather on the annual productivity of palm trees in the whole of Malaysia. The study used atmospheric temperature as an indicator for determining the relationships between oil palm yields and El Niño-Southern Oscillation (ENSO) Modoki. In a similar study, Siang et al. (2020) studied the effect of daily maximum and minimum temperatures, hourly air temperatures to estimate the relative humidity and hourly total solar irradiance on the oil palm growth and yield. In comparison to major factors such as climate and agronomy, studies on using remote sensing features are still very limited (Balasundram et al. 2013;Chong et al. 2017).
Remote sensing and geographic information systems (GIS) are capable of detecting, classifying and monitoring oil palm plantations using spectral features (Shaharum et al. 2020). Remote sensing at visible and near-infrared spectral regions (Vis-NIR) has been used to extract various spectral indices for deriving various crop properties for crop monitoring and prediction (Xue and Su 2017). This includes chlorophyll concentration, other photosynthetic pigments, crop-soil interaction, leaf area index, and other crop biophysical indicators. (Sims and Gamon 2002;Viña et al. 2011;Zarco-Tejada et al. 2012).
The utilisation of remote sensing for yield monitoring has its foundation from its close relation to the canopy leaf area index (LAI) and the fraction of absorbed photosynthetically active radiation (fAPAR) (Huang et al. 2014;Zhou et al. 2017;Rezn ık et al. 2020). The relationship between net primary production and absorbed photosynthetically active radiation (APAR) is linear (Bolton and Friedl 2013). Therefore, vegetation indices can be used as an indirect estimate of primary crop productivity (Al-Gaadi et al. 2016;Demirpolat and Lelo glu 2018). The relationship between vegetation indices and biomass/ fAPAR aids in crop yield estimation since the photosynthetic activity of agricultural plants during specific seasons determines the yield of various crops (Mkhabela et al. 2011;Newton et al. 2018;Panek and Gozdowski 2020). Newton et al. (2018) used NDVI to estimate potato yield using Landsat time-series images. A linear regression model was developed from 2010 to 2016 based on the day of the maximum NDVI. The findings showed a relationship between NDVI and yield, which was related to the calculation of maximum field NDVI for each growing cycle based on chronological changes. The overall coefficient of correlation (R 2 ) of yield prediction was found to be 0.81 between the mean of NDVI and potato yield.
Furthermore, machine learning has been widely used in crop yield prediction in recent years due to the large volume of data generated by predictors and its ability to integrate, analyse, and make the best-informed decision (Chlingaryan et al. 2018). Advanced machine learning algorithms are capable of producing accurate prediction methods. Xu et al. (2021) combined vegetation indices with backscattering and texture features to develop classification models based on grid search optimization using random forest (IGSO-RF). The proposed approach improved the oil palm detection accuracy and obtained the best model performance (OA ¼ 96.08 percent and kappa ¼ 0.9462) using the IGSO-RF classifier and optimal features. Yoosefzadeh-Najafabadi et al. (2021a) proposed an ensemble method for yield assessment based on bagging strategy and generic algorithms, given the features of the yield component traits information. These information were used to forecast total soybean seed yield using the Multilayer Perceptron (MLP), Radial Basis Function (RBF), and Random Forest (RF) machine learning (ML) algorithms, both individually and collectively via an ensemble method based on bagging strategy (E-B). The results showed that the E-B algorithm was able to improve prediction accuracy by increasing the values of R 2 , MAE, and RMSE by 0.1, 0.24, and 0.96 kg.ha-1, respectively. In another study, Yoosefzadeh-Najafabadi et al. (2021b) used hyperspectral vegetation indices to predict soybean seed yield and fresh biomass using ensemble-bagging (EB) and deep neural network (DNN). The results showed that DNN and EB generate coefficients of determination (R 2 ) of 0.76 and 0.77 for yield and 0.91 and 0.89 for fresh biomass, respectively. The DNN-SPEA2 hybrid was used to improve prediction accuracy, and it was found that it can differentiate superior genotypes in large breeding populations. Phan et al. (2020) utilised three machine learning algorithms to predict tea yield with other variables such as NDVI and mean temperature. Three models were developed: standard linear regression, support vector regression, and random forest (RF) regression, with the results revealing that support vector and RF regressions provided the highest accuracy over linear regression.
However, only limited studies have worked on comprehensive mappings and accurate oil palm yield prediction using advanced machine learning algorithms (Diana and Dharma 2019;Rodr ıguez et al. 2021). Based on our literature review, this is the first study to integrate remote sensing and ensemble machine learning in oil palm mapping and yield prediction. Therefore, the objectives of this study were to: i) optimise the oil palm classification workflow from 2016 to 2020 by applying statistical analytics and machine learning algorithms (RF and modified AdaBoost) for subsequent yield prediction, and ii) build a yield prediction model based on time-series NDVI coupled with RF and AdaBoost using monthly data from 2016 to 2020.

Study area
The state of Pahang is the study area, which is situated in Peninsular Malaysia and is one of the country's largest states, with rainforests and mountains. Figure 1 shows the location of the study area. The weather is tropical mainly hot and humid. The weather is mild to hot all year and the average water temperature is 29 degrees. The precipitations were low from February to June, in which the precipitations were the lowest among the months (WorldData n.d.).

Data used and sources
This study used Landsat-8 Top of Atmosphere (TOA) reflectance data in the periods of 2016-2020. Landsat-8 has eleven bands, which are categorized into two sensors: operational land imagers (OLI) and thermal infrared sensors (TIRS). In this study, six bands were chosen: band 2 (blue) (0.45-0.51 mm), band 3 (green) (0.53-0.59 mm), band 4 (red) (0.64-0.67 mm), band 5 (near-infrared) (0.85-0.88 mm), band 6 (SWIR 1) (1.57-1.65 mm), and band 7. (SWIR 2) (2.11-2.29 mm). TOA reflectance data from the Tier 1 collection contains higher geometric and terrain accuracy and the scenes in the Tier 1 collection are constantly geo-registered. This geometric registration ensures the pixel-to-pixel correspondence necessary for the combination of multi-temporal imagery. These tier 1 TOA reflectance satellite images have already been radiometrically calibrated using wellestablished methods to ensure consistency with other scenes, and these are now available on Google Earth Engine. The NDVI time series profiles retrieved from the Landsat-8 images were used to derive chronological parameters following the flowchart depicted in Figure 2.
A land cover map of Pahang was used and provided by the Department of Agriculture (DOA), Malaysia as the validation for more accurate classification. Meanwhile, the average fresh fruit bunch yield (tons/hectare) was derived from the Malaysian Palm Oil Board (MPOB) website under Economics and Industry Development Division (MPOB 2021).

Oil palm mapping and classification
2.3.1. Sampling strategy Stratified random samplings were applied in this sampling strategy. The samples were created with the assistance of a high-resolution Google Earth Pro and a reference map provided by DOA. A total of 15,586 pixels from the whole samples based on stratified systematic samplings were created and comprised of 5 classes (water, forest, oil palm, bare soil, built up, and background). Stratified systematic samplings are applied as a sampling technique. Stratified random sampling combines the benefits of stratified and random assignments with the advantages of systematic samplings, while avoiding the risk of bias when the cycle period coincides with the sampling interval. The focus was on sample size distribution when assigning the classes for each feature. Then, the minimum number of samples was selected from each stratum from three spatially diverse areas. First, a number to every class was assigned from each ground feature for three spatially diverse areas. The features were then chosen at random and the classes were chosen at regular intervals from all three diverse locations.

Pre-processing
The cloud masking and filling method were applied to remove the clouds and fill the region with cloud-free images by using multiple image combinations in different periods, and the google earth engine has provided an algorithm to get cloud-free images. Cloud masking algorithms were used to remove bands 10 and 11 from the image, causing the values in 10 and 11 to be classified as the cloud. The cloud is then filled into the mask pixels using the value 0 as the second step.

The generation of spectral indices
For the mapping and classification of the oil palm, ten spectral indices (Table 1) were derived and applied. Normalized difference vegetation index (NDVI), Tasselled Cap Transformation wetness band (TCT_W), Tasselled cap transformation greenness (TCT_G), Tasselled cap transformation brightness (TCT_B), Enhanced vegetation index (EVI), Enhanced vegetation index (EVI2), Normalized difference built-up index (NDBI), Soil adjusted vegetation index (SAVI), Land surface water index (LSWI), and Optimized soil adjusted vegetation index (OSAVI) were chosen for the extraction of SI. SIs were then stacked together and used for feature selection as described in following section.
2.3.4. Selection of the best spectral indices using feature selection and linear projection Ten spectral indices were generated for mapping and classification. The ten spectral indices were analysed and selected using a feature selection algorithm. Feature selection alleviates the redundant issues of the features by removing irrelevant and redundant features.
Baig et al. (2014) Tasselled cap transformation brightness (TCT_B) Huete (1988) Land surface water index (LSWI) Rondeaux et al. (1996) The filter-based approach is a statistical measure that is used to find the most correlated features while indicated the unrelated features based on certain mechanisms. The relief-based algorithm is further improved and proposed by ReliefF (Kononenko 1994). Kira and Rendell (1992) defined the relief algorithms inspired by instance-based learning (Aha et al. 1991). Relief measures a proxy statistic for each feature that can be used to estimate feature 'relevance' to the target concept. These feature statistics are directed to as feature weights that can range from À1 (least correlated) to þ1 (most correlated). Table 2 shows the pseudo-code for the ReliefF algorithm.
The workflow of the ReliefF algorithms is shown as below: 1. The Relief algorithm cycles through m random training instances (Ri), selected without replacement, where m is a user-defined parameter. R i is denoted as 'target' case and the feature score vector or weight is denoted as W. 2. W is modified per cycle based on feature value differences observed between the target and neighboring instances. Relief defines the target's two closest neighbours: one with the same class as the target, known as the nearest hit (H), and the other with the opposite class, known as the nearest miss (M). 3. If the feature value varies between the target instance R i and either the nearest hit H or the nearest miss M, the cycle's final stage updates the weight of a feature A in W. Features with differences between R i and H provide evidence to the contrary, so the quality estimation W[A] is decreased. Oppositely, the features with different values between Ri and M are informative of outcome is supported, so the output estimation W[A] is increased. 4. The weight is updated by averaging the contribution of all hits and all misses. The contribution for each class of misses is weighted by the class prior probability P. (C). The misses' probability weight sum should turn into 1 since the contribution of hit and misses is symmetric (in each step to be [0;1]. Each likelihood weight is divided by factor 1-P(class(Ri)) due to the class of hits is missing from the total. The process is repeated for m times. The fundamental difference from Relief is the selection of k hits and misses, which guarantees the algorithm's robustness in the face of noise. Userdefined parameter k controls the locality of estimates. To deal with missing data the diff function must be changed. Missing values of attributes are treated probabilistically.
For the 2016 to 2020 imagery datasets, oil palm mapping and classification were performed using a feature-selected dataset generated by a relief-based algorithm and containing the selected spectral index layers. After performing the feature selection, the selection process was further analysed by statistical analysis such as linear projection for the best discrimination among classes.
Linear projection is a linear function on a vector space, such that when it is applied to itself will produce a similar result, i.e., P 2 ¼P. The explanation can be implied through the v as a vector, which is a finite straight line pointing in a given direction. Assuming there is some class (x), i.e., Px is a function that returns another class (K) closer to x along with the vector line v. In most conditions, two points between them refer to the Euclidean distance, where i ranges over the dimensions of the vector space (in this case two dimensions) for class separation. The implementation of the feature selection algorithm was run using the Orange software (Demsar et al. 2013). The separation class was further verified and displayed in the linear projection. After the feature selection process, the featureselected dataset with selected spectral indices was compared to other datasets. Table 3 shows the details of the dataset.
2.3.5. Oil palm mapping and classification using machine learning algorithms Machine learning algorithms for oil palm mapping and classification were implemented in Scikit-learn (Pedregosa et al. 2011). Scikit-learn is a library in Python that provides much-unsupervised learning and supervised learning algorithms that can be used for classification purposes. It is efficient to build machine learning models using this library.
2.3.5.1. Random forest. A random forest (RF) is a supervised learning algorithm that is used for both classifications and regression problems. The bagging method was used by creating multiple bootstrap samples so that each new bootstrap sample will be an independent dataset from another. From that, each of the selected features is drawn randomly by the replacement of N examples to ensure the features are distributed in a random subset for fitting into the latter weak learners. After that, a weak learner such as a decision tree was fitted for each of these samples and then obtained a prediction from each of them, and finally chose the best solution through voting. This is an ensemble method that is better than a single decision tree because it reduces overfitting by averaging the result (Figure 3).
2.3.5.2. AdaBoost. AdaBoost is a common boosting classification algorithm and could be applied in a wide range of datasets. AdaBoost tends to induce a strong classifier from several classifiers. Weak models are added sequentially, trained using weighted training data. The process extends until a pre-set number of weak learners has been invented or no further improvement can be made on the training dataset. Stage values were formed for a pool of weak learners, each with a stage value. Predictions are decided by measuring the weighted average of the weak classifier. The method applied was based on SAMME (Stagewise additive modelling) (Hastie et al. 2009). The detail of the rules is mentioned below: As compared to general AdaBoost algorithms, the same method requires (1 À err(m)) > 1/K or the accuracy of each weak classifier be better than random guessing rather than 1/2 to maintain the positivity of a(m). Besides, although err(m) can be bigger than 1/2 (or equal to 1/2), the a(m) is still positive; hence, the misclassified training samples get more weight, and the test error keeps decreasing even after 600 iterations. Consequently, the new algorithm puts more weight on the misclassified data points in (2d) than AdaBoost and the new algorithm also combines weak classifiers a little differently from AdaBoost; for instance, it is differed by logðk À 1Þ P M m¼1 │ðT m ð Þ x ð Þ ¼ k: The extra term log (K À 1) in (1) is not artificial; it makes the new algorithm equivalent to fitting a forward stagewise additive model using a multi-class exponential loss function. In multi-class classification scenario, it is crucial to generalise it with the exponential loss functions as expressed below by applying Equation (1): where fk stands for class k and the significance will become apparent later. It is worth noting that f must be restricted to be estimable; otherwise, adding any constant to all f k 's does not affect the loss. We use the symmetric constraint: f1 þ … . fK ¼ 0, which reduces the multi-class exponential loss function for class flexibility.
GridSearchCV was applied to each hyperparameter combination, seeking for the optimal parameters to achieve the greatest performance. GridSearchCV is a function in the model selection package of Scikit-learn in Python.
2.3.6. Accuracy assessment Cross-validation was used as a resampling technique in this evaluation analysis to test machine learning models (Berrar 2018). The cross-validation technique is a model validation technique for evaluating how well the results of a statistical study can be generalised to a different collection of data. K-fold technique was applied to split every observation from the original dataset to split themselves into the equal chance of appearing in the training and test set. The important step has a parameter called k that refers to the number of groups that a given data sample is to be split into. In this study, 10 was set for K, which chose the (K-1) folds and validate the model using the remaining Kth fold. The process is repeated until every K-fold serves as a test set and is the average of the score. Cross-validation proved to be a less biased over training and testing approach, which can improve the model performance (Lyons et al. 2018). Cross-validation is applied for the evaluation matrices in overall accuracy, kappa (k), F-measure, precision and recall. Finally, the prediction was collected from a cross-validation loop; and the validation matrices or overall accuracy, kappa (k), F-measure, precision and recall were calculated with the given class labels.
The overall accuracy metric is the most common and straightforward performance metric for classifier evaluation; it simply assesses an algorithm's overall effectiveness. This metric is calculated from the confusion matrix's principal diagonal entries and is classified as a degree of correct classification. This metric is defined as the degree of correct classification of a model and is calculated using the confusion matrix's principal diagonal entries .
Kappa technique is another method for evaluation based on the discrete multivariate technique used in accuracy assessment (McHugh 2012). The statistical method defines the actual level of agreement between reference data and a detailed map. The value of kappa lies between 0 and 1, where 0 agreements are only due to chance, while 1 represents the total agreement between two data sets. Recall is a metric that quantifies the model's ability to avoid false negatives by calculating the proportion of positive examples that were correctly marked (Buckland and Gey 1994). While precision factored in how many of the samples that were expected as positives turned out to be accurate. The F1-Score is a metric that combines precision and recall (Sasaki 2007). It is sometimes referred to as the harmonic mean of the two. The harmonic mean is a method of calculating an 'average' of values that is said to be better for ratios (such as precision and recall) than the standard arithmetic mean. Before applying the model to test hypotheses, it ran numerous times to determine the best precision and recall. Pairwise comparisons were used to the proposed feature-selected dataset with other datasets (original, original þ NDVI and full selected dataset). The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used to compare the mean ranks of related datasets on a single dataset (Pratt 1959).

Pre-processing and data transformation
All NDVI values were further masked based on the oil palm area. Finally, we calculated the maximum NDVI values from the oil palm within the geometry representing an oil palm plantation for each month. First, NDVI values for oil palm were derived from landsat-8 by selecting on band 5 (near-infrared ¼850-880nm) and band 4 (red ¼ 640-670nm). NDVI were then averaged and derived from the oil palm boundaries to produce NDVI for each month. Secondly, the geometry of the oil palm was scaled to include the centroid and approximately calculated the scale of a square geometry. The weighted reducer was then used to include the pixels which roughly include at least 0.5% of the pixel. By that, a mean reducer was chosen to cover as a small region as possible for the aggregation of pixels for oil palm within each polygon representing plantations to extract the maximum outcome of maximum NDVI.
Box cox transformations with lambda 4.61 were applied for normality. A box cox transformation is a transformation of a non-normal dependent variable into a normal shape. One of the benefits of box cox transformation is the capability of anomaly detection (Osborne 2010). The detail of the rule is shown below (Equation 2): (2) T (t) is the transformed variable, y is the target variable. When lambda equals one, no transformation is required.

Correlation and regression analysis with walk-forward validation
The relationship between NDVI and yield was tested using a spearman rank-order correlation analysis followed by a chi-squared test to see if it was statistically significant (p < 0.05). Machine learning algorithms such as RF and AdaBoost algorithms were applied for yield prediction. The mechanisms of RF and AdaBoost regression are similar to the classification. In this study, NDVI was used as a predictor for developing the model. Hyperparameter tuning was done for several estimators such as 50, 100, 500, 1000 in RF and AdaBoost algorithms with other parameters to set as default values to find the sets of optimal hyperparameters.
The dataset was first split into train and test sets by specifying a cut point, with all data except for the last 12 months of 2020 used for training and the last 12 months of 2020 used for testing walk-forward validation. The model runs upon a one-step forecast for the evaluation of the model by training on the training dataset and predicting the first step in the test dataset. Then, the real observations from the test set were added to the training dataset and the model is refitted and re-predicted for the second step in the test dataset. This process was reused for the entire dataset, resulting in a one-step forecast for the entire test dataset, from which an error measure can be determined to assess the ability of the model. Models can be trained and evaluated after a design for the test setup has been chosen. The minimum number of samples in the window was used to train a model starting at the beginning of the time series. The model predicts the next time step. The prediction is saved or assessed against the known value.
There are a few criteria to list: 1. The smallest number of observations possible is chosen to train the model. If a sliding window is used, this can be presumed as the window width. 2. Second, the model will determine whether to train on all the data it has access to or only on the most recent observations when appropriate. This decides whether the window will be sliding or expanding.
To assess the performance of the models, root mean square error (RMSE), mean square error (MSE), and mean absolute error (MAE) were calculated using equations 3, 4, and 5 as follows: RMSE ¼ root mean square deviation I ¼ variable i N ¼ number of non-missing data points Xi ¼ actual observations time serieŝ MAE ¼ mean square error Yi¼prediction Xi ¼ true value RMSE, MSE, and MAE are important statistical calculations that are applied to measure the error between the predictor and the response variables (Mood et al. 1974;Armstrong and Collopy 1992;Willmott and Matsuura 2005). A smaller RMSE, MSE, and MAE indicate a small error, while a large RMSE, MSE, and MAE indicate a large error between the predicted value and the observed value.

Selection of spectral indices based on feature selection algorithms and linear projection
ReliefF algorithms with sequence predictor rank were used, followed by statistical visualisation (linear projection) to find the best spectral indices for class discrimination from 2016 to 2020. In the year 2016, eight spectral indices such as NDBI, LSWI, TCT_W, TCT_B, TCT_G, EVI_3 and EVI_2 were selected for further classification due to higher predictor rank and showed good separation between classes. Whereas in the year 2017, seven spectral indices such as SAVI, TCT_W, NDBI, LSWI, NDVI, TCT_B and OSAVI showed the highest predictor rank and a good separation. In the year 2018, TCT_B, EVI_3, NDBI, LSWI, EVI_2, SAVI, OSAVI and NDVI were selected due to the highest rank and showed better separation between classes. Whereas in the year 2019, NDBI, EVI3, TCT_W, EVI2, OSAVI, NDVI, TCT_T, TCT_G and SAVI showed the highest rank and better separation between classes. Notably, TCT B, NDBI, and LSWI had the highest predictor rank and demonstrated a greater separation of classes in the year 2020. Figure 4 displays the result of the ReliefF algorithm for the spectral indices. Furthermore, class discrimination was visualised on the best set of spectral indices using linear projection, as shown in Figure 5. It can be seen that the classes were distinctly differentiated by using only three spectral indices (TCT_B, NDBI and LSWI) for the years 2020. The best selected spectral indices were chosen as feature-selected datasets.

Classification and accuracy assessment
The accuracy of the featured-selected dataset ( Figure S2) was assessed based on the dataset from 2016-2020. The accuracy of most of the years using RF achieved at least 90%, which is compatible with the other datasets. In the year 2019, the accuracy of feature- selected dataset achieved slightly higher than the full-selected dataset, which is 0.893 and 0.881, respectively. However, the feature-selected dataset achieved slightly lower in terms of accuracy compared to the others in the year 2018 and 2020. Besides, the accuracy of the feature-selected dataset for both algorithms was slightly lower in 2017 compared to the others. For instance, the feature-selected dataset using RF achieved 0.891 with the lowest value of F1 score (0.776) and recall (0.771). Meanwhile, the feature-selected dataset using AdaBoost recorded an accuracy of 0.897 with an F1 score of 0.734, precision score of 0.816, recall score of 0.727 and kappa coefficient of 0.823. Feature-selected dataset using a modified AdaBoost algorithm (SAMME) in the year 2017 achieved the lowest accuracy, which was 0.795 with a F1 score of 0.704 and a kappa coefficient of 0.691. The result showed that RF still achieved higher accuracy for oil palm classification with true positive based on the precision and F1-measure.
Mostly, oil palm and forest were unable to be correctly classified and oil palm were wrongly assigned as forest especially in the modified AdaBoost algorithm. These kinds of confusions could be explained by the similarity in spectral characteristics between classes belonging to the same large group, such as vegetation. Despite these confusion between forest and oil palm being worse in modified AdaBoost algorithm, our results showed that feature-selected dataset using modified AdaBoost algorithm (SAMME) achieved a higher corrected label in recognising the forest with only three layers, which is 0.97 compared to RF (0.94) in the year 2020 ( Figure 6). This can reduce further the confusion of the spectral similarity between forest and oil palm using the least spectral indices dataset. Overall, RF is consistent in identifying the oil palms from other features such as water, bare soil and built-up in these years (2016)(2017)(2018)(2019)(2020).
The Wilcoxon signed-rank test was used to determine whether the accuracy of the feature-selected dataset is substantially higher than that of the rest of the comparable dataset in each classification with a significance level of less than 5%. The null hypotheses were rejected for all cases, and this means that the proposed dataset containing only feature- selected datasets have no significant difference (p-values > 0.05) with the other datasets when detecting oil palm trees. Noticeably, the feature-selected layer showed a significant difference (p-values < 0.05) with the original dataset and original plus NDVI dataset, respectively using AdaBoost algorithms. Figure 7 shows the Pahang prediction map for oil palm classification and other classes.

Analysis of the chronological changes of oil palm NDVI on temporal and spatial
The chronological changes of maximum NDVI values of oil palm were determined between the ages of 8 and 18 years, as shown in Figure 8. It also showed the difference in NDVI based on the growing period. The oil palm NDVI indicated that the differences were not only temporally but also spatially (Figure 8). A total of 48 satellite images from January 2016 to December 2020 growing periods were chosen chronologically referring to the days from the yield information of oil palm plantation starting from 2016. From each NDVI image, the maximum values of NDVI were derived for oil palm in five years. The NDVI values on the oil palm plantation in these images were also changing constantly. The spatial distribution of NDVI values increased until it reached a maximum value. Overall, the NDVI difference was consistent from 2016 to 2020 (Figure 8). The average value of NDVI of oil palm was discovered to be 0.66. The maximum value of NDVI was 0.80 for November 2016 while the minimum value of NDVI was 0.51 for January 2018.   The minimum values of the NDVI were 0.58, 0.57, 0.51, 0.62 and 0.53 for 2016, 2017, 2018, 2019 and 2020 growing periods, respectively. The maximum values of the NDVI were 0.80, 0.78, 0.77, 0.78 and 0.79 for those growing periods, respectively. There was a small peak at three points which were at July 2016, January 2018 and July 2020. The outlier occurred in July 2018 because the sensor was unable to capture the NDVI value within the geometry with no valid values due to the possibility of cloudiness. It was noted that lower NDVI occurred every April and May due to the lesser rainfall especially in 2016.

Machine learning regression analysis of NDVI values and walk-forward validation approach over the year 2016-2020
The yield prediction model-based RF and AdaBoost regression models were built and developed based on yield data acquired from 2016 to 2020. The walk-forward validation was proposed in this study since it was a time series analysis, with the months in 2016 to 2019 serve as the training period and the year 2020 serves as the validation period.
Overall, the reaction of the RF (Figure 9) year by year had a significant level of accuracy (P < 0.05) with the curves of the predicted and the actual yield displayed the same patterns. However, adaBoost ( Figure 10) obtained an insignificant level of accuracy (P < 0.05). Figure 11 shows yield prediction models for the monthly prediction with RF and AdaBoost models for the validation period in 2020 in Pahang state. The yield prediction model based on the machine learning regression analysis was developed using monthly yield data from 2016, 2017, 2018 and 2019. Validation is needed to assess the performance of the model. Model performance can be calculated using the differences between the real and predicted values. As a result, yield data from the 2020 oil palm growing season was used to further validate the model. The yield predicted by the RF (P < 0.05) was quite close to the actual data obtained by the Malaysian Palm Oil Board under the Economic and Industry Development Division.  . The relationship between actual and predicted yield was significantly correlated based on Spearman's Rank Correlation analysis, which revealed that the correlation coefficient, for (a) rho ¼ 0.80, P < 0.05 and for (b) rho ¼ 0.7, P < 0.05. Figure 12 shows the overall prediction accuracy between the actual yield (Tonnes/hectare) against predicted yield (Tonnes/hectare). The actual yield and predicted yield of RF and AdaBoost performed the best in terms of yield estimation. The prediction of oil palm yield by the RF model was the best with RMSE¼ 0.384; MSE¼ 0.148; MAE¼ 0.147, followed by AdaBoost with RMSE¼ 0.410; MSE¼ 0.168; MAE¼ 0.176.

Comparison of the accuracy of the validated prediction model
3.6. Trend of prediction yield over the NDVI Figure 13 illustrates the best prediction trend from the RF model, highlighting that the prediction trend should be satisfactory in 2020. This prediction model (Figure 13) was initially supported by Spearman's Rank Correlation analysis, which revealed that the  . The best yield prediction trend with NDVI generated by RF regression. The relationship between NDVI and predicted yield was significantly correlated based on Spearman's Rank Correlation analysis, which revealed that the correlation coefficient, rho ¼ 0.806, P < 0.05. R ¼ 0.806 between NDVI and yield was statistically significant (P < 0.05). Influential patterns of NDVI in 2020 were predicted based on the projected yield. The yield decreased when NDVI decreased during June and from October until December. Yield prediction increased steadily when the NDVI slowly increased from March to May within the range of 1.3 till 1.5 tonnes per hectares. In June, there was a modest decrease in yield, with 1.4 tonnes per hectare recorded. It reached its peak with a yield of 1.7 tonnes per hectare in August when the NDVI was at its maximum of 0.79. There was a consistent pattern for the predicted yield from September to December, with yields ranging from 1.4 to 1.6 tonnes per hectare and NDVI values vary by around 0.10, as shown in Figure 13.

Feature selection before oil palm mapping
The ReliefF algorithm was successfully utilised to delineate features such as forest and oil palm. The use of the feature selection method has the advantage of reducing computation power by processing only the most relevant spectral indices. Through the filter-based approach as a feature selection algorithm, the combination of selected spectral indices has been selected as the uppermost features in the rank importance of predictors using ReliefF algorithm. Surprisingly, only three selected features such as TCT_B, NDBI and LSWI had similar performance with other datasets such as original and original plus NDVI in optimal feature selection. The delineation of oil palm can be done from most of the forest through feature selection and selective linear projection based on classes. Figure  5 visualised the discrimination of oil palm from other features using only three spectral layers (TCT_B, NDBI, LSWI) for the year 2020 in the form of linear projection. This result agreed with Kiala et al. (2019), where the classification of Parthenium weed (Parthenium hysterophorus) showed that ReliefF (a filter-based approach) was the best performing feature selection method with the least features as compared to other feature selection methods. Feature selection using ReliefF algorithm has successfully selected the best spectral indices that recognised the separation of classes with statistical visualisation by linear projection display. The combination of function selection and statistical visualisation has reaffirmed the class distinction.

Mapping and classification for oil palm
Mapping is important to ensure that the selected feature dataset could map the oil palm correctly for subsequent yield prediction. Overall, ensemble learning that was used in this study successfully mapped the oil palm with more than 85% accuracy. Meanwhile, our findings were found to agree with Kobayashi et al. (2020), which proved that ensemble learning algorithms achieved higher classification accuracy for the computation of combined vegetation indices as a good classifier that applied in heterogeneous crops and forests. This is because forests and oil palm trees have similar spectral information and characteristics. Surprisingly, this paper has also proved that the minimal number of subset features could delineate the spectral characteristics from different features such as the classification in the year 2020. This is an important finding if the crop mapping covered the areas that are made up of a vast variety of features.
However, the modified AdaBoost algorithm, which is SAMME, achieved lower accuracy for the selected feature dataset compared to the others due to the possible similarity in the spectral characteristics between forests and oil palm trees that belong to the vegetation. Another cause might be the temporal response which poses the difficulty in separating these vegetation classes (Yin et al. 2014). This has further shown that the modified AdaBoost was still unable to solve this issue properly. The problem of the temporal consistency of time-series satellite data should be investigated further. Also, losses of spectral information cannot be accounted for by the variety of features for this classification, as seen in the year 2020. Notably, the modified AdaBoost algorithm correctly identifies forest when the spectral indices layer is at the minimum with three selected spectral layers in the year 2020. This may be explained by the ability of the modified algorithm to update the augmented sample weight using alpha to increase the strength of the weak learner (Hastie et al. 2009). There is the necessity of selecting appropriate feature selection for improving classification accuracy (Chu et al. 2012). Further studies are needed to examine the enhancement of the machine learning algorithm. One of the benefits of comparing the selected spectral indices dataset with others using evaluation matrices and hypothesis testing is that it ensures accurate mapping performance, which can facilitate yield prediction and identification.

The status monitoring of oil palm by NDVI and the yield prediction using machine learning regression time-series approach
Numerous studies have proved that NDVI is an effective tool in monitoring the status of crops and predicting the yield. This is because NDVI involves the combination of NIR and red spectral region, which has a direct relationship to the biophysical properties and physiological development of crops (Cohen and Alchanatis 2019). Niinemets and Tenhunen (1997) suggested a conversion area of the rapid shift in leaf reflectance, caused by higher chlorophyll absorption in the red region and leaf scattering in the near-infrared spectrum, is related to crop health. Higher NDVI values show enhanced photosynthetic activities towards the peak growth stage of oil palm and this result agreed with the work of Phan et al. (2020). Ensemble learning algorithms such as RF and AdaBoost were applied in this study after hyperparameter tunings. The best model in predicting the oil palm is RF as shown in predicting the oil palm yield using NDVI with lowest prediction errors ( Figure 12).
Walk-forward validation was applied for yield prediction as to the validation approach. It is a model evaluation technique used largely with sequential datasets and time series (Stein 2007). We can see from the study that it used data from preceding years or months to train the model from 2016 to 2019 as the training set. As the model progresses through the validation set in 2020, it makes a prediction for each time-step and retrains and repredicts for each time-step until reaching the end of the month in 2020. For example, the model predicted the yield in 2020 by sliding through each time-step for a single step prediction. It helps with sequential data because the model has been trained to sequence with the time-steps.
When applying state-scale models to predict yield at the pixel level, we found that evaluation matrices such as RMSE, MSE and MAE had appreciably low MAPEs across the training dataset. Figure 13 indicates the best prediction trend with NDVI. This approach will have considerable utility in agriculture. Although this study is not exhaustive, it showed high predictability of oil palm yield at state-level from remote-sensing indices using the integration of advanced machine learning based on a time-series approach. Further targeted research will be pursued to explore more remote sensing features and agronomic factors as well as climatic information in the yield prediction model for improving yield prediction accuracy.

Limitations and future perspectives
There are several limitations to this study. Although this is the time-series model, the cause of the accuracy difference between each model was not investigated. Future studies may focus on the studies of these problems with the help of higher resolution satellite images and applying deep learning techniques such as recurrent neural networks (Kim and Lee, 2016). The presence of cloud has a profound noise signal on the NDVI signals, which is most likely the cause of the model inaccuracies (Alvarez-Mendoza et al. 2019). Furthermore, soil-induced variations in VIs can be one of the noises in the background. Besides, NDVI has certain limitations over perennial crops like oil palm, which is more stable in terms of leaf area index across the ages. This is because NDVI tends to be saturated, especially in crops with a bigger canopy and fully closed leaves, and increasing biomass does not improve reflectance on such occasions (Myneni and Williams 1994;Prabhakara et al. 2015). Therefore, more studies should be explored to eliminate these factors by using narrow-band indices. Previous studies have also shown the importance of meteorological data in identifying and estimating oil palm yield (Paterson et al. 2015;Sarkar et al. 2020;Siang et al. 2020). Although the predicted and estimated oil palm yield shows a strong relationship, this model-based forecasting methodology does not consider the unexpected weather condition. Climate and agronomic considerations should be considered in future research. The majority of the previous studies demonstrated the usefulness of vegetation indices in yield prediction (Balasundram et al. 2013, Setyowati et al. 2016Diana and Dharma 2019). However, these previous findings have a limitation in a more targeted and comprehensive satellite mapping techniques for subsequent oil palm yield prediction. In this study, we demonstrated the importance of comprehensive feature selection approaches for oil palm mapping and classification, which can then be used for subsequent yield prediction. Additionally, this study is the first to illustrate the significance of a time-series machine learning approach to resolve NDVI time series fluctuation issues. This time-series machine learning approach should be expanded in future studies to cover significant factors.

Conclusion
This study attempted to investigate oil palm mapping and yield prediction tasks using Landsat-8 time-series images-driven NDVI with ensemble learning method in the state of Pahang of Malaysia. The mapping task of the oil palm was done to show the location of the oil palm with optimal age. A feature selection algorithm with linear projection was applied to select the best combination of spectral indices in oil palm classification. The use of the feature selection method has the advantage of reducing computation power by processing only the most relevant spectral indices. Chronological changes of NDVI with time have been manifested to figure out the maximum NDVI for the oil palm growing period (2016-2020) and it was discovered that the average NDVI value was 0.66 and the maximum was 0.80 in November 2016. In this study, the prediction of oil palm yield by ensemble learning algorithms used a walk-forward validation approach. Using the walkforward validation approach, ensemble learning algorithms (RF and AdaBoost) were able to produce a superior prediction in this study. RF was the best model, followed by AdaBoost for estimating oil palm yield with the lowest prediction errors (RMSE, MSE and MAE). The best model was able to make a closer prediction trend similar to the actual observation. In future work, more focus should be given to yield-determining factors such as climate and agronomic data. Since this is a state-scale mapping and yield prediction in a larger context, block-based should be applied in the near future for precise area identification and yield prediction. Illustrating and highlighting such accurate prediction can help decision makers be aware of the problem; thus, regulating their management practices. More studies on temporal data should be made to identify temporal inconsistency. Furthermore, high-resolution satellite images should be used to improve mapping accuracy.