Comparing modelling performance of chemometric methods for wood discrimination by near infrared spectroscopy

ABSTRACT Comparative wood anatomy is the most accepted (traditional) method for wood identification. However, there is an ongoing search for an effective method where traditional methods may be insufficient in distinguishing on the species level. Near-infrared spectroscopy (NIRS) is one of the developing methods for wood identification. Near-infrared data of Scots pine, black pine, sessile oak and Hungarian oak were collected and examined in the spectral range of 12,000–4000 cm−1 with a resolution of 4 cm−1. Data were analyzed by partial least squares discriminate analysis (PLS-DA), decision trees (DT), artificial neural networks (ANN) and support vector machines (SVM). Raw data were subjected to multiple scatter correction (MSC), standard normal variate (SNV), Savitzky–Golay for derivatives (first [FD], second [SD]) and smoothing (Sm) and combinations of these preprocessing methods (Sm + FD, Sm + SD, FD + MSC, FD + SNV). Model performance compared through test accuracies. Accuracies varied between 99–100%, 76–98% and 73–96%, for genus level, oak and pine species, respectively. PLS-DA and SVM were found the most successful models. This study revealed that it is possible to discriminate Scots pine from black pine, and sessile oak from Hungarian oak by near-infrared spectroscopy and multivariate data analysis.


Introduction
In the forest and related industries, the knowledge of the tree species, which has a wide variety and range of properties, is very important as it affects the utilization and processing properties of this material. The wood that is used as a raw material in factories, is generally supplied from the forests, cut in forms such as lumbers, logs, timbers, and is free from branches, leaves and even barks. For this reason, many helpful features used in the identification of tree species remain unknown, and the identification is solely made by examining the anatomical structure of the wood. In the context of wood anatomy, identifying the tree species accurately and reliably requires an expert and a labor-intensive performance. Due to the fact that wood is an organic (thus, a biological) material, the anatomical structure of this material responds to a wide variety of environmental influences. Consequently, there are anatomical differences within different individuals of the same species, as well as anatomical differences can be observed even in the position of the wood along the stem (e.g. trunk, branch) or the wood parts formed by the tree at different ages (as mature wood, juvenile wood). Thus, quantitative and qualitative features of various wood pieces have differed from each other. The level at which the diagnosis will or can be carried out (e.g. family, genus, species) is one of the most difficult subjects in wood identification, and identification to the level of species is not always possible (Wheeler and Baas 1998, Hather 2000, Schoch et al. 2004). On the other hand, wood identification is essential not only for commercial purposes, but it is also valuable for historical, archaeological, paleontological, climatological, geological studies and forensic analyzes as well (Wheeler and Baas 1998).
In general, the wood of tree species is classified as softwood and hardwood. Although this distinction is quite easy for trained eyes, in some cases, even experts may be insufficient to distinguish trees by species or even genus level. Even so, the distinction on the basis of the genus is sufficient industrially, making the distinction in species groups can gain importance according to the place and purpose of usage. As one of the hardwood, oaks (Quercus spp.) have a wide variety of species and an impressive distribution throughout the world in the entire northern hemisphere (Axelrod 1983, INRAE 2020. This genus can be divided into 3 main species groups: white oaks, red oaks and evergreen oaks (Berkel and Bozkurt 1961). Among these species groups, identification at the species level, only with their wood, is considered to be close to impossible but required in many cases. While Q.robur, Q.petraea, Q. alba species in the white oak group are used and preferred for high-quality coopering, the use of Q. frainetto in the same group is not preferred for this purpose. It should be noted that, in many cases, these species are called Hungarian oak in many cases, and this regular practice may cause a commercial confusion that should be avoided (Doussot et al. 2000, Anonymous 2015. As one of the softwood genera, pines are represented by a large number of species and have a wide distribution in the northern hemisphere like oaks. Pine species provide a frequently preferred material due to their wide range of industrial usage (timber, pulp and paper, particleboard etc.) and easy processing capabilities (Burdon 2002, Plomion et al. 2007). Pine species have 2 subgenus: hard pines and soft pines. Similar to the oak species, the difficulty and importance of species distinction are also valid for the pines. Scots pine, black pine and mugo pine (Pinus sylvestris L., P. nigra J.F.Arnold, Pinus mugo Turra), which are in the hard pine group, are accepted as inseparable species due to their anatomical structures (Hather 2000, Schoch et al. 2004. The distinction of these species gains importance due to the usage of pine species as construction material in buildings for many years and the necessity of using the same tree species in order to preserve the originality during the restoration of historical buildings (Hwang et al. 2016).
Regardless of industrial, commercial, usage area or historical importance, the most valid diagnostic method is the traditional ones which are based on comparative wood anatomy through macroscopic and microscopic examinations made by an expert. However, traditional methods may be insufficient in distinguishing on the basis of species in woods that do not have any additional diagnostic information. With the developing technology, the increase in computer-human interaction and the desire to reach information in a shorter time, with non-destructive methods independent of the human factor, are gaining more ground day by day. There are different methods tried to be used for this purpose with the help of advanced statistics, computer science and machine learning as well. The utilization of near-infrared spectroscopy combined with chemometric methods is one of these methods which have very promising results in studies carried out on various wood species. These chemometric approaches involved preprocessing methods and multivariate analyzes. The reason for the use of pretreatment methods is that reflectance spectroscopy used in solid materials such as wood is easily affected by physical phenomena and is prone to have noise (Leblon et al. 2013, Tsuchikawa and Kobori 2015, Wang et al. 2021. In order to correct these effects, the raw data from near-infrared spectroscopy are preprocessed before the modeling phase. In addition to the use of raw datasets (Brunner et al. 1996, Adedipe et al. 2008, Tounis 2009, Nisgoski, de Oliveira, et al. 2017, Nisgoski et al. 2018, Leandro et al. 2019, Li et al. 2019, Pace et al. 2019, Vieira et al. 2021, various preprocessing methods and their combinations have been used in near-infrared studies for wood identification. When some researchers prefer to use just first derivatives (Haartveit and Flaete 2008, Russ et al. 2009, Bergo et al. 2016, Lang et al. 2017 or just second derivatives (Tsuchikawa et al. 2003, Horikawa et al. 2015, Lazarescu et al. 2017, Vieira et al. 2021, various researchers combined derivatives with MSC (Tounis 2009, Espinoza et al. 2012 Yong et al. 2019, Zhou et al. 2020, Sm (Shou et al. 2014, Bergo et al. 2016, Leandro et al. 2019, Yong et al. 2019, Zhou et al. 2020 or MC (Haartveit and Flaete 2008, Bergo et al. 2016, Snel et al. 2018. It is seen that the linear models used to distinguish tree species from each other were performed mainly with Principal Component Analysis (PCA), partial least squares regression and discriminant analysis, and SIMCA (Dawson-Andoh and Adedipe 2012, Bolzon de Muñiz et al. 2016, Lazarescu et al. 2017, Snel et al. 2018, Leandro et al. 2019, Yong et al. 2019). In addition to these studies, more complex nonlinear models such as artificial neural networks (Lazarescu et al. 2017, Nisgoski, de Oliveira, et al. 2017) and support vector machines (Li et al. 2019, Zhou et al. 2020 have also been tried. When previous studies were reviewed, it was seen that tree distinctions made by the NIRS were often made at the genus or species group level, or at the species level to distinguish endangered species from similar-looking species. In the literature research, no study was found on the discrimination of pine (P. sylvestris-P.nigra) and oak (Q. petraea -Q.frainetto) species, which are important for commercial and historical reasons. Considering the NIRS studies on these genera, it was seen that Adedipe et al. (2008) worked on the distinction between white oaks and red oaks at the species group level; Zhang et al. (2014) and Hwang et al. (2016) studied the distinction at the species level for different pine species. In addition, in our previous study, the discrimination of pine species (P. sylvestris, P. nigra) utilized by PLS-DA and the efficiency of preprocessing methods were examined and the effectiveness of different models in the classification of these pine species was investigated (Tuncer et al. 2021).
In this study, the classification capability of NIRS was examined from genus to species in order to see the operability of the system from easy to difficult, and this analysis was demonstrated with oak and pine species (Q. petraea-Q. frainetto, P.sylvestris-P.nigra). Thus, the success of different models in classification between species and between genus was investigated related to the preprocessing methods. In this way, it is aimed to improve an area where traditional methods are limited. By sharing the collected data as an output of the study in accordance with the aim of making science universal, accessible and reproducible by everyone; it is planned to contribute to other studies to be carried out by establishing the data banks needed for the development of this method.

Material and methods
In this study, 2 species of pines and oaks, were studied to represent the frequently utilized softwood and hardwood species. For representing softwood, Pinus sylvestris L. (Scots pine) and Pinus nigra J.F.Arnold (black pine), which are hard to distinguish from each other with traditional methods, were selected. For representing hardwood, Quercus frainetto Ten. (Hungarian oak) and Quercus petraea (Mattuschka) Liebl. (sessile oak) were investigated. Tree samples were taken from Ayancik (Sinop) and Sariyer (Istanbul) regions in Turkey. To increase the number of the specimen, sample collection was held by cutting wood discs and taking increment bores at the breast height (1.30 m) ( Table 1). All samples were air-dried at 20 ± 2°C with 65% + 5 relative humidity until they reached constant weight. The Cross (transversal) section of each sample from bark to pith, was sanded with 80 and 180 grits for homogenization of the surface.

Near-infrared spectroscopy -NIRS
Near-infrared spectra were collected by Antaris FT-NIR Analyzer (Thermo Nicolet Scientific, USA) at the Forest Products Laboratory, Madison, USA. Diffuse reflectance measurements were made using an integrating sphere accessory at the spectral region from 4000 to 12,000 cm −1 with a resolution of 4 cm −1 (4151 variable) and 64 scans averaged per scan. Since the chemical properties change from pith to the bark, in the radial direction (Zobel andvan Buijtenen 1989, Schweingruber 2007), more than one measurement area (5-10 per sample) with a diameter of 1 cm in the radial direction were determined on each sample ( Figure 1) (Tuncer 2021).

Data analysis
Discrimination of wood species held by near-infrared spectroscopy coupled with chemometric techniques carried out in four main steps for data analyses ( Figure 2). All chemometric analyses were performed in R software (R Core Team 2018) with some modifications of various packages/libraries (Appendix).

Preprocessing
Reflectance near-infrared data of solid samples are prone to have noise and easily impress by physical phenomena (like changes in the path of light, light scattering, baseline shifts, etc.) which caused reducing effects in data quality (Rinnan et al. 2009). In order to remove these undesirable impacts and improve the spectral features of interest, preprocessing methods were applied. The main goal with the preprocessing stage is to develop models related to the chemical composition of the samples (not with noise) and improve a subsequent model to be built afterward (Rinnan et al. 2009, Gholizadeh et al. 2015, Olivieri 2018. In this study, commonly used preprocessing methods (Tuncer 2020) were applied to remove or at least reduce the effects of noise, light scattering, baseline shifts and improve the performance of the subsequent classification model. Raw data were preprocessed by multiple scatter correction (MSC) (Geladi et al. 1985), standard normal variate (SNV) (Barnes et al. 1989), Savitzky-Golay algorithm (Savitzky and Golay 1964) for derivatives (1st & 2nd Dr [FD&SD]), smoothing (Sm) and the combinations (FD, FD + SNV, FD + MSC, Sm + FD, Sm+ SD) of these preprocessing methods ( Figure 2). For preprocessing, raw data were imported in R environment and several codes at R packages were used with modifications (Kucheryavskiy 2018, Mevik et al. 2019).

Chemometric modelling
In order to provide diversity in sampling, trees from different areas, ages, and diameters were studied and measurements were made in more than one area from pith to bark in each tree. In order to preserve this diversity, the measurements collected from the same tree were assigned to the classes manually, so that they were in the same group (i.e. training/test). Each data set (raw and preprocessed data) were divided into two groups; calibration and test. The calibration set consisting of two-thirds of the data was used in the establishment of chemometric models ( Table 1). The test set consisting of one-third of the data was used to control the models. Classification of wood species by genus and species carried out with partial least squares discriminate analysis (PLS-DA), diagonal linear discriminant analysis (DL-DA), decision trees (DT), artificial neural networks (ANN) and support vector machines (SVM).
Partial least squares discriminant analysis (PLS-DA) was carried out according to Straightforward Implementation of a statistically inspired modification of the partial least square (PLS) method SIMPLS algorithm (de Jong 1993, Kucheryavskiy 2018. Diagonal linear discriminant analysis (DL-DA) (also known as Quadratic Discriminant Analysis), which is based on the assumption of the covariance matrix is diagonal, was carried out according to the linear discriminant rule and Gaussian maximum likelihood function (Naive Bayes) (Friedman 1988, Guo et al. 2005, Wehrens 2011). Decision trees (DT) are carried out according to the classification and regression trees (CART) algorithm. This algorithm is an implementation of recursive partitioning, using Gini's impurity index as the splitting criterion (Breiman et al. 1984, Therneau and

Model performance metrics
The performance of all models was evaluated with sensitivity (SEN), specificity (SPE), and correct classification rate (CCR)/ accuracy metrics based on confusion matrix (Table 2) Interpretation Selectivity ratio (SR) was used to interpret for variable selection of the latent variables used in PLS-DA models (Rajalahti et al. 2009, Tran et al. 2014, Farrés et al. 2015. where v expl,i is the ratio of explained variance and v res, i is the residual variance for each variable. While determining the variable importance, the variables are ranked from high to low, according to their SR values (Kvalheim 2020).

Results and discussion
The near-infrared spectra of wood samples without any treatment (raw data) and mean spectra of preprocessing data are presented in Figures 3 and 4, respectively. No visible (apparent) difference could be detected in the raw spectral shapes of both pine and oak species. After the preprocessing, clustering tendencies were observed among the spectral data of the same genus (at SNV, MSC, Sm). However, the visual differences for classification were still at the minor level, considering these differences arise from the amount of absorption (Figure 4, appendix). As mentioned above, in this study, five different modelling approaches were compared. For determining the best-performed model, correct classification rates were calculated, in terms of model fit. In addition, correct classification rates were analyzed separately for avoiding misinterpretation and overfitting. The correct classification rates of calibration models with preprocessing methods are shown in Tables 3  and 4 for train and test sets, respectively.
Generally, it is desirable that calibrated models should represent the data well, but not very well, because it may result in overfitting. The overfitting of the model with a training data set is a problem that may cause too optimistic results, and this situation prevents the generalization of the model (James et al. 2013). In this study, such kind of possibility (overfitting of the calibration models with the training data sets) was revealed by the prediction of the test sets. When correct classification rates of train and test sets were compared (Tables 3 and 4), it was seen that the models that were notably compatible with the training set have relatively low correct classification rates when tested with new data sets (test set) as expected (Hastie et al. 2017). Besides that, some of the DT models (such as raw, MSC, Sm for pine species) were considered to be overfitting with the training dataset, which is a common flaw that DT models might have (Nisbet et al. 2018).
As with the traditional methods, classification of wood species with NIRS and chemometric models in the genus  Figure 3. Near infrared spectra of wood samples. Scots pine (green), black pine (black), sessile oak (pink), Hungarian oak (grey).
level is quite easy. For genus level separation, all calibrated models performed well at least in one case/preprocessing method (99-100%). When comparing the species level classification, correct classification rates varied between 76-98% and 73-96%, for oak and pine species, respectively. In terms of model performances (given in Table 4), in general, the PLS-DA and SVM models can both be interpreted as successful. In parallel with this study, correct prediction rates of previous studies that used partial least squares discriminant analysis (PLS-DA) in wood discrimination varied between 60 and 100% which depends on the wood species and preprocessing methods (Flaete et al. 2006, Tounis 2009, Horikawa et al. 2015, Yang et al. 2015, Bergo et al. 2016, Hwang et al. 2016, Lazarescu et al. 2017, Snel et al. 2018, Yong et al. 2019, Leandro et al. 2019, Pace et al. 2019, Zhou et al. 2020.
It was observed that correct prediction rates have decreased tendencies from the studies performed on the different genera to different species which have similar macroscopic and microscopic characteristics. The fact that this tendency is also valid for traditional methods, it can be concluded that the studies carried out with this technology were not a matter of data manipulation. It can be said that the use of near-infrared spectroscopy in wood classification is very convenient, non-destructive, fast and reliable when supported by accurate chemometric methods. For this, robust models should be established before any practical use. For the establishment of successful models; reliable NIR databases with sufficient quantity and diversity must be built, as in the pharmaceutical sector.
In the literature, it is seen that linear models (such as PLS-DA, PLS-R) that are relatively easy to interpret and have good performance were preferred. In addition to these preferences, the use of nonlinear models interpreted as closed boxes in tree species classification has been tried in recent years. In tree species classification, ANN and SVM methods are leading nonlinear methods used with NIRS. In the ANN and SVM models, it is not known how much each of the independent variables affects the dependent variables. In terms of interpretation, these models are acknowledged to be difficult, hence, they are called closed box/black box (Xiaobo et al. 2010, Ciaburro andVenkateswaran 2017). Nonetheless, they are preferred for their good prediction rates.
As seen in Table 4, the correct prediction rates of the genus, oak and pine species were 100-83 -75% for ANN and 100-97 -94% for SVM, respectively. Also, the variation in the correct prediction rates is similar to linear models. Similar to the general trend encountered in traditional methods, the success of near-infrared spectroscopy in wood classification, decreased from genus to species level.
The effects of the preprocessing methods were examined according to the percentages of improvement in the correct prediction rates of raw data sets (Table 5). As can be seen, in the same data set, a preprocessing method can increase the correct prediction rate of one model while decreasing the other. Similarly, when the effect of a preprocessing method on the same model is examined with a different data set (different wood species), it is seen that a different data set showed different improvements with the same model and preprocessing method ( Table 5).
Examination of mean spectra is indicated that the discrimination is getting harder from genus level to pine species, as expected ( Figure 5). When the mean spectra were examined, it was detected that the differences between the species arose from the amount of absorption, not from their shape as experienced in the genus level. In other words, discrimination between hardwood and softwood was due to the wavenumbers/shape of spectra, while discrimination between species related to the absorption level. This can be explained based on the genus, differences mainly attributed to the qualitative and quantitative differences in the chemical structure of hardwood and softwood. Especially, the chemical structures of polyose and lignin components in hardwoods and softwoods are quite different from each other. Hardwoods contain more polyose and have a high rate of xylose units and acetyl groups, while softwoods contain more lignin and high rate of mannose units. Lignins (guayicil, syringyl and p-hydroxyphenyl) found in softwoods and hardwoods show significant differences in their content and their functional groups (Fengel and Wegener 1983). It is possible to explain these structural differences by noting the different functional groups at different wavenumbers. In the light of this information, it is expected that the distinguishing features in genus discrimination will be related to the changes along the y-axis (wavenumbers), while the distinguishing features of species in the same genus will be related to the changes along the x-axis (absorbance). In order to understand the modelled relationship, the selection of the original variables is essential (Xiaobo et al. 2010, Kvalheim 2020, Mehmood et al. 2020. In addition, the determination of important variables is highly related to the success of the model. For interpretation of the modelled relationship, linear (Selectivity Ratio for PLSDA) and nonlinear (Decision Treestree nodes) approaches were used and compared with their mean spectra (Table 6). Selected (important) variables and absorption differences between species were compared and found quite compatible with each other and with the literature (Table 6 and Figure 6).
When Figure 6 and Table 6 were examined, it was seen that wavenumbers used in the classification varied depending on the tree species; wavenumbers with similar ranges at the genus level, different ranges at the species level were used. Only the variables selected by the decision tree (DT) covered almost the entire spectrum in different situations.
Although the decision trees are easier to interpret, overfitting is a problem as seen in this study, and its resulting unstable models. This problem could be overcome by using Random Forest, which uses the combination of multiple decision trees but compromises interpretation. Apart from the wavenumbers determined in the decision trees, it is seen that the general trend is focused on the 4000-10,000 cm −1 range, which is preferred in qualitative analyzes like classification of wood species or genus , 2011, Bächle et al. 2010, Popescu et al. 2018. But most importantly, it was deduced that the wavenumbers/variables considered important by the classification approaches used in tree species or genus determination depend on the tree species. The determination of important wavenumbers in wood identification by NIRS also varied according to the model and the method to be used in the interpretation of this model. For this reason, changes in selected wavenumbers in wood identification should be examined thoroughly and interpreted in more detail with different variable selection methods in future studies.

Conclusion
In this study, the classification capability of NIRS was examined from genus to species level, and these analyzes were demonstrated with oak species for hardwoods (Q. petraea-Q. frainetto) and pine species for softwoods (P. sylvestris-P. nigra). In this way, it is aimed to improve a field where traditional methods are limited. The outputs obtained from the results of this study can be summarized as follows: -The use of near infrared spectroscopy in combination with chemometric methods for tree species classification is not data manipulation. -It is possible to discriminate P. sylvestris from P.nigra and Q. petraea from Q. frainetto by near infrared spectroscopy and chemometrics methods. Classification of these wood species was achieved by NIRS and multivariate data analysis.  -The best performing models were found to be PLS DA (linear) and SVM (nonlinear) for wood discrimination via NIRS. -The use of chemometric methods is necessary for the effective use of near infrared spectroscopy. -Preprocessing is a very significant, even an essential step which should not be ignored. Preprocessing methods have huge impacts on the success of near infrared spectroscopy, regardless of any instrumental changes. -There is no perfect preprocessing method for near infrared data. The most suitable preprocessing methods should be selected by making trials in every modelling phase for every different wood species. -It has been observed that the range of 10,000-4000 cm −1 is effective in wood identification.
It can be concluded end of this study the use of near infrared spectroscopy in wood classification is very convenient, non-destructive, fast and reliable when supported by accurate chemometric methods. To develop robust models before any practical use; reliable NIR databases with sufficient quantity and diversity should be built. Last, but most certainly not least, an international compilation of an open access NIRS database for different wood species is very important for this method to be considered alongside the traditional methods.