Quantifying Uncertainty in Lumber Grading and Strength Prediction: A Bayesian Approach

This article presents a joint distribution for the strength of a randomly selected piece of structural lumber and its observable characteristics. In the process of lumber strength testing, these characteristics are ascertained under strict grading protocols, as they have the potential to be strength reducing. However, for practical reasons, only a few such selected characteristics among the many present, are recorded. We present a data-generating mechanism that reflects the uncertainties resulting from the grading protocol. A Bayesian approach is then adopted for model fitting and construction of a predictive distribution for strength that accounts for the unrecorded characteristics. The method is validated on simulated examples, and then applied on a sample of specimens tested for bending and tensile strength. Use of the predictive distribution is demonstrated, and insights gained into the grading process are described. Details of the lumber testing experiments can be found in the online supplementary materials.


INTRODUCTION
Concern about the effect of climate change on the growth of trees (Andalo, Beaulieu, and Bousquet 2005) combined with technologically induced changes in the way they are grown and processed to make lumber, has led to the establishment of longterm monitoring programs (Kretschmann, Evans, and Brown 1999). These concerns and changes have in turn led a renewed interest in ways of assessing lumber properties, including innovative analytical approaches that exploit modern statistical theory.
A property of great importance is lumber strength since construction is a primary use of this product. It must be strong enough to meet future demands in the form of both dynamic and static loadings. However, strength is highly variable. So a system for classifying lumber into grades has been developed to reduce that variability within grade classes; design values are then set on the basis of the in-grade conditional distributions. Consumers select a grade of lumber appropriate for their anticipated loadings.
Grading is done in accordance with grading rules, which involve characteristics that can be determined without destructive testing, especially ones that relate to strength. This process, which yields the requisite conditional distributions and hence design values, has worked well and withstood the test of time. The possible long-term effects of changes in climate and lumber production technology on the strength of lumber, however, anticipate a future need to modify the grading rules to preserve the design values. Thus, we seek to leverage modern statistical methods and computational power for constructing the conditional strength distributions based on strength determining characteristics, and for evaluating the efficacy of current grading rules.
The task is to predict lumber strength based on its recorded characteristics from the grading process. There are a few broad types of characteristics; knots, for example, are the most commonly occurring type of characteristic. A piece of lumber often has multiple characteristics, and the difficulty in our context is that not all characteristics present are recorded; most are nonrandomly censored, which distinguishes this work from classical regression and missing-data problems. We must rely on records that have much missing information, while accounting for the processes described in the next section that generate the experimental data. Together these factors merit a Bayesian approach (Section 3) that ensures a coherent hierarchical framework for linking the elements of the process while reflecting that uncertainty. The next section describes the relevant basic features of our problem and thereby lays the groundwork for the remainder of the article. Using the framework, we can quantify the uncertainty involved in generating the data on the characteristics, and study how the characteristics affect the two major strength properties, namely the "modulus of rupture" (MOR) and the "ultimate tensile strength" (UTS). To that latter end, a predictive distribution for strength can be built from the fitted model and we illustrate its application to the prediction of strengths for future pieces of lumber.
The article is laid out as follows. Section 2 describes the methods prescribed for testing the strength of lumber. Section 3 describes the data-generating mechanism and proposes a Bayesian approach for handling the censored characteristics and constructing the distribution of interest. That model is validated through a series of simulation examples in Section 4. The method is then applied in Section 5 on an experimental dataset generated in a wood products testing laboratory. Section 6 gives our concluding remarks.

ASSESSING THE STRENGTH OF LUMBER
Grading assesses the features of each piece of lumber (specimen) that are likely to affect its strength or utility. "Knots" (formed when branches or limbs are incorporated in a piece of lumber), "shake" (a lengthwise separation of the wood), and "slope of grain" (the deviation of the line of fibers from a line parallel to the sides of the piece) are examples of characteristics that often, but not always, limit the strength of a piece of lumber when tested to failure.
Destructive strength tests are also carried out on a limited but representative sample of lumber for calibration purposes on a chosen population. Such a population might consist of all pieces currently available of a specified grade, size, and species. In a destructive test, the piece of lumber is subjected to an increasing load until it fails. The maximum stress reached immediately before failure is called its "strength." We now outline the protocol used for carrying out a laboratory experiment on a chosen population with the aid of a human grader, to obtain such calibration data. The steps of testing each piece of lumber proceed in the following order: 1. Grader confirms grade (piece is admissible). Wood processed at a mill is sorted into grades before being bundled for sale. This proceeds in accordance with particular standards involving (see, e.g., Green, Ross, and McDonald 1994) "grade controlling characteristics." Some characteristics, which may be purely cosmetic in nature, are included because they affect the commercial value of the piece. Other characteristics are also controlled for, as they are deemed to be strength reducing. In many modern mills, automated technology is used to classify sawn lumber into "grades." Therefore, in a bundle to be used for calibration purposes, a human grader first examines each piece to confirm that it indeed belongs in the population of interest. Specimens found to have been misclassified are marked as "off-grade" and excluded. 2. Grader selects and records M. The visual characteristic thought most likely to cause a piece to fail under a destructive test is called its "maximum strength-reducing characteristic" (MSRC, or M for short). Practical considerations have meant that M is recorded in a coded form for each piece. The record of M includes the category of the characteristic (e.g., knot, shake; see Table 4 in Sec-tion 5 for the full list of codes used by our grader), as well as a description. It is not unusual for failure to occur not at M, but at some other characteristic, as wood is highly unpredictable. Traditionally, M is visually selected by a human grader who has been trained to meet industry standards that "a maximum of 5% below grade as an allowable variation between agency qualified graders" can be maintained (NIST 2010, p. 10). More recently, machine vision or automated visual grading is being used to improve the efficiency of selecting M, designed to apply the same selection rules used by human graders. 3. Measure and record MOE. Methods have been developed to assess strength without destructive testing. The most widely used characteristic for this purpose is the "modulus of elasticity" (MOE), which can be measured in various ways, each of which quantifies the stiffness of a piece of lumber (Ross et al. 1991). A simple model of lumber strength could use MOE as a linear predictor of MOR (Kretschmann, Evans, and Brown 1999), as the two tend to be positively correlated. For the data in this study, MOE measurements are obtained by the transverse vibration method (Pellerin 1965). Measurements of MOE are typically given in millions of pounds per square inch (psi).

Perform destructive test and record breaking strength.
Two common destructive tests used in the industry are the "modulus of rupture" (MOR, or R for short), and "ultimate tensile strength" (UTS, or T for short). R is found by bending a specimen until it breaks, while T is found by stretching the piece longitudinally until it separates. The test to be performed on the specimen will be chosen in advance; in this step, the specimen is broken according to the chosen test and its breaking strength is recorded. Measurements of R and T are typically given in thousands psi. 5. Grader examines failure and records C. The characteristic at which the specimen broke is called the cause of failure (C for short). The grader examines the broken specimen and records C in coded form. The possible categories of C come from the same list as M. Ideally, the C will be the same characteristic as M, but it is not unusual for M and C to turn out being different characteristics. This shows the inherent difficulty of correctly identifying M beforehand, and this is one aspect we wish to quantify with the modeling approach that follows.
The use of such visual and physical characteristics to predict breaking strengths of individual pieces of lumber has been a much-studied subject. For example, Taylor et al. (1992) considered tensile strength models that treat a piece of lumber as consisting of a set of smaller contiguous segments. Lei, Zhang, and Jiang (2005) used a regression approach to predict MOR based on tree and stand characteristics. Divos and Tanaka (1997) also used a multiple regression approach to predict both bending and tensile strength. Significant regressors found by that study were MOE and modified knot diameter ratio, and it was found that both machine stress grading and appropriate visual grading are important for lumber. However, we have not found previous studies that attempt to coherently capture the entire grading process and the uncertainties involved.

A Data-Generating Mechanism
The simplest model for lumber strength relates a destructive strength property Y to the MOE v according to the linear regression model (Kretschmann, Evans, and Brown 1999) for regression coefficients β 0 , β 1 and normal error . This section extends this model in a way that accounts for the additional information recorded in the maximum strength-reducing characteristic M and the cause of failure C, and captures a meaningful description of the underlying process that generated the data.
Of interest are the visual characteristics deemed to potentially cause the failure in a destructive test. In this study, we work with three major categories of such visual characteristics: knots (k), shakes and grain deviations (s), other (o). The number of significant knots on a lumber specimen is random; some pieces have many large knots, while others have none at all. A plausible model for the number of significant knots N k would be a Poisson distribution. Some specimens have a long crack (shake) or a major grain deviation, while others do not. We shall indicate the presence of shake by N s = 1, and absence by N s = 0. Finally, for simplicity we always allow one characteristic in the category "other" to act as a baseline and cover all the remaining varieties of miscellaneous defects that are neither knots nor shake, and which may not be fully evident until the piece is broken.
Considering an individual specimen, it has a total J = N k + N s + 1 independent characteristics, where are independent. Each of the J characteristics has a latent effect on the strength, listed in the vector . Each characteristic in X o belongs to one of the three major categories, as indicated in the following vector corresponding to X o , For instance, a specimen with four total characteristicscomposed of two knots, the presence of shake, and other-would have t = [k, k, s, o]. The distribution of a latent effect will depend on the category of the characteristic, which we model using independent Normals for i = 1, . . . , J , Only the most severe one of the J characteristics will be the cause of failure, namely C = i : . Therefore, we postulate that a regression model including X o as a predictor will only depend on X o C , namely where the coefficient γ acts as a scaling factor for relating the latent effect to a corresponding change in the strength Y. The remaining error term ∼ N (0, σ 2 e ) is small and intends to capture measurement uncertainty in Y. Note that it is not possible to fit the model (4) directly, as the vector X o of latent effects is not observable.
A grader examining the specimen cannot know the true values X o , but instead observes the noise-contaminated vector of subjective assessments of effects For an unbiased grader, we would have b = 0, while a nonzero b can describe scenarios where the grader has a systematic bias in over/underestimating the effects of certain categories. If grader bias is suspected, we set b k = 0 as a baseline and include b s , b o as parameters to be estimated.
Prior to destructive testing, the grader selects the MSRC based on X, namely M = {i : X i = max(X)}. After destructive testing, the grader visually determines the characteristic at which the specimen failed and thusly identifies C, albeit without knowing the value of X o C . The frequency with which M and C agree depends on the size of σ 2 x ; smaller values of σ 2 x correspond to a larger probability that M = C for an unbiased grader.
Recall that grading rules stipulate that a coded category and description of the M and C be recorded by the grader. These must be converted to a quantitative measurement for modeling purposes. The coded description of a knot allows the grader's subjective assessment of its effect to be computed based on its size and location. We use these calculated quantities as values for elements of X in the case of knots, which is the most common characteristic. For shake and other, not enough detail is available to calculate a corresponding effect, and hence their corresponding elements in X will be treated as missing.
In summary, the data for a single test specimen consist of the following. The values [y, v, m, c] are always observed. In addition, if t m = k, then we observe the value x m . Likewise, if t c = k, we observe x c . The remaining elements of x are missing. Furthermore, the entire vector x o is missing, and the number of characteristics j ≥ 1 is missing.

Bayesian Inference
The parameters to be inferred are θ = [β 0 , β 1 , γ, μ k , σ 2 k , μ s , σ 2 s , μ o , σ 2 o , σ 2 x , b s , b o ] from a sample of independent test specimens. Expert knowledge provides values of λ k , p s , σ 2 e , which we assume to be known constants throughout the analysis. We adopt a Bayesian approach to handle this inference problem, with emphasis on the posterior predictive distribution of Y based on v, M, X M that is obtained by integrating over the missing characteristics and the posterior distribution of the parameters.
, iid for i = 1, . . . , J , we can represent X i by X i = X o i + Z i with all the X i and Z i independent. Then the events M = m and C = c can be expressed equivalently in terms of the X i 's and Z i 's, as With this representation, it follows that the term p(m, c|θ , n k , n s ) can be obtained from the CDF of the multivariate normal distribution.
Next, for computational purposes we expand the following joint probability, Any missing elements in x for each specimen must be integrated out in the likelihood computation. Priors π (θ ) are required to complete the specification of the posterior. While further expert knowledge concerning the effects of characteristics could be infused at this point, in our analysis we assume independent flat priors on the regression parameters as well as the μ's, and independent noninformative Inv-Gamma(0.001,0.001) priors on the σ 2 's. If grader bias is to be estimated, we give b s and b o Normal prior distributions with mean zero.
Denote the observed data from n specimens by D, and the observed and missing parts of x for an individual specimen by x obs and x mis , respectively. Then, after obtaining the posterior distribution π (θ|D) = π (θ) n l=1 p(y l , m l , c l , x l,obs |v l , θ ), this framework readily provides the strength predictive distribution of interest for a future piece of lumber. Its quantities v f , m f , x m f can be recorded without destructive testing, and using the likelihood function as expanded in (5) In cases where t m f ∈ {s, o}, then we must also integrate over the missing x m f .

Computational Methods
Computation of the likelihood of each specimen requires integration over a multidimensional vector consisting of x o c and the missing parts of x, for each potential combination of values n k and n s . To achieve usable computational speeds, a Monte Carlo integration procedure was used based on the expansion (7). Also, the maximum number of knots n k is truncated beyond six, which is a reasonable approximation when λ k ≤ 3. As a concrete illustration, consider the likelihood evaluation for n k = 1, n s = 1 where m = 1, c = 2, t = [k, s, o]. In this case, x o c and [x 2 , x 3 ] are missing. The likelihood (suppressing the parameters θ for simplicity) is Thus, the expectation is approximated by drawing samples of [X o 2 , U 2 , U 3 ] and averaging the values of the function within. Sample draws from the truncated multivariate normal are obtained via the R package tmvtnorm (Wilhelm and Manjunath 2014); drawing from the truncated distribution is a natural importance weight adjustment for handling the indicator functions I (x 2 < x 1 )I (x 3 < x 1 ) from (7) and increasing the efficiency of the Monte Carlo estimate. It was found empirically that 2000 draws is sufficient to reliably compute the total log-likelihood over n = 200 specimens. The likelihood computation for other values of n k , n s and other patterns of missingness in x proceeds in an analogous fashion.
The parameter vector θ has 12 dimensions, so a naive Markov chain Monte Carlo (MCMC) Metropolis sampling algorithm would be very inefficient. We first applied Nelder-Mead iterations on the posterior, from a set of crude parameter values. This yields a set of local modes from which to use as starting points for MCMC exploration. To improve the speed and reliability of subsequent convergence, we applied parallel tempering (Swendsen and Wang 1986) over 15 computing nodes for a range of temperatures geometrically spaced from 1 to 50. Metropolis-Hastings draws are used within the individual chains, with proposals adapted to the temperature of the chain. With this setup, the computational time required for 1000 posterior draws for a specimen size n = 200 is about 2 hr.
To obtain the posterior predictive distribution, p(y|v, m f , x m f ) is evaluated according to (8) on a grid of y values using the same techniques as in the likelihood computation. The approximate density at each value y is obtained by averaging (8) for MCMC samples that had been drawn from π (θ|D), after a suitable burn-in.

SIMULATION STUDIES
To explore the inference and prediction procedures in a controlled setting, we simulated data under three different scenarios.
The parameter values were chosen to roughly mimic the proportions of (k, s, o) in the real dataset, while expert knowledge provides the values σ 2 e = 0.04, λ k = 3, p s = 0.28. Parameters: To produce a simulated dataset according to given parameters, for each specimen we apply the following steps: ∼ N(1.6, 0.2 2 ), to mimic the distribution of MOE in our dataset. 4. Draw y ∼ N(β 0 + β 1 v + γ x o c , σ 2 e ). 5. Draw x and set m = {i : To study the properties of the method, the following scenarios were considered: 1. n k and n s known, and no grader bias. 2. n k and n s unknown, and no grader bias. 3. n k and n s unknown, and data generated with b s = b o = −10. We then set fairly diffuse priors N (0, 10 2 ) for b s and b o . The same dataset is used for Scenarios 1 and 2; the difference is in the model-fitting treatment of n k and n s . For the different scenarios, the counts of M and C are tabulated and shown in Table 1. Since there can be more than one knot, there are two cases to distinguish when t m = t c = k. The cell (k, k) indicates that M and C were the same knot, while (k , k) indicates that M and C turned out to be different knots. Scenario 3 is intended to mimic a pattern seen in the real data. The grader has a stronger preference for selecting knots as M, leading to asymmetry between the cells (k, o) and (o, k) for example.
Models were fitted to the three scenarios using the techniques described in the previous section. A summary of the posterior distributions of the parameters is shown in Table 2. The fits appear acceptable. The estimates relating to shake and other in Scenario 3 have comparatively more uncertainty, as grader bias results in relatively fewer pieces having those characteristics recorded; these must also be estimated together with grader biases, nonetheless fairly reasonable estimates were obtained.
Next, additional test sets of 100 pieces were generated to study the performance of the posterior predictive distribution p(y|v, m, x m ) under the scenarios. The summaries of the meansquared prediction errors (MSPEs) based on the posterior predictive modes are listed in Table 3. The difference between Scenarios 1 and 2 shows that the number of characteristics being considered J, if always recorded, has a role in improving predictive performance. In Scenario 3, grader bias skews the M that is recorded for new data, and increases prediction error slightly although bias terms have been fitted.
To quantify how much prediction uncertainty is due to parameter estimation, in Scenario 2 we compared the posterior predictive p(y|v, m, x m ) and p (y|v, m, x m , θ ), where the latter case refers to θ set at the true values. We found that using the latter for prediction decreases the MSPE for Scenario 2 from 1.17 to 1.06. To illustrate this graphically, Figure 1 plots  predictive density functions of a specimen computed for each of 100 posterior draws from π (θ|D) in gray and some spread is evident. The black curve shows the density based on the true values of θ.

Lumber Strength Dataset
We now illustrate the methodology proposed in this article on a real data example, obtained from a set of destructive experiments run in a wood products testing laboratory. The chosen population of interest for this study was 12-ft 1650f-1.5E Spruce-Pine-Fir (SPF) 2x4, with species composition as listed in appendix A of NIST (2010) and listed in the online supplementary materials for our sample bundles. This population is an example of machine-stress-rated lumber, which controls for the expected variability in MOE and R within the grade. Members of this population also satisfy certain restrictions on allowable visual characteristics, such as maximum allowable knot size and maximum length of shake. In other words, the grade of this population has been assigned based on machine-stress rating together with visual screening.
A sample of 710 pieces from this population was destructively tested, to collect data on the R (496 pieces) and T (214 pieces) strength properties. The steps involved in the preparation and testing of these samples are described in the supplementary materials available online, along with commentary on some examples of tested pieces. The M and C for these samples were coded by one human grader, following a system developed by a wood products firm, which lists 37 possible codes for characteristics. To test the effectiveness of the predictive model, the last set of R measurements (116 samples) and the same proportion of T measurements (50 samples) were not used for model fitting. Figure 2 shows the histograms of MOE (in millions psi), R (in thousands psi), and T (in thousands psi). Table 4 shows the frequency distribution of M and C over the raw codes, tabulated separately for R and T. The final column shows the category (k, s, o) that each raw code was assigned to for this analysis. Table 5 shows the frequencies of M and C after collapsing the raw codes into the three categories. The cell (k, k) indicates cases where the grader-selected knot as M is correct and turns out to be C. The cell (k, k ) indicates the cases where the grader selected one knot as M, but the specimen failed at a different knot C. The few specimens that were deemed by the grader to be "off-grade" (i.e., in this case not belonging to the grade 1650f-1.5E) were removed from the analysis.

Analysis Results
The category counts in Table 5 are strongly asymmetric, for example comparing the (k, o) and (o, k) cells in the bending table. This suggests that our grader overestimated the strength effects of knots compared to shake and other, and we include the bias terms b s and b o in the model fitting with N (0, 10 2 ) priors. With the full models fitted, the posterior distributions of the parameters are summarized in Table 6. Then applying the predictive distribution on the test sets, we found an MSPE of 1.45 for R and 1.35 for T.
The fitted mean effects of the three categories are not significantly different between the two destructive tests based on these samples. However, due to having fewer samples in the "shake" and "other" cells, the estimates of μ s and μ o have much uncertainty. The σ x values do appear to be significantly different between R and T. This suggests that it may intrinsically be more difficult to correctly select M in bending. From a physical perspective, the top and bottom edges of a specimen undergoing bending experience tensile forces and compressive forces, respectively; as the effect of a characteristic may depend on the type of force applied to it, additional uncertainty in the grader's effect assessments may thus ensue for bending specimens. In contrast, specimens undergoing tension experience uniform tensile forces throughout. There is significant grader bias in the selection of M for bending, as evidenced by the negative b s and b o values, which imply an underestimation of the effects of shake and other compared to knots. There is likely some bias for tension as well, but not as apparent. Even after accounting for grader bias, both the bending and tension σ x values are relatively large compared to the estimated latent effects. Taken together, these results suggest that the predictive power gained from incorporating M in its current form is small at best for a high-quality grade of lumber such as this one.

DISCUSSION AND CONCLUSIONS
This study presented a coherent framework for understanding the uncertainty involved in the process of grading lumber and for constructing a prediction rule to apply on future specimens. This permitted the effect of unrecorded characteristics, due to the design of grading protocols and the choices of the grader, to be modeled and analyzed. A fitted model provides quantification of uncertainty in the selection of M, which in turn informs the value of M as a predictor of C and ultimately breaking strength.
The model presented could quite easily be extended to include more than three categories. This would require more extensive testing, and would ideally take place on a more variable global population of lumber. The results of this immediate study do suggest some recommendations for such future testing. First, the amount of information recorded on the characteristics in the current grading protocol may be too heavily censored to be useful for prediction purposes. For instance, the number and types of characteristics present could be recorded without too much additional effort, even if they are not individually documented in detail, and yet provide valuable information toward prediction. Second, based on these results graders (whether human or automated) might require calibration to adjust for the bias currently seen in their subjective assessments of effects among the different characteristics, or alternatively the underlying rules for selecting M could be updated to increase the probability that M = C. A robust analysis toward this latter objective, however, would require a full documentation of all characteristics on test specimens, which we anticipate will be achievable in the future.

SUPPLEMENTARY MATERIALS
Lumber strength testing experiment: Details about laboratory experiment and the species of lumber samples used to produce the dataset analyzed in this article. Examples of specimens are also illustrated. (PDF file)