A general theory of effect size, and its consequences for defining the benchmark response (BMR) for continuous endpoints.

A general theory on effect size for continuous data predicts a relationship between maximum response and within-group variation of biological parameters, which is empirically confirmed by results from dose-response analyses of 27 different biological parameters. The theory shows how effect sizes observed in distinct biological parameters can be compared and provides a basis for a generic definition of small, intermediate and large effects. While the theory is useful for experimental science in general, it has specific consequences for risk assessment: it solves the current debate on the appropriate metric for the Benchmark response in continuous data. The theory shows that scaling the BMR expressed as a percent change in means to the maximum response (in the way specified) automatically takes "natural variability" into account. Thus, the theory supports the underlying rationale of the BMR 1 SD. For various reasons, it is, however, recommended to use a BMR in terms of a percent change that is scaled to maximum response and/or within group variation (averaged over studies), as a single harmonized approach.


Introduction
The preferred statistical method of analyzing dose-response data resulting from a toxicological study is the so-called Benchmark dose (BMD) approach (EFSA 2009;WHO 2009;USEPA 2012), which uses dose-response modeling (nonlinear regression) with the purpose of estimating the dose at a prespecified "small" effect for the relevant endpoints observed in the study. The pre-specified effect is called the Benchmark response (BMR) and the associated dose the Benchmark dose. To account for sampling error in the data, the BMD is estimated as a confidence interval, the lower bound of which (the BMDL) is used as a POD (Point of Departure), also called RP (Reference point) in human risk assessment.
While there appears to be consensus on how to apply the BMD approach in mainlines (EFSA 2009;WHO 2009;USEPA 2012), the issue on how to define the BMR is still under debate, in particular for endpoints that are observed and reported as continuous data. For continuous data, two distinct types of metrics are in use: those expressed relative to the variation (SD) in the controls (in this paper denoted as BMR SD ), and those that ignore the within-group variation and focus on the change in mean response (usually a percent change; this BMD metric is also called critical effect size and will be denoted as CES in this paper). The argument behind the use of the BMR SD is that a given change in means does not seem to be comparable between endpoints that largely differ in within-group variation ("natural variability"). The BMR SD is a way to correct for that, where a larger change in means would be "acceptable" for endpoints with a relatively large "natural variability". The problem of this approach is that the within-group variation in the specific study depends on experimental error. First of all, it is subject to sampling error so that a given fixed value of the BMR SD in two perfectly replicated studies would relate to different changes in mean response. Further, studies examining the same dose-response may differ in within-group variation because the experimental conditions were not kept equally homogenous, or because differed analytical techniques resulted in different measurement errors for the same endpoint. Thus, "natural variability" is not entirely an appropriate term, as the observed within-group variation is study-specific, and includes sources of variation that are irrelevant for the dose-response, and hence for the BMD. As a final difficulty of the BMR SD , two identical dose-response relationships relating to populations that only differ in inter-individual variation will show different BMDs for the same value of the BMR SD . This makes the extrapolation of animal BMDs to human BMDs problematic.
The BMR defined as a percent change in means (here denoted as CES, critical effect size; Slob and Pieters 1998) , on the other hand, does not suffer from these disadvantages, simply because it ignores the within-group variation. The advantage of CES is that it expresses the effect size in a way that is biologically interpretable. It can be interpreted as a change that may occur in an individual subject in an exposed versus unexposed situation, and the question if a change of that size might be adverse in a biological sense can indeed only be considered at that individual level. Note that the BMR SD , in contrast, does not allow for such an interpretation. However, it is clear that a single value of CES (such as 5%, the default recommended by EFSA 2009) does not reflect an equivalent effect size for all toxicological endpoints that may occur. For example, a 5% change in ALT (liver enzyme in serum), or in mutation frequency, can hardly be considered equivalent to a 5% change in red blood cells. The CES lacks the option to adjust the change in means to any quantitative property of the endpoint, as the BMR SD does (whether justified or not). Indeed, toxicologists sometimes propose larger percent changes to be used as the CES based on expert judgment (Woutersen et al. 2001). In general, however, they have great difficulty in quantifying a CES for each endpoint such that the values may be considered as equivalent effect sizes among endpoints (Dekkers et al. 2001).
The fact that both BMR metrics have their own disadvantages hampers an objective decision on which of them is favorable. As a result, both are currently in use, which is unfortunate as harmonized methodology is important in (international) risk assessment. So far, however, the two different metrics remain to exist, which may result in conflicting outcomes from the same dose-response datasets.
This paper presents a generic theory that tells how the CES should be scaled based on quantitative properties related to the specific endpoint(s) considered, in particular maximum response and within-group variation. This theory is supported by an empirical relationship between maximum response and within-group variation for a wide range of biological parameters. It will be shown that this theory reconciles the two BMR metrics just discussed, such that the disadvantages of either of them are resolved.
The two distinct ways of quantifying effect size as just discussed for the BMR in toxicology is a general phenomenon in experimental science, although usually not as explicitly debated as in the field of toxicology. For example, Cohen (1988) in his textbook on statistical power analysis defines effect size (for continuous data) by taking the within-group variation into account, that is, as the fraction nonoverlap between the control distribution and the treated distribution (with specific values for small, medium and large effects). In reporting experimental studies, some authors relate an observed change in means to the scatter in the data, while others report the change in observed means as such (not taking the scatter into account). When investigators report their observed effect sizes by using distinct metrics results are difficult to compare among studies and a harmonized approach of measuring, reporting and interpreting effect sizes would be highly beneficial. The theory presented here directly applies to the question of how to quantify and interpret effect sizes in experimental science in general.

An empirical relationship between maximum response and within-group variation
The theory discussed below is supported by an empirical relationship between maximum response and within-group variation, which will be presented first. Both these parameters were estimated in a large number of toxicological dose-response datasets, with the responses always being measured as continuous data. As an example, Figure 1 shows a dose-response dataset together with a fitted dose-response curve. In this figure, M is the maximum fold change (according to the fitted model), and s the within-group standard deviation related to the log-transformed observations (the reason of the log-transformation is explained in the Annex). Throughout this paper, both M and s are used as just defined.
Not unexpectedly, the value of M appears to be a characteristic of the biological parameter considered (Slob and Setzer 2014) and not to depend on the applied treatment. Similarly, the value of s was found to be parameter-specific, although there may be substantial differences among individual studies due to estimation errors and to the fact that studies vary in the level of homogeneity of experimental conditions, measurement errors, and so on. As an illustration of the variation in s among studies, see Figure 2.
The fact that both the maximum response (M) and the within-group standard deviation (s) appear to be characteristics of the biological parameter raises the question if there is any relationship between these two characteristics. While the values of M and s estimated in Slob and Setzer (2014) were based on multiple dose-response datasets for each endpoint, resulting in relatively precise estimates for M and s, the number of endpoints was rather limited. To better establish a potential relationship between M and s the number of endpoints was extended by adding a number of datasets related to other endpoints that were available to the author, while having the property of showing a clear dose-response based on relatively many dose groups. The latter is particularly important in trying to estimate M. None of the additional datasets were, in retrospect, discarded. The final number of endpoints included for estimating M and s was 27 (Supplementary Material). Figure 3 shows the estimated values of M plotted against the associated estimates of s. Both parameters are found to be positively correlated (see Table 1 for the numerical values and their confidence intervals). The relationship between M and s is adequately described by the function M ¼ exp(z s), which is equivalent to ln M ð Þ ¼ zs: The proportionality constant z was estimated to be around 7, with 90% confidence interval (5.9, 8.9). The points show considerable scatter around the fitted curve, as may be expected from two facts. First, the estimation error for M is, in most cases, considerable (see the CIs in Table 1). Second, the estimate of s is for most points based on one (maybe 2) studies ( Table 1). As Figure 2 illustrated, there is considerable variation among studies regarding the estimated s, and an estimate based on multiple studies would better characterize the typical value of s for that endpoint, instead of being dependent on the coincidental value of s in a single study. By using an average value of s over studies, any study difference like laboratories or strains are averaged out. Similarly, an estimate of M based on multiple studies will result in a more reliable estimate of the true M for that endpoint. Indeed, as the confidence intervals in Table 1 show, using multiple studies results in much more precise estimates for M and s. An important observation is that the data points that were based on multiple studies are much closer to the fitted curve (see triangles in Figure 3), which supports the adequacy of the fitted curve.

The basic theory
A theory on effect size will now be discussed that predicts the empirical relationship between maximum response and within-group standard deviation, as shown in Figure 3. The basis of the theory says that different biological parameters differ in their way of "expressing" the impact of a treatment, just like people may differ in the way they express a given discomfort. When we know a particular person tends to be extrovert (or introvert) we will take that into account in estimating how serious the discomfort expressed by that person really is. Similarly, when we knew that some biological parameters are more "extrovert" than others, we would take that into account when appraising the underlying impact of the treatment. Here, the impact from a treatment on some parameter is not directly observable, just like the discomfort in the analogy just mentioned. However, the parameter expresses the underlying Figure 3. Relationship between maximum response (M) and within-group standard deviation (s) of the (natural) log-observations estimated for 27 endpoints observed in in vivo toxicity studies. The fitted curve is M ¼ exp(z s), where z is estimated at around 7 with 90%-confidence interval (5.9, 8.9). Both M and s (and the fitted curve) are plotted on log10 scales (note that the choice of axis-scales has no impact on the relationship; the only purpose is to make the individual points better visible). The triangles relate to points that are based on multiple studies (!5). impact into a given effect size, which is observable, just like people express their discomfort to others, which is all the outsider can observe. Therefore, when comparing different biological parameters we need to somehow adjust for the fact that different parameters may translate a given (invisible) impact by larger or smaller effect sizes depending on their expressiveness. To make this theory practically useful we need to quantitatively specify how different biological parameters should be adjusted such that they better represent the underlying impact from the treatment. At first sight, it appears that any quantitative specification will remain a blind assumption that cannot be validated, since the impact itself is nonobservable. Nonetheless, it can be shown that a fairly simple quantitative scaling of effect sizes is supported by the empirical relationship discussed above ( Figure 3). The underlying argument will now be briefly explained; a more detailed discussion is provided in the Annex.
An important reference is the maximum effect size that can be reached in different parameters, as it always reflects the underlying maximum impact. Hence, maximum effect sizes observed in different parameters are associated with the same (namely the maximum) impact in each of the parameters. A first idea for scaling effect sizes might be to divide a given effect size (in terms of a fold change 1 ) by the parameter-specific maximum fold change. However, this rule fails based on first principles. For example, when we observe an ES of 2-fold in an endpoint with M ¼ 10-fold, scaling to M leads to a fold-change smaller than unity, which is impossible (note that a fold change is always larger than 1 in both increasing and decreasing dose-responses). Another simple rule is to scale the logarithm of a given effect size to the logarithm of M. In that case, the scaled effect size will lie in the range 0 to 1, for every M. Parenthetically, this scaling rule naturally follows from assumption 2 discussed in the annex.
This rule can be tested by the empirical relationship of Figure 3. The underlying idea is that the within-group variability reflects the impact of the many experimental factors that experimenters try to neutralize to the extent possible, by making all experimental conditions as homogenous as possible. As a result, all those unintended experimental factors will only have a relatively small impact on the biological parameter considered, and together they give rise to the withingroup scatter in the observations (see Annex, assumption 1). Now, the (small) impact of each unintended experimental factor will be translated into some (small) effect size. The latter translation will depend on the expressiveness of the parameter. That explains that the within-group variation is larger in parameters with a larger maximum effect size: biological parameters that express maximum impact by a relatively large effect size also express small impacts (from the unintended experimental factors) by relatively large effect sizes, resulting in a relatively large within-group variation.
As shown in the Annex, scaling the logarithm of the effect size to the logarithm of M (both expressed as fold changes) predicts the fitted curve in the empirical relationship shown in Figure 3. Thus, the theory can be quantitatively specified which makes it practically useful, with various important applications, as discussed below.

Comparing and interpreting effect sizes in experimental science
Comparing effect sizes observed in the same biological parameter, associated with different treatments or factors, is obvious: the treatment resulting in the larger effect size had more (biological) impact. However, when comparing different biological parameters, this is not so obvious. Suppose that a particular medicine shows two side effects: an increase of around 20% in ALT (a liver enzyme measured in serum, indicating liver damage), and a decrease of around 5% in hematocrit. Does the fact that 20% is more than 5% mean that the hematocrit side effect has less biological impact, so that it is better to focus on liver damage? Or, should we take into account that the concentration of ALT can be easily moved away from its background value after trivial events such as having a good meal, while hematocrit appears to be a stable parameter where smaller deviations will occur even after more serious events?
The theory on effect size (ES) provides the answer: before comparing effect sizes, they need to be normalized to the relative expressiveness of the parameters involved by: where M and ES are expressed as fold changes. The NES is a value between 0 and 1. In this way, the effect sizes observed in different biological parameters (like in the ALT vs. hematocrit example) are corrected for their relative "expressiveness", and therefore, NES forms a better basis for comparison. In addition, this normalized ES-scale forms a good basis for defining small, medium or large effects as follows. The maximum response (NES ¼ 1) might be qualified as a huge effect, half way that maximum might be termed a large effect (NES ¼ ½), the qualifier medium effect may be used for halfway a large effect (NES ¼ 1 = 4 ) and a small effect half way a medium effect (NES ¼ 1 = 8 ). Here, halfway relates to the logscale of the fold changes (see Figure 4, left panel), so that on the original scale a large effect would be the square root of the maximum effect, a medium effect the square root of a large effect, and a small effect the square root of a medium effect (Figure 4, middle panel). A numerical example for a parameter with maximum effect equal to 4 (fold change), is given in the right panel of Figure 4. Clearly, associating qualifiers like small, medium or large (in the absolute sense) to specific quantitative values always contains some arbitrary component, but they are clearly defined, and their use may be helpful in practice.
In practice, the maximum response (M) of a biological parameter is hard to estimate (see the large confidence intervals in Table 1), and often its value may not, or only poorly be known. Then, it might be considered to use the within-group variability as a surrogate, by combining expressions (1) and (2): where z is the estimated constant in the empirical relationship of Figure 3, and s is the (within-group) standard deviation of the log-transformed observations. Obviously, in expression (3) the log should relate to the same base as used in calculating s (i.e. both log10 or both ln). The term s in expression (3) reflects the within-group variation 2 that is typical for that endpoint, which might be called the "natural variability" of that parameter. It should be kept in mind however that this parameter-specific "natural variability" can only be defined in the relative sense, that is, when considering different parameters observed in the same study, while differences in measurement errors among the parameters can be ignored. As soon as parameters are measured in different studies, the differences in s among those parameters will be blurred by study specific properties (such as relative heterogeneity of the experimental units or of the experimental conditions). Therefore, when comparing effect sizes among studies, the value of s in expression (3) should be based on a larger number of studies, as that would better characterize the "typical" value of s. Table 2 lists equivalent effect sizes (small, medium and large) in relation to M for a range of M values, each related to some biological parameter. The table also contains a column with associated values of s, predicted to relate to the value M, based on the empirical relationship of Figure 2 (assuming that the constant z equals 7.0). In this way, the table can be used to check if a given observed combination of M and s matches one of the rows in the table, which will often be useful as the value of M is only roughly known for most parameters (at least, at this point in time, see discussion). If so, the required effect sizes may be read off from that row. Obviously, the table may also be used if M is not known at all while s is, with the appropriate caution, in particular when s is based on just one or only few studies. Finally, it should be noted that the relationship between M and s in Table 2 is based on the empirical relationship based on in vivo toxicological studies (Figure 3), and other types of studies (e.g. in vitro) might show a somewhat different value of the constant z between ln(M) and s.

Deriving endpoint-specific values for CES in toxicology
The preceding discussion of how to compare and interpret effect sizes in experimental science directly applies to the question of how to deal with the BMR in toxicology and risk assessment. When the maximum fold change M is known for each of the endpoints considered, the ideal approach would be to choose an endpoint-specific CES that is scaled to M, using expression (2). When there is no information on M for the endpoints considered, one may use the surrogate approach, and scale to the value of s, preferably estimated as an average over a larger range of different studies (of the same type, e.g. in vivo). Or, one may combine the available information on M and s and try to find a row in Table 2 that more or less matches both values of M and s. The examples below illustrate these options, where it is assumed that a "small" effect size (as defined in Figure 4) would be a suitable choice for the benchmark response (CES) in risk assessment. It is noted in advance that the outcomes of CES may vary somewhat depending on whether M or s is used as the basis for scaling, due to the fact that in most cases M and s are known approximately only (See discussion for how this problem could be solved).

Illustrative examples
As discussed in the Supplementary Material, the best way of estimating M and s for a given biological parameter is by combining dose-response data from different studies examining that same parameter. By including the factor "study" as a covariate in the dose-response analysis, potential differences in response level in the controls, as well as differences in sensitivity among studies, can be taken into account (for a comprehensive discussion, see Slob and Setzer 2014). In this way, the complete dataset informs the parameters M and s, leading to better precision, while the estimate of s will be closer to the typical study, as desired when applying the ES theory. The PROAST software (www.proast.nl) is directly suitable for that purpose, and fitting the four-parameter exponential model directly results in an estimate of M and s.
As an illustration, Figure 5 shows the result of a doseresponse analysis in PROAST for one of the biological parameters considered in this study: 8-OHdG DNA adducts in the liver. The legend on the right hand side of Figure 5 shows the estimate of var, which is the within-group variance for the ln-responses. The estimate of s is the square root of that: sqrt(0.023) ¼ 0.15. Note that var (and hence s) is estimated from the complete dataset, as the assumption is that the within-group variance is homogenous (on log-scale; also see Supplementary Material for validation). Further, the estimate of M can be read off from the right-hand side legend as well, as M is equivalent to c in PROAST notation (or to 1/c for when c < 1, i.e. in decreasing curves). So, M is estimated to be around 34 in this case. The confidence intervals for both parameters can be established in PROAST (menu version) by selecting option 12 from the main menu.
Once the values of M and s of the endpoints involved are established (with their confidence intervals), the value of CES may be assessed as illustrated in the following examples.
Consider the endpoints red blood cell counts and AChE (acetyl-cholinesterase) activity. Both endpoints are included in Table 1, and we might assume that the values of M are around 1.5 and 6, respectively. Then, from Table 2, we can read off that the associated "small" effect sizes are 1.05 and 1.25, respectively. Thus, the values for CES considered to be "small" effects are 5% and 25%, for red blood cell counts and AChE, respectively.
Next, suppose that for the same two endpoints we do not have any information on M, but we do have an estimate of s, in particular as given in Table 1. For red blood cell counts s ¼ 0.063, and Table 2 tells us that the associated "small" effect size would be close to 1.06, so the associated CES of 6% is similar to the one that was based on M. For AChE activity, Table 1 reports s ¼ 0.19, and the associated "small" effect size according to Table 2 would be around 1.18 (CES ¼18%). This value is somewhat lower than the one based on M in the previous illustrative example.
Finally, consider RBC mutants as the endpoint, for which approximate information is available on both M and s, as given in Table 1. Considering the confidence intervals in this table, M may be, roughly speaking, somewhere between 50 and 150, while s may be in the range 0.5 to 0.6. In Table 2, the row with M ¼ 60 and s ¼ 0.585 comes closest to both  Effect size and M are expressed as a fold change, which translates into a percent change by subtracting one (and multiplying with 100). Ã s is the within-group SD related to the natural logarithms of the observations. †var is used in PROAST notation (see Figure 5). these ranges and the associated small effect size would be around 67%.

BMR SD versus CES: a reconciliation
Another important consequence of the ES theory presented here is that it reconciles the two distinct BMR metrics: the BMR SD (taking "natural variability" into account) and the CES (ignoring "natural variability"). The theory showed that scaling CES to the maximum response of the parameter according to expression (2) implies scaling to s, the "natural variability" of that parameter, at the same time. Roughly speaking, both metrics now coincide. It is emphasized, however, that the theory only provides support to the underlying rationale of the BMR SD , i.e., a change in mean responses should be scaled to the parameter's "natural variability". There are two important caveats however. First, the BMR SD is supported by the ES theory (and the empirical relationship) only when the dose-response analysis is performed on the log-response scale (i.e. after logarithmic transformation). Fortunately, as the Annex and the Supplementary Material show, biological data can be expected to be lognormally distributed, both for theoretical and empirical reasons. A concomitant advantage of an analysis on logscale is that the within-group variances will tend to be homogenous, so that s can be estimated from the complete dose-response dataset rather than just the controls. Importantly, using the BMR SD based on an estimate of the SD in the controls on the original response scale is not supported by the ES theory. Further, the general observation that the SD tends to be proportional to the associated mean (on the original scale) in biological parameters, which is firmly confirmed by the Supplementary Material, implies that within-group variation should be measured in terms of a coefficient of variation, or equivalently, by the SD on log-scale (i.e. s). This property of biological parameters invalidates the use of the BMR SD on the original response scale. Parenthetically, a dose-response analysis applied to the responses on the original scale is not recommendable for the simple reason that in decreasing doseresponses the tail of the assumed distribution around the fitted curve (or even the mean response) may intersect the x-axis at higher doses, that is, the fitted model may predict negative responses. These problems will not occur when the log-transformation is applied (both to the observations and to the predicted responses) when fitting the model. It should be noted that the log-transformation does not change the information in the data, all it does is change the original ratio scale (which is reality) into an interval scale (which is the scale used in the statistical analysis). The PROAST software (www.proast.nl) fits the models for continuous data on the log-response scale as the default, while BMDS unfortunately uses the normal distribution as the default, with an option to switch to lognormal but only for the exponential models. Therefore, BMDS users who want to make use of the ES theory presented here should not forget to change the BMDS default setting into lognormal.
In conclusion, the theory presented here makes clear that the rationale behind the BMR SD is adequate, but only when the response data are log-transformed, and only when the value of the SD (on log-scale, i.e. s) represents the typical value for that endpoint in the long run (i.e. averaged over different studies). The preferred way of dealing with the benchmark response, however, is by using a value of CES that is adjusted to the "expressiveness" of the particular endpoint by using information on M and s, where s is the typical value over a range of different studies. In this way, the benchmark response covers the rationale behind the BMR SD but with the crucial advantage that the scaled CES is expressed as a percent (or fold) change, which is biologically/ toxicologically interpretable, while making the associated BMD suitable for extrapolation to an equipotent dose in the (median) human being (WHO 2014).
It may be interesting to express the proposed definition of a "small" effect, that is, 1/8 ln(M), in terms of s. Using the estimate of z ¼ 7.0 from the empirical relationship in Figure 3, this "small" effect size would be equivalent to 0.9 standard deviations zs 8 ¼ 7:0 s 8 ¼ 0:88s À Á . In particular, when taking the 90% confidence interval (5.9, 8.9) for z into account, the proposed definition of a small effect size is close to a change of one s, that is, one SD on log-response scale. Further, it may be noted that the present definition of a "small" effect size results in an approximate 5% change in the mean response for endpoints with a relatively small s, like hemoglobin or red blood cell counts, where s is around 0.05 or 0.06, as shown in Table 1. Table 2 shows that, for that value of s, a small effect is around 5%.

Discussion
The estimates of M used in establishing the empirical relationship were highly imprecise for a large fraction of the endpoints, in particular when based on a single study (see confidence intervals for M in Table 1). Further, while the confidence intervals for s were relatively small, most of them were based on one single study as well, which means that they may not closely represent the typical value of s for that endpoint. Despite these imprecisions in the data, the correlation between M and s is beyond doubt. Further, it is reassuring that the points in Figure 3 which related to multiple studies and hence resulted in much more reliable estimates of M and s, were much closer to the relationship predicted by the ES theory. This calls for a systematic effort to more precisely estimate both M and s for all toxicological endpoints (and biological parameters in general) that are frequently measured, comparable to efforts to precisely estimate physical constants in the physical sciences. With more precise knowledge of the values of M and s, the interpretation and scaling of effect sizes may greatly improve. Thus, a totally new aspect of the value of experimental data arises: apart from being valuable in the context of answering a specific research question, they can also be used for answering more general questions, such as estimating generic characteristics of biological parameters, in this case M and s. Indeed, it would be highly valuable to systematically collect data for each biological parameter that is regularly studied in the life sciences. For the purpose of estimating M, specific studies with a large range of doses, while including doses as high as practically and ethically possible, may be helpful, but combining many studies examining the same endpoint will also reduce the uncertainty in the estimate of M, as indicated by the confidence intervals in Table 1. For estimating the relative magnitude of s in different endpoints, there is no need to generate new data, as the purpose is to evaluate within-group variation as realized in currently performed studies (of commonly accepted quality).
When better and more estimates of M and s become available, they can be used to re-evaluate the ES theory, by checking the prediction that the points related to multiple (or more informative) studies will be closer to the fitted curve (like the triangles in Figure 3).
The main conclusion from the ES theory is that in principle the best way to make effect sizes in different biological parameters comparable is by scaling to log(M), as M is a constant for a given biological parameter, independent of treatment or study conditions. This rule may be used for deriving endpoint-specific values for CES to be used in toxicology and risk assessment, or for interpreting effect sizes in experimental science in general. Unfortunately, estimating the value of M from experimental data is typically difficult. As long as no good information on M is available, effect sizes may be scaled to the typical value of s, as an approximate approach.
When scaling of CES is based on s only, caution is needed. For example, the relatively large value of s in thymus weights might partly be related to a larger measurement error in this small organ as compared to larger organs. Obviously, measurement error does not relate to the "expressiveness" of the endpoint, and should not have an impact on scaling effect sizes. If, by lack of data, one needs to resort to the estimate of s based on a single study, extra caution is needed: the resulting value of s might deviate from the typical value that would have been found from a large number of studies, even if the confidence interval for s were small.
The theory presented here further illuminates the concept of "natural variability". According to the theory, the relative expressiveness of an endpoint is reflected by the relative value of M but also by the typical value of s, the within group variation (on log-scale). Because of the relationship between s and the expressiveness of the biological parameter, it makes sense to talk about "natural variability", but only in a relative sense, and only for studies that are performed in a similar way, with similar sources of variation (e.g. in vivo animal studies). Further, s should relate to an average over studies, so that it does not depend on coincidental circumstances in a single study. Finally, the measurement errors of the endpoints considered should be relatively small. If these conditions are met, the relative magnitude of s reflects the expressiveness of the endpoint, and can indeed be called (relative) natural variability of that endpoint.
It might be that the same biological parameters measured in in vitro studies will generally show smaller within-group variation than in in vivo studies, due to fewer factors in the study that (unintendedly) act as sources of variation. Therefore, applying Table 2 for deriving, say, a "small" effect size based on the typical value of s found in in vitro studies might not be valid: column s is based on the estimated constant z for in vivo studies (Figure 2). A similar empirical relationship would first need to be established based on in vitro studies, which might result in another estimate of the constant z.
The ES theory supports the rationale underlying the BMR SD but it must unfortunately be concluded that this BMR metric is invalid when applied in the usual way, that is, omitting logarithmic transformation of the responses. Theoretically, it is valid when the log-transformation is applied, but even then the BMR SD is not favorable, for two reasons. The first reason is that it uses the study-specific value of s, which is a very unstable parameter, as illustrated in Figure 2. This makes the BMR SD incomparable among studies, even when those studies are highly comparable and examine exactly the same doseresponse. The second reason is that the BMD associated with a given BMR SD depends on the inter-individual variation in the population considered. This is a major problem in risk assessment, where one of the required steps involves an extrapolation of the BMD in the test animal to an equipotent dose in the (median) human individual (WHO 2014). Therefore, a BMD derived for a given BMR SD in animals will not be the same as that in the (median) human being, even if equally sensitive to the chemical as the test animal.
While the ES theory may be applied to the experimental life sciences in general, its most important consequence for risk assessment is that it may end the current situation where different assessors may derive different BMDLs due to the fact that they used conceptually different BMRs. By using the proposed approach, the size of the effect is expressed as a percent change, which can be interpreted biologically, as it relates to the expected change at the level of an individual. At the same time, the "natural variability" of the specific endpoint is taken into account, defined as the (relative) value of the within-group variation over a larger range of (in vivo) studies. Notes 1. A fold change is equivalent to a percent change, e.g. a 2-fold increase is a 100% increase. However, a 2-fold decrease is a 50% decrease. While expressing changes in terms of percent change is very common, it is better to think (and report) in terms of fold change. For example, a 5 fold-increase and a 5-fold decrease reflect the same change, and thus both increasing and decreasing doseresponses could be included in creating Figure 3. Also see the discussion of plotting SDs against means in increasing and decreasing dose-responses in the Supplementary Material. 2. Here, it is assumed that s is constant over dose groups, or, equivalently, that on the original scale the SD is proportional to the mean (CV is constant). The latter is in line with the basic assumption that effects work multiplicatively, and is confirmed by the data shown in the Supplementary Material.
an external funding. The author has sole responsibility for the writing and content of the paper.

Supplemental material
Supplemental material for this article is available online here.
trations in a bulk sample (which effectively means that the underlying concentrations in the sub-samples were added, thereby distorting the original lognormal distribution). In the former example, the statistical analysis should take the clustering into account rather than assume another distribution. In the latter example, the lognormal distribution indeed fails, however, without rejecting the multiplicativity assumption.
Quantitative specification of the basic theory: how do effect sizes scale?