PRACTICAL VARIANCE REDUCTION VIA REGRESSION FOR SIMULATING DIFFUSIONS

The well-known variance reduction methods—the method of importance sampling and the method of control variates—can be exploited if an approximation of the required solution is known. Here we employ conditional probabilistic representations of solutions together with the regression method to obtain sufficiently inexpensive (although rather rough) estimates of the solution and its derivatives by using the single auxiliary set of approximate trajectories starting from the initial position. These estimates can effectively be used for significant reduction of variance and further accurate evaluation of the required solution. The developed approach is supported by numerical experiments.

the MC technique requires different auxiliary sets of approximate trajectories because of the different starting points (t k , x k ). This is too expensive, i.e., as a rule, such a procedure is more expensive than simple increase of the number of trajectories starting from the initial position (t 0 , x 0 ), at which we aim to find the value of the solution u.
So, a suitable method of constructing u(t k , x k ) and ∂u/∂x i (t k , x k ) should be comparatively inexpensive. Therefore we cannot require a considerable accuracy of the estimates for u(t k , x k ) and ∂u/∂x i (t k , x k ) because there is a trade-off between accuracy and computational expenses. Our proposition is to exploit conditional probabilistic representations. Their employment together with the regression method allows us to evaluate u(t k , x) and ∂u/∂x i (t k , x) using the single auxiliary set of approximate trajectories starting from the initial position (t 0 , x 0 ) only. This plays a crucial role in obtaining sufficiently inexpensive (but at the same time useful for variance reduction) estimatesû(t k , x) and ∂u/∂x i (t k , x). The construction ofû and ∂u/∂x i is accompanied by a number of errors of a different nature. Although it is impossible to evaluate these errors satisfactorily, the suitability ofû(t k , x) and ∂u/∂x i (t k , x) for variance reduction can be directly verified during computations since the MC error can always be estimated. We emphasize that the obtained (even rather rough) estimates can effectively be used for accurately evaluating the function u not only at the position (t 0 , x 0 ) but at many other positions as well.
This paper is most closely connected with [6,12,13,14] (see also the [16]) and with the works [21,20] by N. Newton. The method of importance sampling from [6,12] is exploited in [25] for some specific physical applications. Various other aspects of variance reduction related to simulating diffusions are considered, e.g., in [2,4,9,10,24] (see also the references therein). An extended list of works devoted to variance reduction of MC simulations can be found in [7].
In section 2 we recall some known facts concerning the MC technique for linear parabolic equations and the general scheme of regression method for estimating conditional expectations. Section 3 is devoted to conditional probabilistic representations of solutions of parabolic equations and their derivatives. These representations together with regression approach play a decisive role in the economical estimating of u and ∂u/∂x i at all points (t, x), given the only set of trajectories starting from the initial point (t 0 , x 0 ). In section 3.2 we obtain the estimateû(s, x) and propose to estimate the derivatives ∂u/∂x i (s, x) by ∂û/∂x i (s, x). This estimation of derivatives is inexpensive from the computational point of view, but they are rather rough. Section 3.3 is devoted to the more accurate way of estimating derivatives using a linear regression method directly to find ∂u/∂x i (t k , x). In section 3.4, we obtain ∂u/∂x i (t k , x) in the case of nonsmooth initial data exploiting probabilistic representations for ∂u/∂x i (s, x) which rest on the Malliavin integration by parts. To this aim, we derive a conditional version of the Malliavin integration-by-parts formula adapted to our context. It should be noted that if the dimension d is large, the procedures of sections 3.3 and 3.4 are computationally very demanding since they require integration of the d 2 -dimensional system of first-order variation equations whose solution is present in the probabilistic representations for ∂u/∂x i (s, x). Therefore, in practice, the inexpensive procedure of section 3.2 is preferable if d is large. In section 4 we give a simple, analytically tractable example to illustrate the benefits of the proposed variance reduction procedure, and we also test it on a one-dimensional array of stochastic oscillators and on the Black-Scholes pricing model for a binary asset-or-nothing call option. Section 5 gives a summary of the proposed approach to variance reduction.

Preliminaries.
In this section we recall some known facts concerning probabilistic representations of the solutions of parabolic partial differential equations and the regression method of estimating conditional expectations in the form suitable for our purposes.

Probabilistic representations.
Let us consider the Cauchy problem for the linear parabolic equation with the initial condition The matrix a(t, x) = {a ij (t, x)} in (2.1) is symmetric and at least positive semidefinite. Let σ(t, x) be a matrix obtained from the equation Let (Ω, F , F t , P ), t 0 ≤ t ≤ T, be a filtered probability space. The solution to the problem (2.1)-(2.2) has the following probabilistic representation (the well-known Feynman-Kac formula): where X s,x (t), Y s,x,y (t), Z s,x,y,z (t), t ≥ s, is the solution of the Cauchy problem for the system of SDEs Here w(t) = (w 1 (t), . . . , w d (t)) is a d-dimensional {F t } t≥t0 -adapted standard Wiener process, and Y and Z are scalars. If y = 1, z = 0, we shall use the notation Y s,x (t) := Y s,x,1 (t), Z s,x (t) := Z s,x,1,0 (t) (analogous notation will be used later for some other variables). So, There are various sets of sufficient conditions ensuring connection between the solutions of the Cauchy problem (2.1)-(2.2) and their probabilistic representations (2.5)-(2.4). For definiteness, we shall keep the following assumptions.
We assume that the coefficients b, σ, c, and g have bounded derivatives up to some order, and additionally c and g are bounded on [t 0 , T ] × R d . Further, we assume that the matrix a(t, x) is positive definite and, moreover, the uniform ellipticity condition holds: there exists σ 0 > 0 such that As for function f (x), it is assumed to grow at infinity not faster than a polynomial function. It can be both smooth and nonsmooth. We note that the results of this paper can be used under other sets of conditions. For instance, one can consider situations with nonglobally Lipschitz coefficients [18] or with matrix a(t, x) which is positive semidefinite. For example, in section 4.2 we consider a numerical example with nonglobally Lipschitz coefficients and positive semidefinite matrix a(t, x), and the example from section 4.3 has a discontinuous f (x).
The value u(s, x) from (2.5) can be evaluated using the weak-sense numerical integration of the system (2.4) together with the MC technique. More specifically, we have where the first approximate equality involves an error due to replacing X, Y , Z bȳ X,Ȳ ,Z (the error is related to the approximate integration of (2.4)) and the error in the second approximate equality comes from the MC technique; mXs,x (T ), mȲs,x (T ), While the weak-sense integration of SDEs is developed sufficiently well and a lot of different effective weak-sense numerical methods have been constructed (see, e.g., [16]), the methods of reducing the second error in (2.6) are more intricate. The error of the MC method is evaluated bȳ where, e.g., the values c = 1, 2, 3 correspond to the fiducial probabilities 0.68, 0.95, 0.997, respectively. Introduce Since varΓ s,x is close to varΓ s,x , we can assume that the error of the MC method is estimated by

Variance reduction.
If varΓ s,x is large, then to achieve a satisfactory accuracy we have to simulate a very large number of independent trajectories. Clearly, variance reduction is of crucial importance for effectiveness of any MC procedure. To reduce the MC error, one usually exploits some other probabilistic representations of solutions to considered problems. To obtain various probabilistic representations of the solution to the problem (2.1)-(2.2), we introduce the system (see [13,14,16]) where μ and F are column-vector functions of dimension d satisfying some regularity conditions (e.g., they have bounded derivatives with respect to x i up to some order). We should note that X, Y , Z in (2.10) differ from X, Y , Z in (2.4); however, this does not lead to any ambiguity. The formula (2.5), i.e., remains valid under the new X, Y , Z. While the mean EΓ does not depend on the choice of μ and F, the variance varΓ = EΓ 2 − (EΓ) 2 does. Thus, μ and F can be used to decrease the variance varΓ and, consequently, the MC error can be reduced.
The following theorem is proved in [14] (see also [13,16] provided that the expectation in (2.12) exists. In (2.12) all the functions σ ij , μ j , F j , u, ∂u/∂x i have (t, X s,x (t)) as their argument.
In particular, if μ and F are such that We recall that if we put here F = 0, then we obtain the method of importance sampling (first considered in [6,12,24]), and if we put μ = 0, then we obtain the method of control variates (first considered in [21]). Theorem 2.1 establishes the combining method of variance reduction proved in [13]; see also [16].
Obviously, μ and F satisfying (2.13) cannot be constructed without knowing u(t, x), s ≤ t ≤ T, x ∈ R d . Nevertheless, the theorem claims a general possibility of variance reduction by a proper choice of the functions μ j and F j , j = 1, . . . , d. Theorem 2.1 can be used, for example, if we know a functionû(t, x) connected with an approximating problem and which is close to u(t, x). In this case we take anyμ j , F j , j = 1, . . . , d, satisfying and then the variance var Γ, though not zero, is small. Let us emphasize that (2.13) serves only as a guidance for getting suitable μ and F (recall that the mean EΓ does not depend on the choice of μ and F ). In particular, the derivative estimate ∂u/∂x i can differ from ∂û/∂x i . In such cases, instead of (2.14) we use It might seem that the problem of at least rough approximation of the functions u(t, x) and ∂u/∂x i (t, x) is not difficult since they can be found approximately due to the Feynman-Kac formula, numerical integration of SDEs, and the MC technique. But then numerical integration of the system (2.10) presupposes evaluating u(t k ,X k ) and ∂u/∂x i (t k ,X k ) at many points (t k ,X k ). Their evaluation by the MC method requires different sets of auxiliary approximate trajectories because of the different starting points (t k ,X k ). This is too expensive; i.e., as a rule, such a procedure is more expensive than simple increase of M in (2.6). Our aim is to propose a systematic method of approximating the functions u and ∂u/∂x i , i = 1, . . . , d, relatively cheaply, and hence obtain systematic methods of variance reduction. To this end, we exploit the regression method of evaluating u(t k , x) and ∂u/∂x i (t k , x), which allows us to use only one set of approximate trajectories starting from the initial position (t 0 , x 0 ). i (s, x). The probabilistic representation for the derivatives

Pathwise approach for derivatives ∂u/∂x
can be obtained by the straightforward differentiation of (2.11) (see, e.g., [7,13]): satisfy the system of variational equations associated with (2.10): Introduce a partition of the time interval [t 0 , T ], for simplicity the equidistant one: t 0 < t 1 < · · · < t N = T with step size h = (T − t 0 )/N. Let us apply a weak scheme (see, e.g., [16]) to the systems of SDEs (2.10), (2.17)- (2.19) to obtain independent approximate trajectories (t k , mX (t k )), m = 1, . . . , M, all starting from the point (t 0 , x), and mȲ (t k ), Then we obtain the following MC estimates of the derivatives ∂u/∂x i (t 0 , x) from (2.16) with (s, x) = (t 0 , x): Clearly, the estimates∂ i (t k , x) for derivatives ∂u/∂x i (t k , x) can be obtained analogously. Theorem 2.1 asserts that the variance in evaluating u by (2.11) can reach zero value for some μ and F . In [13] it is proved that for the same μ and F the variance in evaluating ∂ i by (2.16) is equal to zero as well (we pay attention that not only μ and F but also their derivatives are present in (2.18) and (2.19)).

Regression method of estimating conditional expectation.
Let us recall the general scheme of the linear regression method (see, e.g., [8]). Consider a sample ( m X, m V ), m = 1, . . . , M r , from a generic member (X, V ) of the sample, where X is a d-dimensional and V is a one-dimensional random variable. We pay attention that we denote by M r the size of the sample used in the regression, while M is the number of realizations used for computing the required quantity u(t 0 , x 0 ) (see (2.6)). Let the values of X belong to a domain D ⊂ R d . It is of interest to estimate the regression function Let {ϕ l (x)} L l=1 be a set of basis functions each mapping D to R. As an estimateĉ(x) of c(x), we choose the function of the form L l=1 α l ϕ l (x) that minimizes the empirical risk: whereα l satisfy the system of linear algebraic equations Thus, the usual base material in the field of regression is a sample ( m X, m V ), m = 1, . . . , M r , from a generic member (X, V ) of the sample.
Remark 2.2. Although in this paper we use linear regression, in principle other regression methods (see, e.g., [3,8]) can be exploited as well.

Conditional probabilistic representations and methods of evaluating u(s, x) and ∂u/∂x i (s, x) by regression.
The routine (unconditional) probabilistic representations are ideal for the MC evaluation of u(t 0 , x 0 ) by using a set of trajectories starting from the point (t 0 , x 0 ). To find u(s, x) by this approach, we need to construct another set of trajectories which starts from (s, x). However, we can use the previous set starting from (t 0 , x 0 ) to compute u(s, x), s > t 0 , if we make use of conditional probabilistic representations. In this section we introduce the conditional probabilistic representations for solutions of parabolic equations and for derivatives of the solutions.

Conditional probabilistic representations for u(s, x) and ∂u/∂x i (s, x).
Along with the unconditional probabilistic representation (2.11), (2.7), (2.10) for u(s, x), we have the following conditional one: This formula can be considered as the conditional version of the Feynman-Kac formula.
Analogously to (3.1), we get for So, we have two different probabilistic representations both for u(s, x) and ∂ i (s, x): the first one is in the form of unconditional expectation (see section 2), and the second one (i.e., (3.1) and (3.2)) is in the form of conditional expectation. The first form can be realized naturally by the MC approach and the second one by a regression method. As we discussed before, it is too expensive to run sets of trajectories starting from various initial points (s, x), and we do have the set of trajectories (t, m X t0,x0 (t)). Taking this into account, the second way (which relies on the conditional probabilistic representations and regression) is more preferable although it is less accurate. A proof of (3.1) and (3.2) relies on the following assertion: if ζ isF -measurable, e.g., [11]). From this assertion, for any measurable g it holds (with ζ = X t0,x0 (s),F = σ{X t0,x0 (s)}, f(x, ω) = g(X s,x (T ))) that hence (3.1) and (3.2).

Evaluating u(s, x).
In evaluating u(s, x) by regression, the pairs (X, V ) and ( m X, m V ) have the form To realize a regression algorithm, we construct the set of trajectories (t, m X t0,x0 (t)). Of course, we construct them approximately at the time moments s = t k and store the obtained values. So, in reality we have (t k , mXt0,x0 (t k )). The time s in (3.3) is equal to that of t k . We note that i.e., X s,X (t) is a continuation of the base solution starting at the moment t 0 and X s, Let us recall that Y s,X (t) is the solution of the equation (see (2.10)) Clearly, Therefore Having this sample, one can obtainû(s, x) by the linear regression method (see section 2.4): Then from (2.14) we find someμ(s, x),F (s, x) for any t 0 < s < T (in reality for any t k ) and construct the variateΓ(t 0 , x 0 ) (see (2.5) and (2.7)) for u(t 0 , x 0 ) due to the system (2.10) with μ =μ and F =F . We repeat that the variateΓ(t 0 , x 0 ) is unbiased for anyμ andF . We note that it is sufficient to have rather rough (in comparison with the required accuracy in evaluating u(t 0 , x 0 )) approximationsμ(s, x) andF (s, x) of some optimal μ and F from (2.13). Therefore, it is natural to use a coarser discretization and fewer MC runs in the regression part of evaluatingû(s, x) due to (3.8), i.e., to take M r in (2.22) smaller than M and to construct samples m X in (2.25) with a comparatively rough discretization. Then in computing u(t 0 , x 0 ) with a finer discretization, the necessary values ofμ andF at the intermediate points can be obtained after, e.g., linear interpolation ofû with respect to time. The success of any regression-based approach clearly depends on the choice of basis functions. This is known to be a rather complicated problem, both in practice and theory. In fact, it is necessary to use a special basis tailored to each particular problem. Fortunately, the variance can easily be evaluated during simulation. Therefore, it is not very expensive from the computational point of view to check the quality of a given basis if we take coarse discretizations both in the regression part and in the main part of evaluating u(t 0 , x 0 ) and if we take not too large numbers M r and M of MC runs. This can help in choosing a proper basis. Remark 3.1. Clearly,α l depend on s (on t k ). Let us note that the number L and the set {ϕ l (x)} L l=1 may depend on t k as well. Remark 3.2. It is obvious that in practice we use (2.10) with different μ and F in the implementation of the regression and in computing the required quantity u(t 0 , x 0 ). Indeed, in the regression part of the procedure we can take arbitrary μ and F (e.g., both zero), while in computing u(t 0 , x 0 ) we choose μ and F according to (2.14) withû obtained via the regression or according to (2.15) withû and ∂u/∂x i obtained via the regression.
Remark 3.3. At s = t 0 the system (2.24) degenerates into the single equation (we suppose that not all of ϕ l (x 0 ) are equal to zero) (3.10) Therefore, the coefficients α 1 (t 0 ), . . . , α L (t 0 ) cannot be found from (3.10) uniquely. At the same time, the linear combination is defined uniquely. Clearly, when t k is close to t 0 (for instance, at t 1 ), the system (2.24), though not degenerate, is ill-conditioned. Nevertheless, for such t k and for x close to x 0 , the estimatê can be found sufficiently accurate. However, since it is not possible to satisfactorily determine the coefficients In addition, let us emphasize that such difficulties are not essential for the whole procedure of variance reduction because the variance is equal to the integral (2.12), and unsatisfactory knowledge of u and ∂u/∂x i on short parts of the interval [t 0 , T ] does not significantly affect the value of the integral. /∂x i (s, x). The problem of evaluating ∂u/∂x i (s, x) is of independent importance due to its connection with numerical computation of Greeks in finance. Many articles are devoted to pathwise methods of estimating Greeks (see [7] and the references therein; see also [13]). In [17] the finite-difference-based method is developed, and [5,4] suggest using Malliavin calculus for computing Greeks. Several pathwise and finite-difference-based methods for calculating sensitivities of Bermudan options using regression methods and MC simulations are considered in [1] (see also the references therein). In this section we propose a conditional version of the pathwise method, and in section 3.4 we present a conditional version of the approach based on the Malliavin integration by parts for evaluating ∂u/∂x i (s, x).

Evaluating ∂u
As mentioned previously, differentiating the equality (3.8) gives an estimate for ∂ i (s, x) = ∂u/∂x i (s, x) (see (3.9)); however, in general, it is rather rough. A more accurate way is to use the linear regression method directly.
In evaluating ∂ i (s, x) by regression, the pair (X, V i ) has the form (see (3.2)) (3.11) X = X t0,x0 (s), We already have expressions for X s,X (T ), Y s,X (T ), Z s,X (T ) via X t0,x0 (t), Y t0,x0 (t), Z t0,x0 (t), with t being equal to s and T (see the formulas (3.4), (3.6), (3.7)). Our nearest aim is to express δ i s, We begin with δ i s,X X j (t). The column-vector δ i s,X X(t) is the solution of the linear homogeneous stochastic system (2.17) whose coefficients depend on X s,X (t) = X t0,x0 (t). Let the matrix Φ s,X (t) := {δ i s,X X j (t)} be the fundamental matrix of solutions of (2.17) normalized at time s, i.e., Φ s,X (s) = I, where I is the identity matrix. Its element on the jth row and ith column is equal to δ i s,X X j (t). Clearly, (3.12) Φ s,X (t) = Φ t0,x0 (t)Φ −1 t0,x0 (s). Now let us turn to the column-vector δ s,X Y (t), consisting of components δ i s,X Y (t). We have (see (2.18)) dδ s,X Y = Y s,X (t)Φ s,X (t) ∇c(t, X s,X (t))dt + c(t, X s,X (t))δ s,X Y dt (3.13) Due to the equality X s,X (t) = X t0,x0 (t) and (3.6) and (3.12), we get from (3.13) Taking into account the equality it is not difficult to verify that In the similar way we obtain Hence the column-vector ∂(s, x) with the components ∂ i (s, x) is equal to where δ s,X Y (T ) and δ s,X Z(T ) are from (3.15) and (3.16).
where m Φ t0,x0 (s) is a realization of the fundamental matrix Φ t0,x0 (s) which corresponds to the same elementary event ω ∈ Ω as the realization m X t0,x0 (t). We use ( m X, m V i ) for evaluating ∂ i (s, x), i = 1, . . . , d, by the linear regression method: Remark 3.4. This paper is most closely connected with [6,12,13,14] (see also [16]) and with the works [21,20] by N. Newton. In [21,20], both the method of control variates and the method of importance sampling for calculating solutions u(t, x) of parabolic partial differential equations by the MC method are considered. In both cases, a perfect variate (i.e., one which is unbiased and has zero variance) is constructed based on the Funke-Shevlyakov-Haussmann formula (see the corresponding reference and details in [21]; such a formula is usually called as the Clark-Ocone-Haussmann formula). Then some approximation methods of simulating the variates are proposed in [21,20] to yield unbiased estimators for the desired solution u(t, x) with reduced variances. If the dimension d is large, the most labor-consuming calculations are connected with integration of the d 2 -dimensional system of first-order variation equations. This is required to construct the estimators. In this paper, we use variates in the form (2.11), (2.10) with μ and F satisfying (2.13). Due to Theorem 2.1, these variates are perfect if u and ∂u/∂x i are exact. We evaluate u and ∂u/∂x i based on conditional probabilistic representations and construct unbiased estimators for u(t, x) using (2.15) or (2.14). We note that (2.14) allows us to avoid estimating ∂u/∂x i (see (3.8)-(3.9)) and hence to avoid integration of the equations of first-order variation. In addition, the obtained estimator by (2.14) remains unbiased. In spite of the fact that our approach and that of N. Newton clearly differ, they undoubtedly have profound connections. For example, the Clark-Ocone-Haussmann formula, being the basis for Newton's approach, can fairly easily be derived using the conditional probabilistic representations (3.1), (3.2).

Evaluating ∂u/∂x i (s, x) using the Malliavin integration by parts.
If f (x) is an irregular function, one can use the procedure recommended in section 3.2, where we do not need direct calculations of derivatives ∂u/∂x i . Another way consists in approximating f by a smooth function with the consequent use of the procedure from section 3.3. Because we do not pursue a high accuracy in estimating u and ∂u/∂x i , such approximation of f can be quite satisfactory. For direct calculation of derivatives ∂u/∂x i without smoothing f, we can use the conditional version of the integration-by-parts (Bismut-Elworthy-Li) formula. This formula is successfully applied for evaluating deltas in the case of an irregular f (see, e.g., [5,4,22]).
For calculating ∂u/∂x i in the case of u given by where X s,x (T ), Y s,x (T ), Z s,x (T ) satisfy system (2.10), the following variant of the integration-by-parts formula can be derived: where μ , σ −1 , and F have (s , X s,x (s )) as their arguments. In particular, if c = 0, g = 0, μ = 0, F = 0, we get the well-known integration-by-parts formula (see, e.g., [22]): As in section 3.1, together with the unconditional probabilistic representation (3.20) for ∂ i (s, x), we have the following conditional one: Again, the formula (3.20) is natural for the MC approach and (3.22) for a regression method. An implementation of the regression method is based upon the corresponding approximation ( m X, m V i ) of the pair (X, V i ) = (X t0,x0 (s), D i (s, X t0,x0 (s))) following the ideas of section 3.3.

Two-run procedure. The straightforward implementation of evaluating u(s, x) and ∂u/∂x i (s, x) by regression as described in sections 3.2 and 3.3 requires storing
(or, more precisely, their approximations mΛ (t k )) at all t k , k = 1, . . . , N, in the main computer memory (RAM) until the end of the simulation. This puts a requirement on the RAM size that is too demanding and limits the practicality of the proposed approach since in almost any practical problem a relatively large number of time steps is needed. However, this difficulty can be overcome and we can avoid storing mΛ (t k ) at all t k by implementing the two-run procedure described below.
First, we recall that, as a rule, pseudorandom number generators used for MC simulations have the property that the sequence of random numbers obtained by them is easily reproducible (see, e.g., [16] and the references therein). Let us fix a sequence of pseudorandom numbers. The two-run procedure can schematically be presented as follows.
First run: • simulate M r number of independent trajectories mΛ (t k ), k = 1, . . . , N, with an arbitrary choice of μ and F (e.g., μ = 0 and F = 0); • compute and store the values mΓ to form the component V needed for the regression in the second run and compute and store the values and mȲ (T ) to form the components V i in the second run. Second run: • reinitialize the random number generator so that it produces the same sequence as for the first run; • for k = 1, . . . , N simulate the same mΛ (t k ), m = 1, . . . , M r , as in the first run (i.e., they correspond to the same sequence of pseudorandom numbers as in the first run), keeping only the current mΛ (t k ) in RAM; -use the values stored in RAM during the first run and mΛ (t k ) from this run to findū(t k , x) and ∂u/∂x i (t k , x) by regression ( mΛ (t k ) and mΛ (T ) form the pairs ( m X, m V ) and ( m X, m V i ) needed for the regression); use the foundū(t k , x) and ∂u/∂x i (t k , x) to obtainμ(t k , x) andF (t k , x) required for variance reduction (see section 2.2); -simulate (2.10) with μ =μ and F =F on this step and thus obtain M independent triples which we keep in RAM until the next step; • use the obtained ( mXt0,x0 (T ), mỸt0,x0 (T ), mZt0,x0 (T )) to get the required u(t 0 , x 0 ) (see (2.6)). We emphasize that in the two-run procedure at each time moment s = t k we need to keep in memory only the precomputed values stored at the end of the first run and the values mΛ (t k ) and ( mXt0,x0 (t k ), mỸt0,x0 (t k ), mZt0,x0 (t k )) (only at the current time step k), which is well within RAM limits of a PC.
We note that the two-run realization of the procedure from section 3.2 based on using regression for estimating u only is less computationally demanding (both on processor time and RAM and especially for problems of large dimension d) than the procedures of sections 3.3 and 3.4 which estimate the derivatives of u via regression.
The two-run procedure was used in the numerical experiments of sections 4.2 and 4.3.

Examples.
The first example is partly illustrative and partly theoretical. The second and third examples are numerical.

Heat equation. Consider the Cauchy problem
Its solution is The probabilistic representation (2.10), (2.11) with μ = 0 takes the form Due to Theorem 2.1, we have varΓ s,x = var X 2 s,x (T ) + Z s,x (T ) = 0 for the optimal choice of the function F (t, x) = −σ∂u/∂x = −2σx. We note that in this example ∂u/∂x and the optimal F do not depend on time t.
For the purpose of this illustrative example, we evaluate u(0, 0) = EΓ 0,0 . Let us simulate (4.4) exactly (i.e., we have no error of numerical integration): where m X N are independent realizations of X N obtained by (4.6). Further, varΓ 0,0 = 2σ 4 T 2 , and hence the MC error is equal to (see (2.9)) For instance, to achieve the accuracy ρ = 0.0001 for c = 3 (recall that there is no error of numerical integration here) in the case of σ = 1 and T = 10, one needs to perform M = 18 × 10 10 MC runs.
To reduce the MC error, we estimate ∂u/∂x by regression to getF (t k , x) close to the optimal F = −2σx. As the basis functions for the regression, we take the first two Hermite polynomials: We note that in this example the required derivative ∂u/∂x can be expanded in the basis (4.8); i.e., here we do not have any error due to the cut-off of a set of basis functions. In the construction of the estimate for ∂u/∂x, we put F = 0 in (4.5).
The variational equation associated with (4.4) has the form (see (2.17)) dδX = 0, δX(s) = 1, and hence δX(t) = 1, t ≥ s. Thus, the sample from (3.18) takes the form ( m X, m V ) = ( m X t0,x0 (s), 2 m X t0,x0 (T )) and the estimator∂(t k , x) for ∂u/∂x(t k , x) is constructed as 2 × m X(t k ), (4.11) Here m X(t k ), m = 1, . . . , M r , k = 1, . . . , N, are independent realizations of X(t k ) obtained by (4.6). Hence We definê We simulate (4.5) with F =F (t, x) exactly (i.e., again we have no error of numerical integration): The increments Δ k w are the same both in (4.6) and in (4.14) and are independent of the ones used to estimateα 1 andα 2 . We simulate where m X N and m Z N are independent realizations of X N and Z N obtained according to (4.6) and (4.14). We note that the approximation (4.15) does not have the numerical integration error or the error due to the cut-off of the basis; it has the MC error only. Using Theorem 2.1, one can evaluate varΓ 0,0 in the case of F =F defined in (4.13) and obtain varΓ 0,0 ≈ 4σ 4 T 2 /M r . Then the MC error ρ in this case is equal to (compare with (4.7)) This example illustrates that in the absence of the error due to the cut-off of a set of basis functions used in regression and of the numerical integration error, the MC error is reduced ∼ 1/ √ M r times by the proposed variance reduction technique. This is, of course, a significant improvement. Indeed, let us return to the example discussed after (4.7). The estimate (4.16) implies that to achieve the accuracy ρ = 0.0001 for c = 3 in the case of σ = 1 and T = 10, one can take, e.g., M = M r = 6 × 10 5 ; i.e., one can run about 10 5 times fewer trajectories than when the variance reduction was not used (see the discussion after (4.7)). The gain of computational efficiency is significant in spite of the fact that there is an overhead cost of solving the linear system (4.10) in the "regression's runs." Remark 4.1. In the above analysis we assumed that "regression's runs" and the MC runs for computing the desired value u(0, 0) are independent. In practice, this assumption can be dropped, and we can use the same paths X(t) for both the "regression's runs" and the MC runs. Then, as a rule, we choose M r ≤ M.
Remark 4.2. We are expecting (see also experiments in section 4.2) that in the general case the MC error after application of this variance reduction technique has the form where the first term has the same nature as in this illustrative example (see (4.16)); the second term is due to the error of numerical integration (it is assumed that a method of weak order p is used); and the third one arises as a result of the use of a finite set of functions as the basis in the regression, while the solution u(t, x) is usually expandable in a basis consisting of an infinite number of functions (i.e., this error is due to the cut-off of the basis). We note that finding an appropriate basis for regression in applying this variance reduction approach to a particular problem can be a difficult task and requires some knowledge of the solution u(t, x) of the considered problem. Roughly speaking, in the proposed implementation of the variance reduction methods (the method of importance sampling, the method of control variates, or the combining method) we substitute the task of finding an approximate solution to the problem of interest with the task of finding an appropriate basis for the regression. For complicated systems of SDEs, it is preferable to use regression to approximate the solution u(t, x) and then differentiate this approximation to approximate the derivatives ∂u/∂x i . In the case of this illustrative example we take the first three Hermite polynomials, as the basis functions for the regression. In this example the required function u(t, x) can be expanded in the basis (4.18). We construct the estimatorû(t k , x) for u(t k , x): whereα 1 (t k ),α 2 (t k ),α 3 (t k ) satisfy the system of linear algebraic equations (2.24) with the corresponding coefficients. Further, we approximate the derivative ∂u/∂x(t k , x), (4.19), and we define which we use for variance reduction by putting F =F in (4.5). In the experiments we simulate (4.5) with F =F (t, x) exactly (see (4.14)). The new estimator for u(0, 0) has the form (4.15) again but with the new Z N corresponding to the choice ofF (t, x) from (4.21).
We also did similar experiments in the case of the terminal condition u(T, x) = x 4 in (4.1). To estimate ∂u/∂x by regression, we took the basis consisting of the first four Hermite polynomials. The results were analogous to those given above for the case x 2 .
The SDEs (4.22) are ergodic with the Gibbs invariant measure μ. We are interested in computing the average of the potential energy with respect to the invariant measure associated with (4.22): To this end (see further details in [19]), we simulate the system (4.22) on a long time interval and approximate the ergodic limit E μ U (Q) by EU (Q(T )) for a large T. To illustrate variance reduction via regression, we simulate where Z(t), 0 ≤ t ≤ T, satisfies We choose the n-dimensional vector function F (t, p, q) to be equal to (see (2.14)) whereû =û(t, p, q) is an approximation of the function u(t, p, q) := EU (Q t,p,q (T )).

G. N. MILSTEIN AND M. V. TRETYAKOV
And we approximate (4.24) by the standard second-order weak method (see [16, p. 103]): where ξ ik and ζ jk are mutually independent random variables, ξ ik are distributed by the law (4.27), and the ζ ik are distributed by the law P (ζ = ±1) = 1/2. We consider two potentials: the harmonic potential and the hard anharmonic potential We define the approximationû(t, p, q) used in (4.25) at t = t k , k = 0, . . . , N − 1 , as follows. First, it is reasonable to put ∂û/∂p i (t, p, q) = 0 for 0 ≤ t ≤ T 0 with some relatively small T 0 since for large T the function u(t, p, q), 0 ≤ t ≤ T 0 , is almost constant due to the ergodicity (the expectation in (4.23) is almost independent of the initial condition).
Further, let T 0 , T, h, N, and a nonnegative integer κ be such that In the case of harmonic potential the required function u(t, p, q) can be expanded in the basis consisting of the finite number of functions In our experiments we deal with three oscillators (n = 3); the basis (4.31) in this case has 28 functions. We use the set of functions (4.31) as a set of basis functions for regression in both cases of harmonic and hard anharmonic potentials. Namely, using regression as described in section 3.2, we construct the estimatorû(θ k , p, q) for u(θ k , p, q) as (4.32)û(θ k , p, q) = L l=1α l (θ k )ϕ l (p, q), where ϕ l are defined in (4.31) andα l (θ k ) satisfy the system of linear algebraic equations (2.24). The matrix formed fromα l (θ k ) is positive definite, and we solve the system of linear algebraic equations by Cholesky decomposition. To find the estimator u, we use M r independent trajectories.
Then for T 0 < t k < T we putû(t k , p, q) =û(θ k , p, q) with θ k ≤ t k < θ k +1 . The recalculation of the estimatorû once per a few number of steps κ reduces the cost of the procedure.
We note that for the basis (4.31) the corresponding function F from (4.25) is such that some terms in the scheme (4.28) are canceled; in particular, it is not required to simulate the ζ ik in this case.
We compute u(0, p, q) in the usual way, by simulating M independent realizations of Q N , Z N from (4.26), (4.28). In these experiments the two-run procedure described in section 3.5 was used. Suppose we would like to compute u(0, p, q) for the particular set of parameters n = 3, λ = 1, ν = 1, σ = 1, T = 10 and the potentials (4.29) and (4.30) with accuracy of order 10 −3 . Since we are using the scheme of order two, we can take h = 0.02.
Let us first consider the case of harmonic potential (4.29). Without variance reduction (i.e., for F = 0), we obtain 0.7500 ± 0.0010 with the fiducial probability 95% by simulating M = 1.4 × 10 6 trajectories, taking ∼541 sec on a PC. When we use the variance reduction technique as described above, it is sufficient to take T 0 = 2, κ = 2, M r = 2 × 10 4 , M = 3 × 10 4 to get 0.7496 ± 0.0010 in ∼64 sec. In this example the procedure with variance reduction requires an eighth of the computational time. All the expenses are taken into account, including the time required for the first run of the two-run procedure, which is less than 10% of the total time. We recall that in this case the required function u(t, p, q) can be expanded in the finite basis (4.31), unlike the case of hard anharmonic potential when such a basis is infinite. Now consider the case of hard anharmonic potential (4.30). Without variance reduction (i.e., for F = 0), we obtain 0.6491 ± 0.0011 with the fiducial probability 95% by simulating M = 10 6 trajectories, taking ∼403 sec on a PC. With variance reduction, we reach the same level of accuracy 0.6491 ± 0.0011 in ∼98 sec by choosing, e.g., T 0 = 2, κ = 2, M r = 2.5 × 10 4 , M = 5.5 × 10 4 . Thus, the procedure with variance reduction requires a quarter of the computational time.
Some other results of our numerical experiments are presented in Tables 2 and 3. They show dependence of the MC error on M and M r . The numerical integration error is relatively small here and does not essentially affect the results. The case M r = 0 means that the simulation was done without variance reduction. We observe that in both tables for a fixed M r the MC error decreases ∼1/ √ M . Further, we see from Table 2 that the MC error is ∼1/ √ M r for fixed M (for M r > 0, of course), and, consequently, it is ∼1/ √ M M r when the variance reduction is used (we recall that the time step is relatively small here). As noted before, the basis used in the variance reduction is such that the function u(t, x) can be expanded in it in the case of harmonic potential; i.e., err B in (4.17) is equal to 0. These observations are consistent with the MC error estimate (4.17). For the anharmonic potential, err B is not equal to zero, and we see in Table 3 that the increase of M r has less impact on the MC error in this case.  Table 2 Harmonic potential. Two standard deviations of the estimator (4.33) in the case of potential   Table 2.

Pricing a binary asset-or-nothing call option.
Consider the Black-Scholes equation for pricing a binary asset-or-nothing call option: The solution of this problem for x > 0 and K > 0 is The purpose of this example is to illustrate that the approach to evaluating u(s, x) introduced in section 3.2 works, in principle, in the case of discontinuous initial conditions f (x). We use, as a set of basis functions for regression, the set consisting of three functions: ϕ 1 (x) = K π (arctan(α(x − K)) + arctan(αK), (4.39) where α > 0, β > 0, and γ > 0 are parameters, which can change from one time layer to another. We note that the functions are chosen so that ϕ l (0) = 0, l = 1, 2, 3, and the payoff f (x) is well approximated by ϕ 1 (x) + ϕ 2 (x) with large α and small β.
In the experiments, we take the volatility ν = 0.2, the interest rate r = 0.02, and the maturity time T = 3 and approximate the option price u(0, 1), whose exact value due to (4.35) is u(0, 1) ≈ 0.635 48. We define the time-dependent α = α(t) and β = β(t) via linear interpolation: and we choose γ = 8. We simulate (4.37)-(4.38) using the weak Euler scheme with time step h = T /N = 0.001. In the first run (see section 3.5 for the description of the algorithm), we put F = 0 and store the values f ( mX (T ))e −rT , which are needed for the regression in the second run. In the second run, using regression with the set of basis functions (4.39), we construct the estimatorû(θ k , x) for u(θ k , x), where θ k = κk h, k = 1, . . . , N ; κ and N are nonnegative integers such that κN h = T . We use here κ = 5; i.e., we recalculate the estimatorû only once per five time layers to reduce the computational cost. Further,û(t k , x) is set equal to zero for 0 ≤ t k < 0.01.
In the second run we put F (t, x) = −ν∂û/∂x. In both runs we simulate M = 4 · 10 4 independent trajectories. As a result, we get u(0, 1) ≈ū(0, 1) = 0.6358 ± 0.0018 with the fiducial probability 95%. To achieve a similar result without variance reduction, namely,ū(0, 1) = 0.6342 ± 0.0019, one has to simulate M = 5 · 10 5 independent trajectories, which requires at least three times more computational time than the procedure with variance reduction. This experiment demonstrates that the simple and cheap estimation of ∂u/∂x by ∂û/∂x works even in the case of discontinuous initial conditions.

Conclusions.
Starting an MC simulation, first of all we have to estimate the number of trajectories required to reach a prescribed accuracy. Fortunately, we can easily do this because a reliable estimate of the variance can be obtained by a preliminary numerical experiment using a relatively small set of trajectories. If the required number of trajectories is too large, we run inevitably into the problem of variance reduction. The known variance reduction methods (the method of importance sampling, the method of control variates, and the combining method) are based on the assumption that approximations of the solution u(t, x) of the considered problem and its spatial derivatives ∂u(t, x)/∂x i are known. In this paper we proposed to construct such approximations as a part of the MC simulation using conditional probabilistic representations together with the regression method and thus make the variance reduction methods practical. The basis used in the regression method can be chosen using some a priori knowledge of the considered problems, as illustrated in the examples.
As is known (see, e.g., [16]), the variance reduction methods are applicable in the case of boundary value problems for parabolic and elliptic equations as well. Although here we illustrated the proposed implementation of these variance reduction methods for the Cauchy problems for parabolic equations, the approach is straightforwardly applicable to boundary value problems.
We also note that the proposed technique of conditional probabilistic representations together with regression can be used for evaluating different Greeks for American-and Bermudan-type options (see [1]).