Optimal importance sampling parameter search for Lévy processes via stochastic approximation R eiichiro K awai

The author proposes stochastic approximation methods of finding the optimal measure change by the exponential-tilting for Lévy processes in Monte Carlo importance sampling variance reduction. In accordance with the structure of the underlying Lévy measure, either a constrained or unconstrained algorithm of the stochastic approximation has to be chosen. For both cases, the almost sure convergence to a unique stationary point is proved. Numerical examples are presented to illustrate the effectiveness of our method.


Introduction
The importance sampling method is aimed at reducing the variance of iid Monte Carlo summands by appropriately transforming the underlying probability measure, from which interested random variables or stochastic processes are generated, so as to put more weight on important events and less on undesirable ones.Due to its practical effectiveness, it has long been thought of as one of the most important variance reduction methods in the Monte Carlo simulation and has been intensively studied in a wide range of applications, such as mathematical finance, queueing theory, sequential analysis, to mention just a few.For its principle with some numerical examples, see, for instance, Section 4.6 of Glasserman [7].
In the importance sampling "variance" reduction, the optimal measure change means nothing but the one minimizing the variance of iid Monte Carlo summands.In the Gaussian framework, it is well known by Girsanov theorem that the measure change is indexed by a single parameter, that is, the drift parameter, and several attempts have been made to find the optimum.In financial applications, for example, Glasserman, Heidelberger, and Shahabuddin [6] proposes an optimization setting to find a nearly optimal measure change in pricing financial derivatives, while Su and Fu [14] and Arouna [1] apply the stochastic approximation so as to achieve the root of the gradient, with respect to the measure change parameter, of the Monte Carlo variance.
The aim of the present work is to apply the idea of [1,14] to a class of Lévy processes with no Brownian motion, or equivalently after discretization, a class of infinitely divisible distributions without Gaussian component.In general, the measure change for Lévy processes involves every single jump, which forms the sample paths of Lévy processes.(See Section 33 of Sato [12] for details.For an importance sampling method with such intricate measure changes, see Kawai [8].)In this paper, we however restrict ourselves to the simplest measure change, often called the Esscher transform, which needs not take jumps of Lévy processes into account but has only to look at the terminal marginals.The Esscher transform is nothing but the well known exponential tilting of distributions and is thus indexed by a single parameter, just as in the Gaussian case.Indeed, the Gaussian measure change is also an exponential tilting, which happens to be equivalent to the drift change due to the structure of the Gaussian distribution.A crucial difference is however that depending on the structure of the underlying Lévy measure, the exponential-tilting parameter for the measure change might have to be in a bounded set, while the drift parameter of the Gaussian distribution may be arbitrarily taken.
The rest of the paper is organized as follows.Section 2 recalls the Esscher transform and the principle of the importance sampling variance reduction, and constructs the basis of our analysis.In Section 3, the almost sure convergence of the stochastic approximation is proved separately for the constrained and unconstrained algorithms, depending on the structure of the underlying Lévy measure.Section 4 illustrates the effectiveness of our method via numerical examples for both constrained and unconstrained stochastic approximation algorithms, and Section 5 concludes.

Preliminaries
Let us begin with some notations which will be used throughout the text.N is the collection of all positive numbers, with N 0 := {0} ∪ N. R d is the d-dimensional Euclidean space with the norm • and the inner product (Ω, F , P) is our underlying probability space.Leb(•) denotes the Lebesgue measure, while P| F t is the restriction of a probability measure P to the σ-field F t .Denote by ∇ the gradient, and by Hess [•] the Hessian matrix.The interval (0, −1] is understood to be [−1, 0).The expression f (x) ∼ g(x) means f (x)/g(x) tends to 1.We say that a stochastic process {X t : t ≥ 0} in R d is a Lévy process if it has independent and stationary increments, if it is continuous in probability, and if X 0 = 0, a.s.By the Lévy-Khintchine representation theorem, the characteristic function of its marginal distributions is uniquely given by where γ ∈ R d , A is a symmetric nonnegative-definite d × d matrix, and ν is a Lévy measure on R d When the above holds, then we say that the Lévy process {X t : t ≥ 0} is generated by the triplet (γ, A, ν).In this paper, we restrict ourselves to pure-jump Lévy processes, that is, we set A ≡ 0 throughout.Moreover, we also assume that all components of the marginal X 1 are non-degenerate.A function f : R d → [0, ∞) is said to be submultiplicative if there exists a positive constant a such that f submultiplicative, and the functions x ∨ 1, e c,x are submultiplicative, and a product of two submultiplicative functions is submultiplicative.We recall an important moment property of Lévy processes, which will be used often in what follows.
Theorem 2.1.(Sato [12], Theorem 25.3) Let f be a submultiplicative, locally bounded, measurable function on R d , and let {X t : t ≥ 0} be a Lévy process in R d with Lévy measure ν.Then, E[ f (X t )] is finite for every t > 0 if and only if

Esscher transform
Among the density transformations of Lévy processes, there is a simple class ending up with a path-independent structure, which is built by exponential-tilting.The class is often called the Esscher transform.
Let {X t : t ≥ 0} be a Lévy process in R d generated by (γ, 0, ν), and let (F t ) t≥0 be the natural filtration of {X t : t ≥ 0}.Define where the second equality holds by Theorem 2.1, and assume that Leb(Λ 1 ) > 0 throughout.It is known that the set Λ 1 contains the origin and is convex.For λ ∈ Λ 1 , we denote by ϕ the cumulant generating function of the marginal distribution at unit time of {X t : t ≥ 0} under the probability measure P, that is, ϕ(λ) := ln E P [e λ,X 1 ].For convenience in notation, we also denote by ϕ t (λ) := ln E P [e λ,X t ], t > 0, in view of where the second equality holds by the infinite divisibility of the marginal distributions of Lévy processes.Note that ϕ(λ) is continuous in λ ∈ Λ 1 .Under the probability measure Q λ , where λ ∈ Λ 1 and which is defined via the Radon-Nikodym derivative, for every t ∈ (0, +∞), = e λ,X t −ϕ t (λ) , P-a.s., the stochastic process {X t : t ≥ 0} is again a Lévy process generated by (γ λ , 0, ν λ ), where γ λ = γ + ∫ z ≤1 z(ν λ − ν)(dz), and ν λ (dz) = e λ,z ν(dz). (2.1) Then, the probability measures P| F t and Q λ | F t are mutually absolutely continuous for every t ∈ (0, +∞).We also get E Q λ [e − λ,X 1 ] < +∞, and For t > 0, let p be a probability density function on R d of the random vector X t under P, provided that it is well defined.Then, the density function p λ of X t under Q λ is given by (2.2)

Importance sampling variance reduction
Suppose we want to evaluate by Monte Carlo simulation, where F(X) := F({X t : t ∈ [0, T ]}) ∈ L 2 (Ω, F T , P), and assume P(F(X) 0) > 0. In view of the equality define a set and suppose that Leb(Λ 2 ) > 0. Let us now give a lemma, whose proof will be often adapted in what follows.
For λ ∈ Λ 2 , the variance under the probability measure Q λ is given by Define also a set and assume that Leb(Λ 3 ) > 0.
Proof.The convexity of Λ 3 can be proved in a similar manner to the proof of Lemma 2.2.Since λ ∈ Λ 3 , by the Hölder inequality, we have and thus with the help of the dominated convergence theorem, we can obtain the gradient ] .
and also the Hessian ] .
Then, for y ∈ R d 0 , since Hess[ϕ T (λ)] reduces to the variance-covariance matrix of the random vector X T under the probability measure Q λ , which is clearly positive definite.The proof is complete.
Remark 2.4.The definition of the sets Λ 2 and Λ 3 is less intuitive and is of less practical use.We may instead give more intuitive definition in connection with the Lévy measure by giving up some part of its domain as and It is easy to check that both Λ 2 and Λ 3 are convex, and that They are derived as follows.By the Hölder inequality, with 1/p + 1/q = 1 for some p > 1 and for By Theorem 2.1, the finiteness of the second expectation of the above right hand side for each k = 0, 2 is equivalent to ∫ z >1 z kq e −q λ,z ν(dz) < +∞ for corresponding k.This, with k = 0, asserts the definition of Λ 2 , while the definition of Λ 3 is verified with k = 2.
Meanwhile, as soon as F(X) reduces to f (X T ) with f being submultiplicative, the set by Theorem 2.1.

Convergence of stochastic approximation algorithms
We begin with recalling the stochastic approximation algorithms.Let {X n,t : t ∈ [0, T ]} n∈N be iid copies of the stochastic process {X t : t ∈ [0, T ]}.For simplicity in notation, we will write X n := X n,T for n ∈ N, and F(X) n := F({X n,t : t ∈ [0, T ]}).Let H be a connected set in R d with {0} ∈ H, and define a sequence {Y n } n∈N of random vectors in R d by where λ 0 ∈ H, {λ n } n∈N is a sequence of random vectors in R d iteratively generated by where Π H is the projection onto the constraint set H and where { n } n∈N 0 is a sequence of positive constants satisfying Moreover, define the filtration (G n ) n∈N 0 by In what follows, "the constrained algorithms" means the algorithms where the constraint set H in (3.1) is a compact set in R d and the sequence {λ n } n∈N 0 is required to stay in the set, while by "the unconstrained algorithms" we mean the ones whose constraint set H is extended to R d , that is, the sequence {λ n } n∈N 0 may be arbitrarily taken in R d .

Define a set
We will below see that as long as the iterates remain in Λ 4 , the algorithm converges P-a.s. to a unique stationary point.Hence, if Λ 4 = R d , then the algorithm is clearly unconstrained.It is however difficult to check whether or not that is the case, since the set Λ 4 depends on the operator F. Meanwhile, to be unconstrained, we need at least Λ 4 ⊆ Λ 1 = R d .In this sense, let us give a rough illustration of the situation in the following.
Lemma 3.1.If the Lévy measure ν has a compact support, then

Constrained algorithms
The following proves the almost sure convergence for the constrained algorithms.Their gradientbased structure simplifies the argument.Moreover, V(λ * ) ≤ V(0).
Proof.First, note that E P [e − λ,X T F(X) 2 ] < +∞ since E P [e −2 λ,X T F(X) 4 ] < +∞, and that by the Hölder inequality, Hence, Λ 4 ⊆ Λ 3 .The convexity of Λ 4 can be proved in a similar manner to the proof of Lemma 2.2.Moreover, we have and since ∇ϕ(λ) < +∞ and ϕ(λ) < +∞, for λ ∈ H, the expectation of the above right hand side is finite if and only if E P [ X T k e −2 λ,X T F(X) 4 ] < +∞ for each k = 0, 1, 2. This proves sup n∈N E P [ Y n 2 ] < +∞.Since Λ 4 is convex, by Theorem 2.1 (pp.127) of Kushner and Yin [11], the sequence {λ n } n∈N 0 converges P-a.s. to a unique stationary point in H.The last assertion follows from the strict convexity of V on H.
Remark 3.4.We may give some modifications of the set Λ 4 so that it looks more intuitive, as in Remark 3.3.If F(X) = f (X T ) with f being submultiplicative, then Λ 4 can be rewritten as Otherwise, by the Hölder inequality, which is a convex subset of Λ 4 .

Unconstrained algorithms
The following result asserts the existence of a single point λ * such that ∇V(λ * ) = 0. Recall again that in the constrained algorithm case, the limiting stationary point does not necessarily attain the root of the gradient.
Proposition 3.5.If Λ 4 = R d and if there exists c > 0 such that then there exists a unique λ * ∈ R d such that ∇V(λ * ) = 0.
Remark 3.6.Most applications use Lévy processes with independent components and with each component possessing small jumps in both positive and negative directions.In such cases, their Lévy measures are supported on all the axes of R d , that is, for some Lévy measures {ν k } k=1,...,d on R 0 .We then get In Example 4.2 below, we discretize a Lévy process of positive and negative jumps on a finite time horizon into a few independent increments, and thus the condition (3.3) holds true.
Proof.By Proposition 2.3 with Λ 3 ⊇ Λ 4 = R d , it suffices to show that lim λ ↑+∞ V(λ) = +∞.First note that with a suitable The first component of the right hand side above is bounded from below by −ν({z ∈ R d Therefore, we get which explodes as λ ↑ +∞.This proves the assertion. The unconstrained algorithms often show a rough numerical behavior, and unfortunately, that is the case in our setting.This phenomenon is mainly due to the extremely fast grow of E P [ ∇ϕ T (λ)− X T 2 e −2 λ,X T +2ϕ T (λ) F(X) 4 ] with respect to λ .Alternatively, Chen, Guo, Gao [4] proposes a projection procedure.In essence, by forcing the iterates to stay in an increasing sequence of compact sets, the procedure avoids the explosion of the algorithm during the early stage.Meanwhile, we adapt the results of Chen and Zhu [3] and Delyon [5].Let {H n } n∈N 0 be an increasing sequence of compact sets such that ∪ n∈N 0 H n = R d , and modify the algorithm (3.1) as where σ(n) counts the number of projections up to the n-th step.
Proof.Let m ∈ N and define for n ∈ N 0 , By Proposition 3.5 and the results in [3,5], we are only to show that for each m ∈ N, {M n } n∈N 0 converges P-a.s.Since the sequence {M n } n∈N 0 is a martingale with respect to the filtration (G n ) n∈N 0 , it suffices to show that {M n } n∈N 0 is a L 2 -martingale.To this end, for each m ∈ N, we will show that, P-a.s., We begin with proving that for each m ∈ N, the following four quantities are well defined: Clearly, C 1 (m) is finite since Λ 1 = R d and ν({z ∈ R d 0 : z > 1}) < +∞, while the finiteness of C 2 (m) follows from e λ,z − 1 − λ, z ∼ λ, z 2 ≤ λ 2 z 2 as z ↓ 0. For C 3 (m), the Hölder inequality gives the assertion, that is, with 1/p + 1/q = 1 for some p > 1, e qλ,z ν(dz) Let us now proceed to the main part of the proof.First, as previously, note that Both the first and the second integrals of the right hand side above are well defined due to the finiteness of C 1 (m) and C 2 (m), respectively.Hence, we get Next, note that where the passages to the gradient operator are verified by the finiteness of C 3 (m) and C 4 (m), and thus In total, we get for each m ∈ N, E P which is bounded P-a.s., since Λ 4 = R d .The proof is complete.

Numerical examples
In this section, we give two numerical examples, corresponding to the constrained and unconstrained stochastic approximation algorithms.We will evaluate the efficiency of the importance sampling variance reduction by the ratio of variances (vratio), defined by . Example 4.1.(Constrained algorithm) Let X := (X 1 , . . ., X 5 ) be an infinitely divisible random vector with independent and identically distributed components under the probability measure P, where the common Lévy measure ν on R 0 for each component is the Meixner type of Schoutens and Teugels [13], characterized by three parameters (a, b, d) in the form of where a > 0, b ∈ (−π, π), and d > 0. It is very convenient that the probability density function p of X 1 is given in closed form by It is easy to prove that Then, for λ ∈ Λ 1 , we can obtain Since the exponential tail decay of the Lévy measure dominates each polynomial moment, we get, for λ ∈ Λ 1 and for each k ∈ N, ∫ |z|>1 |z| k e λz ν(dz) < +∞, or equivalently E P [ X k e λ,X ] < +∞.This verifies the finiteness of every polynomial moment and thus ) .
For a numerical example, we will consider an Asian payoff For the condition E P [|F(X)| 2p ] < +∞, it is sufficient to have , and in view of the interval of q, we get )) 5 , In a similar manner, we can prove Λ 3 = Λ 2 and )) 5 , We set the parameters for the Meixner distribution by (a, b, d) = (0.1, 0.0, 1.0), and thus an effective domain is approximately Λ 4 = (−13.707963,13.707963) 5 .The constraint set H must be compact, so it is safe to set H = [−13.70796,13.70796] 5 ⊂ Λ 4 .We generate N = 1e+5 Monte Carlo summands with the full help of the closed-form density function (4.1).With those runs, we perform the constrained algorithm (3.1) with n = α/(n + 1) and λ 0 = {0}.For the ATM case (K = 100), a OTM case (K = 125), and a deep OTM case (K = 150), we draw in Fig- ure 1 a sequence {λ n } n∈N 0 of iteratively generated exponential-tilting parameters in R 5 (left), and a sequence { ∇V(λ n ) } n∈N 0 of the absolute gradient levels (right), which is "desired" to achieve lim n↑+∞ ∇V(λ n ) = 0, P-a.s.(As pointed out in Remark 2.4, it is not clear whether or not the constraint set H contains λ * such that ∇V(λ * ) = 0.) Finally, each figure on the right presents the convergence of the Monte Carlo estimate E P [F(X)] (MC) and that of the importance sampling Monte Carlo estimate E Q λ N [(dP/dQ λ N )F(X)] (IS MC), of which λ N is the exponential-tilting pa- rameter obtained after 1e+5 of the stochastic approximation iterations, while the three vertical lines indicate C := E Q λ N [(dP/dQ λ N )F(X)], 0.99 C and 1.01 C. The ATM (K=100) case with α = 5e+0.The vratio is 5.063.{λ n } n∈N 0 (left) and { ∇V(λ n ) } n∈N 0 (middle), while E P [F(X)] (MC) and E Q λ N [(dP/dQ λ N )F(X)] (IS MC) (right).
The absolute gradient level tends to decrease as expected, and the resulting importance sampling succeeds in reducing the Monte Carlo variance.Although the sequence {λ n } n∈N 0 seems to have converged to a point in the interior of H, the absolute gradient level has not converged to zero.On contrary, the absolute gradient level seems to have already converged to zero, while a component of {λ n } n∈N 0 seems to stay at the upper boundary (=+13.70796) in the ATM (K=100) and in the OTM (K=125).Those are delicate and unsolved matters in the constrained algorithms.
Example 4.2.(Unconstrained algorithm) Let X := (X 1 , . . ., X 5 ) be an infinitely divisible random vector in R 5 with independent and identically distributed components, whose common Lévy measure ν on R 0 of each component is given in the form of the standard Gaussian density function, Evidently, for each λ ∈ R,  ) .
Due to the compound Poisson structure, the random vector under the probability measure P can be generated as where {N n } n≤5 is a sequence of iid Poisson random variables with unit parameter and {W k,n } k≤5,n∈N is an iid standard Gaussian random array.Now, in view of (2.1), the Lévy measure under the probability measure Q λ is given by which is just like a drift shift of the Gaussian density by λ (up to the constant e 1 2 λ 2 ).Then, the random vector under the new probability measure Q λ can be generated as where {N n } n≤5 is now a sequence of iid Poisson random variables with parameter e while the drift shift λ k is further accelerated by factor e 1 2 λ 2 k on average.
As the result in Example 4.1, the algorithm reduces the absolute gradient level as expected, and the resulting importance sampling succeeds in reducing the Monte Carlo variance.Unlike in the constrained algorithm case, there exists a unique optimum λ * which makes the absolute gradient level zero, which seems to be in fact nearly attained.

Concluding remarks
In this paper, we have studied stochastic approximation methods of finding the optimal measure change for Lévy processes in Monte Carlo importance sampling variance reduction.Our analysis is, on one hand, valid on the basis of the restriction to the exponential-tilting measure change, that is, limiting the density to a function the terminal value X T .It would thus be very interesting to extend to the intricate series representation setting of [8] by using characterizing parameters of the Lévy measure in the stochastic approximation procedure, which will be studied elsewhere.On the other hand, our method should be applicable to a variety of applications in the sense that its principle is not specific to the structure of Monte Carlo estimator itself.Finally, the extension of our framework to the combined importance sampling and control variates is investigated using a two-time-scale version of the stochastic approximation algorithm in subsequent papers [9,10].{λ n } n∈N 0 (left) and { ∇V(λ n ) } n∈N 0 (middle), while E P [F(X)] (MC) and E Q λ N [(dP/dQ λ N )F(X)] (IS MC) (right).