Multistart global optimization with tunnelling and an evolutionary strategy supervised by a martingale

ABSTRACT Global optimization is one of the most difficult research topics in the optimization area. In the past few decades, many stochastic methods have been proposed that proved to be effective. In this article, a new multistart global optimization method, using tunnelling and an evolutionary strategy and supervised by a martingale (MGO-TEM) is proposed, which also belongs to the stochastic family. MGO-TEM is under a multistart framework such that a local search method will be called when a new sample point is generated. A sampling technique that is a hybrid of tunnelling and an evolutionary strategy is proposed in MGO-TEM with a set of stopping criteria whose derivation is based on martingale theory. Through experiments, MGO-TEM is compared with 10 other well-known algorithms on 55 single-objective test functions, and the results show that MGO-TEM is superior to all of them. Moreover, MGO-TEM is user-friendly since there are only two parameters to be tuned.


Introduction
With the flourishing of artificial intelligence nowadays, there is a higher demand than ever on the performance of global optimization algorithms (Chow et al. 2002;Bengio 2009;Cassioli et al. 2012). Non-convex optimization problems arise quite often in any layer of artificial intelligence applications as well as in many other engineering applications such as multidisciplinary design optimization (Bottou 2010;Mu Li et al. 2014). However, the global optimization of arbitrary real continuous functions is still a difficult task and full-length research on this topic has been conducted by many researchers for decades. Many different methods have been proposed and proved to be effective.
Given the fact that there does not exist any characteristics that can theoretically differentiate a global minimum from a local minimum, deterministic approaches are hard to develop. One existing deterministic method called αBB was proposed by Floudas (2005), which is non-heuristic and guarantees -suboptimality, but the convexifying procedure for various objective and constraint functions is demanding and the algorithm is often very slow. Other efficient deterministic methods, such as DIRECT and pattern search (PS) were also developed and had been applied to engineering problems Suresh Kumar, Nagesh Kumar, and Madichetty 2017). However, those methods do not have guaranteed optimality in general. Contrary to deterministic approaches, many methods were proposed to obtain the global minimum in a stochastic manner, such as genetic algorithms (GAs), simulated annealing (SA), differential evolution (DE), the covariance matrix adaption evolution strategy (CMA-ES), etc. (Holland 1992;Bäck 1996;Storn and Price 1997;Hansen and Ostermeier 2001;Yang 2014). Typically, those methods are heuristic and require parameter tuning. As a result, the optimality of the obtained global (or even local) minima from those methods cannot be guaranteed. Compared with deterministic methods, stochastic ones are mainly used for high-dimensional problems, noisy or disturbed objective functions, or when the gradient information of objective functions is not available. Therefore, research on stochastic methods remains in the mainstream of global optimization. To improve the performance of stochastic methods, there are hybrid methods that first apply a stochastic method, followed by a local method such as those described in Kelner et al. (2008), Mahinthakumar and Sayeed (2005) and Alsumait, Sykulski, and Al-Othman (2010). However, those articles focus on specific applications, and therefore global convergence properties of the hybrid methods are not theoretically analysed.
In general, stochastic methods usually have heuristic structures so as to evolve iteration by iteration. By sampling the feasible domain, a better minimum is expected to be found. For example, GAs, CMA-ES, DE and particle swarm optimization (PSO) are population-based algorithms and they generate new sample points via operations such as mutation and crossover among the population (Cortez 2014;Wang, Tan, and Liu 2018;Sengupta, Basak, and Peters 2019). To this end, the population size, mutation and crossover rates, possibly with other parameters, need to be tuned. CMA-ES and Bayesian optimization (BOA) assume a jointly Gaussian correlation among all the sample points and requires tuning the kernel and acquisition functions (Brochu, Cora, and de Freitas 2010). Different heuristics and parameters for SA, global search (GS) and multistart (MS) can be found in Ugray et al. (2007). It should be noted that only the major parameters of the aforementioned stochastic methods are discussed. In reality, once the parameters are fine tuned, enabling the method to adjust itself smoothly to the landscape of the function, its performance will be good. Nevertheless, those parameters are usually empirical, making the tuning task very difficult. To summarize, stochastic methods have effective sampling techniques but their performance is based on parameter tuning. There exists, nevertheless, research on guaranteed convergence to local optimal solutions such as in Morinaga and Akimoto (2019). However, only the class of strongly convex functions with Lipschitz continuous gradients was considered in that article. Other research on guaranteed convergence to local optimal solutions can be found in Rudolph, Informatik, and Xi (1997) and Jägersküpper (2006Jägersküpper ( , 2007. In the stochastic family, memetic algorithms are close to a form of population-based hybrid GA coupled with an individual learning procedure capable of performing local refinements, first introduced in Moscato (2000), and have become one of the recent growing research topics in evolutionary computation. A broad review on many memetic algorithms is presented in Neri and Cotta (2012) with the pre-mature convergence issue being addressed. Therefore, theoretical analysis on the global convergence of a hybrid method is important and can be achieved if two aspects, namely the sampling strategy and the stopping criteria, are well designed. In this article, major efforts will be devoted towards these two aspects, especially the second one. As a result, a new global optimization method called multistart global optimization with tunnelling and an evolutionary strategy supervised by a martingale (MGO-TEM) is proposed, which belongs to the stochastic family. In this method, sample points will be generated using a proposed hybrid sampling technique, and a local search method will be initiated at each sample point. Rigorous analysis is conducted to provide well-designed stopping criteria that ensure optimality. Martingale theory (Chung 2001;Steele 2001) is used to derive the stopping criteria, and therefore the optimization process is supervised by a martingale. In MGO-TEM, two straightforward parameters (representing two probabilities) are left for the users to tune. Since these two parameters have clear physical meanings, the tuning task is straightforward and simple. In the literature, global optimization using the concept of a martingale was mentioned in Lockett and Miikkulainen (2014). However, merely a concept was discussed and no theoretical results were derived using martingale theory. In this article, the proposed MGO-TEM method will be verified on 55 single-objective test functions and compared with another 10 well-known methods. The results obtained for the test functions are outstanding, which demonstrates the applicability and capability of MGO-TEM.
The rest of the article is organized as follows. Details of MGO-TEM are presented in Section 2. The stopping criteria are proposed in this section together with the sampling technique. In the same section, a test series is also provided and the pseudocodes are provided in the online supplemental data. In Section 3, experiments are carried out on 55 single-objective test functions. The performances of the 11 global optimization methods are compared. In Section 4, mathematical derivations of all the stopping criteria are presented. Conclusions are given in Section 5.

Proposed MGO-TEM
The standard optimization problem to be discussed in this work is given as where f : R n → [0, 1] is a continuous function, and the domain Q ⊆ R n is closed and bounded. Note that neither f nor Q is necessarily convex. For functions with a range that is not contained in [0, 1], the 'normalization up to iteration t' technique can be found on the first page of the online supplemental data. Assume there are N local minima, denoted by f i , i = 1, . . . , N. Then, at the tth iteration, the local minimum f * t found by the local method will be one of them, i.e.
If the associated local minimizer of f * t is denoted by Hence, it is possible that several minimizers have the same function value.
Furthermore, w t is used to denote the transformed running minimum at the tth iteration, i.e.
Therefore, the sequence (w t ) is non-decreasing and takes values in [e −1 , 1]. For the purpose of global minimization, how to generate sample points and how to terminate the searching process are two fundamental and important aspects. MGO-TEM fully considers these two aspects and provides novel sampling and terminating mechanisms. The contributions of MGO-TEM are as follows: • by using martingale theory to analyse the trends of the obtained local minima and the running minima, well-designed stopping criteria and a well-designed sampling technique are proposed; • compared with other stochastic methods, the sampling technique is more effective and the proposed stopping criteria enhance the efficiency and probability of obtaining the global minima upon termination.
Overall, MGO-TEM consists of • a deterministic local method; • a proposed sampling technique that generates initial points for the local method; • a proposed set of stopping criteria.
Since a local method is called during the optimization process, multiple local minima will be obtained. By using martingale theory to analyse the obtained local minima, the sampling technique as well as the set of stopping criteria of MGO-TEM are proposed. Next, the sampling technique used in MGO-TEM is introduced. Then, four stopping criteria are presented and the physical meaning of each one will be discussed. To aid readability of this section, mathematical derivations of these four stopping criteria are deferred to Section 4.

The sampling technique
To proceed with the multistart process before its termination, initial points for the local method should be spread in the feasible domain, which demands a sampling technique. Generally speaking, it enhances efficiency to use a good sampling technique that balances the performance of exploration against that of exploitation. In the following, a hybrid sampling technique that is a combination of a martingale property (exploitation) and an existing evolutionary strategy (exploration) is presented.
Assuming that the multistart process has proceeded to the tth iteration, a design vector CurrentBest x CB ∈ R n is defined as which is the best design vector so far during the multistart process. This x CB will be updated in each iteration.

The tunnelling process
The martingale, due to its special property, has been used to aid sampling such as in Qiu, Zhou, and Wu (2008). In MGO-TEM, the martingale property is utilized to design a tunnelling process based on the discussions in Section 4.1. Specifically, at the tth iteration (t ≥ 2), a sub-domain Q sub is created as follows. When the local minimizer x * t and its neighbouring local minimizer a hypercube with x * CB and argmin x∈{x * t ,x * nb } f (x) being two vertices on the main diagonal is created. This hypercube should be truncated when necessary to ensure Q sub ⊆ Q. Then, a point is randomly spread in Q sub and a local method with that point being the initial point runs to obtain a new local minimizer x * t+1 . This step is called Tunnelling, and the above procedures are written as Tunnelling is regarded as an operator from now on. Further, x * t+1 is called a one-level-deeper design vector compared with x CB andx. Thus, the tunnelling step can be performed iteratively if x * t+1 and its neighbouring local minimizer also satisfy Equation (1). To avoid tunnelling too deeply, the maximum level of tunnelling is set to be √ n where n is the dimension of the problem. The pseudocode for the iterative tunnelling process is provided as Algorithm 3 in the online supplemental data.
The purpose of the Tunnelling process is to improve the exploitation ability of the sampling process. Therefore, the tunnelling process may ignore certain good solutions in the searching space and that is why the tunnelling process needs to work together with the evolutionary strategy to achieve a balanced performance between exploration and exploitation. Please refer to the pseudocodes of Algorithm 1, Algorithm 2 and Algorithm 3 in the online supplemental data to see how they work together. The following section presents the technical details of the evolutionary sampling strategy.

Evolutionary sampling
The evolutionary sampling strategy is much like that in the DE method. The details of the strategy can be found in Ali and Törn (2004), Storn and Price (1997) and Ronkkonen, Kukkonen, and Price (2005), and it has been proved effective in many applications (Gao et al. 2019;Stanovov, Akhmedova, and Semenkin 2019;Ye et al. 2019). The only difference is the occurrence of the x CB vector acting as the interface between the evolutionary sampling strategy and the tunnelling process. The terminologies in the DE method will be used directly as follows. The population size is fixed to be 10n, i.e. Pop = {x 1 , x 2 , . . . , x 10n }, and the crossover probability is chosen to be 0.2. Then, the usual mutation and crossover in DE are carried out on Pop. Denote the new sample point generated from this evolutionary sampling strategy as x s and both x s and x CB are updated if certain conditions are satisfied. To summarize, x s is initialized to be x 1 at the beginning, then the following steps are performed for each x j ∈ Pop: Each entry y i of y is restricted to be within the feasible domain by setting y i = min(y i , y i ) and represent the upper and lower bounds of y i , respectively; (3) crossover is carried out by triggering one of the two events E1 : happens with probability 0.8 and event E2 happens with probability 0.2. The new extremum seed z collects z i 's as its entries; The above six steps summarize how the evolutionary sampling strategy works as part of the proposed sampling technique. After obtaining x s , the local method is conducted with x s being the initial point. Denote the above procedures as In summary, Hence, the entire sampling is a hybrid of these two as declared at the beginning of this subsection. The pseudocode for the evolutionary sampling strategy is provided as Algorithm 2 in the online supplemental data.

Four stopping criteria
Martingale theory is a theory that studies the trends of stochastic processes. By using it to analyse the trends of f * t 's and w t 's, evidence on whether the global minimum has already been found can be obtained; when the evidence is strong enough, the searching process can be terminated. Therefore, in this subsection, four stopping criteria used in MGO-TEM are proposed using the statistics of f * t 's and w t 's based on the application of martingale theory, and they are used to decide whether the searching process should continue or not. The combined usage of the stopping criteria will enhance the probability of obtaining the global minima upon termination. Their analytical derivations are provided in Section 4.
Stopping Criterion 1. Based on Proposition 4.1 (which guarantees that the probability that f * t has reached the -neighbourhood of E[f * t ] is larger than 1 − ), the following stopping criterion is proposed: where ∈ (0, 1) is a user specified value and t, r are the running indexes of f * t . Within the absolute value sign is the difference between the sample mean and the mean of temporal means. By the strong law of large numbers, these two mean processes both converge to the true mean, which is unique. Hence, they converge to each other. However, the sample mean process is biased when the number of samples is small, which alerts us to the fact that this stopping criterion alone is not enough.
Stopping Criterion 2. Based on Proposition 4.3 (which guarantees that the probability of finding a better local minimum is lower than δ), the following stopping criterion is proposed: where δ ∈ (0, 1) is another user specified value. Since w t is monotonically non-decreasing, therefore the quotient will gradually approach one as t increases. Hence, this stopping criterion supervises the maturity of the running minimum process. That is, when the running minimum is mature enough (i.e. w t does not change any more), the difference on the left-hand side should be small.
Stopping Criterion 3. Based on Proposition 4.5 (which controls the quality of the obtained local minimum), the following stopping criterion is proposed: Different from the first two stopping criteria, this one is memoryless, which controls the quality of the newly obtained local minimum since the right-hand side is monotonically non-decreasing. When performing global optimization using MGO-TEM, the satisfaction of (S1)-(S3) simultaneously should be required before termination. However, these three stopping criteria are stochastic, which may require a huge amount of computational resources. Hence, the following deterministic stopping criteria are also proposed.
Stopping Criteria 4. In addition to the above three stopping criteria, the following deterministic stopping criteria based on the discussions in Section 4.5 are proposed: where R a , R b are the minimal number of consecutive w t 's that do not change during the optimization history. In practice, MGO-TEM will terminate when either stopping criterion (S4b) is satisfied or stopping criteria (S1)-(S3) and stopping criterion (S4a) are satisfied. The simultaneous satisfaction of stopping criteria (S1)-(S3) and stopping criterion (S4a) can avoid premature termination but can be time consuming in reality. Therefore, stopping criterion (S4b) can be applied when necessary. This set of stopping criteria provides a comprehensive way to terminate the searching process. Compared with the usual stopping criteria, such as a maximum number of function evaluations or the vanishing of the step size or gradient, the proposed stopping criteria will enhance the efficiency of the searching procedure and the probability of obtaining the global minima upon termination.

A test series
With the pseudocodes provided in the online supplemental data, MGO-TEM is implemented in MATLAB 2017b. 1 The Whitley 5D test function is solved by MGO-TEM with parameters = 0.3 and δ = 0.06. The optimization histories of the test series are shown in Figure 1 and Figure S1 in the online supplemental data. In both figures, there are 12 subfigures, in which the first 10 subfigures form 5 pairs. Each pair consists of two subfigures where the one above shows the process of each stopping criterion followed by the logic state indicating whether the stopping criterion is fired or not in each iteration. When the process enters the stopping zone (the unshaded area in each sub- figure), the corresponding stopping criterion is fired, and the logic state is shown in the bar chart. The eleventh subfigure shows the logic state of whether MGO-TEM should terminate and the bottom subfigure shows the optimization history of the local minima found and the running minimum. There are several interesting observations from these two figures: • the optimization history in Figure 1 terminates owing to the activation of stopping criteria (S1), (S2), (S3) and (S4); • the optimization history in Figure S1 in the online supplemental data terminates owing to the activation of stopping criterion (S4b); • the stopping bound of stopping criterion (S3) will vary once a better local minimum is found, which adaptively controls the quality of the local minimum obtained; • two optimization histories both find high quality local minima (i.e. close to zero) upon termination; • the 2nd, 6th, 15th, 17th, 18th and 25th samples in Figure 1 show that stopping criterion (S3) controls the quality since it is not fired when the local minima found are of low quality; • the stopping criterion (S3) in Figure S1 in the online supplemental data is difficult to satisfy and thus MGO-TEM terminates owing to stopping criterion (S4b). This shows that MGO-TEM can avoid slow convergence, which enhances its computational efficiency.

Experiments
In  Table 1 and their mathematical definitions are given in the last section of the online supplemental data. To evaluate the performance of each algorithm on each test function, the performance measures proposed in Serani et al. (2016) are used. The performance measures are defined as follows: where R i is the range of coordinate values of the design variable along the ith dimension, x i,min is the coordinate value of the minimizer found along the ith dimension while x * i,min is that of the global minimizer, f min is the local minimum found and f * min , f * max are the global minimum and global maximum  of f, respectively. Since the performance of the algorithms may vary from one test run to another, each algorithm runs 400 times on each test function and the median value is reported as representative of the performance of the algorithm on the test function (Vasile, Minisci, and Locatelli 2011). In total, 55 median values are reported for each algorithm. To ensure results are of practical values, the maximum number of function evaluations for each test function is set to be 500n. All algorithms are fine tuned according to the literature and testing experience. The parameter settings and the tuning method reference for each algorithm are summarized in the following list.
(1) For MGO-TEM, two parameters are set as = 0.1, δ = 0.005 and the active-set method is chosen as the local method.
(2) For GA, the population size and the number of generations are set to be 10 √ n and 50 √ n , respectively. The crossover fraction is 0.8 and the mutation is Gaussian with the variance being 0.05 (Lockett and Miikkulainen 2014).
(3) GS generates a bunch of trial points using the scatter search algorithm and decides whether to start a local method from each trial point. The number of trial points of GS is set to be 34n while the number of stage-one points is 3n, and the local method is set to be the active-set method (Ugray et al. 2007). (4) The annealing function of SA is configured to be Boltzmann with the reanneal interval being 100 (Lockett and Miikkulainen 2014). (6) MS starts a local method from uniformly spread seeds inside the feasible domain. The number of seeds is set to be 20n and the local method is the active-set method (Ugray et al. 2007). (7) Pattern search looks for a minimum based on an adaptive mesh that, in the absence of linear constraints, is aligned with the coordinate directions. The mesh contraction factor is set to be 0.5 while the mesh expansion factor is 2. The poll method uses a maximal positive basis of 2n (Weiming Li et al. 2014;Suresh Kumar, Nagesh Kumar, and Madichetty 2017). (8) The population size of CMA-ES is set to be (4 + 3 log(n) ). In each generation, 50% of the population is used to update the joint normal distribution (Hansen and Ostermeier 2001;Lockett and Miikkulainen 2014). Furthermore, when the covariance matrix is not positive definite, CMA-ES will be restarted (Auger and Hansen 2005;Hansen 2009). (9) The population size of DE is set to be 10n and the crossover rate is set to be 0.2 while the mutation range is set to be [0.2, 0.9], according to the suggestion in Price, Storn, and Lampinen (2014). (10) The DIRECT method generates a dense set of points over the feasible set and, at each iteration, some 'promising' hyperintervals are selected and further partitioned. The number of iterations is set to be 50 (Liuzzi, Lucidi, and Piccialli 2010). (11) As for the BOA, it is usually used for hyperparameter tuning in the machine learning tasks. It builds a surrogate model for the objective and then uses an acquisition function defined from this surrogate model to decide where to sample. The acquisition function is chosen to be expected-improvement while the kernel function is the radial basis function (Frazier 2018).
Moreover, after each generation of GA, PSO, CMA-ES and DE, an additional local method is applied to obtain a local minimum according to the suggestions in Garg (2010) and Neri and Cotta (2012). The local method is the active-set method too.
After running each algorithm on each test function, the results of x , y , t are collected and the corresponding box plots are shown in Figure 2. From the box plots, it can be concluded that, statistically, MGO-TEM performs better than all the other 10 algorithms both in terms of x and f and therefore has better t . To compare the performance of the algorithms over the test functions, bar plots are presented in Figure 3 showing the distribution of the 'best' algorithm considering x , f and  t . Here, an algorithm is the 'best' among all the algorithms on a test function if it has the lowest x , f and t . The higher the bar suggests that the corresponding algorithm performs the best on more test functions. Given that the performance of an algorithm usually depends on the dimension of the problem, Figure 3 presents results separately for low-dimensional problems (with dimension 2D-3D), medium-dimensional ones (with dimension 4D-10D), and high-dimensional ones (with dimension 15D-50D). From the bar plots, it is clear that MGO-TEM is superior to all the other algorithms for low-dimensional and medium-dimensional problems. As for high-dimensional ones, MGO-TEM is still the best considering f and t although GS and PS are competitive in terms of x . This is because MGO-TEM terminates at a local minimizer that can be away from the global one, while PS and GS terminate in any neighbourhood of the global minimizer that may be closer to it but have less competitive function values.

Derivations of the stopping criteria
The derivation of the stopping criteria relies on the application of martingale theory. Martingale theory analyses the evolution of stochastic processes. In MGO-TEM, the local method X t is regarded as a random variable that takes values in {f i , i = 1, 2, . . . , N}. Hence, the sequence X t , t = 1, 2, 3 . . ., can be regarded as a stochastic process denoted by (X t ) t≥1 , or simply (X t ). Further, let W t := exp(− min s≤t X s ), and the stochastic process (W t ) is the process of the running minimum. The following derivations will be conducted based on (X t ) and (W t ).

The tunnelling process
During the searching process, given the feasible domain Q and that the objective function is not changing, then X t 's are a sequence of independent and identically distributed (i.i.d.) random variables. For the purpose of improving exploitation in global optimization, new samples are spread in sub-domains that are likely to have smaller function values than the expectation of the local minima. With limited information about the function and its domain, a sample is spread in a sub-domain which is a super-martingale, where E[X 1 |Q sub ] is the expectation of X 1 conditioned on the subdomain Q sub . By the construction of Z t , if Q sub satisfies E[X 1 |Q sub ] < E[X 1 ], then E[M s ] < 0 for every s and therefore the conditional expectation of Z t has a decreasing trend, which makes it a supermartingale. That is, if a new sample is spread in such an Q sub , the obtained local minimum is expected to be better than the expectation of all the local minima.
During the tunnelling process, two local minimizers are used to construct Q sub . Furthermore, temporal means are used to estimate E[X 1 ] and E[X 1 |Q sub ]. Then, whether to sample in Q sub is decided by comparing two estimated values, which corresponds to the procedure where new samples will be spread in Q sub if Equation (1) is satisfied.

The (X t ) process and its stopping criterion
The stopping criterion for the convergence of X t is first studied and designed.

Proposition 4.1: If X t is a non-negative random variable where E[X t ] < ∞ and satisfies
Proof: By Markov's inequality, it directly obtains that P[|X t − E[X t ]| ≥ ] ≤ . Hence, Equation (2) is a sufficient condition for the probability that X t has reached the -neighbourhood of E[X t ] to be larger than 1 − .
Given that, in practice, X t 's are i.i.d., the following condition is supervised: where C[X t ] is the mean of the temporal means of X t . This can be done as follows. By the strong law of large numbers, a good estimator for m t := E[X t ] is the temporal mean of the observations, written asm t = 1 t t s=1 X s . Similarly by the strong law of large numbers, the estimator for c t := C[X t ] can be given byĉ t = 1 t t r=1m r = 1 t t r=1 1 r r s=1 X s . Substituting the above two estimators into Equation (3) yields 1 r r s=1 X s ≤ 2 , which is exactly the stopping criterion (S1).

Theorem 4.2: If Y is an integrable non-negative random variable (i.e. Y ≥ 0 almost surely and E[Y] < ∞), then for any λ > 0, P[Y > λ] ≤ E[1 {Y>λ} Y]/λ, which is a variant of the Markov inequality.
Proof: Using the Markov inequality yields which ends the proof.

Proposition 4.3: For the non-negative sub-martingale
Proof: According to Theorem 4.2, it is true that P Hence, it guarantees that the probability of the monotonically increasing sequence W t not increasing any more is larger than 1 − δ.
According to Proposition 4.3, choosing λ = w t−1 (the outcome of For the purpose of application, the above probability is expected to be small, representing that the likelihood of W t is still increasing is small. A sufficient condition for P Then, the quantity E[1 {W t >w t−1 } W t ] needs to be estimated using the W t−1 's. However, since the W t 's are not i.i.d., the temporal mean cannot be used. Instead, this value is estimated as follows: where η ∈ [w t−1 , 1] is a value representing the expected value of W t over {W t > w t−1 }.
needs to be estimated based on the history of the W t 's. Such an estimator should be asymptotically unbiased, that is E[ψ t ] − ψ = 0 as t increases to +∞. To this end, the estimator is chosen asψ t = W t−1 − 1 t t s=1 W s . The above estimator is asymptotically unbiased since is zero as t increases to +∞ since the W t 's are non-decreasing and bounded above by one, and therefore W s = 1 for all s ≥ N where N is a large enough iteration number. Substituting the above estimator into the sufficient condition, the stopping criterion (S2) for the sub-martingale (W t ) is obtained as 1 − 1 tW t−1 t s=1 W s ≤ δ.

Stopping criterion (S3) controls the quality
To control the quality of the obtained local minimum, the stopping criterion (S3) is proposed, which relies on the application of Doob's Upcrossing Lemma (Steele 2001  Hence, the quality of the newly obtained local minimum can be guaranteed by setting y = E[W t ] − E[W 1 ] and then observing whether the current X t exceeds y. By the above lemma, such an X t exists in expectation. Moreover, given that the temporal mean of the W t 's cannot be used to estimate E[W t ] or E[W 1 ], since they are not i.i.d., therefore y = W t − W 1 is used in MGO-TEM. Hence the stopping criterion (S3) is given as exp(−X t ) ≥ W t − W 1 + e −1 . According to the experiments, this stopping criterion works very well in practice.
The choice of W t − W 1 is not close to what the upcrossing lemma suggests but it is a proper choice when MGO-TEM runs sequentially. Such a choice is conservative since it is the largest difference between the values from the history of the W t 's, meaning that a high requirement on the quality of the obtained local minimum is set. Moreover, since W t is determined by the best local minimum ever found, therefore at least one local minimum satisfies the requirement. Last but not least, if MGO-TEM can run in a parallel computing scenario, then E[W t ] and E[W 1 ] can be effectively estimated since multiple initial points are spread in the same iteration.

Complexity analysis of stopping criterion (S1) and stopping criterion (S2)
Given that the mean process of the temporal means of the X t 's will converge faster than the sample mean process, only the complexity of the sample mean process needs to be discussed. The behaviour of the sample mean process can be studied by analysing the difference of two arbitrary sample means. Denote the sample mean in the (t + m)th iteration as S t+m and the other sample mean in the (t + n)th iteration as S t+n . Since the sample mean process S t converges almost surely (according to the strong law of large numbers), then for all 2 > 0 there exists large enough m, n, L ∈ N with m > n > L such that |S t+m − S t+n | < 2 . To satisfy this inequality, it is required that 2 > (m − n) Rearranging the above gives for all m > n > t and > 0. The above shows that the total number of times the local method is called should be at least (m − n)/ 2 − m so that the sample mean process of X t converges. For stopping criterion (S2), since the sequence of outcomes of (W t ) is monotone, thus its behaviour is studied by evaluating its distance from one. The following propositions are proposed. Proposition 4.6: Given that (X t ) is a positive stochastic process and X t ∈ [0, 1] for all t, then the process W t := exp(− min s≤t X s ) has non-decreasing sample pathw t := W t (ω) for fixed ω andw 1 /w t ∈ [e −1 , 1] for all t.

Proof:
The sample path of W t is non-decreasing sincẽ which meansw 1 /w t ≤ 1 for all t. Moreover, givenw t is non-decreasing andw t ∈ [e −1 , 1] for all t according to Section 2, thenw 1 /w t ≥ e −1 , and the proposition is proved.

Stopping criteria (S4a) and (S4b)
Combining Equation (6) with Equation (5), it is sufficient to let, for all natural numbers, m > n > t and l = 1, 2, 3, . . . such that to satisfy the convergence of the sample mean process of X + t and stopping criterion (S2). For practical use, m−n = 1, l = 1 are used, which yields the stopping criterion (S4b) as The satisfaction of stopping criterion (S4b) will terminate MGO-TEM. Additionally, the stopping criterion (S4a) requires that m t ≥ (1 − e −1 )/δ; this criterion should be used together with stopping criteria (S1), (S2) and (S3) to prevent possible premature stopping, since those three stopping criteria are stochastic.

Conclusions
In this article, a multistart global optimization method with tunnelling and an evolutionary strategy supervised by a martingale (MGO-TEM) is proposed. The major contributions of MGO-TEM in this article include a new sampling technique and a set of stopping criteria whose derivation is based on martingale theory. The satisfaction of these stopping criteria enhance the efficiency and the probability of obtaining the global minima upon termination. Two parameters and δ representing two probabilities are left as two user-specified parameters. Theoretical derivations and analysis are presented to prove the effectiveness of the proposed stopping criteria. The pseudocodes of MGO-TEM are provided in the online supplemental data and a test series is presented to show the optimization history of MGO-TEM, which illustrates the status of each stopping criterion during optimization.
In the experiments, MGO-TEM is compared with another 10 very popular and well-known global optimization methods over 55 single-objective test functions. The results verify that MGO-TEM performs better than the others in terms of commonly used performance measures. There are two future research directions of MGO-TEM. First, an extension of MGO-TEM to multi-objective cases deserves further study; second, a variation of MGO-TEM that is suitable for a parallel computing scenario should be considered in the future.

Disclosure statement
No potential conflict of interest was reported by the author(s).