Parameter estimation for univariate Skew-Normal distribution based on the modified empirical characteristic function

Abstract Parameter estimation for the skew-normal distribution is challenging, since the profile likelihood function of shape parameter has a stationary point at zero, which hampers the use of traditional methods, such as maximum likelihood method. We present a modified empirical characteristic function method to perform parameter estimation for the skew-normal distribution. The proposed approach is flexible and easy to implement. We show that the estimators converge to the true values in probability. The simulation study and data analysis suggest that the proposed method performs well, even for the case of small sample size.


Introduction
The skew-normal (SN) distribution, which extends the standard one via adding an extra shape parameter, first appeared in Roberts (1966) and was named by Azzalini Azzalini (1985). Azzalini and Capitanio (1999) demonstrated that it is rather practical to model real data sets with skewness, since the SN distribution possesses prominent properties in term of mathematical tractability. The probability density function of a SN random variable X is f X ðx; aÞ ¼ 2/ðxÞUðaxÞ, x, a 2 R, where a is a shape parameter that controls the shape of the density function. Here /ðxÞ and UðxÞ denote the probability density function and the cumulative distribution function of the standard normal distribution, respectively. If a ¼ 0, the distribution becomes the standard normal distribution; The distribution is left skewed if a > 0, and is right skewed if a < 0: Apart from that, the larger the absolute value of a is, the larger the skewness is. When a goes to infinity, the SN distribution will converge to the half-normal (or folded normal) distribution. It is general to consider a linear transformation, Y ¼ xX þ n: Then the transformed variable Y follows a skew-normal distribution with a parameter set h ¼ ðn, x, aÞ, one may write Y $ SNðhÞ, and the probability density function of Y is f ðy; n, x, aÞ ¼ 2 x / y À n x U a y À n x , y, n 2 R, x > 0, where n and x are the location and scale parameters, respectively. However, in many cases, maximum likelihood (ML) inference on the skew-normal parameters poses some difficulties. Firstly, two of the elements of the gradient of the log-likelihood may become linearly dependent as a goes to zero. Secondly, the profile likelihood function of a has a stationary point at zero, which is independent of the observed sample. The two problems result in singular Fisher information matrix and inconsistent estimators of a. Researchers have developed many methodologies to solve the two problems. For example, Azzalini Azzalini (1985) proposed a parametrization method to avoid singular Fisher information matrix, which can also be seen in Chiogna (1998) and Pewsey (2000). Azzalini and Capitanio Azzalini and Capitanio (1999) stated that the second obstacle is challenging. Azzalini and Arellano (2013) proposed the maximum penalized likelihood estimation (MPLE) by introducing an appropriate penalty function where c 1 ¼ 0:875913 and c 2 ¼ 0:856250: The modified log-likelihood function is where lðhÞ is the log-likelihood function. This method works well only for large sample size. According to Azzalini and Arellano (2013), this method results in infinite estimators for shape parameter a for small to moderate sample sizes. Besides, the moment method (MM) can also be applied to estimate a by inverting the skewness equation based on sample skewness. The estimator of the transformation of a is where d ¼ a ffiffiffiffiffiffiffi ffi 1þa 2 p , andĉ is the sample skewness. The sign ofd is same as that ofĉ: Thus, the estimator of a isâ ¼d ffiffiffiffiffiffiffiffiffiffiffiffiffi Notwithstanding it seems a reasonable way to represent the shape parameter, the method of moments shows even worse results. Because it could not find a solution for Eq. (4) when the sample skewness is large and the transformation jdj ! 1: Alternatively, by settingd ¼ 1 in Eq. (3), the maximum theoretical skewness isĉ % 0:9952717: Thus, let be the sample skewness, where y and s are the mean and standard deviation, respectively. The estimators of the location parameter and scale parameter arê (3), we notice that the efficiency of the MM is affected by the sample skewness. When a sample is extremely skew, the estimator of a will deviate far from its true value, which results in lowering estimation efficiency of the MM. Furthermore, the estimation accuracy of the MM is not high for a small sample as well.
The multivariate skew-normal distribution was generalized by Azzalini and Dalla (1996) and Azzalini and Capitanio Azzalini and Capitanio (1999). Unlike multivariate normal, where the parameters estimates can be obtained using the sample mean and sample variance-covariance matrix, there are some obstacles we will meet when we estimate multivariate skew-normal parameters. Similar to the univariate version, the estimators of the MPLE for a multivariate skew-normal distribution is not robust as well. Although the probability density function is asymmetric, Azzalini, Genton, and Scarpa (2010) introduced estimating equations for a multivariate skew-normal case based on its invariance. However, it is quite problematic to estimate parameters when we actually implement this method for the multivariate skew-normal distribution because of high multiplicity of its solutions.
In this paper, we approach the problem of parameter estimation in a manner that is different from the aforementioned methods. We introduce the modified empirical characteristic function (MECF) method to estimate the model parameters. A major motivation for this work is that the empirical characteristic function (ECF) method is a conceptually simple, flexible, and effective parameter estimation technique for which the only major drawback was the estimator may be far from optimal, as its error terms eðt, hÞ cannot be justified to follow a normal distribution. Hence, we will modify the distance function by the idea of M-estimator, which in turn improves estimation efficiency in small sample cases, and minimize the modified distance function. As we will demonstrate, it is more robust to the initial values than the original ECF method. The proposed new estimators converge to the true values in probability as well. For multivariate data, the MECF gives a way to test symmetry under a multivariate skew-symmetry model as well as provides a new direction to estimate its parameters.
The rest of this paper is arranged as follows. Section 2 reviews the method of ECF and introduce the proposed method concretely. In Section 3, the simulation will appraise the performance of the proposed method under different sample sizes. We analyze a data set for illustration in Section 4. Finally, we present conclusions of the paper and show a brief exploration of the proposed method for estimate parameters for multivariate skew-normal distribution.

Review of the ECF method
The ECF method shares the same mechanism with the empirical moment generating function method, as it replaces the moment generating function by the characteristic function. It was previously developed by Feuerverger and Mureika (1977) and Heathcote (1977), and applied in parameters estimation by Tran (1998) because of two essential advantages of characteristic functions. First, characteristic functions are uniformly bounded, thus the resulting solution is numerically stable. Second, when moment generating functions of distributions does not exist, it is appropriate to adopt the characteristic function for heavy tail distribution.
The characteristic function of an arbitrary distribution with the probability density function f ðy; hÞ is written as where i is an imaginary unit and h represents the unknown parameter vector of the distribution. If {y 1 , y 2 , :::, y n g is a random set of sample from the distribution, the ECF is defined as We can separate the characteristic function into two parts and unite them as a vector Fðt, hÞ ¼ Re cðt 1 , hÞ ½ , :::, Re cðt m , hÞ ½ , Im cðt 1 , hÞ ½ , :::, Im cðt m , hÞ where t 1 , :::, t m are m fixed grid points, Re½cðt, hÞ and Im½cðt, hÞ denote the real part and imaginary part of cðt, hÞ, respectively. The counterpart of the ECF C n ðtÞ is where Re½C n ðt k Þ ¼ 1 n P n i¼1 cos t k y i and Im½C n ðt k Þ ¼ 1 n P n i¼1 sin t k y i , k ¼ 1, :::, m, are real and imaginary parts, respectively.
Referring to the results of Feuerverger and Mcdunnough (1981), we have is the variance-covariance matrix with the (p, q)th elements in each partition If we set eðt, hÞ ¼ Z n ðtÞ À Fðt, hÞ, this problem can be regarded as the weighted leastsquare, where Z n ðtÞ is assumed as a dependent variable and Fðt, hÞ is assumed as a right-side explanatory function. Thus, for a set of grid points t 1 , :::, t m , the optimal estimator of h,ĥ, will be finalized simply by minimizing eðt, hÞ 0 X À1 h eðt, hÞ: Note that X À1 h would dramatically increase computational complexity while we optimize Eq. (7) directly. Therefore, we consider to replace X h by its consistent estimator, X À1 n , whose form isX with the (p, q)th elements in each partition being Taking the results of Feuerverger and Mcdunnough (1981) and the Delta-method into account, we can obtain that the ECF estimators of h,ĥ, is consistent and asymptotically normal with asymptotic covariance where A ¼ @Fðt, hÞ @h : However, the efficiency of the ECF is momentously affected by the choice of ft 1 , :::, t m g as m approaches the number of unknown parameters. Schmidt (1982) has proved that the asymptotic efficiency of the EFC will generally improve as m increases, but will converge to a constant while m is large enough. In fact, Eq. (10) can be close to the Cramer-Rao lower bound arbitrarily when m is sufficiently large. Furthermore, based on the analysis of Feuerverger and Mcdunnough (1981) and Schmidt (1982), grid points t 1 , :::, t m must distinguish from each other. Given a fixed m, we adopt the equal interval, s, for the grid points, t i ¼ is, i ¼ 1, 2, :::, m: Specifically, s is an arbitrary real constant and is obtained by minimizing the size of the asymptotic covariance matrix Eq. (10). Thus, one way is to initialize s at the preliminary values of the unknown parameters and iterate them together, since the optimal s only bonds with the model parameters.

The modified empirical characteristic function
Assume that y 1 , y 2 , :::, y n are sampled from the SNðhÞ with the characteristic function where d ¼ a ffiffiffiffiffiffiffi ffi 1þa 2 p and EfriðxÞ ¼ 2 ffiffi p p Ð x 0 e t 2 dt is imaginary error function (N.J.A 2007). The partial derivatives of the characteristic function (Feuerverger and Mcdunnough 1981) with respect to the three parameters are @cðt, hÞ ð1 þ a 2 Þ À 3 2 ðÀ sin tn þ i cos tnÞ: The rest of the estimation process is the same as the previous subsection 2.1. Although the ECF method could estimate parameters of the SN distribution, the estimator may be far from optimal, as the distribution of its error terms eðt, hÞ cannot be justified to follow a normal distribution. Many robust estimators have been proposed to achieve a tradeoff between efficiency and stability in the presence of infinite estimates. M-estimator, derived by Peter and Huber (2011), is regarded as a desirable method in regression modeling and is robust to departure from the normality assumption. The fundamental idea of M-estimator is to replace the squared residuals by a function qðÞ of residuals. Rey (1983) has proposed many different q functions, and all q functions have the following properties.
In practice, the Tukey biweight function, proposed by Yohai and Zamar (1988), is the most ubiquitous q function, defined as where c is the tuning parameter. Because the M-estimator with the Turkey biweight function has an efficiency of 95% under independent Gaussian error by setting c ¼ 4.6851. The following theorem gives rise to the asymptotic properties ofĥ obtained by minimizing Eq. (12). The proof of theorem 1 is similar to the proof of theorem 1 in Chen, Ye, and Zhao (2017).
Theorem 1. Let t 1 , t 2 , :::, t m be discrete fixed grid points. The estimator of h based on Equation (12),ĥ, converges to the true value h 0 in probability.
As for confidence intervals of parameters, we would meet obstacles if we directly construct them based on Eq. (12). Alternatively, resampling algorithms, including the bootstrap resampling method, are frequently adopted for intervals estimates. DiCiccio et al. (1996) proved that bootstrap is asymptotically more accurate than the standard intervals obtained using sample variance and assumptions of normality. Efron and Tibshirani (1986) exhibited that bootstrap percentile confidence intervals have invariant properties. Thus, we perform interval estimation using the bootstrap percentile confidence interval for the parameters of the SN distribution. The procedures are as follows.
Step 1 For the given sample size n, draw B samples with replacement from the observed data y 1 , y 2 , :::, y n : Step 2 For each bootstrap sample, estimate the parameters by minimising Expression (Feuerverger and Mureika 1977).

Simulation
This section gives a simulation to investigate the performance of different methods for the SN parameters. Specifically, the proposed method in this article is compared with the MPLE and the MM. In the proposed method, we choose the Tukey biweight function defined in Expression (Heathcote 1977) as the q function, where c ¼ 4.6851. Besides, we use the MM estimators as the initial value for performing the MEFC method. Due to the randomness, not all eigenvalues ofX n are positive. If such a case is encountered, we just let the negative eigenvalues as 0, then perform the optimal computation of (Feuerverger and Mureika 1977) to obtainĥ: Assume that y 1 , y 2 , :::, y n from SNðn, x, aÞ: Without loss of generality, we assume n ¼ 0 and x ¼ 1. The shape parameter a is chosen as À3, À 2, À 1, 1, 2 and 3, respectively. And the sample size is selected as n ¼ 50, 100 and 200, respectively. We generate 2000 times with different sample sizes n. The comparisons are based on the absolute bias and the mean squared error (MSE). We set the eigenvalue as 0, if it is less than 0.
First, we need to determine the optimal number of grid points, m, under the condition of different values of a. Figures 1 and 2 show that when n ¼ 50, the size of grid points has no significant effect on the estimation of the location parameter n and scale parameter x. As for the shape parameter a, the lines of biases and MSE fluctuate slightly when m 15, and keep steady when m ! 20: When m ¼ 10, the biases and MSEs become stable. The case of n ¼ 100 and 200 show the similar phenomena, and the figures are shown in the Supplemental file. Thus, for balancing the efficiency of estimation and computation, we choose the number of grid points m ¼ 10.
Then, with optimal m, we compare the proposed method with the MPLE and the MM. Azzalini Azzalini (1985) stated that compared with the shape parameter a, the location parameter n and the scale parameter x are easier to be estimated. This conclusion can also be verified in the Figures 3 and 4 because of small biases and MSEs in all the scenarios. Thus, we just utilize the results of the shape parameter a to assess Figure 1. When the sample size n ¼ 50 and shape parameter a ¼ À3, À 2 and -1, the number of grid points m versus the estimation efficiency in term of abstract biases (left panel) and MSEs (right panel). The different colors and shapes of lines represent estimated results of the parameters, that is, location parameter n by a green dot line, scale parameter x by a red dashed line, and shape parameter a by a solid black line.   estimation efficiency. As can be seen from Figures 3 and 4, the MSEs of the MM are much lager in all the case comparing with the other two methods, although it has small biases, especially for the case of the small sample size. Thus, the estimation efficiency of the MM is the lowest one. MPLE estimates being infinite is increasing as the sample size becomes small, which can be seen in the Table 1. Thus, the results obtained by MPLE may be non-convergent. Moreover, even if we take out these infinite values, the MPLE still works worse than the MEFC in term of biases and MSEs, but better than the MM. However, the proposed method, the MEFC, performs the best in overall, since its results are closer to the true value and show the smallest MSEs, even for the small sample size. When the sample size n ¼ 200, the MPLE also performs well, which proves the conclusion of Azzalini and Arellano (2013), and the results are listed in the Supplemental file. Therefore, the MEFC has a better performance comparing with the other two methods.

An application
In this section, we compare the performance of the bootstrap percentile confidence interval based on the MECF method, the MPLE method and the MM. We consider the dataset downloaded from http://azzalini.stat.unipd.it/SN/frontier.dat. It consists of simulated data from SNð0, 1, 5Þ with n ¼ 50. The ML estimation is not proper since the profile likelihood of a doesn't have a close-form equation. Some researchers have approached this problem in different methods. Sartori (2006) obtainedâ ¼ 9:14 by using the corrected ML method. Azzalini and Capitanio Azzalini and Capitanio (1999) used the graphical approach to fit the data by setting a ¼ 8:1: Liseo and Loperfido (2006) utilized the reference priors to estimate a. But its posterior does not have finite mean and its median, which is about 15.9, is regarded as the reasonable point estimate, with the 95% highest posterior density region given by ð4:2, 52:5Þ: We combine the MEFC with the bootstrap method to construct the percentile confidence intervals of the parameters, by setting number of bootstrap replications B ¼ 2000. The results are shown in Table 2. From Table 2, we can see that the bootstrap percentile confidence intervals of the parameters based on the MECF cover all the true values, and the lengths of intervals are short. However, the MM and the MPLE fail to achieve this results. Therefore, the MECF method outperforms the MM and the MPLE in this example.

Conclusion
We have presented the MECF method to estimate parameters of the SN distribution and construct their confidence intervals based on bootstrap. By leveraging qðÞ function, the modified ECF is extrapolated to more complex situations, where the distribution of its error terms eðt, hÞ cannot be justified to follow a normal distribution, thereby improving estimation accuracy even for small to mediate sample sizes. By comparison with the MPLE and the MM, the MECF works well no matter how small sample size is. Data analysis also shows that the MECF performs well.
As for multivariate versions, the MECF method could be a way to test symmetry under multivariate skew-symmetry models and explore new direction to estimate their parameters. It is well-known that the characteristic function of a symmetric distribution has only the real part. Then the multivariate data could be projected onto the direction of maximal skewness (Loperfido 2018) and the characteristic function of this univariate projection could be used to test its skewness. If the corresponding p-value were high enough, then the sampled multivariate distribution could be taken as symmetric. Combined with the method in Jim enez- Gamero et al. (2016), the parameters of multivariate skew-normal distribution could be obtained. Any skewed and multivariate probability density function can be expressed as the product of a symmetric probability density function and another function, the proposed method provides a possible new way to test symmetry under a multivariate skew-symmetry model.