Extended Expectation Maximization for Under-Fitted Models

In this paper, we generalize the well-known Expectation Maximization (EM) algorithm using the α−divergence for Gaussian Mixture Model (GMM). This approach is used in robust subspace detection when the number of parameters is kept small to avoid overfitting and large estimation variances. The level of robustness can be tuned by the parameter α. When α → 1, our method is equivalent to the standard EM approach and for α < 1 the method is robust against potential outliers. Simulation results show that the method outperforms the standard EM when it comes to mismatches between noise models and their realizations. In addition, we use the proposed method to detect active brain areas using collected functional Magnetic Resonance Imaging (fMRI) data during task-related experiments.


INTRODUCTION
Signal detection has a wide range of applications in radar [1], cognitive AI [2] and medical imaging [3].Most detectors rely on strong assumptions such as knowledge of the exact noise density model [4], knowledge of the signal and the interference subspace [3] and availability of a secondary signal-free dataset [5] for covariance estimation.A slight deviation from these assumptions can lead to drastic change in the expected results.Suboptimal methods that have been developed to improve robustness against these deviations, still rely upon similar assumptions (e.g., see [6]).To remove the exact knowledge of density model assumption, mixture models are widely used in a variety of applications including robust signal processing [7] and image processing [8].Approaches to estimate the parameters of mixture models, particularly Gaussian Mixture Model (GMM) include: (i) Maximum Likelihood (ML); (ii) Maximum a posteriori (MAP); and (iii) Bayesian Inference.The ML based algorithms, which are the main focus of this paper, estimate the parameters of the model such that it maximizes the likelihood function assuming parameters are unknown deterministic.The most well-known ML algorithm is the Expectation Maximization (EM) algorithm [9].This research was partially supported by ARC DP210101682.
Model orders are most of the time determined by adding a term to the EM objective to penalize high order models.Various penalties have been proposed in the literature including the Akaike's Information Criterion (AIC) [10], Bayesian Inference Criterion [11], and Minimum Description Length (MDL) [12].There are a number of problems associated with EM, including convergence to a local minimum and sensitivity to initial values [13], and recent studies have attempted to address these issues [14].There are, however, challenges in GMM learning that are still untouched.For example, several methods can estimate the order of the model using estimates such as AIC, BIC or MDL.However, in applications such as radar or medical imaging, information about the number of components may be available due to physical factors.In this case, one needs to limit the model order to adequately approximate the underlying physical mechanism.Although GMM is naturally robust, one should investigate and find a parameter estimation method that is robust against potential outliers caused by other sources that are not in our interest for modeling.The remainder of this paper is organized as follows.Section 2 provides background on the GMM and EM approach.In Section 3 a robust α-divergence based approach is introduced.Results are presented in Section 4 and finally the paper is concluded in Section 5.

BACKGROUND
Using the GMM and independent and identically distributed where is the multivariate normal density.The goal is to estimate β using learning methods.EM is the most popular and commonly used approach for learning of GMM.EM determines the unknown parameters by recursively maximizing the likelihood function in equation ( 1).An EM algorithm consists of two steps: an Expectation step (E) and a Maximization step (M).For a given value of K, the parameters of the GMM can be computed iteratively in the M step by: where γ nk represents the responsibility of the n th sample in forming the k th component.This parameter can be computed in the E step by These two steps are repeated until convergence.From the above expressions, it can be seen that all data are used to estimate the parameters, regardless of whether they are outliers or if they perfectly follow the GMM.As a result, the EM algorithm is generally not robust in the presence of a high proportion of outliers and when the number of components is insufficient to track all complex observed patterns.

PROPOSED METHOD
In standard EM, the goal is to maximize the log likelihood of the data, which has two main components: where z is the unobserved variable, L and KL are the lower bound and the Kullback-Leibler divergence, respectively.Instead of directly maximizing the log f (Y|β), EM performs the maximization in a two-step procedure: (i) by minimizing the KL term in the first step and (ii) maximizing the lower bound in the second step.Together, these two steps increase the log likelihood value.Because of the small number of components in target applications, the M step is not robust against unexpected outliers.In the M step of the EM algorithm, the goal is to maximize the lower bound L with respect to β which is literally the ML of the complete data log likelihood.It is understood, however, that when ML is used with a large number of samples (N → ∞), it is equivalent to minimizing the KL divergence between the observed empirical density and the model density [15].It is therefore possible to reformulate the relationships and replace the KL divergence with the α-divergence (that is, to transform the KL into an extended form that has valuable properties, including robustness) [16].Note that despite some similarities, α−divergence is different than Rényi divergence used in the literature.α−divergence has been used recently to improve the convergence and estimation of weights in mixture models for variational inference [17,18].For two probability density functions g(y, β * ) and f (y, β), the α-divergence is defined as [19] D α (g(y, where α is the hyperparameter.For the specific case when α → 1, the α-divergence becomes the KL-divergence [20].As we will see below, robust detection benefits from the above measure adopted for GMM learning. Robust Hypothesis Testing: In many applications such as communication, medical imaging, and radar it is common that the observations do not follow the model perfectly and there are components in observations that cannot be captured by the model [21].For example, in many applications, the observed vector y may follow the general linear model (GLM) as: where H ∈ R N ×p is a known matrix whose columns span the signal subspace and θ is an unknown deterministic parameter vector.Let us assume h i is the i th row of matrix H.In this model, entries of ξ follow an unknown but i.i.d.distribution function g that can be any arbitrary function modeling both actual density and outliers.For such distributions, GMMs can be used to approximate g by a parametric model f .If prior information is available, one can fix the number of components in GMM (K) to a small integer value to reduce the number of unknown parameters.In this case the existence of outliers affects the parameter estimation.Therefore, a robust learning method is required.In robust hypothesis testing, the goal is to make decision between (H 1 : θ = 0) and (H 0 : θ = 0).
Considering the following nominal likelihood function for a univariate Gaussian mixture model: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
and using the empirical distribution f e for inputs, we have The defined divergence (D α ) is proportional to where c is a negative constant.Therefore, the solution can take the form of ) By taking the derivative with respect to unknown parameters, specially θ and forcing it to zero, the unknown parameters can be computed in an iterative approach.For example, it can be shown for θ that the estimator is is the weight of the n th observation in making k th component.If we rewrite the above expression, (i.e., Eq. ( 15)) in a matrix form, we obtain where is a diagonal weight matrix and For α → 1 we have w nk → γ nk , and the proposed method becomes the standard EM algorithm.In the case of K = 1, we obtain the standard form of the weighted least square estimator in Gaussian nominal noise [15].With the same approach, one can estimate other unknown parameters as Algorithm 1 Proposed EMi Algorithm Input: Observation vector y, number of components K, α.
Output: µ k , σ 2 k , θ. Initialize parameters.Run the following two steps until convergence: and finally Using the above formulation, the parameters of the Gaussian components can now be estimated by a weighted form of observations, which depends on a hyperparameter α and for α < 1 the robustness property can be obtained.This method is useful in situations where the underlying distribution cannot be traced due to the outliers and the limited number of components for approximation.The proposed method is presented in Algorithm1 and is called Expectation Minimization (EMi) since it minimizes the α-divergence.Finally, to decide between H 0 and H 1 , we evaluate the energy of θ in following form where λ is an appropriate threshold value.

EXPERIMENTS AND RESULTS
In order to examine the proposed method, 4000 data points were generated using the following distribution function where µ i is the mean value and σ 2 i is the variance of the respective components and h is an unknown distribution which models the outliers.This distribution consists of 3 main components of which one is an outlier and with the remaining ones forming the nominal model.Only two of the model Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.components will need to be fitted to the data using the available prior information about the noise.Fig. 1 shows that when α is 0.8 and H consists of two cosine functions, the EMi approach has almost the same performance as the standard EM approach when π 2 = 0, π 1 = 0.25, µ 1 = 0, µ 2 = 4, K = 2, σ 0 = 1 and σ 1 = 0.2.On the other hand, when outliers are added as a point mass distribution at x = 10 (Fig. 2) with a probability of 0.15, the standard EM method has a significantly lower power (probability of detection) than the proposed EMi method, which has a high power level.It only loses a small amount of power, proving that the proposed method is robust against outliers.The least square (LS) estimation of a Gaussian noise model is shown in Fig. 1 to illustrate the difference between GMM and LS estimation in non-Gaussian noise.The above result for EMi was achieved via a better estimation of the parameters in the presence of outliers.Figure 2 shows the estimated components when θ = 0.The estimated parameters are exactly the same as the standard EM approach when α = 1, while for α = 0.8 the third component (outlier) is ignored.The analysis of fMRI data is one important application of signal detection methods.Its aim is to identify the brain area that is active during a particular task.Based on the assumption that brain voxels (small cubic volumes in the brain) are linear and time-invariant systems, the approximated signal can be obtained by convolution between the task stimulus function and impulse response of the brain Hemodynamic Response Function (HRF).When evaluating the performance of the proposed detector on real fMRI data, we assume that a number of outliers are present in the noise.In this study, we use the Block-Paradigm Right Finger Tapping task (BPRFT) and the Event-Related Right Finger Tapping (ERRFT) data [22].Briefly, the BPRFT data shows a 15 second task period was alternated with a 72 second rest period for four times, followed by a 30 second rest period.The total recording time was 480 sec.The total recording time in ERRFT was 650 seconds.Following a dummy scan, the right finger was tapped 40 times randomly, followed by a 30 second rest.These datasets have 592895 voxels.Several steps were taken to pre-process the data, including motion correction, temporal pre-whitening, drift removal, and slice time correction.As shown in Fig. 3, the outputs of the proposed detector are shown as well as the area of the brain that is detected as being active during the experiment when α is 0.8.Because ground truth is lacking, but metabolism behavior is known, a smoother area can represent an activation map more accurately.Additionally, we have shown that by adding artificial point mass outliers of 20% and 30%, the proposed EMi method is more robust and the activation maps remain stable, while the EM-based detection method produces different activation maps.

CONCLUSIONS
This paper proposes a novel high performance generalized expectation maximization algorithm for under-fitted models.The proposed approach is based on a general form of parametric divergence family called α−divergence.Based on the new measure, the proposed estimator is enriched by weight functions proportional to the likelihood of outliers.Reported experiments demonstrate that the proposed method has better robustness against outliers when the number of Gaussian components is limited.Application to fMRI data also shows more stable and smooth activation maps.

Fig. 3 :
Fig. 3: Active area in BPRFT and ERRFT data, from the top, the results of EMi in BPRFT, EM in BPRFT, EMi in ERRFT and EM in ERRFT.Left and right figures show the results for the clean and contaminated data, respectively.