Micro-expression Video Clip Synthesis Method based on Spatial-temporal Statistical Model and Motion Intensity Evaluation Function

Micro-expression (ME) recognition is an effective method to detect lies and other subtle human emotions. Machine learning-based and deep learning-based models have achieved remarkable results recently. However, these models are vulnerable to overfitting issue due to the scarcity of ME video clips. These videos are much harder to collect and annotate than normal expression video clips, thus limiting the recognition performance improvement. To address this issue, we propose a micro-expression video clip synthesis method based on spatial-temporal statistical and motion intensity evaluation in this paper. In our proposed scheme, we establish a micro-expression spatial and temporal statistical model (MSTSM) by analyzing the dynamic characteristics of micro-expressions and deploy this model to provide the rules for micro-expressions video synthesis. In addition, we design a motion intensity evaluation function (MIEF) to ensure that the intensity of facial expression in the synthesized video clips is consistent with those in real -ME. Finally, facial video clips with MEs of new subjects can be generated by deploying the MIEF together with the widely-used 3D facial morphable model and the rules provided by the MSTSM. The experimental results have demonstrated that the accuracy of micro-expression recognition can be effectively improved by adding the synthesized video clips generated by our proposed method.


INTRODUCTION
Micro-expressions (MEs) are very subtle and involuntary facial expressions when people try to conceal genuine emotions [1]. The recognition of MEs has been widely used in many applications. For example, in the clinical field, detecting and recognizing micro-expressions is vital to assist psychologists to diagnose and remediate patients with mental diseases such as autism [2] and schizophrenia [3], [4]. ME recognition is also useful in emotion monitoring [5]- [7], teaching assistance [8], [9], serving as a vital clue in law enforcement [10], evidence collection and criminal investigations [11], [12].
Although machine learning-based (ML) [13]- [16] and deep learning-based (DL) methods [17], [18] have been widely applied for ME recognition and yielded remarkable results, these models heavily depend on the availability and effectiveness of ME datasets, which makes them vulnerable to the overfitting problem if no sufficient training data is provided. Unfortunately, it is very difficult to collect and annotate micro-expression data because micro-expressions are characterized by low intensity, fragmental Action Units (AUs), short duration and unnoticeable (only trained psychologist can perceive and decode them [19] [20], which are different from normal facial expressions). For example, One of the largest ME databases, CASMEII [21], has only 255 samples. Therefore, the scarcity of training data limits the improvement of these models for ME recognition. Transfer learning is one of the popular techniques to tackle the data shortage problem. For instance, [18], [22], [23] propose deep neural network (DNN) -based ME classifiers trained with general expression data and then fine-tuned with the limited micro-expression data. However, the effectiveness of transfer learning for ME recognition still needs improvement because the general expression domain and the micro-expression domain are substantially different in both motion intensity and the occurrence pattern of action units.
Data augmentation is another widely applied technique to address the data shortage problem. It could be classified into image-preprocessing-based augmentation and generativemodel-based augmentation. Both of them still face the following difficulties when they are applied for ME recognition: 1) Methods of the former category generate new images by partially altering the appearance such as flipping images [16], adding noise [17] to increase ME images. The problem of this category is that they fail to increase the subject or emotion diversity of data, which makes the generated data lacking informative.
2) The generative-model-based methods usually synthesize facial images by tracking expressions in videos and have been widely used for synthesizing general expression data. Although this method has been used to address the data shortage problem for ME recognition, there are still three main difficulties to synthesize micro-expressions. (1) Micro-expressions are too subtle. As illustrated in Fig.1 (b), the difference between the onset frame and the apex frame is nearly unnoticeable, while the difference of normal expressions, such as Fig.1 (a), is apparent. It makes facial expression detection and tracking of micro-expressions much harder than normal expressions. There are no specific facial features tracking or action unit detection methods reported to be able to track or detect micro-expressions up to now. So, the generative models [24] [25] [26] which depend on facial expression tracking are difficult to capture micro-expressions and cannot produce facial images accurately.
(2) The Facial Action Unit (AU) occurrence patterns of normal expressions are different from micro-expressions, which decreases the effectiveness of transfer learning from normal expressions to micro-expressions [18]. The activated AUs of the normal smile expression in Fig.1 (a) includes AU6, 12, 20, and 25, while the activated AUs of the micro-smile expression in Fig.1 (b) only includes AU12 only. (3) The intensities of micro-expressions are very small and it is difficult to synthesize facial expressions with such small intensities.
In this paper, we propose a novel micro-expression video clips synthesis scheme to address the difficulties of the generative-model-based method for ME recognition, thus overcoming the data scarcity problem. Different from the existing popular methods, our scheme does not depend on facial tracking. Instead, we propose a Spatial-temporal Statistical Model (MSTSM) to provide the rules of dynamic characteristics of micro-expressions. Applying these rules on 3D facial morphable models, we generate various facial images with micro-expressions. Furthermore, different from existing generative models, our scheme includes a novel motion intensity checking algorithm that evaluates whether the generated realistic micro-expressions which have consistent motion intensities when obeying the rules in MSTSM. The key points of our paper are as follows: 1) A novel data augmentation method for microexpressions is proposed. This method can synthesize 3D face models and facial video clips with micro-expressions of an arbitrary number of new subjects, thereby improving ME recognition accuracy effectively.
2) To our best knowledge, this is the first ME spatialtemporal statistical model (MSTSM) to statistically analyze the dynamic characteristics of micro-expressions. The characteristics include the AUs occurrence pattern, peak intensity of each micro-expression, and the relative positions of peaks. By applying the dynamic characteristics provided by the MSTSM, we could produce realistic micro-expression images without using facial tracking.
3) A Motion Intensity Evaluation Function (MIEF) is designed to guarantee the accurate measurement of expression intensity in the synthesized video clips. The scheme searches in the parameters space of facial morphable models and selects the optimal parameters to generate facial expressions which is consistent with the intensity of real micro-expressions.

II. METHODS
Our scheme consists of two phases. In the first phase, we build a ME spatial-temporal statistical model (MSTSM) to represent three kinds of dynamic characteristics of microexpressions including the action unit's occurrence pattern (AU combinations), the peak intensity of each micro-expression and the relative positions of peaks. In the second phase, we generate new ME samples with a novel motion intensity evaluation function (MIEF) and a 3D Morphable face Model (3DMM). By tweaking the shape and expression parameters of 3DMM based on the dynamic characteristics in MSTSM, facial images could be rendered via generating 3D face models with microexpressions. Finally, MIEF is applied in the image synthesis procedure to make sure that the intensity of facial expression in the synthetic video clips is consistent with real microexpressions. In this way, the generated micro-expression video clips with new subjects share similar dynamic characteristics with the clips of existing ME dataset. The framework of our method is shown in Fig. 2

A. Build Spatial-temporal Statistical Model (MSTSM) for Micro-expressions
The ME spatial-temporal statistical model (MSTSM) is built to analyze ME dynamic characteristics statistically. We extract three types of important metrics from ME video clips, which are common AU combinations, peak intensity, and positions of apex frames. Fig. 3 shows the main steps of this phase and the detailed steps are described below.

1) Extract the Common AU combinations of microexpressions
Facial expressions are related to the contractions of specific facial muscles, which can be described in terms of Action Units (AUs). It is noticeable that there are so many AU combinations (more than 7000) but there is no consensus about whether the AU combinations of micro-expressions are the same as the normal expressions. Therefore, we do statistical analysis to obtain common AU combinations of micro-expression in this study.
For each class of micro-expressions, such as surprise, regression, happiness, disgust, and others, we sort AU combinations by their corresponding sample number from high to low. Let [x1, x2, …, xn] denotes the sorted AU combinations and [y1, y2, …, yn] denotes the corresponding samples number, the common AU combinations [x1, x2, …, xm] are selected by: (1) where, m < n, β∈(0,1) and denotes the proportion of common AU combinations to all AU combinations.

2) Calculate the common peak intensities of microexpressions
To quantify the intensity of an AU combination from a ME video accurately, we use optical flows as the measurement in our study. Peak intensity reflects the most intense facial movement during the process of micro-expression and thus is represented by the max optical flow intensity extracted from the aligned video clips. Following [8], all the faces in video clips are aligned and cropped based on a reference neutral face. In this manner, the variations of spatial appearance among different video clips are normalized.
To simplify the calculation and reduce the influence of image noise, we only extract dense optical flows from the face region in each image frame where actual facial expressions may occur. Each video frame is divided into non-overlapping 8×8 blocks. The blocks that contain facial movements are called activated blocks. Only the dense optical flows in the activated blocks are computed to measure the expression intensity. For any AU combination, it is easy to tell which blocks are activated blocks by checking the regions where the corresponding AUs occurs.
The motion intensities are estimated by summing up the magnitude of optical flow vectors extracted from the first frame and each of the rest frames. The biggest motion intensity is taken as the peak ME intensity for the video clip. The intensity values are rounded to the nearest integers for ease of comparison in the later ME image synthesis phase. Finally, among the samples with common AU combinations, we count the samples number of different peak intensity and sort peak intensity by their samples number, then the common peak intensities are obtained with the similar manner as (1), except that [x1, x2, …, xn] denotes the sorted peak intensities and [y1, y2, …, yn] denotes the corresponding samples number.

3) Count the common positions of apex frames in microexpression videos
The relative positions of apex frames is another important facial movement feature to represent the relative speeds between the contraction and expansion of facial muscles. In our work, we use a decimal to denote the position of the apex frame. Assuming a micro-expression video has L frames in total and the i-th frame is the apex frame, the position of the apex frame in this micro-expression video is computed as . For all samples with common AU combinations, we count the samples numbers of different positions of apex frame and sort the positions by their samples number, then the common

B. Synthesize new Micro-expression video clips with 3D Morphable Model (3DMM) and MIEF
As illustrated in Fig. 4, our method include three steps to synthesize a realistic ME video clip. :the first step is generating a 3D face model of an unknow subject with neutral expression by setting parameters of the 3DMM, the second step is fitting 3DMM parameters with MIEF to synthesize the 3D face model with apex expression, and the third step is rendering 2D video clip based on the generated 3D models with MEs.

1) Build the base 3D face model
Based on (3DMM) [27], the geometry and texture of a face can be represented by a shape-vector S and a texture-vector T. We define the base 3D face model with the neutral expression as (Si, Ti). According to the performance-based technique [28], face models (Si e , Ti e ) with desired expressions can be obtained by transforming the base face model linearly as in (2) and (3).
where λ is the parameter that controls expression intensity, the subscript "expression" denotes the corresponding vector related to the most intense expression, and "neutral" denotes the vector related to the base face model.
Micro-expression movements can be modeled by adjusting λ. When λ changes from zero to one, normal facial expressions change from neutral to apex. λME denotes the parameter in the apex frame which has the peak intensities of micro-expressions. Because micro-expressions are much weaker than normal expressions, the parameter λME that relates to the peak intensity of micro-expression is far less than 1. The detailed method to calculate λME is presented in the next section.
2) Calculate the 3DMM parameters for the apex frames with MIEF λME is the 3DMM parameter relating to the peak intensities of micro-expressions. The bigger λME is, the more intense a facial expression is and vice versa. However, there is no direct mapping between them. Here we propose an iterative searching method to find the optimal λME, given a target ME. We name this method as Motion Intensity Evaluation Function (MIEF).
The idea of MIEF is to search for the optimal λME which corresponds to the expression intensity equal to real MEs. The expression intensity is calculated with the optical flow between the rendered 3DMM with λ and the rendered base face model. The common peak intensities in MSTSM is regarded as expression intensity in real MEs. A binary search algorithm is applied to speed up the process. The details are described in Algorithm 1.

3) Generate Micro-expressions videos
With the calculated λME of the apex frame, we further compute the 3DMM shape parameters for the other frames. 3DMM parameters for the other frames can be obtained based on the linear interpolation for two processes of expression movement in a video: neutral to apex and apex to neutral, using the following formulas respectively: (4) where λs start is the parameter corresponding to s-th frame in the former process, N is the total frames number in a synthetic micro-expression video, R is the position of apex frame. (5) where λh end is the parameter corresponding to h-th frame in the process of apex to neutral.
In this manner, we can synthesize ME video clips with desired face shapes, appearances, and expressions. The synthetic ME video clips have similar dynamic characteristics and similar expression intensities with real ME videos. They can help us improve the performance of ME recognition models.

III. EXPERIMENT RESULTS
Our proposed method aims to generate samples with realistic ME facial movements by matching expression intensity and dynamic characteristics with existing ME samples. The main purpose of our generated synthetic samples is to improve ME recognition performances. Therefore, we conduct experiments to evaluate if enriching ME datasets with our synthetic samples can improve ME recognition rate and exactly how much ME recognition rate can be improved in this section.

Input:
the upper bound of λME written as U. Initially, U=1 the lower bound of λME written as L. Initially, L=0 the common peak intensity of a target micro-expression, written as peak intensity (rounded to integers) Output: λME While (U-L)>=0.01 Median= (U+L)/2; Calculate the expression intensity when λ=Median, the result saves in Intensity_n; if Intensity_n> peak intensity U=Median, L stays the same else if Intensity_n< peak intensity L=Median, U stays the same else if Intensity_n= peak intensity λME =Median return λME end if end while keep U accurate to 2 decimal places as λME. return λME

A. Experimental setting
In this study, we use the CASME II dataset, one of the popular ME databases with the largest number of samples among similar databases, for ME synthesis and evaluation of micro-expression recognition improvement. The CASME II dataset consists of 255 micro-expression video clips from 26 subjects, which are recorded with a resolution of 640 480 pixels and a frame rate of 200 fps. Video clips in CASME II consists of seven classes of micro-expressions: surprise, happiness, fear, repression, disgust, sadness, and others. In our experiments, we classify CASME II into four categories: negative, positive, surprise, and others as [14]. The specific emotions that each class contains and the number of clips in CASME II are shown in Table I.
Here, we employ the methods in [14] [15], which are two of the state of the art methods, for ME recognition. Both methods extract LBP-TOP features from ME database first, then utilizes SVM to classify MEs. In our experiments, we set the proportion parameter β (defined in section II(A)) to 0.7. In total, 360 new ME samples are generated.
Section III (B) presents the statistical results of ME dynamic characteristics by using MSTSM. Section III(C) shows the comparisons between our method and the other popular augmentation methods for ME recognition.

B. The statistical result of ME dynamic characteristics
The statistical common ME dynamic characteristics are shown in Table II. In our experiments, we randomize the shape parameters of 3DMM to obtain eight new faces whose features are similar to that in the existing database. And we use Unity3D's built-in render to render the 3D models to images. Based on statistic results, we set different peak intensity for different AU combinations and only one peak intensity corresponds to one AU combination. For each new synthetic subject, we synthesized 12 (4 AU combinations×3 position of apex frame) samples of negative, 12 (4×3) samples of positive, 18 (6×3) samples of others, 3 (1×3) samples of surprise.

C. Comparison of different micro-expression augmentation methods for ME recognition
To evaluate the effects of our generated data for microexpression recognition, we have tested the recognition accuracy for different expressions and generated confusion matrices.
For all ME recognition experiments, we use three-fold crossvalidation which randomly divides CASME II into three folds and leave one-fold as the testing set, then combine the rest folds and synthetic samples as the training set. To the best of our knowledge, the existing methods for micro-expression data augmentation include adding noise [17] and flipping [16]. Hence, we compare ME recognition rates of our approach to these two augmentation approaches. As our method concentrates on MEs' dynamic information rather than texture information when generating new ME samples, here we employ two state-of-the-art ME recognition methods [14], [15] which pay more attention to dynamic information when recognizing MEs. The recognition results are shown in Table III. The results in Table III show that all the above data augmentation approaches improve ME recognition performance. The best accuracy is 69.72%, which is obtained from the enrichment by our generated samples using the recognition method in [14]. The recognition accuracy of the enrichment by our generated samples is 3.77% higher than that of no augmentation. Compared with the augmentations by flipping samples [16] and noise added samples [17], the recognition accuracy of using our method is 3.75% and 2.51% higher respectively. Table III also lists the recognition accuracy of different augmentations methods of ME recognition method [15]. The recognition rate is also improved by 7.04% with our generated samples compared with no augmentation. Compared with the augmentations by flipping samples [16] and noise added samples [17], the recognition accuracy of the method [15] with our generated samples is 6.69% and1.66% higher respectively.
These results demonstrate that our method can effectively improve the recognition performance and outperforms the other two augmentation methods for ME recognition performance improvement, which is resulted from the effectiveness of the proposed ME spatial-temporal statistical model (MSTSM) and expression intensity guarantee.
To further evaluate the effects of our generated data for micro-expression recognition, we tested the recognition accuracy for different expressions and generated confusion matrices. Fig. 5 and Fig. 6 show the confusion matrices of no augmentation, augmentation by adding noise, augmentation by image flipping, and augmentation by our method for weighted LBP-TOP [14] and LBP-IP [15] respectively. As shown in Fig. 5, compared with no augmentation, the recognition model in [14] with our generated data achieves better performances on all the four categories of MEs, which are improved by 0, 18.4%, 3.4%, 0.1% respectively. Although the recognition rates of our generated samples are lower than those of flipping and adding noise for 'negative' category, the enrichment by our generated samples achieves better performances on all the other three categories, which are 'surprise', 'positive', and 'others', by improving 10.6%, 4.7%, 1.1% for adding noise and 14.6%, 4.7%, 3.4% for flipping. These results demonstrate that the recognition method in [14] with our generated data outperforms that with no augmentation or using existing methods for micro-expression synthesis in most occasions.
As shown in Fig. 6, compared with no augmentation and the augmentation by image flipping, the enrichment by our generated samples achieves better performances on all these four categories of MEs for LBP-IP [15]. Specifically, for 'negative', 'surprise', 'positive', and 'others', the recognition rates with our generated samples are improved by 12.5%, 22.2%, 12.5%, 4.5% respectively compared with no augmentation, and 12.5%, 22.2%, 4.2%, 4.5% respectively compared with flipping. Compared with adding noise, the enrichment by our generated samples achieves better performances on 'surprise', which is improved by 8.1%, and comparable performances on 'negative', 'positive', and 'others'. These results demonstrate the enrichment by our generated samples shows an outstanding ability of recognition performance improvement for LBP-IP [15], which outperforms that with no augmentation or using existing methods for micro-expression synthesis in most occasions.
The experimental results show the synthetic ME video clips by our method effectively improve the performance of two of the state of the art ME recognition methods compared with no augmentation, augmentation by adding noise, augmentation by image flipping. Therefore, the effectiveness of the proposed micro-expression video clip synthesis method has been demonstrated by these experiments.

IV. CONCLUSIONS
In this paper, we propose a novel micro-expression video clips synthesis method to generate ME samples. In our method, a spatial and temporal statistical model is first established to represent the dynamic characteristics of micro-expressions such as the patterns of AUs occurrence and the peak intensities for micro-expressions. To our best knowledge, it is the first attempt to analyze and summarize the dynamic characteristics of ME samples. By using these dynamic characteristics, the 3D facial morphable model is applied to generate facial video clips to avoid using facial tracking as this kind of tracking is difficult for micro-expressions. Furthermore, a motion intensity evaluation function is proposed to make sure that the facial expressions' intensity in the synthetic video clips is following real microexpressions. The experimental results have demonstrated that the generated samples by our method can effectively improve the predictive capability of the recognition model, confirming the effectiveness of our proposed micro-expression video synthesis method. For future works, we plan to improve our methods by using nonlinear interpolation to obtain the expression intensity of each frame which makes the expression intensity of synthesized samples more approach to real samples.  [14] with no augmentation, noise-added samples, flipping samples and our generated samples respectively.  [15] with no augmentation, noise-added samples, flipping samples and our generated samples respectively