Learning Clusters in Autism Spectrum Disorder: Image-Based Clustering of Eye-Tracking Scanpaths with Deep Autoencoder

Autism spectrum disorder (ASD) is a lifelong condition characterized by social and communication impairments. This study attempts to apply unsupervised Machine Learning to discover clusters in ASD. The key idea is to learn clusters based on the visual representation of eye-tracking scanpaths. The clustering model was trained using compressed representations learned by a deep autoencoder. Our experimental results demonstrate a promising tendency of clustering structure. Further, the clusters are explored to provide interesting insights into the characteristics of the gaze behavior involved in autism.


II. MOTIVATION
The study can be viewed from the perspective of an exploratory analysis.The main motivation was to investigate if the eye-tracking data could indicate a plausible clustering structure of ASD.We were generally intrigued to discover interesting patterns, or structures underlying the visual representations of eye-tracking scanpaths.A set of motivational questions are listed as follows: 1. Would the visual patterns of eye-tracking scanpaths indicate an underlying structure of clusters?
2. If so, could the clusters discovered reveal possible connections related to the dynamics of gaze behavior (e.g.velocity, acceleration)?
3. Further, how would clusters vary with respect to the characteristics of participants (e.g.age)?

III. RELAT ED WORK
Plentiful studies sought to take advantage of the eye-tracking technology for the study and analysis of the ASD behavior.For instance, Pusiol [2] worked on the analysis of the eye focus on faces during conversations.Various classification models were experimented including a Recurrent Neural Network (RNN) and Support Vector Machine (SVM).With RNNs, they were able to reach an excellent classification accuracy up to 91%.Likewis e, a recent segment of our work demonstrated a quite promising classification accuracy (≈92%) using a variety of models including neural networks [3].The models were trained based on the visual patterns of eye-tracking.As such, the diagnosis could be approached as a typical task of image classification.
Compared to literature, this study is concerned with gaining insights from unsupervised learning rather than developing predictive models.It is a continuation of our approach that is exclusively based on the visual representation of eye-tracking scanpaths.In particular, the visual patterns here are utilized for learning clusters in data.To the best of our knowledge, the idea of visual-based clustering of scanpaths has not been experimented yet in the ASD context.

A. Participants
The study comprised a group of 59 children recruited from a number of French schools located in the region of Hauts-de-France.The participants were initially organized into two groups as: i) ASD, or ii) Non-ASD.Further, the severity of autism was classified by psychologists using the Childhood Autism Rating Scale (CARS) [4].The CARS scheme has been largely adopted as a standard method for estimating the intensity of ASD symptoms [13].Table 1 gives a summary of the participants.

B. Visualization of Eye-Tracking Scanpaths
A scanpath represents a sequence of consecutive fixations and saccades as a trace through time and space that may overlap itself [5]. Figure 1 gives a simple scanpath example that includes five fixations and four saccades.
Our representation of scanpaths follows on the core idea of visualizing fixations and saccades.Furthermore, we aimed to visually encode the dynamics of gaze (e.g.velocity) using color gradients.With the coordinates/time data, we were able to calculate the velocity of gaze movement.Subsequently, the acceleration and jerk of movement could be computed based on the change in velocity and acceleration respectively.Figure 2 presents a sample of the visualizations produced.
The visualization of scanpaths resulted in a dataset of 547 images (219 ASD, and 328 non-ASD).A comprehensive presentation of the data acquisition process was elaborated in an earlier publication [6].Furthermore, the dataset along with metadata files have been made publicly available on the Figshare data repository [7].

A. Initial Features
Basic image processing was applied to simplify the problem dimensionality in general.First, the image dimensions were scaled down to 100x100.In addition, the RGB images were converted into grayscale for further simplification.This contributed to significantly reduce the number of features under consideration from 30K to 10K.
The initial feature set represented the grayscale pixels, which made up 10K features.However, the crude pixel-based representation unavoidably contained a large number of redundant features, which belonged to the image background.In this regard, data compression techniques were utilized to reduce the feature set.Principal Component Analysis (PCA) was utilized to transform the 10K features into 50 components only.Likewise, the t-Distributed Stochastic Neighbor Embedding (t-SNE) technique [8] was applied to reduce the dataset.

B. Learning Features with Autoencoder
Autoencoders are a particular implementation of artificial neural networks (ANNs), which aims at learning compressed representations of input data, referred as codings.In contrast to the usual applications of ANNs (e.g.classification), the learning process of autoencoders is entirely unsupervised.The compression and decompression functions are learned automatically through data examples rather than hand-crafted.
There is a variety of valid uses of autoencoders.Basically, autoencoders can be used a means of dimensionality reduction, whereas codings usually have a much lower dimensionality than the original input.Further, they can act as generative models.The codings learned can allow for randomly generating synthetic samples very similar to the original data.Perhaps more importantly, autoencoders can serve as a potent mechanism for extracting features, which is the application in this study.
The autoencoder architecture is depicted in Figure 3.As shown, a multi-layered ANN performs the functionality of encoding and decoding of images.The encoder consists of five layers starting from a 10K input layer that matches the imag e dimensions.The compression process is conducted in sequence over four hidden layers.The concentration of neuros continues to gradually decrease down to 500 units, which contain the final image codings.As such, the encoded representation was only %5 of the original dimensionality.
The decoder can simply be viewed as a flipped copy of the encoder.In an inverse fashion, the number of neurons progressively increases all the way back to the original dimension (i.e.10k units).The autoencoder could be trained by comparing the decoder output against the input image.

A. K-Means Clustering
Four clustering models were developed using the standard K-Means algorithm along with the different set of features explained before.In K-Means clustering, the very first question is usually concerned with choosing the number of clusters (K).In our case, we largely had a plausible preconception regarding the clusters that could expectedly exist in the dataset as follows.
The simplest clustering structure (k=2) would be partitioned into two broad clusters trying to resemble the binary grouping of participants (i.e.ASD, or non-ASD).While a finer clustering would seek to organize the ASD samples into smaller chunks reflecting the intensity of autism as provided by the CARS score.Based on the severity of autism symptoms, the CARS scheme allows for organizing the autism spectrum into three groups as follows: i) Low, ii) Mild, and iii) Severe.As such, the clustering model was experimented with K= (2: 4).

B. Results: Quality of Clusters
The quality of clusters was examined based on the Silhouette method [9].The Silhouette score has been widely used in clustering-related studies as an objective means to measure the robustness of cluster membership.Specifically, score can be calculated for each point as follows:

𝑆 ( 𝑖 ) = ( 𝑏(𝑖) − 𝑎(𝑖) ) max (𝑎(𝑖), 𝑏(𝑖))
Where a(i) is the average distance of point i to all other points in containing cluster, and b(i) is the smallest average distance of point i to points in another cluster.
Figure 6 compares the Silhouette scores achieved by clustering models.As it appears, the score stood at its highest when K=2 in all cases.However, the quality of clusters declined in higher partitioning of clusters (i.e.K=3, 4).In general, there was a poor separation of clusters in case of pixel-based features, and the compressed representations with PCA and t-SNE.
In contrast, the encoded features could yield much better quality of clusters.Specifically, the Silhouette score surged to more than 0.6 for K=2.As well as, the clusters could largely maintain better quality for K= (3,4), compared to other clustering models.This clearly highlights the effectiveness of features learned by the autoencoder.
The main outcome of the clustering experiments could largely provide an answer to the first question included in our motivation.In other words, the experiments evidently indicated that the dataset inherently included a clustering structure based on the visual representation of eye-tracking scanpaths.
The clustering experiments were implemented in Python using the Scikit-Learn library [10].For the purpose of transparency and reproducibility, the output of experiments and code are shared in a Jupyter notebook [11] on the Azure platform.The experiments associated with the above-mentioned results can be seamlessly replicated.

VII. CLUSTER ANALYSIS
The analysis aimed mainly to investigate possible correlations in clusters, which could possibly be linked to the gaze behavior.Specifically, the clusters were inspected against the velocity and acceleration of eye movement.On the other hand, Figure 7 compares the average velocity of eye movement in clusters (K=2).Interestingly, Cluster 2 was considerably higher in this regard, apart from a few outliers in Cluster1.Similarly, Figure 8 shows the acceleration of movement is higher Cluster 2. It is worth mentioning that the majority of Cluster 2 belonged to the ASD class.

VIII. CONCLUSIONS
Reflecting on the motivational questions, the results provided a set of implications to be considered.First, the clustering experiments empirically confirmed that eye-tracking scanpaths could be grouped into coherent clusters, which largely resembled the original grouping of samples (i.e.ASD or non-ASD).This could translate into that scanpath visualizations could effectively discriminate the ASD-diagnosed samples from others.This should give promising prospects for employing such visual patterns for developing assistive diagnostic tools based on sophisticated classification models.
From a practical standpoint, the cluster analysis could reveal potential connections between the dynamics of gaze behavior and autism.It is conceived that the clusters could serve as a kernel for deeper analysis to bring interesting insights into the context of autism and diagnosis -related applications.Further, the study demonstrated how the deep autoencoder played a key role in learning compressed representations of sparse visualbased features.
Figure 4 plots the loss of training and validation over 100 epochs during training the autoencoder model.

Fig. 5 .
Fig. 5. Example of autoencoder images.The first row gives the original images, while the second one represents the reconstructed images by the autoencoder.

Figure 5
Figure5demonstrates some examples output by the autoencoder.The figure compares a couple of images against their reconstructed representations.Though they are not identical, it appears that the reconstructed images could capture the key features of the original examples.The autoencoder was implemented using the Keras library[12] with Python.

Fig. 6 .
Fig. 6.Quality of clusters based on the Silhouette score.

Figure 9
Figure 9 similarly compares the clusters (K=3) in terms of the velocity.It appears clearly again that the velocity of eye gaze was higher in clusters that included larger proportions of ASD participants.This could draw possible links related to the gaze behavior of autistic individuals and the dynamics of gaze.

Table 2 .
The table compares clusters in terms of the proportion of ASDdiagnosed, and average age in clusters.For K=2, it turned out that the ASD percentage was notably significant in Cluster 2. This can translate into that the clusters had a coherent structure that could resemble the original grouping of participants (i.e.ASD or Non-ASD).Likewise, the table describes clusters while The distribution of ASD samples was much higher in Cluster 2 and Cluster 3. In general, the average age of participants was nearly the same in all clusters .