Exploratory data analysis using scalable self-organising maps

Exploratory data analysis is used to derive insights from large volumes of data. Unsupervised learning methods, such as the self-organising map (SOM) and the growing self-organising map (GSOM), have gained popularity as data exploration tools due to the limited nature of the availability of meta-information about real-world datasets. The key advantages of the SOM and the GSOM, in the domain of exploratory data analysis, are their visualisation and summarisation features. However, the application of SOM based techniques for large scale data exploration has been limited due to their high time consumption. Distributed computing has emerged as a means of providing large amounts of computing power for data and compute-intensive applications. A number of parallel and distributed algorithms have been proposed for SOM based learning. However, none of the current distributed SOM algorithms possess all the desirable features of data-intensive distributed algorithms: a distributed memory model, data parallelism, a horizontal data layout and the ability to process both sparse and dense data. This thesis presents a distributed SOM model, a distributed memory architecture utilising data parallelism with a horizontal data layout. The distributed SOM model employs a divide and conquer architecture in four stages: data partitioning, SOM training in parallel, redundancy reduction and topographic mapping. Both sparse and dense datasets are used to demonstrate the efficiency and the clustering accuracy of the algorithm which shows up to 99% reduction in processing time while maintaining similar levels of clustering accuracy. The distributed SOM model has the advantage of using any SOM technique as the learning engine. However, results demonstrate that the GSOM with a dynamic structure represents the dataset better than the SOM with a static structure in exploratory data analysis. An incremental data integration model for the Distributed GSOM is proposed in order to maintain the currency of the analysis with the availability of new data. The incremental model reuses components of the Distributed GSOM to incorporate new data into an existing network, thus avoiding the need to re-train the entire network. Results indicate that the topographic mappings generated by incremental data presentation are almost identical to the maps generated by the Distributed GSOM on the entire dataset. The applicability of the distributed SOM model for real-world distributed computing technologies is demonstrated through implementing the Distributed GSOM on Hadoop, a popular MapReduce distributed computing framework. A MapReduce architecture is developed for the Distributed GSOM where combiners are used to further improve the efficiency of the algorithm. The efficiency of the distributed SOM model is further improved by the development of MapReduce processes for the four data partitioning methods. The effectiveness of the Distributed GSOM is demonstrated by exploring a real-world dataset for electricity consumption profile identification. Twelve gigabytes of smart electricity meter data are processed in order to profile customer behaviours. Heuristic based partitioning is used to improve the quality of the output by incorporating outliers into the analysis process. Electricity consumption profiles are identified over multiple time intervals using the Distributed GSOM. A multi-granular profile generation framework is proposed in order to combine the outcomes of the analysis in short-term, medium term and long-term granularity levels. The results demonstrate that the Distributed GSOM identifies the most prominent and distinctive electricity consumption profiles whilst reducing the time consumption of the analysis process by 95%. In summary, this thesis presents a practical, distributed SOM model for exploratory analysis of large datasets. The work extends the current knowledge and the technology of self-organising maps and enhances the practical value of exploratory analysis in big data environments. The effectiveness of the model is demonstrated through implementing the Distributed GSOM on Hadoop and using the algorithm for customer profile identification from real-world data.