Statistical inference problems with applications to computational structural biology

2017-03-02T23:14:18Z (GMT) by Kasarapu, Parthan
In this data pervasive world, the efficient and accurate modelling of data is crucial to support reliable analyses and to improve the solution to related problems. In order to describe the given data, the problem of selecting a suitable model has to be carefully addressed. Traditional approaches to the problem of optimal model selection have relied predominantly on the number of model parameters rather than the actual parameters themselves. This limits the ability of traditional methods to correctly distinguish among models that, while being of different type, have the same number of model parameters. In order to address the problem of model selection satisfactorily, this thesis explores the Bayesian information-theoretic principle of minimum message length (MML). The inference framework based on the MML principle enables the optimal selection of models by using the constituent parameters to better balance the trade-off between the model’s complexity and its goodness-of-fit to the data. The core of this thesis explores the MML-based inference of some of the commonly used probability distributions whose parameters have not yet been characterized and of mixtures of these probability distributions. The models of these probability distributions allow for accurate modelling of data in the Euclidean space and data that is directional in nature. These probabilistic models and their mixtures have widespread uses in statistical machine learning tasks. In this context, we have developed a general purpose search method to determine the optimal number of mixture components and their parameters that describe the given data in a completely unsupervised setting. The use of the MML modelling paradigm and our proposed search method is explored in detail on a variety of real-world data, specifically on directional text data and on the spatial orientation data of protein three-dimensional structures. Further, mixtures of directional probability distributions have facilitated the design of reliable computational models for protein structural data. Furthermore, the inference framework has been used for concise representations of protein folding patterns using a combination of non-linear parametric curves. The results of this work have a wide-variety of important uses including direct applications in protein structural biology.