figshare
Browse
2012MuzammalMPhd.pdf (1.47 MB)

Mining Sequential Patterns from Probabilistic Data

Download (1.47 MB)
thesis
posted on 2012-12-20, 11:37 authored by Muhammad Muzammal
Sequential Pattern Mining (SPM) is an important data mining problem. Although it is assumed in classical SPM that the data to be mined is deterministic, it is now recognized that data obtained from a wide variety of data sources is inherently noisy or uncertain, such as data from sensors or data being collected from the web from different (potentially conflicting) data sources. Probabilistic databases is a popular framework for modelling uncertainty. Recently, several data mining and ranking problems have been studied in probabilistic databases. To the best of our knowledge, this is the first systematic study of mining sequential patterns from probabilistic databases. In this work, we consider the kind of uncertainties that could arise in SPM. We propose four novel uncertainty models for SPM, namely tuple-level uncertainty, event-level uncertainty, source-level uncertainty and source-level uncertainty in deduplication, all of which fit into the probabilistic databases framework, and motivate them using potential real-life scenarios. We then define the interestingness predicate for two measures of interestingness, namely expected support and probabilistic frequentness. Next, we consider the computational complexity of evaluating the interestingness predicate, for various combinations of uncertainty models and interestingness measures, and show that different combinations have very different outcomes from a complexity theoretic viewpoint: whilst some cases are computationally tractable, we show other cases to be computationally intractable. We give a dynamic programming algorithm to compute the source support probability and hence the expected support of a sequence in a source-level uncertain database. We then propose optimizations to speedup the support computation task. Next, we propose probabilistic SPM algorithms based on the candidate generation and pattern growth frameworks for the source-level uncertainty model and the expected support measure. We implement these algorithms and give an empirical evaluation of the probabilistic SPM algorithms and show the scalability of these algorithms under different parameter settings using both real and synthetic datasets. Finally, we demonstrate the effectiveness of the probabilistic SPM framework at extracting meaningful patterns in the presence of noise.

History

Supervisor(s)

Raman, Rajeev; Hoffmann, Michael

Date of award

2012-10-01

Awarding institution

University of Leicester

Qualification level

  • Doctoral

Qualification name

  • PhD

Language

en

Usage metrics

    University of Leicester Theses

    Categories

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC