Exploring relational features and learning under distant supervision for information extraction tasks.

2017-03-01T05:25:49Z (GMT) by Nagesh, Ajay
Information Extraction (IE) has become an indispensable tool in our quest to handle the data deluge of the information age. IE can broadly be decomposed into Named-entity Recognition (NER) and Relation Extraction (RE). In this thesis, we view the task of IE as finding patterns in unstructured data, which can either take the form of features and/or be specified by constraints. In NER, we study the categorization of complex relational features and outline methods to learn feature combinations though induction. We demonstrate the efficacy of induction techniques in learning : i) rules for the identification of named entities in text — the novelty is the application of induction techniques to learn in a very expressive declarative rule language ii) a richer sequence labeling model — enabling optimal learning of discriminative features. In RE, our investigations are in the paradigm of distant supervision, which facilitates the creation of large, albeit noisy training data. We devise an inference framework in which constraints can be easily specified in learning relation extractors. In addition, we reformulate the learning objective in a max-margin framework. To the best of our knowledge, our formulation is the first to optimize multi-variate non-linear performance measures such as Fβ for a latent variable structure prediction task. Thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of the Indian Institute of Technology Bombay, India and Monash University, Australia.