figshare
Browse
W15-3209.pdf (149.42 kB)

A Pilot Study on Arabic Multi-Genre Corpus Diacritization Annotation

Download (149.42 kB)
journal contribution
posted on 2018-07-26, 00:00 authored by Houda BouamorHouda Bouamor, Wajdi Zaghouani, Mona Diab, Ossama Obeid, Kemal OflazerKemal Oflazer, Mahmoud Ghoneim, Abdelati Hawwari
Arabic script writing is typically underspecified for short vowels and other mark up, referred to as diacritics. Apart from the lexical ambiguity found in words, similar to that exhibited in other languages, the lack of diacritics in written Arabic script adds another layer of ambiguity which is an artifact of the orthography. Diacritization of written text has a significant impact on Arabic NLP applications. In this paper, we present a pilot study on building a diacritized multi-genre corpus in Arabic. We annotate a sample of nondiacritized words extracted from five text genres. We explore different annotation strategies: Basic where we present only the bare undiacritized forms to the annotators, Intermediate (Basic forms+their POS tags), and Advanced (automatically diacritized words). We present the impact of the annotation strategy on annotation quality. Moreover, we study different diacritization schemes in the process.

History

Publisher Statement

Published in Proceedings of the Second Workshop on Arabic Natural Language Processing, pages 80–88, Beijing, China, July 26-31, 2015.

Date

2018-07-26

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC