Mā te kupu te whenua e ora ai : The Challenges of Geospatial Natural Language Processing with New Zealand Māori. NZGRC 2022.
Recent years have seen concerted efforts to revitalise New Zealand Māori, the indigenous language of Aotearoa New Zealand, after earlier attempts at suppression during colonisation. Automated methods for natural language processing together with the increasing availability of written Māori language resources have great potential for extracting knowledge from text to increase understanding of current and historical Māori worldviews. In the geographic domain, these methods can be used to increase knowledge of Māori conceptualisations of landscape and to enable information retrieval for purposes such as mapping of species distribution and disaster events. However, most existing tools are based on the form and syntax of English and other well-resourced languages and pose challenges when applied to Māori, including lack of annotated data, inappropriate grammatical assumptions and high levels of polysemy. We discuss these challenges as discovered during (1) the creation of a large Māori corpus through the amalgamation of multiple other language resources; and (2) the comparison of five rule-based and machine learning bag of words methods to identify geographic senses of a collection of 11 geographic feature type words, many of which have multiple other meanings.