figshare
Browse
SahulPhonotactics_ALT_2017.pdf (724.66 kB)

High-definition phonotactic typology in Sahul: Choosing the data-rich approach

Download (724.66 kB)
journal contribution
posted on 2018-02-06, 23:38 authored by Jayden L. Macklin-CordesJayden L. Macklin-Cordes
Conference presentation at the 12th Conference of the Association of Linguistic Typology, Australian National University, Canberra, Australia, 13 December 2017.

Abstract:

The challenge for linguistic typology, of how best to compare like with like among different languages, is a constant concern but also an evolving one. As automated analyses using large-scale cross-linguistic databases (for example, Blasi et al., 2016) become more popular, typology becomes ever more sensitive to decisions made about how linguistic data is coded. Relatedly, challenges present themselves when expanding quantitative techniques from pilot studies to larger domains. Here I discuss such challenges and compare two kinds of solutions for the case of a recent methodological advance in phonological typology, arguing against one widely used approach based on data impoverishment, in favour of a novel, data-rich approach.

I first recap a quantitative method for investigating the phonotactic typology of Australian languages, which can generate several hundred characters (i.e., variables) per language (Macklin- Cordes & Round, 2015). The phonotactics of languages may be compared, in the simplest case, in terms of which sequences of segments each language permits or does not permit, and this kind of data is often provided in descriptive grammars (Fig. 1). On these terms, Australian languages have been described as notably homogenous in their phonotactic systems (Hamilton, 1996), as in their phonological systems more generally (Dixon, 1980; Baker, 2014). However, Gasser & Bowern (2014) demonstrate that greater diversity among phonological inventories of Australian languages emerges when matters of frequency are considered. We extended this idea to phonotactics. For all possible two-segment sequences in a language, we calculate the frequency of the occurrence of a+b relative to all sequences of a+X in the language’s vocabulary. This yields an n×n matrix of continuous characters for each language (Fig. 2). These data can be compared between languages and even form the basis for diachronic investigation. A case study examining two Pama-Nyungan subgroups, Yolngu and Ngumpin-Yapa, found evidence of phylogenetic signal across continuous phonotactic characters, extracted as per the method described here; that is, variation among the data fits the hypothesis that the data evolved along an independently derived phylogeny (from Bowern & Atkinson, 2012).

These tentative findings make us optimistic that our avenue of methodological development may drive future insight in typology and historical linguistics. However, expansion to areas with a relatively high diversity of segment types is a challenge. Non-trivial data-processing decisions must be made, since two-segment sequences may only be compared between languages when both segments are considered shared between the languages in question. A large diversity of segment types will therefore result in too sparse a dataset, where mismatches between languages occur. I compare two approaches which respond to this problem, in the context of expanding our proof-of- concept to the rest of Sahul, the continent of Australia and New Guinea: The first approach is that of the Automated Similarity Judgement Program (ASJP; Wichmann et al., 2016) and subsequently used in several large-scale comparative studies (Blasi et al., 2016; Holman et al., 2011), which is to bin segments into broader segment classes (for example, an ‘N’ bin for all nasal consonants). The second approach is to break up segments into their constituent features. I present evidence which favours the second approach. The first approach results in information loss and can introduce biases into the data (Macklin-Cordes, Moran & Round, 2016). In contrast, the second approach enables a rich level of phonological diversity to be represented in a commensurate fashion, while maintaining a large enough collection of characters for data-hungry statistical algorithms.

Phonological typology is as sensitive as ever to how data is coded. In the face of upscaling challenges, I present one methodological pathway for comparing high-definition phonotactics over the diverse macro-area of Sahul.

Baker, B. (2014). Word structure in Australian languages. In H. Koch & R. Nordlinger, The Languages and Linguistics of Australia: A Comprehensive Guide (pp. 139–214). Walter de Gruyter GmbH & Co KG.
Blasi, D. E., Wichmann, S., Hammarström, H., Stadler, P. F., & Christiansen, M. H. (2016). Sound–meaning association biases evidenced across thousands of languages. Proceedings of the National Academy of Sciences, 113(39), 10818–10823.
Bowern, C., & Atkinson, Q. (2012). Computational Phylogenetics and the Internal Structure of Pama-Nyungan. Language, 88(4), 817–845.
Dixon, R. M. W. (1980). The languages of Australia. Cambridge; New York: Cambridge University Press.
Gasser, E., & Bowern, C. (2014). Revisiting Phonotactic Generalizations in Australian Languages. Proceedings of the Annual Meetings on Phonology.
Hamilton, P. J. (1996). Phonetic constraints and markedness in the phonotactics of Australian Aboriginal languages (Ph. D. thesis). University of Toronto (Canada).
Holman, E. W., Brown, C. H., Wichmann, S., Müller, A., Velupillai, V., Hammarström, H., ... Egorov, D. (2011). Automated dating of the world’s language families based on lexical similarity. Current Anthropology, 52(6), 841–875.
Macklin-Cordes, J. L., Moran, S. P., & Round, E. R. (2016). Evaluating information loss from phonological dimensionality reduction. Presented at the SST2016 Satellite Symposium: The role of predictability in shaping human language sound patterns, Western Sydney University, Australia. https://doi.org/https://dx.doi.org/10.6084/m9.figshare.4465976
Macklin-Cordes, J. L., & Round, E. R. (2015). High-definition phonotactics reflect linguistic pasts. In J. Wahle, M. Köllner, H. Baayen, G. Jäger, & T. Baayen-Oudshoorn (Eds.), Proceedings of the 6th Conference on Quantitative Investigations in Theoretical Linguistics. Tübingen: University of Tübingen.
Nichols, J. (1997). Sprung from two common sources: Sahul as a linguistic area. In P. McConvell & N. Evans (Eds.), Archaeology and Linguistics: Aboriginal Australia in global perspective. Melbourne: Oxford University Press.
Wichmann, S., Holman, E. W., & Brown, C. H. (Eds.). (2016). The ASJP database (version 17).

Funding

Australian Government Research Training Program (J. Macklin-Cordes); National Science Foundation grant NSF1423711 (C. Bowern); Australian Research Council grant DE150101024 (E. Round); ARC CoE in Dynamics of Language Evolution grant (E. Round)

History