Reflections of Linguistic History in Quantitative Phonotactics
Advanced quantitative methods are at the cutting edge of historical linguistics, however these methods often ideally require many hundreds of data points per language. In order to generate reliable inferences at ever greater time depths, there is a need for typological datasets which are not only broader in coverage, but also contain a deeper store of information. We explore one avenue by extracting large numbers of high-definition phonotactic ‘traits’ per language. We show that these traits contain phylogenetic signal, thus demonstrating an important path towards high-powered methods of the near future.
Methodology: Languages may be compared in terms of which two-segment sequences they permit. Moreover, such biphones possess distinct lexical frequencies, which can also be compared. We examined whether such data contain information about family-tree structure, i.e., phylogenetic signal. Two standard statistics are used: D [1] tests coarse-grained biphone ‘permissibility’ data; and K [2] tests higher-definition transition probabilities.
We examined 2 subgroups of the Australian Pama-Nyungan family: 10 languages of Ngumpin-Yapa [3] and 7 of Yolngu [4], represented by phonemically-standardised lexicons from the CHIRILA database [5]. Phylogenetic signal is calculated with reference to phylogenies from C. Bowern (updated from [6]). Australian languages present a tough challenge, since phonotactically they are notoriously uniform [7–9]. Moreover, Ngumpin-Yapa has some of the world’s highest borrowing rates [10–11]. Thus we hypothesized that the coarse-grained D test would fail. The key question is whether the high-definition K test succeeds.
Results: D attempts to reject two null hypotheses: that traits’ distributions are (A) too uniform to reveal structure present in the reference tree; and (B) random. We extracted 184 (Ngumpin-Yapa) and 164 (Yolngu) traits per language. We were surprised to reject both hypotheses for Yolngu (Stouffer’s Z>100, p=0.00): thus, even binary permissibility data revealed some phylogenetic signal. For N-Y only the second null hypothesis could be rejected (p=0.00), and further testing showed that when the subgroup’s outermost language was removed, even this failed. We conclude that binary phonotactic data contains weak phylogenetic signal at best; the Y result may represent statistical noise, and more subgroups should be tested.
K attempts to reject one null hypothesis: that no phylogenetic signal is present. A value K=0 represents random trait distribution relative to the reference tree; K=1 represents an exact match and K>1 indicates that outermost languages are even more distinct in the test data than in the reference tree. With 451 (Ngumpin-Yapa) and 541 (Yolngu) traits per language, we reject the null hypothesis in both subgroups (Stouffer’s Z=9.87; 17.6, p=0.00). In Ngumpin-Yapa, the confidence interval for K of [0.86, 0.92] indicates a very good match with the reference phylogeny, and in Yolngu, [1.15, 1.26] indicates an even stronger sorting of languages. Further testing, which removed the outermost language from both subgroups showed the result is stable: [0.81, 0.87] for Ngumpin-Yapa and [0.96, 1.00] for Yolngu.
Conclusion: As linguists attempt to up-scale efforts in quantitative historical linguistics, we demonstrate the significant potential of high-definition phonotactics, which permits the extraction of several hundred traits per language and has revealed phylogenetic signal in two Australian subgroups.
|
n(traits) |
Mean D |
SD |
MFDR-CIa |
Ngumpin-Yapa |
184 |
0.372 |
3.592 |
[−0.23, 0.97] |
Yolngu |
164 |
–1.486 |
4.269 |
[−2.97, −0.73] |
Figure 1: Results for coarse-grained, binary data (D test)
|
n(traits) |
Mean K |
SD |
MFDR-CIa |
Ngumpin-Yapa |
451 |
0.893 |
0.27 |
[0.86, 0.92] |
Yolngu |
541 |
1.206 |
0.595 |
[1.15, 1.26] |
Figure 2: Results for high-definition, continuous data (K test)
a Benjamini–Hochberg [12] mean false discovery rate adjusted CI
References
[4] Schebeck, Bernhard Dialect and Social Groupings in North East Arnhem Land, typescript, Australian Institute of Aboriginal and Torres Strait Islander Studies Library, Canberra, 1968.
[5] C. Bowern, “Chrila: Contemporary and Historical Resources for Indigenous Languages of Australia,” Language Documentation and Conservation, vol. 10 http://nflrc.hawaii.edu/ldc/
[7] R.M.W. Dixon, The Languages of Australia, Cambridge: Cambridge University Press, 1980.
[8] P.J. Hamilton, “Phonetic constraints and markedness in the phonotactics of Australian languages,” Ph.D. dissertation, University of Toronto, 1996.
[9] B. Baker, “Word structure in Australian languages,” in The Languages and Linguistics of Australia: A comprehensive guide, H. Koch and R. Nordlinger, Eds. Berlin: De Gruyter Mouton, 2014, pp. 139-214.
[10] C. Bowern, et al., “Does lateral transmission obscure inheritance in hunter-gatherer languages?” PLoS One, 2011: e25195.
[11] P. McConvell, “Loanwords in Gurindji, a Pama-Nyungan language of Australia,” in Loanwords in the world's languages: A comparative handbook, M. Haspelmath and U. Tadmore, Eds. Berlin: Mouton de Gruyter, 2009, pp. 790–822.