Comparing metrics for quantification of children’s tongue shape complexity using ultrasound imaging

ABSTRACT Speech sound disorders can pose a challenge to communication in children that may persist into adulthood. As some speech sounds are known to require differential control of anterior versus posterior regions of the tongue body, valid measurement of the degree of differentiation of a given tongue shape has the potential to shed light on development of motor skill in typical and disordered speakers. The current study sought to compare the success of multiple techniques in quantifying tongue shape complexity as an index of degree of lingual differentiation in child and adult speakers. Using a pre-existing data set of ultrasound images of tongue shapes from adult speakers producing a variety of phonemes, we compared the extent to which three metrics of tongue shape complexity differed across phonemes/phoneme classes that were expected to differ in articulatory complexity. We then repeated this process with ultrasound tongue shapes produced by a sample of young children. The results of these comparisons suggested that a modified curvature index and a metric representing the number of inflection points best reflected small changes in tongue shapes across individuals differing in vocal tract size. Ultimately, these metrics have the potential to reveal delays in motor skill in young children, which could inform assessment procedures and treatment decisions for children with speech delays and disorders.


Introduction
In the speech of typically developing children (Fletcher, 1989) and children with speech sound disorders (Gibbon, 1999), limitations in the child's capacity for isolated control of anterior versus posterior lingual regions may play an important role in nonadultlike speech patterns. This capacity for different regions of the tongue to operate semi-independently is referred to as 'lingual differentiation'; gestures that lack a typical degree of independent lingual control may be described as 'undifferentiated gestures' (Gibbon, 1999). In many cases, undifferentiated tongue shapes are associated with perceptually incorrect productions (i.e., a substitution or distortion) in child speech. However, for both typically developing (TD) children and children with speech sound disorder (SSD), it is possible for degree of lingual differentiation to dissociate from perceived accuracy for a given production (Gibbon, 1999). Most notably for children with SSD, previous literature has described cases in which undifferentiated or atypical gestures are present in productions that are perceptually transcribed or rated as accurate, sometimes termed 'covert error' (Cleland et al., 2017). For other individuals, there is documented physiological evidence of 'covert contrast', such as when perceptually neutralized productions are produced with measurably different tongue shapes (Gibbon, 1999). For example, for a four year-old child with SSD who exhibited the phonological pattern of alveolar backing, Gibbon (1999) found that /g/ productions were produced with appropriate differentiated velar contact with the palate, whereas /d/ productions were produced with an undifferentiated shape involving both velar and alveolar contact. While the covert patterns described here refer to populations with SSD, children may also exhibit covert contrast over the course of typical phonological development, as originally described with respect to voicing contrasts by Macken and Barton (1980). Therefore, assessing a child's degree of lingual differentiation has the potential to provide information about motor maturation that cannot be obtained from transcribed speech alone.
The goal of the present study is to lay a foundation for research to quantify TD children's degree of lingual differentiation, which we operationalize in the present study as 'degree of tongue shape complexity'. Previous attempts to automate quantification of the degree of tongue shape complexity using tongue contours from ultrasound imaging have yielded promising results. Dawson et al. (2016) used multiple methods for quantifying degree of tongue shape complexity in adult speakers producing various phonemes. Preston et al. (2019) applied an additional ultrasound-based metric of tongue shape complexity to child speakers. In the present study, we considered three of these established approaches to quantification of tongue shape complexity and applied them to adult data representing a range of phonemes. We first evaluated the extent to which the three measures differentiated between phonemes/phoneme classes theoretically expected to differ in articulatory complexity. Then we examined whether the same patterns found with adults were also present when the three measures were applied to ultrasound tongue shape data from children, whose articulation is known to be more variable than that of adults (Goffman & Smith, 1999).

Expected complexity of phoneme classes
Before applying metrics of tongue shape complexity to child or adult speech, it is essential to consider what degree of tongue shape complexity might be expected for a given phoneme. Not all speech sounds are expected to be produced with complex tongue shapes; for instance, it is widely agreed that vowels are articulatorily simple and the liquid sounds /l/ and /ɹ/ are articulatorily complex (e.g., Kent, 1992). 1 In the context of a project to develop measures of tongue shape complexity, it is difficult to avoid circularity when defining the expected complexity of a phoneme. One alternative is to draw on taxonomies that describe the order in which phonemes are acquired in typical child speech development, since later-developing sounds tend to involve more complex tongue shapes. Such taxonomies are commonly derived from transcription-based studies of the perceived accuracy of children's speech at different ages. Clinicians assessing English-speaking children for speech disorders commonly make reference to Shriberg's (1993) system that groups consonants into early, middle, and late stages based on data from 64 children ages 3-6 with speech delays. The 'early eight' include /m/, /n/, /p/, /b/, /d/, /w/, /j/, and /h/; the 'middle eight' include /ŋ/, /t/, /k/, /g/, /f/, /v/, /ʧ/, /ʤ/; and the 'late eight' include /θ/, /ð/, /s/, /z/, /ʃ/, /ʒ/, /l/, and /ɹ/. More recently, Crowe and McLeod (2020) conducted a systematic review of fifteen studies comprising over 18,000 children acquiring American English and reported a slightly different set of three stages: the 'early 13' include plosives, nasals, and glides, the 'middle 7' include affricates, unvoiced fricatives, and laterals, and the 'late 4' include rhotics and voiced fricatives.
Other developmental taxonomies have been established with more explicit reference to the articulatory complexity of speech sounds. It is important to acknowledge at the outset that not all articulatory complexity is tongue shape complexity; some sounds have increased complexity because they require coordination of oral articulatory gestures with glottal gestures or opening/closing of the velopharyngeal port. We begin with a broad view that encompasses all aspects of articulatory complexity, and subsequently narrow our focus to the specific topic of tongue shape complexity. Studdert-Kennedy and Goldstein (2003) described a hierarchical relationship among classes of phonemes that corresponds with how much coordination among articulatory gestures is needed to achieve accurate production. Young children were reported to first show mastery of voiceless stops, nasals, glides, and /h/, and then later added voicing contrasts as they developed the ability to coordinate laryngeal gestures with lingual gestures. The third stage occurred when children were able to coordinate jaw height and/or degree of constriction with lingual gestures in order to produce fricatives and affricates. They described the final stage as occurring when multiple lingual constrictions began to occur, allowing for the production of liquid consonants including /l/ and /ɹ/ (Studdert-Kennedy & Goldstein, 2003).
In a similar description of stages of speech development in English, Kent (1992) described speech sounds according to the shape of the tongue and the type of movement involved in production. For consonants, Kent (1992) presented four stages of consonant sound development, classified according to the degree of 'ballistic' versus 'controlled' movement involved in articulation. He described how young children in the first stage produce consonants with mostly 'ballistic' movement, characterized by short durations with high-velocity accelerations and decelerations, as in /p/, /m/, and /n/. Some consonants in this stage also feature 'ramp' movements, which involve slow movements of relatively stable velocity and long duration, as in /w/ and /h/. The next class of sounds to develop involves more rapid ballistic movements (/b/, /k/, /g/, /d/) and ramp movements (/j/), and the emergence of the fricative /f/. In the third stage, an additional rapid ballistic sound (/t/) is added, along with controlled movements that allow voicing distinctions as well as the complex tongue shapes associated with /ɹ/ and /l/. Kent (1992) described the fourth and final stage as comprising the additional fricative sounds that require precise focal control at the point of lingual constriction.
Although there are differences across these four hierarchies of consonant development based on perceived accuracy and/or articulatory development (see Table 1), there is general agreement that nasals, voiceless stops, glides, and /h/ are developed relatively early due to their low level of articulatory complexity. Vowels are generally omitted from such taxonomies on the assumption that they are even simpler than these early consonants, and thus earlierdeveloping. 2 It can also be observed that consonants produced using a labial place of articulation tend to develop before the same classes of sounds produced with a lingual constriction. Finally, there is general agreement that consonants requiring a single lingual constriction, such as /t/ and /k/, are usually acquired before sibilants produced with lingual grooving, such as /s/ and /z/, and also before liquids that require multiple lingual constrictions, such as /l/ and /ɹ/.
In their study specifically focusing on tongue shape complexity, Dawson et al. (2016) established multiple methods for quantifying tongue shape complexity and compared the metrics' collective ability to classify tongue shapes into complexity classes. Unlike the other proposed hierarchies, the complexity classes from Dawson et al. (2016) were not intended to represent a developmental hierarchy. They included vowels and did not consider consonantal voicing contrasts (i.e., with the exception of /θ/, only voiced consonants were included in their classification system). Their low complexity category included all unrounded vowels with a single lingual constriction, including /ɑ/, /ae/, /ɪ/, /ʌ/ and /ɛ/. Their medium complexity group included sounds involving lip rounding, including /w/ and /u/, sounds with lateral bracing, including /j/, and sounds with a constriction formed with the tongue dorsum, including /g/. Their high complexity group included all sounds with a constriction of the tongue tip, including /d/ and /l/, sounds with more than one lingual constriction, including /ɹ/, and all fricatives, including /z/, /θ/, and /ʒ/. Dawson et al. (2016) provided evidence from two-dimensional ultrasound-based midsagittal lingual contours that generally supported these a priori categories in adults.
The categories proposed by Dawson et al. (2016) agree in several respects with the other four phoneme hierarchies (see Table 1 for a summary), including classifying fricatives, laterals, and rhotics as having relatively higher complexity than other consonants, including nasals, glides, and stops. The most notable discrepancy is that the alveolar stops /t/ and /d/ are placed in the early or middle groups by Shriberg (1993), Crowe andMcLeod (2020), andStuddert-Kennedy andGoldstein (2003). However, /t/ is developed relatively late according to Kent (1992) and /d/ is included in the 'high complexity' category according to Dawson et al. (2016). The present research drew on the categories from Dawson et al. (henceforth, 'Dawson categories') because they broadly align with existing hierarchies of articulatory development and because we re-analyzed data from their original study. However, we kept the other taxonomies in mind throughout our analyses, and we paid particular attention to the patterning of alveolar stops due to the discrepancy in their characterization across previous studies.

Instrumental measures of motor control
The preceding discussion suggests that tongue shape complexity could be a valuable measure for assessing motor skill in children with a suspected delay or disorder in speech development. It is thus important to consider different approaches that can be used to measure tongue shape complexity or other indicators of articulatory skill. Gibbon's (1999) foundational work on lingual differentiation used electropalatography (EPG), which is useful because it makes rieeadily visible what areas of palatal contact are present, allowing researchers to infer what lingual regions are being used to form a constriction. EPG was instrumental in the discovery of developmental decreases in the amount of broad linguopalatal contact for a variety of targets (Fletcher, 1989). Gibbon (1999) distinguished between stop consonants produced in a 'differentiated' fashion (i.e., with contact isolated to a subregion of the palate) and those produced in an 'undifferentiated' fashion (i.e., with broad palatal contact), and found covert contrasts between stops produced with velar versus simultaneous alveolar and velar places of articulation. EPG has also been instrumental in revealing linguopalatal contact patterns in lateral bracing in sibilants (Gibbon et al., 1995). Although there is evidence supporting the use of EPG for diagnosis and treatment of speech disorders , clinical application of the method is limited by the high cost and time delay required to manufacture individually customized palatal prostheses. Additionally, the approach is not considered well-suited for children with growing vocal tracts, who may require several palates over the course of development (Gibbon, 1999). Therefore, there is a need for alternative approaches to measuring lingual differentiation that are more accessible and better suited for young children.
Ultrasound imaging is an increasingly available and affordable alternative that is also minimally invasive and thus suitable for use with small children. Data collected from EPG and ultrasound are not directly comparable because EPG provides discrete information about palatal contact patterns, whereas ultrasound provides a continuous representation of tongue shape in one anatomical plane (e.g., midsagittal). Despite this fundamental difference, it is reasonable to suggest that insights from the EPG-based lingual differentiation literature may dovetail nicely with the current efforts to quantify tongue shape complexity using ultrasound. Namely, ultrasound has been used to reveal covert contrasts for velar targets (McAllister Byun et al., 2016) as well as covert errors in perceptually accurate /k/ and /t/ productions (Cleland et al., 2017). It is highly probable that these ultrasound-based findings provide insight into the same covert articulatory phenomena previously observed using EPG. These findings suggest that both EPG and ultrasound can provide articulatory information that is distinct from and supplementary to readily available ratings of perceived accuracy. Extending ultrasound imaging of the tongue into the clinical domain, differences in degree of lingual differentiation have been quantified using ratio-based measures (Klein et al., 2013;Ménard et al., 2012;Zharkova et al., 2015). While such approaches are helpful for describing the shape and position of contours with one lingual constriction, they are not suitable for quantifying differences among contours with multiple lingual constrictions, such as /l/ and /ɹ/. Instead, qualitative descriptions have been used to describe differences in tongue shape between individual children with and without SSD producing rhotics (Klein et al., 2013). In light of its relative accessibility, its potential to provide insight into a child's stage of motor development, its suitability for measuring a variety of speech targets, and its clinical applications, the ultimate goal of the present study is to use ultrasound to quantify tongue shape complexity in young children.
Although in the present study we chose to focus on measuring tongue shape complexity, it is important to note that previous research has identified other means of quantifying motor development, including measures of articulatory coupling and movement variability using such methods as electro-magnetic articulography (EMA). Tracking kinematic measures of lip and jaw movement, Green et al. (2000) found incremental increases in temporal coupling of these two articulators with increasing age. Children also exhibit a relatively high degree of motor variability that decreases over the course of development as lip and jaw motor targets are refined (Goffman & Smith, 1999;Grigos, 2009). Similarly, children with motor-based SSD show greater lip, jaw, and tongue tip movement variability than TD children (Terband et al., 2011). However, because EMA is limited to the anterior vocal tract, it is not appropriate for tracking posterior lingual constrictions, and therefore is not optimal for questions about late-emerging, complex sounds like the liquids /l/ and /ɹ/. It also is more invasive and therefore more challenging to use with young children than ultrasound. Therefore, we arrive at the present need for a valid and accessible metric that represents the degree of differentiation of lingual contours, including contours with multiple constrictions. As highlighted in the preceding discussion of taxonomies of articulatory development, it is important to keep in mind that the expected degree of lingual differentiation differs across phonemes; for instance, most vowels are produced with simple tongue shapes. As the term 'undifferentiated' implies an absence of differentiation that ought to be present, we favor the term 'tongue shape complexity' (as used by Dawson et al., 2016) because it can neutrally characterize both within-and across-speaker differences without suggesting an expected tongue shape. In addition, continuous measures of tongue shape complexity might be more appropriate than a binary differentiated/undifferentiated distinction to evaluate small articulatory differences across phonemes and groups of speakers.

Ultrasound measurement of tongue shape complexity
Ultrasound is used to visualize boundaries between tissues with different densities, such as the boundary between the surface of the tongue and the air in the oral cavity, as detailed in Stone (2005). An ultrasound transducer, or probe, is placed beneath the chin and piezoelectric crystals inside the probe emit high-frequency sound waves in a fan-like shape through a section of the tongue (either sagittal or coronal, depending on the orientation of the probe). The time that it takes sound to return to the probe after being reflected by the density boundary is used to generate an image of the surface of the tongue, which appears as a white line. Sound waves do not readily pass through bone, which therefore appears as a black shadow in ultrasound images. When imaging in a midsagittal section, it is desirable for the field of view to extend from the shadow of the hyoid bone in the posterior aspect to the shadow of the mandibular bone anteriorly.
A variety of approaches have been used to quantify the shape of a lingual contour that has one major place of constriction. The simplest strategy is to derive measures from raw coordinates representing the surface of the tongue. However, these coordinates are relative to the position of the ultrasound probe, resulting in noise in the signal if the probe moves independently from the head. In order to compare any two sets of raw coordinates within or between individuals, it is essential to control for head movement or otherwise determine that the two contours have identical head position and probe orientation. One approach is for the probe to be physically stabilized relative to the head, as with headsets and collars (Cleland et al., 2017;Derrick et al., 2018;Stone, 2005), but this equipment can be heavy and uncomfortable and is not well tolerated by all participants, especially children. An alternative approach is to allow the head to move freely, but to measure its position and orientation relative to the probe using tracked visual or infrared sensors (Kabakoff et al., 2021;Whalen et al., 2005). Such measurements can then be used to normalize observed lingual position to a consistent head-centric coordinate system, or alternatively used to identify productions that are misaligned and should therefore be removed from the data set.
Instead of or in addition to using preventative or corrective approaches to control for head movement, it is possible to derive measures that are relatively robust to the rotation and translation introduced by head movement. Previous research has proposed two such measures, curvature degree and curvature position, derived from the length of the contour from the mandible to the hyoid shadow and the height of the contour at the highest vertical point perpendicular to the length (Ménard et al., 2012;Zharkova et al., 2015). As these measures are ratios of two segment lengths, they intrinsically normalize across speakers of different sizes. While these metrics are equipped for describing one lingual constriction, as in most vowels and stop consonants, they cannot distinguish contours with multiple lingual constrictions, such as /l/ and /ɹ/. This highlights the need for metrics that can capture the degree of curvature along multiple lingual constrictions while remaining robust across individuals differing in vocal tract size. Dawson et al. (2016) developed and compared various approaches to obtaining continuous metrics that normalize for differences in vocal tract size and head movement with the goal of quantifying the complexity of lingual contours across a variety of phoneme targets in adults. They sought to determine which metrics best classified adult productions into the pre-established low, medium, and high complexity categories described above. The primary metrics of tongue shape complexity were a modified curvature index (MCI) and a Procrustes analysis. MCI is the average of unsigned curvature integrated at each point along the length of the arc of a traced lingual contour ; it differs from another published curvature index (Stolar & Gick, 2013) in that the curve parameterization is used as the reference rather than the x-axis (which would require head stabilization for the values to be interpretable). MCI for a given tongue contour is determined by first computing the absolute curvature (the reciprocal of the tangent circle) at each of the normalized equidistant points along an outline of the tongue surface, and then integrating across these. The Procrustes analysis utilizes a lingual contour at rest to obtain a baseline measure intended to represent a minimal degree of tongue shape complexity . Dawson et al. (2016) described the 'resting' contour as a pre-phonatory position in which the tongue 'lay flat in the mouth, with no palate contact'. Tongue contours obtained during target phonemes are superimposed over this resting contour, and then translation, rotation, and scaling are applied to minimize the sum of squared differences between each frame and the resting state (Goodall, 1991). Finally, Dawson et al. (2016) also considered an analytical approach in which a Discrete Fourier Transform (DFT) was used to transform the tangent angles of a tongue contour into a characterization of tongue shape as a sum of its spatial frequency components (following Liljencrants, 1971). DFT yields coefficients with real and imaginary components, of which the first coefficients correspond to a wavelength equal to the contour length and higher coefficients correspond to multiples of this frequency. The real component of each coefficient provides phase information (at what point along the vocal tract a constriction occurs); the imaginary component provides the magnitude of the constriction. In Dawson et al. (2016), including all three coefficients (C1, C2, C3) did not improve categorization. DFT did provide more consistent categorizations than MCI and Procrustes, but did not make for an 'intuitive' interpretation in terms of complexity. Dawson et al. (2016) used linear discriminant analysis to determine which metrics or combination of metrics best classified various phonemes into complexity classes. In their analysis, Dawson et al. (2016) found that the metrics that were best at independently grouping individual productions into their proposed complexity categories were the imaginary component of C1 (77% accuracy), Procrustes (62% accuracy), and MCI (56% accuracy). However, they determined that the combination of the real and imaginary components of C1 from the DFT together was an even better classifier (81% accuracy), and that adding MCI and Procrustes to this combination improved classification accuracy even more (83% accuracy). However, it is important to note that although the imaginary component of C1 was successful at classifying tongue shapes into complexity categories in Dawson et al. (2016), it is difficult to interpret and compare DFT coefficients across speakers in the absence of how a given tongue shape maps into idiosyncratic vocal tract morphology, information not available from ultrasound alone. Based on these considerations, only MCI and Procrustes were examined as candidate measures of tongue shape complexity in the present analyses. Preston et al. (2019) proposed an additional metric for quantification of tongue shape complexity that is robust across differences in vocal tract size: an ordinal measure that represents the discrete number of inflection points (NINFL) determined by the number of curvature sign changes of a given lingual contour. To avoid including inflections due to small local changes in curvature, only changes exceeding a consistent threshold are counted. Comparing NINFL values of /ɹ/ sounds produced by school-aged children with and without SSD, Preston et al. (2019) found that children with /ɹ/ misarticulation had lower NINFL values than TD children. Additionally, NINFL values correlated with /ɹ/ accuracy ratings, such that higher values were associated with higher perceived accuracy. Finally, for those children enrolled in treatment for /ɹ/ misarticulation, lingual contours showed higher NINFL values after treatment than before treatment. Although Preston et al. (2019) did not apply NINFL to other sounds, its success in quantifying changes in /ɹ/ production suggests that it could be useful to distinguish between phonemes involving dual lingual constrictions, such as /l/, and phonemes involving less complex tongue shapes.

The current study
The present study first compared the three above-mentioned metrics in the sample of adult speakers producing a variety of targets in Dawson et al. (2016), then applied them to child data with the ultimate goal of identifying the metrics that best represent degree of tongue shape complexity in children. The specific objectives of the current study are as follows: (1) To determine the extent to which the three metrics distinguish various adult speech targets expected to differ in articulatory complexity based on established taxonomies. (2) To determine the extent to which patterns of tongue shape complexity found with adults were also present in children for whom relatively late-developing (and therefore lingually-complex) targets may still be emerging.
For the first objective, we applied the three metrics to the adult participants from Dawson et al. (2016) to see how well they separated the adult productions into phonemes and phoneme classes. The rationale for conducting this initial re-analysis of the adult data was threefold. First, this analysis was intended to draw attention to any metric-specific categorization patterns for phonemes and phoneme classes, which were not readily apparent in Dawson et al. (2016) due to their focus on overall classification accuracy for the metrics. Second, the present analysis included NINFL, which was not one of the metrics considered in Dawson et al. (2016). Third, adult productions are known to show more articulatory stability than child configurations (Goffman & Smith, 1999), so it follows that analysis of tongue shape complexity in children would be premature without detailed knowledge of what patterns ought to be present in mature adult speech. For the second objective, we applied the same three metrics to a new data set of young TD children to determine whether the same patterns related to phoneme and phoneme class found in adults were also present in children.
We predicted that for both adults and children, we would see general agreement between measures of tongue shape complexity and articulation-based schemes of phoneme acquisition, such that later-emerging phonemes would be associated with higher tongue shape complexity. For adults, based on patterns observed across all of the developmental trajectories, we predicted that tongue shape complexity would be lowest for vowels and glides, higher for stops, and highest for liquids. For children for whom late-developing targets may still be emerging, we anticipated that there would be a reduced degree of separation across phonemes due to articulatory simplifications and increased variability while children try out different tongue shapes (Goffman & Smith, 1999;Grigos, 2009), which should especially affect late-developing targets.
Ultimately, we evaluated each metric's strength at measuring tongue shape complexity based on how well it ranked the adult tongue contours into phoneme classes along the developmental trajectory of complexity. As reflected in the developmental hierarchies, the ideal metric for adults would reveal the relatively fine-grained pattern in which vowels/glides have the lowest, stops have an intermediate, and liquids have the highest tongue shape complexity. As we expected a high degree of overlap across categories, each metric's ability to separate vowels from liquids was used as a relatively coarse evaluative criterion for each metric in the present study. Identifying a measure that broadly agrees with existing schemes for classifying articulatory complexity (i.e., vowels < liquids) in adults will increase confidence in the clinical utilityie of these measures when applied to children. Our proof-of-concept analyses would suggest that these measures may also be valid for child data if the predicted coarse separation between vowels and liquids is also observed in children, or if patterns of reduced tongue complexity or increased variability in the child data set prevent this separation from occurring. Finding these predicted patterns in children would support further research using the selected metrics to quantify differences within and between child speakers. Ultimately, such measures could support clinical efforts to identify the relative contribution of motor skill to a child's error patterns, with implications for diagnosis and treatment planning.

Adult data set
The adult data set from Dawson et al. (2016) was used with permission. This data set included 1125 productions from six participants between the ages of 24 and 45 with no history of speech or language impairment who were seen at the Graduate Center of the City University of New York (CUNY). The target vowels /ɑ/, /ae/, /ɛ/, /ɪ/, /u/, and /ʌ/ were produced in a /bVb/ context, and the target consonants /j/, /w/, /ɡ/, /d/, /z/, /ʒ/, /l/, /ɹ/, and /θ/ were produced in an /ɑCɑ/ context. After rehearsal of the complete set of stimuli, participants produced two sbets of at least six repetitions of each stimulus, elicited in a random order. Ultrasound recordings were collected with an Ultrasonix SonixTouch with a C9-5/10 microconvex transducer (frequency range 5-9 MHz, 10 mm footprint) at 60 frames per second. A heavy-duty metal stand with a spring-loaded probe arm was used to minimize probe movement relative to each participant's jaw. The frame selected for measurement was the frame closest to the acoustic midpoint for vowels and the point of maximal constriction (i.e., greatest lingual displacement) for consonants. See Table 2 for details about the adult participants after exclusions (as described below), including gender and total number of tokens in the final data set.

Child data set
The child data set included 1132 productions from 17 typically developing children who participated in an evaluation at one of three sites, including Molloy College, Haskins Laboratories (Yale University), and Syracuse University. Data collection from this study was carried out in accordance with the Molloy College Institutional Review Board (no protocol ID provided), the Yale University Human Research Protection Program Institutional Review Boards (protocol ID #1610018484), and the Syracuse University Institutional Review Board for the Protection of Human Subjects (protocol ID #16-282). The participating children had a mean age of 5;2 (range 4;2-6;3) and included 11 females and 6 males. All children had normal hearing and no history of speech or language impairment. However, many of these children produced at least some errors on the latedeveloping sounds /ɹ/ and /l/ (as described below), as these phonemes were still emerging along a developmentally appropriate trajectory.
At Molloy College, ultrasound recordings were collected with a Siemens Acuson X300 with a C8-5 wideband curved array transducer (frequency range 3.1-8.8 MHz, 25.6 mm footprint, 109 degree field of view) at 43-49 frames per second with 60-70 mm depth. At Haskins, a Siemens Acuson X300 was used with a C6-2 wideband curved array transducer (frequency range 1.8-6 MHz, 73.0 mm footprint, 90 degree field of view) at 36-37 frames per second with 80 mm depth. At Syracuse University, a Telemed Echoblaster 128 was used with a PV 6.5 wideband curved array transducer (frequency range 5-8 MHz, 156 degree field of view) at 21-25 frames per second with 110 mm depth). 3 Ultrasound video recordings were captured at 60 frames per second on a PC through an AverMedia video capture card at Molloy and Haskins, and at 35 frames per second with Debut (NCH Software) at Syracuse. The ultrasound probe was placed in a microphone stand while the clinician supported alignment of the probe with each child's head. In addition, blue dots were placed on the forehead, nose, lips, chin, and on the ultrasound probe in orader to measure the alignment of the probe with the head (see ultrasound measurement section below). Children were initially familiarized with the pictures used for elicitation prior to placement of the ultrasound probe, with the evaluating clinician providing cues or modeling the word as needed until the child could name each image. The researcher monitored the ultrasound image during data collection, and if they had concerns about the quality of the ultrasound image during a given production (e.g., due to movement of the child's head relative to the probe), an additional production was prompted. Sixteen words were elicited three times each in random order for a total of forty-eight productions. 4 Consonants were targeted in initial position and included /j/ in 'yam', /w/ in 'wake' and 'wing', /k/ in 'cape', 'cat', 'coat', 3 The divergent frame rates used with the different systems had the potential to affect our ability to identify the optimal frame within a given acoustic interval. At our lowest frame rate of 21 frames per second, the selected frame could be at most 48 ms from the true frame of interest, which is judged to be sufficient for the present non-dynamic analysis. Although reduced zoom depth may result in fewer pixels available, MCI and NINFL computations are made from overlayed anchors and do not depend on the available number of pixels. 4 Because re-elicitations were possible with the young child sample, more than 48 productions were elicited from some child participants. However, some elicitations were later determined to have misaligned ultrasound images or were otherwise unclear (as indicated in the ultrasound measurement section) and were removed from the final data set.
'key ', /t/ in 'tape', 'tea', and 'toe', /l/ in 'lake' and 'lamb', and /ɹ/ in 'rake', 'rat', 'ring' and 'rope'. As word-initial consonants were the present focus, final consonants were not analyzed. However, for a point of comparison with simple tongue shapes, we did analyze two monophthongal vowels (/ae/ and /ɪ/) that occurred in words with final consonants. As such, the target vowels included /ae/ in 'cat', 'lamb', 'rat', and 'yam' and /ɪ/ in 'ring' and 'wing'. See Table 2 for details about the child participants, including age, gender, site, and total number of tokens in the final data set. Although accuracy ratings were not quantitatively analyzed as part of the present study, accuracy ratings based on narrow transcription (in which distortions were classified as errors) from Kabakoff et al. (2021) indicated that these same typically developing children produced /ae/ with 98.1% accuracy, /ɪ/ with 86.4% accuracy, /j/ with 92.3% accuracy, /w/ with 100% accuracy, /k/ with 98.6% accuracy, /t/ with 98.4% accuracy, /l/ with 84.4% accuracy, and /ɹ/ with 42.3% accuracy. See Kabakoff et al. (2021) for more information on error patterns for these individuals.

Ultrasound measurement
All processing of ultrasound data from adults was performed at CUNY following the protocol described in Dawson et al. (2016). All processing of ultrasound data from children was performed at New York University as part of a larger study that included children with SSD. The procedures used for the child data are described briefly below; see Kabakoff et al. (2021) for additional detail. Midsagittal ultrasound probe alignment was quantified using a procedure in which blue dots were placed along the vertical midline of the child's face and the probe. The position of the dots was automatically tracked in frontal-view video recorded concurrently with ultrasound data collection. This video was used in a Matlab (MathWorks Inc, 2000) script that temporally aligned the video of the child's face with the ultrasound video using crosscorrelation of their mutual audio. The script then flagged video frames for further inspection if the tracked blue dots indicated more than one standard deviation of displacement across the child sample from Kabakoff et al. (2021). This threshold was 15.4 mm of lateral displacement or 13.3° of angular displacement of the probe relative to the face. Using this method, 24.4% of frames (276/1132) were discarded due to lateral misalignment (170/1132) and/or angular misalignment (197/1132), leaving 856 tokens.
For all sound files, trained university students who had taken courses in linguistics or phonetics and had received project-specific training viewed waveforms and spectrograms in Praat (Boersma & Weenink, 2014) in order to mark the relevant sonorant and obstruent intervals in the time-synced TextGrid file and label them by target phoneme. Each marked acoustic interval in the TextGrid file was viewed in the time-synced ultrasound video using GetContours , an ultrasound annotation program that supports navigation to the first frame within marked target intervals. The trained university students selected the frame judged to most clearly represent each target phoneme and placed sixteen anchors along the underside of the white line visible on the ultrasound image (see Figure 1). The software then automatically redistributed the points evenly along the traced contour, and the evenly distributed points were automatically extrapolated into 100 x-and y-coordinates.
After initial data processing by students, a graduate student with specialized training in phonetic analysis assured consistency across ultrasound files by verifying that all target productions were traced and that all traces reflected the entire visible contour. For most frames, this meant that the tracing of the tongue's surface should extend from the hyoid shadow to the mandibular shadow. The student specialist discarded any frames that they subjectively judged to be off-center or unclear and retraced any frames that were not traced fully. As such, for the minority of cases where both shadows were not visible, tracings were either judged to represent both posterior and anterior lingual regions fully, or they were discarded. The first author exported all remaining contours (i.e., sets of 100 coordinates), read the coordinates into RStudio (RStudio Team, 2019), scaled (z-score) both the x-and y-axes within speaker and target, and plotted the coordinates for a final round of quality assurance. Consensus was reached by the first author and the student specialist that 16 tokens should be removed because they exhibited high degrees of perseverative coarticulation, as with 8 /ae/ productions and 7 /ɪ/ productions that showed multiple constrictions following /ɹ/ (in 'raft' and 'ring'). Additionally, one /k/ exclusion was made (from 15M) because the contour was dissimilar to the shapes of the speakers' other /k/ productions; this was thought to be attributable to a difference in view range where the most posterior regions of the contour was not captured. After this process of removing outliers based on visual inspection, there were a total of 840 tokens in the child data set.

Tongue shape complexity measurements
For both the adult data set and the child data set, MCI and Procrustes metrics were calculated using a custom script (Dawson, 2016) in Python (Python Software Foundation, 2016), and NINFL was calculated using a custom Matlab script (ComputeCurvature) available as supplemental material. Any NINFL value exceeding five (n = 15 for adults; n = 1 for children) was discarded as unlikely to be a valid representation of a possible tongue shape, following the procedure described in Preston et al. (2019). Similarly, all MCI values exceeding six (n = 5 for adults, n = 1 for children) were removed based on the range of MCI values included in Dawson et al. (2016). After exclusion of these outliers, the total number of adult contours was 1108 and the total number of child contours was 838. After calculation of the three metrics, all subsequent analyses were performed using the RStudio interface to R (R Core Team, 2019).

Analyses
Our first objective was to evaluate how well each of the three established metrics of tongue shape complexity, taken individually, classifies tongue contours in the adult data set into phonemes and into pre-established complexity classes. We qualitatively inspected whether distributions of values for each of the selected metrics were distinct across individual phonemes and across natural classes of phonemes. We then quantitatively tested whether mean tongue shape complexity differed between vowels, glides, stops, and liquids. Although our primary evaluative criterion for the metrics was whether liquids would have higher tongue shape complexity than vowels, we included glides and stops as intermediate levels in order to provide a more complete representation of the natural class patterns across the child and adult data sets. This represents a distinct objective from that of Dawson et al. (2016), which used linear discriminant analysis to find the metric or combination of metrics that yielded the highest classification accuracy, but did not conduct a detailed examination of phoneme-specific patterns with reference to expected articulatory complexity.
Our second objective was to determine whether the patterns found in the adult data set would also be found in children for whom complex targets may still be emerging. As for the adult data, we visually examined whether the distributions of complexity scores for the selected metrics would distinctly categorize the child productions into phonemes and into complexity categories. We also considered transcription-based accuracy in this qualitative analysis. We performed the same quantitative analysis as for adults in which we tested for differences in tongue shape complexity based on phoneme class (vowels, glides, stops, liquids). Because the adult data were elicited in nonwords with a constant phonetic context and the child data were elicited in real words with varying vowels and coda consonants, quantitative comparisons between the two data sets were not possible; therefore, only qualitative comparisons were reported.

Results
For our first objective looking at tongue shape complexity patterns in the adult data set, notched boxplots of the three metrics for each target phoneme were created to allow visual estimation of the extent to which MCI, NINFL, and Procrustes values differed across phonemes. Figure 2 shows complexity values for each target, pooled across participants, and sorted in increasing order by the median across participants. While MCI and Procrustes are continuous metrics that can be sorted meaningfully by the median value for each target phoneme, NINFL is an ordinal metric with values ranging from one to five that would lead to many instances of the same whole number when attempting to sort by the median values. Therefore, in order to avoid such ties between NINFL distributions with the same median, it was necessary to take the mean value across repetitions for each speaker for each phoneme, and then rank-order the phonemes based on the median of these individual speaker means. Calculating individual means in this way leads to fractional values that characterize the NINFL patterns of a given speaker. Therefore, the notched boxplots for NINFL show the median of these means, which characterizes speaker patterns more effectively than if we had plotted the median of the raw ordinal values. The boxes are colored by the Dawson categories. As can be seen in the figure, in most cases, the notches around the ranked medians overlap substantially and thus do not fully separate adult targets from one another. A few exceptions exist; for instance, for both the MCI and Procrustes measures, there is no overlap in notches between the liquids /ɹ/ and /l/ and the next most complex phoneme, suggesting that these sounds are produced with significantly greater tongue shape complexity than other phonemes. Across all three metrics, vowels were generally associated with values ranked in the lower third (the lowest five phonemes out of all fifteen), and /ɹ/ and /l/ were generally associated with high values (the highest five phonemes out of all fifteen, where /ɹ/ for NINFL is the only exception as it is ranked as the sixth most complex phoneme). Figure 2 also shows that rank ordering mostly agreed with the Dawson categories, in that most of the phonemes with medians in the lower third belong to the lowcomplexity category and most of the phonemes in the higher third belong to the highcomplexity category. A notable exception is that /ɑ/ and /ʌ/ were considered low and /w/ and /j/ were considered medium complexity according to the Dawson categories, but /ɑ/ and /ʌ/ were ranked in the intermediate third by NINFL, /w/ was ranked in the lower third by Procrustes and in the higher third by MCI, and /j/ was ranked in the higher third by NINFL. Additionally, /ʒ/ and /d/ were considered high complexity based on the Dawson categories, but MCI and NINFL placed /ʒ/ among the lower/intermediate thirds while MCI ranked /d/ in the intermediate third.
For our second objective looking at tongue shape complexity patterns across phonemes in the child data set, Figure 3 shows the same rank ordered notched boxplots as used for adults, where the median-based rank ordering is shown for MCI and Procrustes. For NINFL, since there were multiple word contexts for each target phoneme, the mean value for each speaker by word was calculated in order to break ties between distributions with the same median. As with adults, the NINFL plots are rank ordered by the median of these means. To a greater degree than observed in the adult data, the notches overlap across phonemes and do not separate child targets from one another. Based on NINFL only, there is minimal overlap in notches between the liquids /ɹ/ and /l/ and the next most complex phoneme, suggesting that these sounds are produced with significantly greater tongue shape complexity than other phonemes. Among the discrepancies within the child data, /w/ was ranked as higher complexity than expected based on all three metrics, as it is in the higher third across all metrics. Additionally, even though /l/ and /t/ belong to the Dawson high complexity category, they were in the lower third based on MCI, while /t/ was in the lower third based on Procrustes, and in the intermediate third based on NINFL. As expected, /l/ was ranked in the higher third based on Procrustes and NINFL, while /ɹ/ was ranked in the highest third according to all three metrics.
Because transcription-based accuracy ratings were available for all child productions, the plots show whether each production was considered 'correct' or 'incorrect'. For the plotted mean NINFL values for each subject and target, an intermediate 'mixed' accuracy rating category is included when all productions for that subject/target did not agree. As can be seen in the plot, the children produced most vowels, glides, and stop consonants with a high degree of accuracy. Recall that /l/ (as well as /ɹ/) were produced with less than 90% accuracy across the children in this data set. For /l/, the incorrect productions do not appear to separate from the correct productions based on any metric, suggesting that accuracy does not mediate the unexpected ordering for /l/ based on MCI. However, for /ɹ/, the incorrect productions do appear lower than the correct productions based on both MCI and NINFL. This suggests that accurate /ɹ/ productions would show even greater separation from the other phonemes than presently shown.
As a quantitative analysis of each metric's ability to differentiate productions into vowel, glide, stop, and liquid phoneme classes, linear mixed-effects regression models were fit for both adults and children using the "lme4" package (Bates et al., 2015). For all three models for both adults and children, tongue shape complexity was the outcome variable and phoneme class (vowel, glide, stop, and liquid, with vowel as the reference level) was the predictor variable. Following the standards described in Harel and McAllister (2019), random intercepts were included because observations were nested within participants and phoneme classes, and by-participant random slopes on phoneme classes were included because relative patterns of complexity across phonemes could differ across participants. Because there were multiple target items per phoneme for the children (but only one target item per phoneme for adults), an additional random intercept for word was included in only the models for the child data. Figure 4 shows the means and 95% confidence intervals of each phoneme class for each metric, pooled across participants, targets, and word contexts. The models for all three metrics for adults indicated that the tongue complexity means for vowels were lower than those for glides (MCI: β = 0.76, SE = 0.067, p < 0.0001; NINFL: β = 0.93, SE = 0.09, p < 0.0001; Procrustes: β = 0.42, SE = 0.037, p < 0.0001), stops (MCI: β = 0.27, SE = 0.078, p = 0.0035; NINFL: β = 0.82, SE = 0.09, p < 0.0001; Procrustes: β = 0.60, SE = 0.023, p < 0.0001), and liquids (MCI: β = 1.46, SE = 0.056, p < 0.0001; NINFL: β = 0.79, SE = 0.09, p < 0.0001; Procrustes: β = 1.16, SE = 0.052, p < 0.0001). The MCI-based models for children indicated that the tongue complexity means for vowels were lower than those for glides (β = 0.15, SE = 0.061, p = 0.015), stops (β = 0.13, SE = 0.05, p = 0.027), and liquids (β = 0.33, SE = 0.069, p < 0.0001). The NINFL-based models for child data indicated that the means for vowels were lower than those for liquids (β = 0.40, SE = 0.13, p = 0.0055). There were no differences based on the Procrustes metric. See supplemental material for full model output.
For a complete representation of all pairwise patterns (and not just differences of the consonants from the vowels), we estimated marginal means and evaluated the significance of the relevant marginal contrasts using the "emmeans" package in R (Lenth, 2019). Results from this analysis are reported in Table 3 with all p-values adjusted using Holm's method for multiple comparisons. 5 For adults, tongue complexity based on MCI was significantly different across the following phoneme classes: vowels were less complex than stops and liquids, glides less complex than stops and liquids, and stops less complex than liquids. Based on NINFL, vowels were less complex than glides/stops/liquids. Based on Procrustes, vowels were less complex than glides/stops/liquids, glides were less complex than stops, and stops were less complex than liquids. For children, tongue complexity based on MCI indicated that vowels were less complex than liquids. No pairwise differences were found based on NINFL or Procrustes.

Discussion
The present study explored the utility of ultrasound-derived measures of tongue shape complexity to characterize speech sounds produced by a group of adult speakers and a group of child speakers. Our overall objective was to determine which metrics, taken individually, best represented degree of tongue shape complexity in children as well as adults. We used both qualitative and quantitative analyses to address this goal, which we summarize comparatively in the following paragraphs. Our first objective was to determine whether phonemes or natural classes of phonemes patterned differently with respect to measures of tongue shape complexity for adults. Based on qualitative analyses, MCI and Procrustes measures replicated the original results from Dawson et al. (2016), and NINFL values broadly agreed with the Dawson complexity categories. At the phoneme class level, values of complexity were typically low for vowels and high for consonants, which was confirmed with the significant differences found between vowels and each of the three consonant classes (glides, stops, and liquids) based on all three metrics in the mixed models. Pairwise comparisons extended this coarse pattern to reveal a hierarchy in which for MCI, vowels/glides were less complex than stops, which were less complex than liquids, for NINFL, vowels were less complex than glides/stops/ liquids, and for Procrustes, vowels were less complex than glides, which were less complex than stops, which were less complex than liquids. However, there were unexpected results based on the qualitative analyses, such as the discrepancies found between certain metrics and the expected Dawson categories for /ɑ/, /ʌ/, /j/, /w/, /ʒ/, and /d/ in the adult data set. Overall, the qualitative and quantitative analyses for the adult sample agree based on all metrics, revealing a pattern in which each of the consonant classes had higher tongue complexity than vowels. This pattern broadly agrees with the Dawson categorization of vowels as low complexity, glides (and the velar stop) as medium complexity, and liquids (and the alveolar stop) as high complexity.
Our second objective was to determine whether the patterns found with adults could also be observed in a sample of TD children for whom some phonemes were still emerging. Based on qualitative analyses, we found that there was substantial overlap of tongue shape complexity values across phonemes for MCI, Procrustes, and NINFL. Tongue shape complexity values did not show an exact correspondence with the Dawson categories, primarily due to the high values obtained for /w/, the relatively low values obtained for /t/, and the low MCI values obtained for /l/. Even though the alveolar stop was categorized as high complexity according to the Dawson categories listed in Table 1, it was considered low complexity according to the other four taxonomies. Based on the mixed models in the quantitative analysis, MCI was the only metric for which the same phoneme class hierarchy observed in adults was also observed in children, such that glides, stops, and liquids had higher tongue shape complexity than vowels. For NINFL, only liquids had significantly higher tongue shape complexity than vowels, and Procrustes showed no significant differences. The pairwise comparisons broadly corroborated these findings for MCI, showing that liquids had higher tongue shape complexity than vowels; however, no pairwise differences were found based on NINFL. For the children, the quantitative results do not fully accord with what was observed qualitatively, particularly because NINFL appears to outperform MCI in ranking the four phoneme classes in a progressively increasing complexity hierarchy (see more discussion comparing MCI and NINFL below). However, the mixed models and estimated marginal means that we implemented are more comprehensive than visual inspection because the random effects take into consideration individual target-based and word-based patterns that may have been averaged out by looking only at pooled group data. Despite this partial incongruency between the qualitative and quantitative analyses, the quantitative analysis serves to identify the robust finding that liquids have higher tongue shape complexity than vowels for the child sample. When considering transcription-based accuracy for the children, /ɹ/ showed lower values for some incorrect productions relative to the correct productions based on both MCI and NINFL. Overall, these results suggest that tongue shape complexity measures (MCI, and to a lesser extent, NINFfL) may be instrumental in revealing differences between phonemes and phoneme classes for both adults and children, but that these measures do not always accord perfectly with phonetically informed expectations. The current analyses serve as a proof of concept for how objective tongue shape complexity measurements can be applied to child data. Despite differences in size between adults and children, it can be observed that the ranges of values for both MCI and NINFL are similar, highlighting how these metrics function independently from vocal tract size. However, the Procrustes measure was associated with different ranges of values for the adult sample (0-5 units) and child sample (0-8 units), suggesting that this measure may be less optimal for comparisons across different-sized vocal tracts. It can also be observed that earlier-developing phonemes such as /ae/ and /ɪ/ have MCI, Procrustes, and NINFL values that are roughly the same for adults and children, whereas later developing phonemes such as /l/ and /ɹ/ appear to have notably higher values for MCI and NINFL (and to a lesser extent, Procrustes) in adults than in children. Although it is not possible to make direct quantitative comparisons between adults and children in the present study due to the different tasks used to collect the target sounds in the two samples, the observed differences between adult and child tongue contours for these sounds support the hypothesis that tongue shape complexity increases with age for certain phonemes, and that these differences may be detectable over the course of maturation. It remains unknown whether children's tongue contours continue to show reduced complexity after production becomes perceptually accurate (i.e., whether reduced tongue shape complexity would persist covertly). Our qualitative inspection suggests that tongue shape complexity and perceived accuracy may be correlated, particularly for rhotic targets. Future research should determine at what point in childhood tongue shape complexity becomes adultlike, and how this trajectory relates to changes in perceived accuracy across late-developing targets.
Above we noted that the Procrustes range for the child productions was larger (0-8) than the range for the adult productions (0-5), suggesting that this measure may be less optimal for analyses of speakers with varying vocal tract lengths. It is also important to consider methodological differences between the Procrustes measure as compared with MCI and NINFL. MCI and NINFL both rely on the degree of curvature at each point along the contour (the combined curvature of all points for MCI and the number of points with curvature above a set threshold for NINFL) to provide a quantitative representation of complexity. The Procrustes metric is understood to rely heavily on what resting shape was selected as the starting point for the subsequent translation, rotation, and scaling. However, there is no agreed-upon method for eliciting a resting contour, and the resting contour could in principle differ in complexity between individuals of different sizes. That is, for younger speakers, the tongue fills the oral cavity more completely, so a resting shape for a younger speaker might track the palate more closely than in larger/ older speakers with more space in the oral cavity. Due to this possibility, it follows that Procrustes-based complexity values may not be comparable across individuals. Synthesizing across these considerations, we favor MCI and NINFL because they are the two metrics that quantified degree of curvature most directly and with equivalence across vocal tracts of different sizes.
We now reflect on the relative performance of these two preferred measures, MCI and NINFL, in dividing adult and child data by phonemes and by phoneme classes, focusing on the phonemes shared across the two data sets. As seen in Figure 2 for adults and in Figure 3 for children, separation across phonemes was relatively limited for both metrics. Especially for children, there was substantial overlap of the notched boxplot intervals across targets, with median values near 2 for virtually all phonemes in both cases. As previously noted, using our coarse criterion solely with visual inspection on the Dawson categories would select NINFL as the best metric with the child data because it most effectively ranked the vowels as the lowest in complexity and /ɹ/ and /l/ as the highest in complexity. However, quantitative analyses would select MCI as the best metric overall, as the models predicting this metric were the only ones that separated phonemes by phoneme categories (vowels < glides/stops/liquids) in both the child and adult samples. The pairwise comparisons for MCI corroborated this finding while extending it to include the other consonants: for adults, vowels and glides were less complex than stops and liquids, and stops were less complex than liquids; for children, vowels were less complex than liquids. For NINFL, the pattern in which vowels were less complex than glides/stops/liquids was observed for adults, but the expected pattern in which vowels were less complex than liquids was not observed for children. Overall, we conclude that MCI is the best metric overall at identifying the predicted coarse pattern distinguishing vowels from liquids; furthermore, MCI is the only metric to reveal a hierarchical pattern compatible with developmental hierarchies.
Synthesizing across both qualitative and quantitative analyses for both MCI and NINFL, contours for vowels (/ae/ and /ɪ/) were consistently among the lowest in complexity and the rhotic liquid (/ɹ/) was consistently high complexity. For adults, contours for the velar stop (/g/) were predominately in or near the borders of the intermediate third and the lateral liquid (/l/) was predominately in the higher third; for children, contours for glides (/j/ and / w/) and the velar stop (/k/) were primarily in the intermediate third. The discrepancies were that glides (/w/ for MCI and /j/ for NINFL) and the alveolar stop (/d/ for MCI) were identified as lower complexity than expected in the adult sample, whereas the alveolar stop (/t/) and lateral liquid (/l/) were ranked as lower complexity than expected based on MCI in the child sample. Despite this noise primarily affecting the intermediate categories, MCI and NINFL converge on the separation of a low complexity class comprising vowels, an intermediate class comprising glides and stops, and a high complexity class comprising liquids.
Finally, we reflect on methodological differences between MCI and NINFL as they relate to the discrepancies between these measures observed in both the adult and the child samples. Notably, /l/ was characterized by high complexity based on NINFL but low complexity based on MCI in the child data set only. Additionally, /d/ was characterized by relatively high complexity in the adult sample based on NINFL but not MCI, while /t/ was characterized by relatively low complexity in the child sample based on MCI (and to a lesser extent NINFL). The relative complexity of /w/ in adults was reversed across the two measures: MCI indicated relatively high complexity, whereas NINFL categorized /w/ with relatively low complexity. Recall that MCI is driven by curvature and NINFL is driven by the number of inflections; as such, MCI is higher when the size of the local curvature is low (as with the locally tight curvature of the tongue tip that occurs in a retroflex /ɹ/), whereas NINFL is not sensitive to differences in curvature size. These computational differences may account for some of the discrepancies observed across metrics, particularly for the alveolar stops and lateral liquids because they involve tongue tip elevation. See Kabakoff et al. (2021) and Kabakoff et al. (under review) for more discussion of the differences between these metrics.
When considering some of the current unexpected findings relative to taxonomies of typical development, it is important to note that the metrics we describe are limited to a midsagittal section of the tongue. In many cases, looking at multiple sections of a tongue shape would reveal complexity that cannot be reflected in a single midline section of the same tongue shape. This may be especially relevant for rhotics and sibilants (such as /ʃ/ and /ʒ/), which are produced with a midline groove with bracing of the lateral edges of the tongue. This suggests that a combination of sagittal and coronal sections, or threedimensional ultrasound imaging, may be necessary for a comprehensive characterization of the tongue shape complexity associated with different phoneme classes. In the present study, the unexpectedly low complexity ranking of the postalveolar fricative could be explained along these lines. That is, access to a coronal section may have revealed parasagittal complexity for such targets with lateral bracing. Likewise, as presented in Gibbon (1999) and discussed in Kabakoff et al. (2021), EPG has revealed that mature /t/ is produced with both an alveolar constriction and lateral bracing, suggesting that the complexity for this target sound also cannot be fully represented in the midsagittal section. In addition to exploring other anatomical sections of the tongue, alternative measures of ultrasound tongue shapes or new articulatory technologies may be needed to quantify tongue shape complexity in a manner that will reliably differentiate productions based on phoneme, age or disorder status of the speaker, and accuracy rating. Discrepancies between expected and observed complexity may also reflect noise generated from the specific frame that was analyzed. That is, tongue contours in frames just before versus at the point of maximal constriction might differ in tongue shape complexity but not reflect meaningful differences in motor skill. Using a higher frame rate is recommended in order to increase the likelihood of capturing the actual point of maximal constriction.
The present study provides a foundation for using ultrasound-based metrics of tongue shape complexity to characterize speech productions in children as well as in adults. Although any single token of a speech sound is unlikely to be accurately classified with respect to complexity category based on any of the present tongue complexity metrics, the current analyses provide a strong case that vowels have lower tongue shape complexity than liquids. This work represents an extension of the methods used in Dawson et al. (2016) to a child population, as well as an extension of what was previously observed with EPG to the relatively more affordable and accessible ultrasound technology. Establishing sensitive and valid metrics of tongue shape complexity could make a substantive contribution to a future understanding of how motor factors influence the course of speech development in children. Although the current study did not quantitatively control for degree of perceived accuracy of the children's productions, the wide notched intervals for the late-developing sounds in our TD child data suggest that both differentiated and undifferentiated gestures may have been represented. In addition to Kabakoff et al. (2021) and Kabakoff et al. (under review), further research should determine whether tongue shape complexity for these targets distinguishes perceptually correct from incorrect productions, as found for rhotic targets in Preston et al. (2019) and as suggested by visual inspection of our data.
As an additional future direction, it would be valuable to examine whether lingual differentiation is higher for TD children than children with SSD, as found in Gibbon (1999), Preston et al. (2019), and Kabakoff et al. (2021). Subsequent research could investigate whether tongue shape complexity measures can identify subtypes within the population of children with SSD, such as those who are most likely to show errors that persist later in development or those whose speech errors are most likely to have a motor-based etiology. This could pave the way for a clinical application in diagnosis and treatment planning. That is, if tongue shape complexity measures from a child with SSD are suggestive of motor involvement (i.e., reduced tongue shape complexity for late-developing phonemes), a motor-based approach to treatment, such as ultrasound biofeedback (e.g., Cleland et al., 2015), might be recommended. If tongue shape complexity measures do not provide evidence for motor involvement (i.e., relatively high tongue shape complexity), a phonological approach to intervention might instead be recommended. For those with motor-based impairments, the current metrics may also serve the additional purpose of helping to quantify a baseline level of tongue shape complexity and track progress over the course of treatment.

Conclusion
In the present study, we asked which among three metrics best reflects the degree of complexity of a given tongue shape. Results from applying MCI, NINFL, and Procrustes metrics to adult productions suggested that they group the contours broadly into the complexity categories proposed by Dawson et al. (2016), although there were exceptions. These metrics also can be applied to child productions, potentially providing insight into developmental patterns that are not observable through perceptual analyses alone. Our evidence suggests that MCI and (to a lesser extent) NINFL are well-suited for detecting differences in tongue shapes between vowels and liquids, whereas the Procrustes method poses additional analytical challenges. In order to determine the true utility of these metrics in clinical populations, future research should apply these metrics to child populations differing in age, disorder status, and degree of perceived accuracy, as well as compare tongue shape complexity before and after treatment targeting linguallycomplex phonemes.