Auditory discontinuities interact with categorization: Implications for speech perception a)

Behavioral experiments with infants, adults, and nonhuman animals converge with neurophysiological ﬁndings to suggest that there is a discontinuity in auditory processing of stimulus components differing in onset time by about 20 ms. This discontinuity has been implicated as a basis for boundaries between speech categories distinguished by voice onset time ! VOT " . Here, it is investigated how this discontinuity interacts with the learning of novel perceptual categories. Adult listeners were trained to categorize nonspeech stimuli that mimicked certain temporal properties of VOT stimuli. One group of listeners learned categories with a boundary coincident with the perceptual discontinuity. Another group learned categories deﬁned such that the perceptual discontinuity fell within a category. Listeners in the latter group required signiﬁcantly more experience to reach criterion categorization performance. Evidence of interactions between the perceptual discontinuity and the learned categories extended to generalization tests as well. It has been hypothesized that languages make use of perceptual discontinuities to promote distinctiveness among sounds within a language inventory. The present data suggest that discontinuities interact with category learning. As such, ‘‘learnability’’ may play a predictive role in selection of language sound inventories. © 2004 Acoustical Society of America. # DOI: 10.1121/1.1778838 $


I. INTRODUCTION
Considering the large set of physically realizable sounds of the human vocal tract, the regularity in phonetic inventories of surveyed languages is remarkable ͑Maddieson, 1984͒. There have been a number of hypotheses about factors that shape preferred sound inventories. A common theme running throughout these proposals is that sounds are favored if they are easy to produce and/or easy to distinguish auditorily from other sounds ͑Diehl et al., 2003;Liljencrants and Lindblom, 1972;Stevens, 1972. For example, according to the quantal theory ͑Stevens, 1972, 1989͒, there are regions along certain phonetic dimensions where articulatory changes have large acoustic consequences and other regions where comparable articulatory variation results in much more modest acoustic change. These alternating regions of acoustic instability and stability are assumed to create conditions for a kind of optimization of sound inventories. If sound categories are located in the stable regions, they require less articulatory precision and are thus easier to produce reliably than sound categories in unstable regions. These stable sound cat-egories will also tend to be auditorily distinctive because they are separated by regions of acoustic instability.
Of course, communicative benefits arise not from acoustic distinctiveness among sounds but from distinctiveness within the human auditory system. Mathematical simulations that incorporate distance metrics based on realistic auditory representations have been able to predict some of the regularities witnessed in phonetic inventories ͑Diehl et al., 2003;Lindblom, 1986͒. Auditory representations differ from acoustic representations because the auditory perceptual space may be ''warped'' in such a way that equal acoustic distinctions need not produce equivalent perceptual changes. Languages may capitalize on regions of perceptual space where sensitivity is enhanced, adopting sounds for which moderate changes in articulatory or acoustic characteristics result in disproportionately large perceptual consequences. For example, there are some regions of acoustic space where sharp peaks in discrimination functions are observed. These ''auditory discontinuities'' ͓natural boundaries in perceptual space across which discrimination is enhanced ͑Kuhl and Miller, 1975͔͒ 1 would be good candidates for placement of boundaries between phonetic categories because tokens from each category would have enhanced distinctiveness with minimal acoustic change. Enhanced distinctiveness in turn provides more robust communication. Stevens ͑1989͒ extended the quantal theory to the relation between acoustic a͒ Preliminary reports of this work were presented at the 145th Meeting of the Acoustical Society of America, June 2003 and at the 10th Annual Cognitive Neuroscience Society Meeting, March 2003. b͒ Electronic mail: lholt@andrew.cmu.edu signals and auditory responses. Certain nonlinearities in the auditory system were assumed to give rise to regions along acoustic dimensions that are auditorily stable and other regions along those same dimensions that are auditorily unstable. Again, it was hypothesized that sound categories tend to be located in the stable regions. An intervening unstable region would then correspond to a kind of auditory discontinuity or natural boundary that enhances the distinctiveness of sound categories.
In addition to enhanced distinctiveness, there is another benefit that may be conferred by utilizing auditory discontinuities as phonetic boundaries. A discontinuity may provide a parsing of the auditory space and delineate an initial ''natural'' boundary between phonetic categories ͑Kuhl, 1993͒. For a novice language-learner, these boundaries could facilitate the imposing task of forming phonetic categories from the considerable acoustic variance in speech. Phonetic systems that include boundaries coincident with auditory discontinuities may thus be more learnable. If this is the case, learnability may be a characteristic of communication systems that exerts selective pressure on the development of languages ͑Deacon, 1997; Lotto, 2000͒. That is, attributes of language systems that promote learning may tend to be favored. This includes aspects of phonetic categories. Evidence from studies of categorization in other domains suggests that some sets of stimuli are more readily categorized than others ͑Alfonso- Reese et al., 2002;Ashby et al., 1999͒. Many factors may contribute to the relative ease of categorization including stimulus dimensionality, covariance among stimulus dimensions, nonlinearity in category boundaries, and overlap in input distribution variability.
We propose here that the placement of category boundaries at preexisting auditory discontinuities may facilitate phonetic category formation and that this increase in learnability may be one of the sources of regularities observed in phonetic inventories. A particular example is the regularity in voicing category distributions seen across languages and the relation of this regularity to a well-documented nonmonotonicity in temporal processing.

A. Auditory discontinuities in temporal processing
Although a host of cues differentiate voiced from voiceless consonants ͑Lisker, 1986͒, the temporal interval between oral release and onset of voicing ͑voice onset time, VOT͒ is especially predictive of the contrast. Across languages, speakers tend to produce three kinds of voicing: lead consonants for which voicing precedes the release of the consonant occlusion ͑thus having a negative VOT value͒, short lag consonants for which voicing closely follows the oral release ͑with VOT values around 0-20 ms͒, and long lag consonants for which voicing lags well behind the consonant release ͑with large positive VOT values͒. Frequency distributions of VOT productions aggregated across languages reflect three modes, with modes centered at Ϫ100, ϩ10, and ϩ75 ms VOT ͑Cho and Ladefoged, 1999;Keating, 1984;Lisker and Abramson, 1964͒. 2 These characteristic distributions are cross-linguistically regular not just in their modes, but also in more precise statistical patterning. For example, distributions of VOT productions typically have a gap in the region cor-responding to small negative VOT values. These values appear very rarely in speech production cross-linguistically. Keating ͑1984͒ suggests that this gap may be a general characteristic of languages regardless of the number of voicing categories. The extent, or range, of the distributions appears to be fairly regular across languages as well. Perhaps partly as a consequence of the absence of VOT productions in the small negative VOT range, short-lag distributions tend to have quite constrained variability across VOT, with values clustered from about 0 to 20 ms VOT ͑Keating, 1984͒. Therefore, cross-linguistically, boundaries between the voicing distributions generally occur at roughly Ϫ20 and ϩ20 ms VOT.
Most languages make use of only two of the three distributions. English, for example, distinguishes short-lag from long-lag consonants, whereas Spanish contrasts lead with short-lag voicing. Of note, no language reported by Lisker and Abramson ͑1964͒ contrasts lead with long-lag voicing. This is of interest because theories predicting sound inventories from acoustic factors might expect to observe two-way voicing contrasts positioned at opposite ends of the VOT continuum, thus enhancing the acoustic distinctiveness of the contrast. However, each of the 51 languages surveyed by Keating et al. ͑1983͒ employed some type of short-lag stop consonant. Thus, rather than dispersing voicing productions acoustically, languages tend to contrast lead or long-lag voicing with the more intermediate ͑and less acoustically dis-tinct͒ short-lag distribution.
Provocative early findings in infant speech perception indicated that mechanisms responsible for patterns of voicing perception are functional at a very early age ͑Eimas et al., 1971;Trehub and Rabinovich, 1972͒. Moreover, experience appears to play a relatively minor role in shaping voicing discrimination early in development ͑Lasky et Streeter, 1976͒. Lasky et al. ͑1975͒, for example, found that infants from a Spanish-speaking environment discriminate differences in VOT if stimuli straddle either the Spanish ͑lead versus short-lag͒ or the English ͑short-lag versus long-lag͒ VOT boundary but show little evidence of discrimination otherwise. This pattern contrasts with Spanish adults, who exhibit a single peak in discriminability across VOT at the lead versus short lag boundary ͑Abramson and Lisker, 1972;Elman et al., 1977;Williams, 1977͒. Aslin et al., ͑1981͒ have found comparable patterns of discriminability for infants from English-speaking homes. Moreover, Kikuyu-reared infants exhibit a similar discrimination pattern along a labial VOT continuum despite the fact that Kikuyu adults do not contrast voicing for labial articulations ͑Streeter, 1976͒. Thus, infants may exhibit regions of betterthan-chance discriminability along a VOT dimension that are not characteristic of native adult perceivers. Moreover, the regions of best discrimination are relatively consistent across groups of infants and align fairly well with the boundaries at Ϫ20 and ϩ20 ms VOT that partition the trio of distributions observed in VOT productions across languages.
Research with nonhuman animals has suggested that general characteristics of mammalian auditory processing may be responsible for these patterns of discrimination. Chinchillas and macaque monkeys without significant expe-rience with speech discriminate the VOT dimension much like human listeners ͑Kuhl and Miller, 1975;1978;Kuhl and Padden, 1982͒. Kuhl and Miller interpreted these results as indicative of a general psychoacoustic discontinuity in VOT perception. Pisoni ͑1977͒ examined the generality of this proposal with human adult listeners by constructing a nonspeech dimension modeling a temporal characteristic of VOT ͑see also Miller et al., 1976͒. The nonspeech stimuli consisted of coterminous tones of 500 and 1500 Hz. Along a tone-onset-time ͑TOT͒ dimension, tones varied in their relative onset times such that the lower-frequency tone preceded the higher-frequency tone in onset, the two tones were simultaneous, or the lower-frequency tone lagged the higherfrequency tone. A schematic of three such stimuli is presented in Fig. 1. In a series of experiments, Pisoni ͑1977͒ found two regions of relatively more accurate TOT discrimination. Listeners were excellent at discriminating stimuli straddling approximately Ϫ20 and ϩ20 ms TOT, but were poorer at discriminating stimuli spanning the same acoustic distance in other regions along the TOT dimension. Interestingly, the two regions of enhanced nonspeech TOT discrimination align quite closely with the Ϫ20 and ϩ20 ms boundaries that partition the three distributions of VOT productions that are observed across languages ͑Lisker and Abramson, 1964͒. Figure 2 illustrates this discrimination pattern. 3 Similar patterns of TOT discrimination performance are obtained for 10-week-old infants ͑Jusczyk et al., 1980͒. Electrophysiological studies have revealed several neurophysiological correlates of discrimination peaks for VOT and TOT stimuli ͑e.g., Simos and Molfese, 1997;Simos et al., 1998;Sinex et al., 1991;Steinschneider et al., 1999͒. Overall, data from adult and infant perception of speech and nonspeech, neurophysiological measures, and animal behavioral experiments indicate that the acoustic VOT and TOT dimensions are not identical to their auditory representations. Rather, there is a discontinuity such that stimulus pairs differentiated by equal acoustic change do not necessarily produce equivalent perceptual change; despite the equivalence of acoustic differences, discriminability may be enhanced for one pair relative to the other. The crosslinguistic regularities in voicing distributions suggest that many languages may exploit this discontinuity by drawing voicing categories from either side of the region of enhanced sensitivity. One benefit of this would be greater perceptual distinctiveness of tokens with small changes in acoustics ͑Kuhl and Miller, 1975;Pisoni, 1977͒. A second possible benefit would be that the discontinuities provide an initial parsing of the VOT space that can facilitate learning voicing categories that are consistent with these natural boundaries. That is, in addition to preferring sound categories that are ''easy to say'' and ''easy to hear,'' languages may favor categories that are ''easy to learn.'' The present experiments were designed to test the hypothesis that operating characteristics of the auditory system constrain the learnability of sound categories. Specifically, they test the prediction that categories defined by onset asynchrony are easier to learn if they are consistent with the auditory discontinuity shown to occur at onset asynchronies near 20 ms. In these experiments, the categories are made up of TOT stimuli similar to those described by Pisoni ͑1977͒. The use of nonspeech stimuli provides control over stimulus attributes and participants' history of experience with the novel nonspeech TOT stimuli. As with the original work of Pisoni ͑1977͒, the results of nonspeech experiments can illuminate some of the general perceptual and cognitive processes that are involved in online speech perception and that may bias the development of linguistic systems.

II. EXPERIMENT 1
Participants in experiment 1 were trained with explicit feedback to assign TOT stimuli to two categories. One group of listeners learned two categories for which the boundary ͑as indicated by feedback and the tails of the distributions͒ was consistent with the reported peak in discriminability at approximately ϩ20 ms. Another group of listeners learned two categories for which the boundary was inconsistent with the peak in discriminability. For this group of listeners, the peak in discrimination at ϩ20 ms occurred within a category.
FIG. 1. Schematic spectrograms of TOT stimuli modeled after those of Pisoni ͑1977͒. Two tones ͑500 and 1500 Hz͒ vary in relative onset to create series of stimuli varying from Ϫ60 to ϩ120 ms TOT. Three representative stimuli are shown.
FIG. 2. Discrimination performance of participants in a pilot discrimination test illustrate the auditory discontinuity at approximately Ϫ20 ms and ϩ20 ms. Although each stimulus pair spans an equal acoustic distance, discrimination is relatively more accurate when stimulus pairs straddle approximately Ϫ20 or ϩ20 ms, as indicated by the arrows.
If learnability is enhanced by aligning category boundaries with perceptual discontinuities, then participants in the consistent condition should learn categories more quickly; that is, they should be faster to reach a criterion level of categorization performance. If learning does not interact with characteristics of the auditory space, then there should be no difference in the time to learn categories in the two conditions.
A. Methods

Participants
Twenty-eight listeners participated. One-half of the participants were randomly assigned to the consistent learning condition; the others were assigned to the inconsistent learning condition. All participants reported normal hearing, learned English as a first language, and were Carnegie Mellon University undergraduates or staff. Each participant was paid $15.

Stimuli
The stimuli were modeled after those of Pisoni ͑1977͒ and were synthesized using Matlab® ͑Mathworks, Inc.͒. A TOT dimension was constructed by creating two-tone stimuli with 500-and 1500-Hz tone components. The 1500-Hz tone had a constant duration of 230 ms and was approximately 12 dB lower in amplitude than the 500-Hz tone. Duration of the 500-Hz tone was varied to produce TOT differences. The only deviation from Pisoni's stimulus construction methods was the addition of 5-ms linear amplitude ramps at onset and offset. Stimuli were labeled with negative TOT values when the lower-frequency tone preceded the higher-frequency tone and with positive TOT values when the lower-frequency tone lagged the higher-frequency tone. A TOT value of zero indicated simultaneous onset. The full TOT series consisted of 37 stimuli varying from Ϫ60 to ϩ120 ms TOT in 5-ms steps. 4 Figure 1 illustrates representative pseudospectrograms for Ϫ50, 0, and ϩ50 ms TOT.
Acoustic presentation was controlled by TDT System II hardware ͑Tucker-Davis Technologies͒. Stimuli were converted from digital to analog, low-pass filtered at 4.8 kHz, amplified and presented binaurally over linear headphones ͑Beyer DT-100͒ at approximately 70 dB SPL͑A͒.

Procedure
Participants first received categorization training with feedback to reach a criterion level of performance. Following training, listeners completed an ABX discrimination task. Next, they received a short categorization refresher. Finally, they again categorized stimuli in a generalization test across the entire TOT series without feedback. Each segment of the experiment was completed while the listener sat comfortably in a sound-attenuating booth wearing headphones and holding an electronic response box. Instructions preceded each segment on an LCD display mounted at eye level in front of the participant. Participants were allowed to take short breaks between segments and the entire session lasted less than 2 h. a. Training. During training, listeners heard a single stimulus over headphones, made a two-alternative forced choice categorization response, and received feedback via a light above the correct response-box button. There were no labels on the response buttons.
Listeners in the consistent condition ͑for which auditory discontinuity, input distribution boundary, and boundary of feedback assignment were the same͒ heard stimuli drawn from the distributions illustrated as white bars in Fig. 3͑a͒. The height of the bar illustrates the relative frequency of stimulus presentation for a particular TOT value. The arrows indicate peaks in discrimination observed at Ϫ20 and ϩ20 ms TOT ͑see Fig. 2 pilot-test data͒. For the consistent condition, stimuli were sampled from two distributions separated along the TOT dimension at approximately ϩ20 ms. Stimuli with TOT values less than ϩ20 ms were assigned to one category and stimuli with TOT greater than ϩ20 ms were assigned to the other category.
The input distributions of the inconsistent condition ͑for which auditory discontinuity differed from the input distribution boundary and the boundary of feedback assignment͒ are shown as dark bars in Fig. 3͑a͒. The input distributions were separated along the TOT dimension at approximately ϩ40 ms, a region of the TOT space that has no known nonlinearities in discriminability. The psychoacoustic discontinuity at ϩ20 ms fell at the mode of one of the inconsistent condition distributions.
For each block of trials, listeners heard 46 stimuli that fully sampled the two input distributions. Listeners were instructed that some of the sounds corresponded to the left button whereas others corresponded to the right button. Following each response, the light above the correct button was illuminated as feedback. The mapping between distribution FIG. 3. Stimulus input distributions illustrate the relative frequency of stimulus presentation during training as a function of TOT. ͑a͒ Shows stimulus distributions of experiment 1. ͑b͒ Shows stimulus distributions for experiment 2. White bars correspond to consistent condition stimuli for which distributions and feedback assignment are coincident with the auditory discontinuity. Black bars show inconsistent condition stimuli where the auditory discontinuity is positioned within a distribution and feedback assignment is inconsistent with the auditory discontinuity. and response button was counterbalanced across participants. Listeners were encouraged to maximize performance and were told that the length of the experimental session was a function of performance.
After each block of training, listeners' percent correct identification across the 46 categorization trials was calculated. Listeners completed training blocks until reaching a criterion of 90% correct performance ͑42 of 46 trials correct͒ or a maximum of 20 blocks ͑920 trials overall͒. Categorization of stimuli within a block took less than 5 min to complete.
b. Discrimination. After reaching criterion performance on the training segment or reaching the 20-block maximum, listeners completed an ABX discrimination task. On each trial, a stimulus pair ͑A and B͒ separated by five steps ͑25 ms͒ along the TOT dimension spanning Ϫ25 to ϩ85 ms was presented with a 750-ms interstimulus silent interval. After another 750 ms of silence, a third stimulus ͑X, identical to either A or to B͒ was presented. Participants indicated whether the third stimulus was identical to either the first or to the second stimulus by pressing a button on an electronic response box. In all, participants discriminated 23 pairs of stimuli that occurred in four counterbalanced permutations per pair ͑ABA, ABB, BAA, BAB͒ for a total of 92 discrimination trials. Participants discriminated this full block of trials two times ͑making 184 overall discriminations͒. Collapsed across counterbalanced stimulus orderings, each subject made eight discrimination responses for each pair. Participants received feedback on each trial. Pisoni ͑1977; Pisoni et al., 1982͒ has argued that providing feedback in the discrimination test is critical because it forces participants to focus on the intended stimulus dimensions and to maintain the same focus from trial to trial.
c. Categorization refresher. To refamiliarize category mappings after the discrimination task, listeners heard two additional blocks, each fully sampling the input distributions. The procedure was identical to that of training. Listeners responded with a categorization response and received feedback on each trial. d. Generalization test. Finally, listeners completed a generalization test of categorization. This task provided a test of categorization across the entire TOT range including stimuli not experienced during training. From these data, identification boundaries estimated using probit analysis and generalization of the learning were assessed. Stimuli spanned Ϫ60 to ϩ120 ms TOT. Each stimulus was equally probable and was presented three times. Listeners were instructed to categorize the stimuli as in previous segments of the experiment. There was no feedback.

Training
The primary hypothesis of experiment 1 was that the auditory discontinuity at approximately ϩ20 ms TOT would interact with category learning. Specifically, it was predicted that category learning should be facilitated in the consistent condition, for which the category boundary coincided with the auditory discontinuity. As can be seen in Fig. 4, the data support this prediction. Listeners in the inconsistent condition required significantly more blocks of categorization training ͑Mϭ8.21͒ to learn the TOT categories to a criterion level than did consistent-condition listeners ͑Mϭ3.36͒, t(13)ϭ2.28, pϽ0.02.

Generalization
Listeners labeled TOT stimuli varying from Ϫ60 to ϩ120 ms TOT in the generalization segment of the experiment, thus responding to stimuli present in categorization training and to novel stimuli. These categorization results ͑presented as a function of an arbitrary category ''A'' indicating the input distribution with the lowest mean value along the TOT dimension͒ are shown with input distributions from training in Fig. 5. A 2ϫ37 ͑condition ϫ TOT stimulus͒ ANOVA revealed a significant main effect of condition, F(1,36)ϭ18.77, pϽ0.001. Inconsistent condition participants categorized more of the TOT series stimuli as members of the ''A'' category than consistent condition listeners. There was also a significant main effect of TOT Stimulus, F(1,36)ϭ81.57, pϽ0.001, indicating that categorization varied as a function of the TOT stimulus series as would be expected if listeners learned to accurately categorize the stimuli. There was also a significant condition ϫ TOT stimulus interaction, F(1,36)ϭ2.28, pϽ0.001, indicating that the identification functions differed in their shape across conditions. This pattern most likely arises from consistent condition listeners' categorization of TOT stimuli on the far negative end of the stimulus series ͑see Fig. 5͒. Consistent condition listeners imperfectly generalized to stimuli at this end of the series, instead rather consistently mimicking the frequency distribution of the trained categories in their generalization responses. Identification boundaries, as estimated from probit boundary analyses computed for individual subjects, support these analyses. There was a significant boundary shift across conditions, t(13)ϭ5.42, pϽ0.001 ͑33.17 ms TOT for consistent listeners versus 51.62 ms TOT for inconsistent listeners͒. There was no significant difference in the probit slopes, t(13)ϭ0.63, pϭ0.54; the steepness of the identification functions was statistically equivalent across conditions.

Discrimination
Listeners also discriminated TOT stimuli in an ABX task. Whereas categorization and generalization analyses demonstrated significant effects across groups, there was little evidence that patterns of discrimination shifted with categorization training. A 2ϫ23 ͑condition ϫ discrimination pair͒ ANOVA revealed that consistent group listeners were somewhat better at all discriminations although not significantly so, F(1,22)ϭ3.22, pϽ0.09. Even this slight trend is surprising, considering that consistent condition listeners had significantly less experience listening to the stimuli because they were faster to reach criterion categorization performance. Pairs were not equally discriminable, F(1,22) ϭ4.80, pϽ0.0001, but the patterns did not relate cleanly to categorization experience. There was no condition ϫ discrimination pair interaction, F(1,22)ϭ1.28, n.s. Overall, discrimination does not appear to be a reliable index of listeners' categorization competence in the present paradigm.

Summary
The main prediction of the learnability hypothesis was supported by the results of experiment 1. Participants trained with a category boundary that was consistent with the auditory discontinuity learned to categorize the sounds faster than participants in the inconsistent condition. The data also make it clear that the presence of a single discontinuity does not determine possible learnable categories. Participants in the inconsistent condition did learn the categories on which they were trained. The generalization test shows a shift in identification boundaries and no obvious change in the identification function related to the discontinuity region. Thus, characteristics of general auditory representations appear to affect the ease with which categories are learned, but the general learning system is capable of overcoming the initial parsing of the perceptual space.

III. EXPERIMENT 2
In experiment 1, the two conditions differed in the range of TOT values that were categorized as well as the consistency of the category boundary with the auditory discontinuity. Although the overall range of TOT was equivalent across conditions, the inconsistent distributions included more large TOT values than did consistent distributions. It is possible that the perceptual space spanned by the acoustic TOT stimuli is compressively nonlinear ͑Fechner, 1860͒, rendering perceived stimulus differences of the inconsistent condition less salient and thereby increasing the difficulty of the inconsistent categorization task. This, too, would be an interaction of general operating characteristics of the auditory system with categorization, but not the sort under investigation in experiment 1.
Experiment 2 controls for this possibility. For this experiment, the two input distributions of the experiment 1 inconsistent condition were modified such that the input distribution with a mode at ϩ65 ms TOT ͓see Fig. 3͑a͔͒ was reflected about ϩ20 ms TOT to create a new input distribution with negative TOT values. The other three input distributions of experiment 1 remained unchanged, as shown in Fig. 3͑b͒. Note that, in experiment 2, both of the inconsistent-condition input distributions spanned a region in auditory space defined by high discriminability. The peak in discriminability at approximately ϩ20 ms fell within one input distribution and the peak in discriminability at about Ϫ20 ms fell within the opposite distribution. As a result, experiment 2 provides an opportunity to observe whether categories inconsistent with both of these auditory discontinuities are even more difficult for listeners to learn.

Participants
Participants were recruited from the Carnegie Mellon University undergraduate psychology participant pool. Each participant earned a credit toward a psychology research requirement and $5 for participation. All participants had normal hearing and learned English as the first language. Fourteen listeners participated in each condition for a total of 28 participants.

Stimuli
The TOT series stimuli were identical to those of experiment 1. FIG. 5. Categorization responses ͑presented on the right-most y-axis as a function of an arbitrary category ''A'' indicating the input distribution with the lowest mean value along the TOT dimension͒ to stimuli presented in the generalization task are shown for consistent and inconsistent condition listeners in experiment 1. The histogram bars denote the position of the categories learned during training along the TOT dimension as a function of the relative frequency of stimulus presentation plotted on the left-most y-axis.

Procedure
The procedure was nearly identical to that of experiment 1. The only difference between experiments 1 and 2 was the position of one of the inconsistent condition distributions along the TOT dimension. Figure 3͑b͒ illustrates the input distributions for experiment 2. Three of the four input distributions are the same as those of experiment 1. The TOT input distribution from experiment 1 centered on ϩ65-ms TOT was replaced with an inconsistent condition distribution centered at Ϫ25-ms TOT, creating a boundary at approximately Ϫ5-ms TOT.
As a consequence of the reflected input distribution, inconsistent condition listeners in experiment 2 were required to categorize input distributions that crossed both Ϫ20 and ϩ20 ms discontinuities. As a result, the inconsistent condition learning task of experiment 2 may have been even more difficult than that of experiment 1, for which only one input distribution crossed a discontinuity. Stimulus distributions for the consistent condition were equivalent to those of experiment 1 and thus provide a replication.
Tasks and experimental apparatus for this experiment were identical to those of experiment 1. Figure 6 illustrates the mean number of blocks to criterion categorization performance across conditions. The results of experiment 1 do not appear to be merely a result of compressive nonlinearity in perception of large TOT values. Listeners in the inconsistent condition required significantly more blocks of categorization training ͑Mϭ16.86͒ to learn the TOT categories to a criterion level than did consistentcondition listeners ͑Mϭ6.57͒, t(13)ϭ4.71, pϽ0.0001. Thus, the results of experiment 1 are replicated in experiment 2.

Training
The consistent condition was identical to that of experiment 1. As would be expected, there was no significant dif-ference in the number of blocks listeners required to reach criterion in the consistent condition across experiments, t(13)ϭ1.908, pϭ0.79. However, it took listeners significantly longer to learn the inconsistent categories of experiment 2 than those of experiment 1 ͓16. 86 vs 8.21 blocks, t(13)ϭ2.99, pϽ0.01]. Note that one of the distributions for the inconsistent condition was identical in both experiments and the distance between distributions was constant. Despite these similarities, participants in experiment 2 performed more poorly in the categorization training task.
In fact, listeners in experiment 2 inconsistent condition had difficulty reaching 90% accuracy within the arbitrarily set maximum 20 blocks ͑920 trials͒ of training. In experiment 1, categorization performance in the last block of training was equivalent across conditions, t͑13͒ϭ1.84, pϭ0.09, suggesting that listeners ultimately learned their respective categories equally well. However, in experiment 2, consistent condition listeners were significantly more accurate in the last block of training than were inconsistent condition listeners, t(13)ϭ3. 866, pϭ0.002. This is primarily due to the high proportion of inconsistent condition listeners (N ϭ10, 71%͒ that did not reach 90% criterion within 20 training blocks. Categorization accuracy in the last block was only 72.05% for this subset of inconsistent condition listeners. Nevertheless, the task of inconsistent listeners was not impossible. One listener achieved criterion categorization performance in a single block of trials. 5 It should also be noted, however, that there was a difference in the listener samples between experiments 1 and 2 such that across conditions, listeners in experiment 2 were slower ͑Mϭ11.71 blocks͒ to reach criterion than experiment 1 listeners ͑Mϭ5.78 blocks͒, t(27)ϭ3.44, pϭ0.002, and the performance of experiment 1 Inconsistent condition listeners was not statistically distinguishable from the consistent condition listeners of experiment 2, t(13)ϭ0.583, pϭ0.57. This latter piece of evidence should be considered in light of the fact that the arbitrarily 20-block maximum for the training session differentially affected inconsistent condition listeners, creating the possibility of a ceiling effect for the inconsistent condition and artificially reducing the hypothesized effect between the inconsistent and consistent conditions. Thus, interpretation of between-experiment comparisons should be interpreted with some caution. Nevertheless, whatever the source of listener differences between the experiments, each group's performance supports the primary hypothesis that the operating characteristics of the auditory system interact with learning to make inconsistent condition categories more difficult to learn than consistent condition categories.

Generalization
In the generalization segment of the experiment, listeners labeled stimuli varying from Ϫ60 to ϩ120 ms TOT with no feedback. Average categorization functions ͑expressed in terms of an arbitrary category ''A''͒ for consistent and inconsistent condition listeners are shown in Fig. 7. A 2ϫ37 ͑condition ϫ TOT stimulus͒ ANOVA of these results revealed a significant main effect of TOT Stimulus, F(1,36) ϭ83.04, pϽ0.0001, corresponding to the sigmoid shape of the categorization function and indicating quite consistent categorization by listeners. Contrary to the results of experiment 1, there was no significant influence of condition upon generalization responses, F(1,36)ϭ1.06, pϭ0.3. As indicated in Fig. 7, there was no evidence of a category boundary shift. There was no significant difference in probit boundaries, t(12)ϭ0.272, pϭ0.79 (M Con ϭ34.4, M Incon ϭ37.2), and no difference in the slopes of the functions, t(12) ϭ1.84, pϭ0.09.

Discrimination
In a 2ϫ23 ͑condition ϫ stimulus pair͒ ANOVA, there was a main effect of stimulus pair, F(1,22)ϭ5.283, p Ͻ0.0001, indicating differences in discrimination accuracy as a function of stimulus pair. There was no condition main effect (FϽ1) and no condition by stimulus pair interaction (FϽ1).

Summary
The effect observed in experiment 1 was replicated in experiment 2. As in experiment 1, consistent condition listeners were faster to reach criterion categorization performance than listeners in the inconsistent condition. This result supports the original contention that it is the placement of the category boundary in relation to the auditory discontinuity and not the range over which TOT stimuli are categorized that makes the inconsistent distributions more difficult to learn.
It is of potential interest that experiment 2 inconsistent condition listeners were significantly slower to learn the categories than were experiment 1 inconsistent condition listeners. This appears to be due to the placement of experiment 2 inconsistent condition distributions on auditory discontinuities at both Ϫ20 and ϩ20 ms TOT. The difficulty of learning the categories in the inconsistent condition of experiment 2 may be responsible for the absence of a category boundary shift relative to the consistent condition in the generalization test.

IV. DISCUSSION
The present findings indicate that auditory categories are not equally learnable. Basic auditory sensitivities interact with auditory categorization, causing some distributions of sounds to be more easily categorized than others. Specifically, categories separated by regions of high discriminability are more easily learned than those that span such regions. Nevertheless, despite the advantage that they confer to learnability, psychoacoustic sensitivities do not necessarily dictate categorization. Listeners in the inconsistent conditions required more experience, but did ultimately learn to categorize the sounds with high accuracy in experiment 1.
These results may have implications for learning phonetic categories. TOT stimuli were originally created to mimic the onset asynchrony between lower and higher frequency energy that corresponds to VOT in speech ͑Pisoni, 1977͒. If this analogy holds, then considering the present results it may be hypothesized that the placement of phonetic category distributions on either side of the discontinuity around 20 ms may confer enhanced learnability of speech categories distinguished by VOT. If linguistic systems are selected to be learnable, then phonetic inventories would be expected to include categories separated by the ϩ20 and Ϫ20 ms VOT boundaries. This pattern is, in fact, widely attested among the world's languages ͑Keating, 1984; Lisker and Abramson, 1964͒. For example, the English labial voicing distinction ͑/b/ vs /p/͒ is realized with distributions of VOT values on either side of the 20-ms boundary. The fact that inconsistent listeners eventually reached criterion performance suggests that contrasts that do not follow this pattern are not unachievable, but may be more difficult to learn.
Previous attempts to account for regularities in phonetic inventories have focused on the communicative benefits arising from ease of production and ease of discrimination ͑e.g., Diehl et al., 2003;Liljencrants and Lindblom, 1972;Lindblom, 1986;Stevens, 1972͒. In the case of VOT, the ''easy to say''/''easy to hear'' approach proposes that aligning category boundaries with auditory discontinuities enhances discriminability of tokens from different categories without imposing substantial articulatory costs. The learnability account proposed here should be seen as augmenting, not replacing, such accounts. Phonetic systems should be easy to say, easy to hear, and easy to learn. In fact, if assignment of speech sounds to equivalence classes ͑e.g., phonetic categories͒ serves a communicative function, then the ease with which categories can be formed and the accuracy with which sounds are assigned to categories may be an essential value on which linguistic systems are selected.
The proposal that auditory discontinuities can provide an initial parsing of the perceptual space for categorization is similar to Kuhl's description of ''natural auditory boundaries'' in her native language magnet theory of phonetic acquisition ͑Kuhl, 1991acquisition ͑Kuhl, , 1993. According to Kuhl ͑1993͒, the initial discriminative abilities of infants reflect the presence of these boundaries. As infants are exposed to native language distributions of speech sounds, these boundaries are modified and sometimes deleted by the process of categorization. The results of the current experiments lend some support to this proposal. The onset asynchrony discon- FIG. 7. Categorization responses during generalization task for consistent and inconsistent condition listeners in experiment 2. tinuity provided a natural boundary for the consistent condition categories but the negative effects of this discontinuity in the inconsistent condition of experiment 1 were ameliorated by learning. However, we must be cautious in generalizing from this one example to all phonetic categories. The voicing contrast is a fairly clear case of alignment between linguistic boundaries and natural auditory boundaries. There is also evidence for natural boundaries between some placeof-articulation contrasts for syllable-initial consonants ͑Eimas, 1974;Kuhl and Padden, 1983͒. However, there is no empirical evidence that auditory discontinuities are pervasive and underlie all or even most phonetic distinctions ͑e.g., Macmillan, 1987;Rosen and Howell, 1987͒. For example, recent studies have not found strong support for proposed auditory discontinuities in the vowel space ͑Fahey and Hoemeke and Diehl, 1994;Molis et al., 1998͒. Inasmuch as auditory discontinuities do not dictate the categories that can be learned ͑recall that inconsistent condition listeners could learn the present categories with more experience͒, it is also important to acknowledge that auditory discontinuities themselves are malleable. Auditory discontinuities appear not to be immutable boundaries independent of other stimulus characteristics, but rather are a function of the spectral and temporal makeup of the stimulus and its surrounding context. For example, the boundary for TOT identification shifts with changes in the frequency separation between the two tones ͑Parker, 1988͒. Similarly, animal studies of VOT identification demonstrate boundary shifts with increased frequency separation of F1 ͑at energy onset͒ and higher formants ͑Kluender and Lotto, 1994;Kuhl and Miller, 1978͒. Recent intracranial electrophysiological recording experiments in humans and monkeys provide physiological evidence that the psychophysical discontinuity typically described as the ''20-ms boundary'' is dependent on the spectral characteristics of the stimuli ͑Steinschneider et al., 2004͒. Neurons' response properties across a VOT series shift as the F1 frequency of the speech syllables varies. Thus, although we have spoken of a 20-ms auditory discontinuity, it is the case that this boundary, in fact, is dependent upon the spectral characteristics of the acoustic components that define the onset asynchrony. As a result, we would expect the relative impact of the placement of categorization training boundaries to vary with changes in stimulus characteristics.
This more stimulus-bound way of viewing auditory discontinuities aligns with behavioral evidence of contextsensitivity of perception along VOT series ͑e.g., Miller and Liberman, 1979;Miller et al., 1986;Summerfield, 1981͒. That is, the present data do not argue that auditory discontinuities determine category boundaries absolutely. Categories with boundaries violating an auditory discontinuity ͑as in the inconsistent conditions͒ may be learned with more experience and, likewise, category boundaries may be shifted with adjacent context or changes in stimulus characteristics. What the present data suggest is that learnability of categories is enhanced when the boundaries of category distributions are aligned with the operating characteristics of auditory processing. When the discontinuities shift because of stimulus characteristics, so, too, should category boundaries. This is exactly what is seen for place of articulation and VOT.
Despite converging evidence regarding the 20-ms asynchrony discontinuity, there has been considerable debate over its existence ͑e.g., Kewley-Port et al., 1988;Pastore, 1988;Pastore and Farrington, 1996͒. One reason is that the behavioral correlates of the discontinuity are not consistently observed across measurement procedures and changes in spectra and may sometimes be impacted by differences in the relative degree of experience across participants ͑Watson and Kewley-Port, 1988͒. However, the present results are difficult to reconcile without suggesting that the auditory system may differentially process stimuli along the TOT dimension. A possible experience-bound alternative interpretation of the present findings is that English-speaking listeners' category boundary for voicing in speech influenced their perception and category learning of nonspeech stimuli along the TOT dimension. Consistent with this proposal, recent results have demonstrated an influence of native-language experience upon nonspeech perception ͑Bent et al., 2003͒. However, in experiment 2 inconsistent categories spanned the natural boundaries at both ϩ20 ms ͑close to the native English voicing boundary͒ and Ϫ20 ms ͑in a region of space not used contrastively in English voicing͒. Native-English listeners were significantly poorer at learning the categories in this condition than in the inconsistent condition of experiment 1 that spanned only the ϩ20 ms boundary. Therefore, the possibility that native-language-specific categories along the VOT dimension interact with TOT learning does not account fully for the present findings.
If learnability is a determinant of the structure of phonetic systems, then empirical work on general perceptual categorization is relevant for comprehensive theories of speech perception. Whereas there is a rich tradition of categorization work in vision, there is a dearth of information on the factors that may facilitate the formation of complex auditory categories. Some of these factors may be low dimensionality of the stimulus space, covariance among stimulus dimensions, and linear category boundaries. With respect to the last factor, Diehl et al. ͑2001͒ report that boundaries for vowel categories in psychoacoustically scaled ͑e.g., Mel or Bark͒ formant space appear to be linear. They point to results from experiments in which participants are asked to identify tokens broadly sampled across the formant space ͑e.g., Molis, 1999͒. In these studies, the boundaries that provide the best discriminative separation of the identification scores were often ͑although not exclusively͒ simple linear functions. Many boundaries were horizontal or vertical lines reflecting unidimensional distinctions between vowel categories, whereas most other boundaries were horizontal lines that reflected distinctions based on formant frequency averaging or difference scores. It is intuitively ͑and computationally͒ clear that categories defined by linear boundaries would be easier to learn than categories with complex nonlinear boundaries. However, in order to make strong claims about learnability, the ease of learning nonspeech auditory categories with different boundary configurations must be examined.
Factors affecting phonetic acquisition extend beyond the structure of the categories to the details of the learning environment. Kuhl et al. ͑2003͒ recently demonstrated that in-fants exposed to live adult models picked up phonetic information better than infants exposed to recordings of speakers. This result suggests the importance of social interaction in the learning situation, perhaps to direct the attention of the learner to important aspects of the sensory input. In addition, the characteristics of speech directed at children may be tailored to enhance learnability of phonetic categories. Childdirected speech has led to more efficient learning of phonetic categories by a computer model than adult-directed speech ͑de Boer and  The degree to which these particular factors of input distributions and learning situations enhance category formation can be tested using methods similar to those of the current study. With nonspeech stimuli, the researcher has control over the training set and the training conditions. Data can be collected from identification, discrimination, and category goodness rating tasks to provide a clear picture of how these factors interact with learning and the structure of the resulting categories. As exemplified by the experiments described here, these results have implications for explaining regularities in phonetic systems specifically and linguistic systems more generally.

ACKNOWLEDGMENTS
This work was supported by a National Science Foundation grant ͑NSF BCS-0078768͒ to LLH and AJL, by a James S. McDonnell Foundation award for Bridging Mind, Brain, and Behavior to LLH, and by a research grant ͑5 R01 DC00427͒ from NIDCD to RLD. The authors thank Christi Adams for her help in conducting the experiments. 1 A better term for this region may be a ''non-monotonicity.'' The present arguments apply not only to regions of perceptual space that result in a qualitative change in perception or a step-function in responding, but more generally to regions of perceptual space warped in relation to physical acoustic space. 2 See Cho and Ladefoged ͑1999͒ for language-specific nuances that supplement this broad description. 3 The data depicted in Fig. 2 were collected in a pilot ABX discrimination study using the same TOT stimuli as in the categorization studies described in this report. 4 Pisoni ͑1977͒ investigated a range of TOT spanning Ϫ50 to ϩ50 ms. 5