Central locus for nonspeech context effects on phonetic identiﬁcation (L)

Recently, Holt and Lotto # Hear. Res. 167 , 156–169 ! 2002 "$ reported that preceding speech sounds can inﬂuence phonetic identiﬁcation of a target syllable even when the context sounds are presented to the opposite ear or when there is a long intervening silence. These results led them to conclude that phonetic context effects are mostly due to nonperipheral auditory interactions. In the present paper, similar presentation manipulations were made with nonspeech context sounds. The results agree qualitatively with the results for speech contexts. Taken together, these ﬁndings suggest that the same nonperipheral mechanisms may be responsible for effects of both speech and nonspeech context on phonetic identiﬁcation. © 2003 Acoustical Society of America. DOI:


I. INTRODUCTION
There exists a class of perceptual phenomena known as phonetic context effects in which the perceived phonemic identity of a speech sound is moderated by the identity of neighboring speech sounds. That is, identical acoustics can lead to different identifications depending on the identity of precursor speech sounds. For example, the reported identification of a syllable-initial stop can be changed from /g/ to /d/ by changing the preceding context syllable from /al/ to /ar/ ͑Mann, 1980͒.
Holt and Lotto ͑2002͒ attempted to ascertain the level of the auditory system at which the stimulus interactions underlying phonetic context effects occur. In one experiment, they presented context syllables ͑e.g., /al/ or /ar/͒ and target syllables ͑/da/-/ga/ series members͒ to opposite ears. The identity of the context syllable affected identifications of the target syllable even in this dichotic presentation condition. However, the size of the identification boundary shift was slightly smaller than for diotic presentation conditions. In a second experiment the duration of the silent gap between the context and target syllable was varied from 25 to 400 ms ͑this gap was typically 50 ms in previous experiments͒. A significant effect of context was evident even when context offset and target onset were separated by as much as 275 ms. Holt and Lotto ͑2002͒ argue that these results suggest that context effects are partially mediated by nonperipheral mechanisms. That is, it is unlikely that they are due to masking or interactions at the level of the auditory nerve or perhaps even cochlear nucleus. In agreement with these conclusions, Holt and Rhode ͑2000͒ failed to find evidence for appropriate speech-sound stimulus interactions in recordings from chinchilla VIIIth nerve.
Recently, there have been a number of demonstrations of shifts in phonetic identification caused by nonspeech context sounds such as sine-wave tones ͑Lotto and Kluender, 1998;Holt et al., 2000͒. Lotto and Kluender ͑1998͒ presented listeners consonant-vowel ͑CV͒ syllables preceded by sinewave tones that modeled the frequency trajectory of the third formant (F3) of /al/ or /ar/. Listeners identified the CVs more often as /ga/ following the sine-wave modeling /al/ and more often as /da/ following the sine-wave modeling /ar/. Because these nonspeech context sounds had no perceived phonetic content, the authors proposed that the spectral content of the context sounds moderates the shift in identity of the target speech sounds. In this case, high-frequency spectral energy (F3 offset of /al/ or high-frequency sine wave͒ leads to more /ga/ responses ͑/g/ has a low-frequency F3 onset͒ and low-frequency spectral energy (F3 offset of /ar/ or low-frequency sine wave͒ results in more /da/ responses ͑/d/ has a high-frequency F3 onset͒. This pattern of results has been referred to as spectral contrast ͑Holt et al., 2000͒. The question that is immediately raised is whether the processes responsible for nonspeech context effects are the same as those underlying speech context effects. Fowler et al. ͑2000͒ suggest that nonspeech context effects are primarily due to masking. On the other hand, they propose that speech context effects are due specifically to perception of speech gestures.
In addition to a masking account, it is possible that nonspeech context effects are complex demonstrations of auditory enhancement ͑Viemeister, 1980; Viemeister and Bacon, 1982;Summerfield et al., 1984͒. Auditory enhancement refers, generally, to a class of effects in which energy in a frequency region is perceptually enhanced if it is preceded by a sound that lacks energy in that region. Holt  patible with an auditory enhancement or peripheral masking account of speech context effects. In particular, the time course of auditory enhancement appears to differ from the speech context effects. Holt and Lotto demonstrated that speech context effects are present out to at least 275 ms of intervening silent gap. Viemeister and Bacon ͑1982͒ found no appreciable auditory enhancement in a masking study beyond about 100 ms of intervening silence. In addition, auditory enhancement appears to be a strictly monaural phenomenon. Summerfield and Assmann ͑1989͒ failed to find effects of a precursor stimulus in auditory enhancement vowel experiments when the context was presented to the contralateral ear. In contrast, Holt and Lotto demonstrated robust effects of speech contexts presented to the opposite ear of the target syllables. The purpose of the two experiments presented here is to determine whether auditory enhancement or peripheral masking can completely account for nonspeech context effects. The manipulations utilized by Holt and Lotto ͑2002͒ have been replicated here with nonspeech contextual sounds. Experiment 1 examines the effect of dichotic versus diotic presentation on nonspeech context effects. In experiment 2, subjects are presented context and target syllables with varying durations of intervening silence. The question is whether these manipulations will moderate nonspeech context effects in a qualitatively different manner than witnessed for speech context effects. If not, then it may be reasonable to suggest that similar mechanisms are culpable for both speech and nonspeech context effects.

Subjects
Twenty-four undergraduate students at Washington State University participated in the experiment for course credit. All were native English speakers that reported no hearing deficits or disorders.

Stimuli
A ten-member series of synthetic speech varying acoustically in F3 onset frequency and varying perceptually from /ga/ to /da/ was created using the cascade branch of the Klatt ͑1980͒ synthesizer. For these stimuli, F3 onset frequency varied from 1800 to 2700 Hz in 100-Hz steps. From onset, F3 frequency changed linearly to a steady-state value of 2450 Hz across 80 ms. All other synthesis parameters were constant across series members. The first formant frequency (F1) increased linearly from 300 to 750 Hz and the second formant (F2) frequency declined from 1650 to 1200 Hz across 80 ms. The fourth formant (F4) had a steady-state value of 2850 Hz. Fundamental frequency ( f 0) was 110 Hz over the first 200 ms and decreased to 95 Hz over the last 50 ms. Total stimulus duration was 250 ms. This CV series is identical to that used by Holt and Lotto ͑2002͒ in experiments 1b and 2b.
The nonspeech context stimuli were based on the speech precursors used in Holt and Lotto ͑2002͒. An analog of /al/ and /ar/ was created by using the synthesis parameters from Holt and Lotto in the parallel branch of the Klatt ͑1980͒ synthesizer. Amplitudes for all formants other than F3 were set to zero. This resulted in a 250-ms harmonic complex ( f 0 equals 110 Hz͒ with a single frequency-varying amplitude peak. In terms of synthesis parameters, the frequency of this single formant was set at 2450 Hz for the first 100 ms for both contexts. The two contexts differed in the formant frequency trajectory over the final 150 ms. For the context modeling /al/ ͑referred to as highfreq͒, the formant increased linearly in frequency to 2700 Hz. For the context modeling /ar/ ͑lowfreq͒, the formant decreased linearly to 1600 Hz. These context sounds are not perceived as speech and certainly contain no identifiable phonemic content.
All stimuli were synthesized with 16-bit resolution at a 20-kHz sampling rate and stored on a computer disk following synthesis. Stimulus presentation was under the control of a microcomputer and Tucker Davis Technologies ͑TDT͒ hardware. Context sounds and target syllables were appended online with a 50-ms intervening silent interval. Following D/A conversion ͑TDT, DD1͒, stimuli were low-pass filtered at a 9.8-kHz cutoff frequency ͑TDT, FTG2͒, attenuated ͑TDT, PA4͒, and presented over headphones ͑Sennheiser HD 285͒ at 75 dB SPL ͑A͒.

Procedure
One to three subjects were tested concurrently in a sound-attenuated booth during a single experimental session. During each trial, listeners heard the appended stimuli ͑context followed by target syllable͒ over headphones. The listeners' task was to identify the target syllable as ''da'' or ''ga'' by pressing a labeled button on an electronic response box. Intertrial interval was approximately 3 s.
The experiment was divided into two blocks corresponding to diotic and dichotic presentation. Each subject completed both blocks and order of block presentation was counterbalanced across subjects. In the dichotic block, context and target were presented to opposite ears, with ear of context presentation randomized across trials. In the diotic block, both context and target were presented to both ears on each trial. In each block, listeners responded to 10 repetitions of each of the context/target combinations ͑2 contextsϫ10 target CVsϫ10 repetitionsϭ200 trials per block͒. In all, the experiment lasted approximately 45 min.

B. Results and discussion
Previous context effect experiments ͑e.g., Lotto and Kluender, 1998͒ have used a performance criterion for inclusion of data in analyses. For the current two experiments, data were withheld from analyses for subjects who failed to correctly identify the two endpoint CVs ͑the best /da/ and /ga/͒ at least 80% of the time across conditions. In experiment 1, this led to the exclusion of data from two subjects. Identification boundaries were computed on the percentage of ''ga'' responses through probit analysis. These boundaries ͑in terms of F3 frequency of the CV series͒ are presented in Fig. 1 The results of experiment 1 are consistent with the results of experiment 1b of Holt and Lotto ͑2002͒. In the latter study, speech context effects were present for both diotic and dichotic presentation, but the effect of context was numerically smaller in the dichotic condition. A 2 ͑speech versus nonspeech͒ϫ2 ͑presentation condition͒ϫ2 ͑context͒ mixed-model ANOVA confirmed the agreement of the results from the two experiments. There were no significant interactions including the speech versus nonspeech variable (ps Ͼ0.10). This agreement of results suggests that similar mechanisms may underlie both speech and nonspeech context effects on phonemic identification. In both cases, it is unlikely that the identification shifts are caused solely by peripheral masking or auditory enhancement, as these mechanisms are monaural in nature. It is still possible that these peripheral mechanisms play some role in both speech and nonspeech context effects since both effects are smaller when context and target cannot interact in the periphery.

Subjects
Twenty undergraduates at Washington State University participated for course credit. All were native speakers of English that reported no hearing deficits or disorders. None of the subjects participated in experiment 1.

Stimuli
Stimuli were identical to those used in experiment 1. Only the duration of the intervening silent interval differed. The six intervening silent intervals were 25, 50, 100, 175, 275, and 400 ms. These duration intervals are identical to those used in experiment 2b of Holt and Lotto ͑2002͒.

Procedure
The task for the subjects was the same as in experiment 1. Each subject participated in three blocks of 120 trials ͑2 contextsϫ6 gap durationsϫ10 CV target stimuli͒. Within each block, presentation order of stimuli was randomized. The context and target stimuli were presented to both ears.

B. Results and discussion
Data from three subjects who failed to identify 80% of endpoint stimuli were excluded from further analysis. Probit boundaries for each gap durationϫcontext condition are presented in Table I.
Planned paired-sample t-tests were used to examine the context effect at each duration of intervening silence. The spectral content of the contextual sound caused a significant shift in identification boundaries for all silent gap durations up to and including 175 ms (psϽ0.05; see Table I͒. No   TABLE I. Means and standard deviations ͑in parentheses͒ of identification boundaries as a function of context and silent interval duration from experiment 2. Differences between contexts were tested with paired-sample t-tests. This pattern of results is quite similar to that obtained by Holt and Lotto ͑2002͒ in experiment 2b. They found an effect of speech context on CV identification out to 275 ms of intervening silence. No effect was present for a 400-ms silent gap. In both the nonspeech and speech context experiments, the size of the context effect decreases monotonically with increasing gap duration.

IV. GENERAL DISCUSSION
The pattern of results from both experiments described here matches the pattern obtained by Holt and Lotto ͑2002͒ with speech contexts. A significant context effect on CV identification remains when context is presented contralaterally to target. In both cases, the dichotic context effect is robust though smaller than for diotic presentation conditions. Effects of context also remain for substantial durations of intervening silent gaps. For speech contexts, this gap can extend to at least 275 ms. For nonspeech contexts, significant shifts were demonstrated out to 175 ms. Fowler et al. ͑2000͒ propose that speech and nonspeech context effects are different in kind. However, the agreement of the current results with those of Holt and Lotto ͑2002͒ implicates similar mechanisms in both kinds of context effects. This agreement can be added to the mounting evidence for a general auditory role in speech context effects. Several previous studies have demonstrated nonspeech context effects that are equivalent in size of boundary shift to corresponding speech context effects ͑Lotto and Kluender, 1998; . The current studies extend these similarities across a series of presentation manipulations.
The results of these experiments support the contention of Lotto and Kluender ͑1998;Lotto et al., 1997͒ that general mechanisms of the auditory system are at least partially responsible for the kinds of speech context effects examined here. The result of these general mechanisms is the perceptual emphasis of energy in frequency regions that are less represented in context sounds. That is, changes in the pattern of spectral energy are enhanced. The behavioral inputoutput function can be described as spectral contrast and it appears to be a general property of auditory systems. Lotto et al. ͑1997͒ demonstrated that birds ͑Japanese quail, Coturnix japonica͒ trained to respond to /da/ and /ga/ stimuli also show contrastive response shifts with /al/ and /ar/ contexts. Lotto and Kluender's description of the pattern of con-trastive output does not implicate any particular mechanism. However, the results of the current set of experiments provide evidence against some proposed mechanisms.
Given the monaural nature of peripheral masking and auditory enhancement, it is unlikely that either of these mechanisms is solely responsible for context effects. The fact that dichotic context effects were smaller suggests that it is possible that peripheral mechanisms play some role. However, a complete explanation will require a description of more central processes that take input from both ears. The relative temporal robustness of the context effects described in experiment 2 is also consistent with a central mechanism.
In general, as one observes effects of interactions at more central levels of the auditory system, there is a longer temporal window over which auditory events interact and influence one another ͑Popper and Fay, 1992͒. These results are in agreement with neurophysiological investigations of speech context effects that found little evidence for contrast at the auditory nerve ͑Holt and Rhode, 2000͒.