Speaker discrimination as a function of vowel realization: does focus affect perception?

Appendices Appendix A: Statement-question pairs used to elicit [±focus] versions of the target words; Appendix B: Visualization of the two significant clusters for the by-trial random slope for the interaction Abstract The acoustic-phonetic characteristics of speech sounds are influenced by their linguistic position in an utterance. Because of acoustic-phonetic differences between different speech sounds, sounds vary in the amount of speaker information they contain. However, do spectral and durational differences between various realizations of the same sound that were sampled from different linguistic positions also impact speaker information? We investigated speaker discrimination in [−focus] versus [+focus] word realizations. Twenty-one Dutch listeners participated in a same-different speaker discrimination task, using stimuli varying in focus, vowel ([aː], [u]), and word context ([ɦ_k], [v_t]), spoken by 11 different speakers. Results show that an effect of focus on speaker-dependent information was present, but limited to words containing [u]. Moreover, performance on [u] words was influenced by (interactions of) word context and trial type (samevs. different-speaker). Context-dependent changes in a speech sound’s acoustics may affect its speaker-dependent information, albeit under specific conditions only.


Introduction
Different speech sounds vary in the amount of acoustic speaker information they carry (Van den Heuvel, 1996;Kavanagh, 2012); vowels tend to contain more acoustic information on the speaker than consonants do, and within the class of consonants, nasals (e.g. [n, m]) may carry more speaker information than fricatives (e.g. [f, z]) or stops (e.g. [t, b]). Also in speech perception, the speech sounds that make up an utterance influence how well a listener can discriminate speakers (Andics et al., 2007) or identify them (Amino & Arai, 2007). Much acoustic-phonetic research, however, has shown that the realization of one particular speech sound in one particular phonetic context varies as a function of linguistic position. Examples from Dutch are how the realization of a speech sound varies with the presence/absence of stress (Sluijter & Van Heuven, 1996), or with word class (Van Bergem, 1993). For instance, in a stressed versus unstressed position, the syllable 'kom' in 'ˈkom·ma' (comma) as opposed to that in 'kom·ˈpas' (compass), has a longer duration and less vowel reduction (Van Bergem, 1995, p. 28). Such variation in a speech sound's acoustics might in turn affect the speaker information carried by the sound. Whereas earlier work has compared the speaker information carried by different speech sounds, the main research question in the current investigation is if speaker discrimination performance on the same speech sound depends on whether tokens of the speech sound are sampled from focused versus non-focused contexts. Focus may be used to indicate new or contrastive information, and this is accompanied by acoustic changes (e.g., Hanssen et al., 2008;Sluijter & Van Heuven, 1996).
In the rest of this introductory section, speaker information in the acoustics of speech sounds (henceforth also referred to as segments) will be explained further in 1.1. This is followed by a discussion of prior research on how speaker discrimination may depend on stimulus composition in 1.2. The research questions and hypotheses are formulated in section 1.3.

Acoustic-phonetic speaker-dependent information in segments
In theory, an acoustic-phonetic speech feature (such as duration, fundamental frequency or formant frequency) would be highly speaker-specific when individual speakers show little variation in that feature when producing different tokens, i.e. there is little within-speaker variation, whereas different speakers would produce tokens that are very different with respect to the feature, i.e. there is large between-speaker variation. In practice, the speaker-specificity or speaker-discriminatory potential of individual speech features tends to be low to moderate. But crucially, both within-speaker and betweenspeaker variation of particular features differ in magnitudes across linguistic contexts (e.g., McDougall, 2006;Smorenburg & Heeren, 2020): in some situations speakers show more variation whereas in others they show less. This creates the option of particular linguistic contexts yielding higher speaker-specificity than others.
Most research on the speaker-specificity of acoustic-phonetic information has investigated how much speaker information is carried by individual speech sounds, and by vowels in particular. An early comparative investigation into differences in speakerdependent acoustics for various speech sounds in Dutch was done by Van den Heuvel (1996). The segments included in that study were the vowels [i, a, u] and consonants [p, t, k, d, s, m, n, r]. Analyses of productions gathered under controlled circumstances, i.e. read pseudo-words, showed that the vowel [a] was the most speaker-dependent segment and that plosives, such as [p, d] contained the least speaker information. The speaker-discriminating potential of a large set of German phonemes was studied by Schindler and Draxler (2013), who found that some consonants, that is [s, n, m, f], contained more speaker-dependent information than most of the vowels. As in Dutch (see Van den Heuvel, 1996), the vowel [aː] was found to contain more speaker-dependent information than [i, u] (Schindler & Draxler, 2013). In Czech, the vowel [iː] was found to pattern more with [aː] than [u] (Fejlová et al., 2013). Finally, various diphthongs were also found to differ in the speaker-specific information they carry (e.g. Morrison, 2009); the reason may be that, in addition to differences in vowel quality, diphthongs vary in the amount and direction of inherent spectral change. Taken together, the results on vowels suggest that in addition to vowel quality, inherent spectral change and vowel duration may contribute speaker-dependent acoustics. For consonants, the longer segments in which specific spectral characteristics (resonances) can be found seem to carry most speaker information.
There is some evidence that the linguistic context from which a speech sound is sampled influences the amount of within-and between-speaker variation it contains. He and Dellwo (2017) found more between-speaker variability in mouth-closing than mouth-opening gestures in read speech. This was explained as the second half of a syllable having lower articulatory demands relative to the first half, thus allowing for more articulatory freedom in individual speakers. In a follow-up investigation, He, Zhang and Dellwo (2019) demonstrated that also the first formant, which reflects the degree of mouth opening during vowel articulation, shows more between-speaker variability over the second half of a syllable than the first. In a study investigating speaker-dependent information in Dutch fricatives [s] and [x], Smorenburg and Heeren (2020) found that coda fricatives showed more between-speaker variation than onset fricatives, whereas the within-speaker variation showed a change in the opposite direction. Speaker classification scores were slightly better in codas than onsets for Dutch [x], whereas for [s] no positional difference in accuracy was found. A later study on speaker-dependent information in Dutch nasals [n] and [m] showed higher speaker classification scores for [n] in codas than onsets, but an effect in the opposite direction for [m] (Smorenburg & Heeren, 2021). This difference in classification was explained by effects of the phonetic context on speech sound acoustics; context effects were larger in onset [m]s and coda [n]s. Hence, in addition to an effect of articulatory freedom depending on syllable position, the speech sound's susceptibility to co-articulatory influences may also affect how much speaker information can be carried by a speech sound.
As was mentioned in section 1 (under Introduction), the acoustics of a particular speech sound are also influenced by higher-level linguistic factors, such as the presence/absence of stress on the syllable containing the speech sound (Sluijter & Van Heuven, 1996) or the word class that its carrier belongs to (Van Bergem, 1993;Heeren, 2020). Another construct that is known to affect the realization of utterances is focus, which for instance expresses which part of a sentence is new (narrow focus) or contrastive (contrastive focus). In Information Structure theory (Chafe, 1976), focus has been defined as indicating 'the presence of alternatives that are relevant for the interpretation of linguistic expressions' (Krifka, 2007, p. 18). Focus may be marked in different ways; several experiments on Dutch word realization demonstrated that segmental and syllabic acoustic measures may vary with focus condition. For instance, in comparison with broad focus, narrow and contrastive focus were found to affect segmental duration and fundamental frequency contours (e.g. Hanssen et al., 2008, see also Chen, 2009).
In sum, a speech sound's spectral and durational characteristics may change with its position in an utterance. It has been demonstrated that between-sound differences in e.g. spectral composition and duration affect speaker discrimination (see section 1.2. for more details). The question then is if the spectral and durational changes found within the same sound as a function of its linguistic context would impact speaker discrimination.

Speaker discrimination by linguistic content and context
In listeners, short words of different segmental composition elicit differential speaker discrimination performance. In Andics et al. (2007), native Dutch listeners heard CVC words and gave same-different speaker decisions for subsequent tokens that were auditorily presented. The CVCs were made up of onset [m]  . In a trained-to-familiar speaker recognition experiment, Drozdova et al. (2017) found that listener performance was positively affected by the presence of vowels and nasals. These findings roughly correspond to findings from speech acoustics on which speech sounds contain most speaker-dependent information (Van den Heuvel, 1996;Kavanagh, 2012;Schindler & Draxler, 2013, and discussion in section 1.1).
Consistent with Fant's source-filter model of speech production (Fant, 1960), Baumann and Belin (2010) found that the two principal components explaining unfamiliar voice discrimination on sustained vowel data were based on the vocal source on the one hand, and the vocal filter on the other. This suggests that voiced speech sounds have an advantage over unvoiced speech sounds in talker perception, because the former benefit from both information sources. This is supported by results from Orchard and Yarmey (1995), who showed that speaker identification in a whispered voice line-up after having heard a whispering 'perpetrator' was worse than identification in an all-normal voice condition. They furthermore found that samples of longer duration resulted in better performance than shorter samples (see also Cook & Wilding, 1997). Earlier, Bricker and Pruzansky (1966) showed that not only duration per se, but also the segmental content within the speech fragment affected listener performance; better identification was obtained with more phonemes in a same-duration sample. The importance of measures of acoustic variation was recently also reported by Lee, Keating and Kreiman (2019), when investigating the acoustic dimensions that explain within-speaker and between-speaker variation.
The task of speaker discrimination not only entails perceiving differences between speakers, but also perceiving that certain differences belong to the same speaker (assuming that non-identical stimulus pairs are presented). It has been shown that such withinspeaker variability is judged differently for familiar than unfamiliar voices (Lavan et al., 2019); given 30 speech samples produced by two voices (15 fragments each) listeners who were familiar with the voices clustered the samples into fewer speaker identities than listeners who were unfamiliar with the voices. Therefore, both within-speaker and between-speaker variation of various acoustic parameters seemingly contribute to the processing of speaker information.
In addition to effects of segmental differences and acoustic variation on the processing of speaker information, earlier work has shown effects of higher levels of linguistic information. At the semantic level, for instance, Van Berkum et al. (2008) demonstrated that indexical information in a voice rapidly influences the semantic processing of what the voice is saying: the mentioning of an alcoholic beverage in a child's voice elicited a different ERP response than in an adult voice. Also, the linguistic relationship between words influences speaker discrimination performance: using a same-different speaker discrimination task, Narayan, Mak and Bialystok (2017) showed that listeners had a tendency to assign linguistically-related word combinations to the same speaker (e.g. words with the same phonological rhyme 'bay-day' as opposed to unrelated words 'day-bee').
Earlier perception studies have thus demonstrated a connection between speaker discrimination and higher-level linguistic information. The current study investigated how the presence versus absence of focus, which would alter the precise phonetic-acoustic content of the same speech sound, affects speaker discrimination. This investigation of sub-segmental differences thus adds to the literature on voice perception, which has so far shown that segments vary in speaker-dependent information, and that higher levels of linguistics may influence voice perception.

Research questions and hypotheses
This study in the first place investigated if speaker discrimination performance is affected by the presence versus absence of focus on a word. This was done using a same-different perception task, in which listeners judged whether subsequent words were produced by the same or by different speakers, and word stimuli were sampled from focused and nonfocused sentence positions (further details in 2.2). In order to include some variation in word stimuli, two vowel nuclei were chosen from the three Dutch corner vowels included in Van den Heuvel (1996), namely those vowels that differed most in acoustic speaker information. This study thus also investigated if the difference in speaker-specific infor-mation between the Dutch corner vowels [aː] and [u] that is found in acoustics is also observed in perception.
We hypothesized that speaker discrimination is more accurate on [aː] than [u], following the acoustic literature. Differences in both spectral and durational information between the vowels may contribute to this effect. As for the effect of focus, the main topic of this study, the prediction is less straightforward. In the comparison of unfamiliar voices there is an important role for low-level acoustic information (cf. Stevenage, 2018). The comparison of acoustic information between two speech samples leads to a same-or different-voice decision when the listener considers if the perceived variation between two non-identical tokens falls under within-speaker or between-speaker variation. From the literature it is known that listeners tend to underestimate withinspeaker variation in unfamiliar speakers (Lavan et al., 2019). Therefore, the closer two realizations are acoustically, the more likely they are to be judged as 'same-speaker' . On the one hand, it can be argued that the relatively precise articulation in a [+focus] word may be more comparable from one token to the next than in a [−focus] condition. Moreover, [+focus] words are expected to be longer, thus giving them a perceptual advantage. On the other hand, with an expected lower occurrence frequency of these canonical forms relative to unfocused realizations in everyday speech, it may also be the case that less articulatory routine results in more within-speaker variation in [+focus] than in [−focus] forms. Moreover, between-speaker variation has been found to be higher in locations where articulatory demands are lower, such as in coda consonants and closing syllable gestures (e.g., He & Dellwo, 2017;Smorenburg & Heeren, 2020). In the current investigation the [−focus] words are therefore predicted to show larger between-speaker variation. Note that only in combination with smaller within-speaker variation, higher speaker-specificity is expected.
We assessed if effects of vowel quality and focus would hold across word contexts by including two carrier words. The word contexts [ɦ_k] and [v_t] were selected so that the speaker information contained by the different onset and offset consonants was estimated to be low and roughly comparable. Schindler and Draxler (2013) showed that the ratio of between-to-within speaker variation was quite low for [v] and [h] sampled from spontaneous German speech, with perhaps a small advantage for [v]. Moreover, according to Van den Heuvel (1996) Dutch stop consonants are expected to contribute relatively low speaker information. Moreover, these word contexts gave target word frequencies falling in the mid-frequency range (haak/hoek, vaat/voet); word frequency is known to influence articulation (Bell et al., 2009) as high-frequency words are more subject to reduction than low-frequency words.
The speaker discrimination results are accompanied by an acoustic analysis, including within-and between-speaker variation, of how speech sounds differ between conditions. Underlying this investigation is the assumption that speech sounds' acoustics vary with focus condition (Eefting, 1991;Van Heuven, 1997), but to ensure that this was also the case for the stimuli presented to our listeners, the acoustic analysis was carried out.

Method
Following Andics et al. (2007), a same-different forced-choice one-back task was used for the speaker discrimination perception task. This means that listeners were presented with a series of stimuli and decided if the speaker of the current stimulus is the same as or different from the speaker of the previous stimulus.

Participants
Twenty-one Dutch listeners without self-reported hearing problems volunteered to take part in this perception study (12 females, 9 males). Their mean age was 23 years (SD = 1.5 years). All participants beforehand gave their informed consent for taking part in the study, and afterwards received a modest thank-you gift.

Stimuli
Two Dutch minimal word pairs were used as stimulus contexts for the vowels [aː] and These four words were each recorded in a sentence context to evoke realizations with and without focus. The target sentences were answers to statement-question pairs that speakers saw on a computer screen, and speakers were instructed to produce answers in the form of full sentences. Two examples of the statement-question pairs and their intended answers are given here (see the appendix for a full list), where the first pair was intended to elicit [−focus] tokens of the target word, and the second pair to elicit [+focus] tokens: Prompt 1: Hij zet zijn vaat in de wasbak. Waar zet hij zijn vaat?
He puts his dishes in the sink. Where does he put his dishes? Answer 1: Hij zet zijn vaat in de wasbak.
He puts his dishes in the sink. Prompt 2: Hij zet zijn vaat in de wasbak. Wat zet hij in de wasbak?
He puts his dishes in the sink. What does he put in the sink? Answer 2: Hij zet zijn vaat in de wasbak.
He puts his dishes in the sink.
Eleven male speakers of Standard Dutch, aged 20 to 26 years, were recruited for the recordings. The stimuli were recorded in a sound-attenuated booth at the Leiden University Centre for Linguistics, using Praat software (Boersma & Weenink, 2018), a Sennheiser MKH 416T microphone, and a FocusRite Scarlet 2i4 sound card. Recordings were saved as mono wave files (22,050 Hz, 16 bits). In an information sheet, speakers were instructed to produce the answer to a question about a short statement (see example above). Both the statement and question were shown orthographically on a computer screen in the recording booth. Before participation, speakers gave their informed consent. Each speaker produced each target word-focus combination six times, so that a sufficient number of tokens would be available for the perception experiment in which five repetitions of each would be needed. Tokens with the highest signal intensity relative to the background noise were kept for further processing. Target words were cut from the carrier sentences, resulting in 440 stimuli (11 speakers × 5 repetitions × 2 focus conditions × 2 vowel nuclei ([aː], [u]) × 2 word contexts ([v_t], [ɦ_k]). Stimulus intensities were all set to 65 dB SPL.

Stimulus acoustics and their statistical assessment
As a first step, we evaluated statistically whether stimulus acoustics varied by focus condition and by vowel, as they are commonly assumed to do. The results are presented in section 3.1. In addition to stimulus duration, the mean fundamental frequency (F0) was taken as a measure of vocal source information, and as vocal filter parameters the first and second formants (F1 and F2) were extracted. Using Praat (Boersma & Weenink, 2018), F0 was measured over the full duration of a stimulus using an autocorrelation method, and formants were measured using the Burg method at the point in time where vowel intensity was maximal (window size = 25 ms).
Acoustic effects of the fixed factors Vowel, Focus and Word Context were evaluated in linear mixed-effects models (with α Bonferroni-corrected to .05/4 = .0125, given four acoustic measures). Factor levels [aː], [−focus] and [ɦ_k] were used default levels, with predictions for acoustic differences between factor levels being directional, e.g. duration is expected to shorten from [aː] to [u], and to lengthen from [−focus] to [+focus]. Using function buildmer from R package buildmer (Voeten, 2020), the maximally-converging models were obtained following a stepwise forward procedure. Duration was log10transformed and F0, F1 and F2 were transformed to the Bark frequency scale before modelling.
Both within-speaker and between-speaker variances were determined by vowel, by focus condition and by word context. The ratio of between-to within-speaker variances is called the 'speaker-specificity index' (SSI, Van den Heuvel, 1996, p. 53), which was computed per acoustic parameter; the larger the variation between speakers relative to that within speakers, the higher the SSI, and the better speakers can presumably be separated. Because this is a descriptive view on the data, the measurements in Hertz and milliseconds were used for the sake of interpretability.

Design and procedure of the perception task
For each of the four words included (haak, hoek, vaat, voet), a pseudo-random presentation list was made, in which each different-speaker pair occurred once (11 ×10/2) and each same-speaker pair occurred five times (11 ×5). In this way the same number (i.e. 55) of different-and same-speaker trials were included per word. The latter trial type made use of the different recordings from the same speaker, so that no identical recordings were used in a comparison. During the full experiment, each individual token was used twice, and different tokens by the same speaker were presented not more than three times in a row. Stimulus lists were distributed over 24 presentation blocks, containing one of eight vowel+word context+focus combinations each (3 blocks/combination). Per word context, there were 110 pairs for comparison, divided into blocks of 37, 37 and 36 trials each. A block lasted for about 1.5 minutes and the order of the blocks was randomized per listener. Across listeners 18,480 responses were collected. In this one-back discrimination task, not all speaker pairs occurred in both orders equally frequently. In the statistical analysis this is controlled for by the inclusion of random effects in the modelling.
To each token's onset and offset a 5-ms fade-in or fade-out was applied in order to prevent clicks at trial onset or offset. Between subsequent stimuli, i.e. during the listener's response time, pink noise at an intensity of 50 dB was played. The next stimulus started 2,400 ms after the onset of the previous one.
The perception experiment was run in a sound-attenuated booth at the phonetics laboratory of the Leiden University Centre for Linguistics, using E-Prime (Psychology Software Tools, 2012). Stimuli were presented at a standardized, comfortable listening level, over Beyerdynamic DT 770 PRO headphones.
Listeners were instructed to carefully listen to the subsequent tokens, and to respond, after every token (but the first), whether the speaker of the latter token was the same as that of the former token or not. Responses were given by pressing one of two buttons on a QWERTY keyboard, 'X' or 'N' , one for 'same' and the other for 'different' speaker. The response buttons were counterbalanced across listeners.
Before the actual experiment started, listeners completed a short practice round including tokens of the word vis ('fish') to get used to the task. Including instruction, practice and breaks, the experiment lasted for about 45 minutes.

Statistical analysis of the perception data
Initial examination of the mean correct responses per speaker showed that none of the eleven voices were especially hard or easy for the listeners. Hence, all speakers were included in the analysis.
In earlier work on speaker discrimination by speech sound (Andics et al., 2007), responses from same-speaker trials were analyzed separately from responses to different-speaker trials. We chose to analyze all data together, but added trial type (same-speaker, different-speaker) as a factor into the design. The other fixed factors in the design were focus (+, −), vowel nucleus ([aː], [u]), and word context ([v_t], [ɦ_k]), with full interactions. All factors were coded using deviation coding. Random intercepts by participants and by trials (that is, speaker1-speaker2 combinations) were included in the design, as were analogous random slopes for all factor combinations included as fixed effects. A diagonal random-effects covariance matrix was assumed. The dependent variable was the correctness of the response given by the listener, coded as 0 for incorrect and 1 for correct. The data were analyzed using a mixed-effects logistic-regression tree (Fokkema et al., 2018;see Tagliamonte & Baayen, 2012 for an accessible introduction to treebased models in linguistics). Function buildmertree from R package buildmer (Voeten, 2020) was used to find the maximal random-effects structure that still converged non-singularly, based on the random effects' contributions to the AIC (Akaike, 1971) of the model. The results of the perception experiment are presented in section 3.2.

Stimulus acoustics: the effect of focus condition
Before a perceptual effect of [±focus] on speaker discrimination was evaluated, the effect of [±focus] realizations on stimulus acoustics was assessed. Per acoustic measure, the model output on effects of Focus, Vowel and Word Context is given in Table 1. Significant terms are printed in bold. Three out of four acoustic measurements showed a main effect of or interaction with Focus; these mainly reflected a higher F0 and longer stimulus durations in focused realizations. The effect was not equally strong in all stimulus conditions. As for F1, there was a tendency for more mouth opening in [aː] (given the marginal main effect main effect of Focus, t = 2.2), but in [u] this effect was countered under focus, presumably because of increased rounding in its pronunciation. Moreover, stimulus acoustics were influenced by vowel quality, as may be expected by intrinsic differences between the vowels, and to a lesser extent by the word context and interactions between the linguistic factors. Table 2 shows, in a descriptive manner, the within-speaker variance by condition and by acoustic measure as well as the speaker-specificity index (SSI, the ratio of betweento-within-speaker variances). The F0 has higher within-speaker variation in focused than unfocused words, but also a higher SSI meaning that between-speaker variances also increase with focus. F1 shows less within-speaker variance with focus, whereas SSI tends to increase. F2 shows an increase in within-speaker variance with focus in the [v_t] condition, but a decrease in [ɦ_k], whereas SSI shows behavior in the opposite direction. For [aː] duration, but not for [u], SSI is higher when focused.  Together, these results show that stimulus acoustics as well as the SSI vary by focus condition, supporting the hypothesis of differential acoustic speaker information by factor combination.

Perception results: effects of focus condition and vowel
The main research questions on speaker discrimination as a function of focus condition and vowel quality were evaluated using a same-different perception task. The results of the statistical analysis are shown in Figure 1, and reflect that listeners were generally quite successful in speaker discrimination, but also that performance varied by factor combination. The model also contained a number of random effects of participant and of trial, including random slopes. To further investigate the random effects and thus look for potential confounds, a cluster analysis was applied, following the method in Voeten (2020). No interpretable1 structure was found in the by-participants or by-items randomeffects structure, which suggests that the experiment was free from confounds due to these random effects.

Discussion and conclusion
This study investigated if speaker discrimination performance varies with the presence of focus on target words and if the difference in speaker-specific information between the Dutch corner vowels [aː] and [u] found in acoustic analyses (Van den Heuvel, 1996) is also observed in perception. On average, listeners reached over 80 % correct responses for both same-speaker and different-speaker trials. The regression tree further showed that different factor combinations, reflecting different stimulus acoustics, affected the listeners' ability to discriminate speakers.
In line with expectations, listeners gave more correct responses to stimulus pairs containing [aː] words than [u] words. These results complement the perceptual results obtained by Andics et al. (2007) Van den Heuvel, 1996), and also in other languages (Fejlová et al., 2013;Schindler & Draxler, 2013). Perception of tokens containing the vowel [aː] was not influenced further by linguistic context. For [u], the situation was different: the first split in the model was made by Focus, where focused realizations reached higher discrimination performance than unfocused ones. Best performance was observed for [+focus] same-speaker trials, where the phoneticacoustic differences at the surface are apparently small enough for good-quality voice matching. This result shows that the phonetic-acoustic content resulting from the linguistic position in which a word is pronounced may influence the speaker-dependent information available to listeners. Yet, the effect obtained here is restricted to the less informative vowel, [u].
In the literature it has been reported that listeners tend to underestimate withinspeaker variation when listening to unfamiliar speakers (Lavan et al., 2019), and are thus more likely to perceive same-speaker samples as coming from different speakers. This would especially lower listeners' accuracy in conditions where within-speaker variance is high. The finding that listeners performed better using focused than unfocused [u], however, suggests that the mere amount of within-speaker variance does not fully explain the discrimination results; as Table 2 shows, within-speaker variance of F0, an important parameter in speaker discrimination (Baumann & Belin, 2010), tended to be larger in focused than unfocused words. At the same time, however, the SSIs of F0 showed that there was also a tendency for between-speaker variance to increase under focus, and also to increase more than within-speaker variance. This suggests that listeners use both within-speaker and between-speaker variance in perception.
Additional evidence that within-speaker variance alone may not explain the speaker discrimination results comes from the comparison of listener performance on [aː] versus [u]. For F1, within-speaker variances are larger for [aː], whereas for F2, within-speaker variances are larger for [u]. SSI, however, is in most cases larger in [aː] than [u], again suggesting a contribution for both sources of variation in perception. The acoustic analysis in this study was limited to one vocal source and two vocal filter parameters, and these are only a small subset of the parameters that have been included in the recent literature (Lee et al., 2019). That work furthermore showed an important contribution of measures capturing phonetic change within stimuli, which was not included here. More detailed analyses of which information listeners use to perform speaker discrimination tasks are left for future research.
The two splits on Word Context, within [u], reflect that speaker discrimination was better on [vut] than [ɦuk] contexts, even though word contexts had been selected to contain similar amounts of acoustic speaker information. As plosives have been reported to contain the least speaker-dependent information (Van den Heuvel, 1996), the wordcontext effect must be attributed mainly to the differential information contained by the onsets. Schindler and Draxler (2013) demonstrated that there was slightly more speakerdependent information in the spectra of [v] than of [h]. In addition, in the current investigation [v] words in most cases were longer than [ɦ] words (see Table 1), where additional duration is generally beneficial in perceptual tasks (Bricker & Pruzansky, 1966;Orchard & Yarmey, 1995;Cook & Wilding, 1997). Finally, the two fricatives are expected to differ in coarticulation with the vowel, with [v] presumably showing larger betweenspeaker differences in coarticulation than [ɦ]. This may also contribute to explaining the word context difference (Bricker & Pruzansky, 1966;Lee et al., 2019).
The current investigation was limited in its scope, studying the effect of focus using two vowels, in two word contexts each. The higher-order interactions in the results showed that listeners are sensitive to subtle differences conditioned by linguistic structure when processing voice information. However, in the current experiment very short utterances were presented, produced by a relatively homogeneous group of male speakers. In natural communicative settings, more variation between (non-seen) speakers is likely to occur, and is expected to improve speaker discrimination. Those circumstances potentially reduce the effect of relatively subtle linguistic cues; such cues here affected speaker discrimination only in the more-challenging words containing [u], but not those containing [aː]. Moreover, if longer utterances had been used, as would be found in natural interaction, listeners would receive additional speaker information from the utterance. However, if within-speaker variability increases over longer phrases, speaker discriminability may also be compromised.
To conclude, an effect of focus on speaker-dependent information contained by a word was present, but limited. Moreover, additional evidence was found for the claim that different speech sounds differ in speaker-dependent information ([aː] versus [u], and possibly [v] versus [ɦ]). At the same time, under more real-world conditions than the current discrimination task, the detection of speaker changes in speech is unlikely to be affected by focus. When needed, however, listeners seem skilled at exploiting the little information that is available.