Assessing second language speaking

While the viva voce (oral) examination has always been used in content-based educational assessment (Latham 1877: 132), the assessment of second language (L2) speaking in performance tests is relatively recent. The impetus for the growth in testing speaking during the 19th and 20th centuries is twofold. Firstly, in educational settings the development of rating scales was driven by the need to improve achievement in public schools, and to communicate that improvement to the outside world. Chadwick (1864, see timeline) implies that the rating scales first devised in the 1830s served two purposes: providing information to the classroom teacher on learner progress for formative use, and generating data for school accountability. From the earliest days, such data was used for parents to select schools for their children in order to ‘maximize the benefit of their investment’ (Chadwick 1858). Secondly, in military settings it was imperative to be able to predict which soldiers were able to undertake tasks in the field without risk to themselves or other personnel (Kaulfers 1944, see timeline). Many of the key developments in speaking test design and rating scales are linked to military needs.


Introduction
While the viva voce (oral) examination has always been used in content-based educational assessment (Latham 1877: 132), the assessment of second language (L2) speaking in performance tests is relatively recent. The impetus for the growth in testing speaking during the 19th and 20th centuries is twofold. Firstly, in educational settings the development of rating scales was driven by the need to improve achievement in public schools, and to communicate that improvement to the outside world. Chadwick (1864, see timeline) implies that the rating scales first devised in the 1830s served two purposes: providing information to the classroom teacher on learner progress for formative use, and generating data for school accountability. From the earliest days, such data was used for parents to select schools for their children in order to 'maximize the benefit of their investment' (Chadwick 1858). Secondly, in military settings it was imperative to be able to predict which soldiers were able to undertake tasks in the field without risk to themselves or other personnel (Kaulfers 1944, see timeline). Many of the key developments in speaking test design and rating scales are linked to military needs.
The speaking assessment project is therefore primarily a practical one. The need for speaking tests has expanded from the educational and military domain to decision making for international mobility, entrance to higher education, and employment. But investigating how we make sound decisions based on inferences from speaking test scores remains the central concern of research. A model of speaking test performance is essential in this context, as it helps focus attention on facets of the testing context under investigation. The first such model developed by Kenyon (1992) was subsequently extended by McNamara (1995), Milanovic & Saville (1996), Skehan (2001), Bachman (2001), and most recently by Fulcher (2003: 115), providing a framework within which research might be structured. The latter is reproduced at Figure 1 to indicate the extensive range of factors that have been and continue to be investigated in speaking assessment research, and these are reflected in my selection of themes and associated papers for this timeline.
Overviews of the issues illustrated in Figure 1 are discussed in a number of texts devoted to assessing speaking that I have not included in the timeline (Lazaraton 2002;Fulcher 2003; Figure 1 An expanded model of speaking test performance (Fulcher 2003: 115) Luoma 2004; Taylor (2011). Rather, I have selected publications based on 12 themes that arise from these texts, from Figure 1, and from my analysis of the literature.
Themes that pervade the research literature are rating scale development, construct definition, operationalization, and validation. Scale development and construct definition are inextricably bound together because it is the rating scale descriptors that define the construct. Yet, rating scales are developed in a number of different ways. The data-based approach requires detailed analysis of performance. Others are informed by the views of expert judges using performance samples to describe levels. Some scales are a patchwork quilt created by bundling descriptors from other scales together based on scaled teacher judgments. How we define the speaking construct and how we design the rating scale descriptors are therefore interconnected. Design decisions therefore need to be informed by testing purpose and relevant theoretical frameworks.
Underlying design decisions are research issues that are extremely contentious. Perhaps these can be presented in a series of binary alternatives to show stark contrasts, although in reality there are clines at work.
Specific purposes tests vs. Generalizability. Should the construct definition and task design be related to specific communicative purposes and domains? Or is it possible to produce test scores that are relevant to any and every type of real-world decision that we may wish to make? This is critical not least because the more generalizable we wish scores to be, the more difficult it becomes to select test content.
Psycholinguistic criteria vs. Sociolinguistic criteria. Closely related to the specific purpose issue is the selection of scoring criteria. Usually, the more abstract or psycholinguistic the criteria used, the greater the claims made for generalizability. These criteria or 'facilities' are said to be part of the construct of speaking that is not context dependent. These may be the more traditional constructs of 'fluency' or 'accuracy', or more basic observable variables related to automaticity of language processing, such as response latency or speed of delivery. The latter are required for the automated assessment of speaking. Yet, as the generalizability claim grows, the relationship between score and any specific language use context is eroded. This particular antithesis is not only a research issue, but one that impacts upon the commercial viability of tests; it is therefore not surprising that from time to time the arguments flare up, and research is called into the service of confirmatory defence (Chun 2006;Downey et al. 2008).
Normal conversation vs. Domain specific interaction. It is widely claimed that the 'gold standard' of spoken language is 'normal' conversation, loosely defined as interactions in which there are no power differentials, so that all participants have equal speaking rights. Other types of interaction are compared to this 'norm' and the validity of test formats such as the interview is brought into question (e.g. Johnson 2001). But we must question whether 'friends chatting' is indeed the 'norm' in most spoken interaction. In higher education, for example, this kind of talk is very rare, and scores from simulated 'normal' conversations are unlikely to be relevant to communication with a professor, accommodation staff, or library assistants. Research that describes the language used in specific communicative contexts to support test design is becoming more common, such as that in academic contexts to underpin task design (Biber 2006).
Rater cognition vs. Performance analysis. It has become increasingly common to look at 'what raters pay attention to'. When we discover what is going on in their heads, should it be treated as construct irrelevant if it is at odds with the rating scale descriptors and/or an analysis of performance on test tasks? Or should it be used to define the construct and populate the rating scale descriptors? Do all raters bring the same analysis of performance to the task? Or are we merely incorporating variable degrees of perverseness that dilute the construct? The most challenging question is perhaps: Are rater perceptions at odds with reality?
Freedom vs. Control. Left to their own devices, raters tend to vary in how they score the same performance. The variability decreases if they are trained; and it decreases over time through the process of social moderation. With repeated practice raters start to interpret performances in the same way as their peers. But when severed from the collective for a period of time, judges begin to reassert their own individuality, and disagreement rises. How do we identify and control this variability? This question now extends to interlocutor behaviour, as we know that interlocutors provide differing levels of scaffolding and support to test takers. This variability may lead to different scores for the same test taker depending on which interlocutor they work with. Much work has been done in the co-construction of speech in test contexts. And here comes the crunch. For some, this variation is part of a richer speaking construct and should therefore be built into the test. For others, the variation removes the principle of equality of experience and opportunity at the moment of testing, and therefore the interlocutors should be controlled in what they say. In face-to-face speaking tests we have seen the growth of the interlocutor frame to control speakers, and proponents of indirect speaking tests claim that the removal of an interlocutor eliminates subjective variation.
Publications selected to illustrate a timeline are inevitably subjective to some degree, and the list cannot be exhaustive. My selection avoids clustering in particular years or decades, and attempts to show how the contrasts and themes identified play out historically. You will notice that themes H and I are different from the others in that they are about particular methodologies. I have included these because of their pervasiveness in speaking assessment research, and may help others to identify key discourse or multi-faceted Rasch measurement studies (MFRM). What I have not been able to cover is the assessment of pronunciation and intonation, or the detailed issues surrounding semi-direct (or simulated) tests of speaking, both of which require separate timelines. Finally, I am very much aware that the assessment of speaking was common in the United Kingdom from the early 20th century. Yet, there is sparse reference to research outside the United States in the early part of the timeline. The reason for this is that apart from Roach (see timeline, reprinted as an appendix in Weir, Vidaković & Galaczi (2013) (eds.) there is very little published research from Europe (Fulcher 2003: 1). The requirement that research is in the public domain for independent inspection and critique was a criterion for selection in this timeline. For a retrospective interpretation of the early period in the United Kingdom with reference to unpublished material and confidential internal examination board reports to which we do not have access, see Weir & Milanovic (2003) and Vidaković & Galaczi (2013). The earliest record of an attempt to assess L2 speaking dates to the first few years after Rev. George Fisher became Headmaster of the Greenwich Royal Hospital School in 1834. In order to improve and record academic achievement, he instituted a 'Scale Book', which recorded performance on a scale of 1 to 5 with quarter intervals. A scale was created for L2 French, with typical speaking prompts to which boys would be expected to respond at each level. The Scale Book has not survived.

Scales of various kinds were developed by social scientists like Galton and
Cattell towards the end of the 19th century, but it was not until the work of Thorndike in the early 20th century that the definition of each point on an equal interval scale was revived. With reference to speaking German, he suggested that performance samples should be attached to each level of a scale, along with a descriptor that summarizes the ability being tested. Roach was among the first to investigate rater reliability in speaking tests. He was concerned primarily with maintaining 'standards', by which he meant that examiners would agree on which test takers were awarded a pass, a good pass, and a very good pass, on the Certificate of Proficiency in English. He was the first to recommend what we now call 'social moderation' (see MISLEVY 1 1992) -familiarization with the system through team work, which results in agreement evolving over time.
YEAR REFERENCES ANNOTATIONS THEME

1952/ 1958
Foreign Service Institute. (1952/1958 The first construct validation studies were carried out in the early 1980s, using the multitrait-multimethod technique and confirmatory factor analysis. These demonstrated that the FSI OPI loaded most heavily on the speaking trait, and lowest of all methods on the method trait. These studies concluded that there was significant convergent and divergent evidence for construct validity in the OPI.
In the 1960s the FSI approach to assessing speaking was adopted by the Defense Language Institute, the Central Intelligence Agency, and the Peace Corp. In 1968 the various adaptations were standardized as the Interagency Language Roundtable (ILR), which is still the accepted tool for the certification of L2 speaking proficiency throughout the United States military, intelligence and diplomatic services (www.govtilr.org/). Via the Peace Corp it spread to academia, and the assessment of speaking proficiency worldwide. It also provides the basis for the current NATO language standards, known as STANAG 6001. Lantolf & Frawley were among the first to question the ACTFL approach. They claimed the scales were 'analytical' rather than 'empirical', depending on their own internal logic of non-contradiction between levels.

A, C, D
The claim that the descriptors bear no relationship to how language is acquired or used set off a whole chain of research into scale analysis and development.
Kramsch's research into interactional competence spurred further research into task types that might elicit interaction, and the construction of 'interaction' descriptors for rating scales. This research had a particular impact on future discourse related studies by YOUNG & HE (1998). This very influential paper questioned the use of the native speaker to define the top level of a rating scale, and the notion of zero proficiency at the bottom. Secondly, the researchers questioned reference to context within scales as confounding constructs with test method facets, unless the test is for a defined ESP setting. This paper therefore set the agenda for debates around score generalizability, which we still wrestle with today.

B, D, F
YEAR REFERENCES ANNOTATIONS THEME 1987 Fulcher, G. (1987). Tests of oral performance: The need for data-based criteria. English Language Teaching Journal 41.4, 287-291.
Using discourse analysis of native speaker interaction, this paper provided the first evidence that rating scales did not describe what typically happened in naturally occurring speech, and advocated a data-based approach to writing descriptors and constructing scales. This was the first use of discourse analysis to understand under-specification in rating scale descriptors, and was expanded into a larger research agenda (see FULCHER 1996).
In another discourse analysis study, Van Lier showed that interview language was not like 'normal conversation'. Although the work of finding formats that encouraged 'conversation' had started with REVES (1980) and colleagues in Israel, this paper encouraged wider research in the area. Rater variation had been a concern since the work of ROACH (1945) during the war, but only with the publication of Linacre's FACETS did it become possible to model rater harshness/leniency in relation to task difficulty and learner ability. MFRM remains the standard tool for studying rater behaviour and test facets today, as in the studies by LUMLEY &MCNAMARA (1995), andBONK &OCKEY (2003). Based on research driving the IELTS revision project, Alderson categorized rating scales as user-oriented, rater-oriented, and constructor-oriented. These categories have been useful in guiding descriptor content with audience in mind.

A 1992
Young, R. & M. Milanovic (1992 Chalhoub-Deville investigated the inter-relationship of diverse tasks and raters using multidimensional scaling to identify the components of speaking proficiency that were being assessed. She found that these varied by task and rater group, and therefore called for the construct to be defined anew for each task/rater combination. The issue at stake is whether the construct has any psychological reality independently from context specific performances. A methodological study to compare the 'informational and interactional functions' produced on speaking test tasks with those the test designer intended to elicit. The instrument proved to be unwieldy and impractical, but the study established the important principle for examination boards that evidence of congruence between intention and reality is an important aspect of construct validation.
A much quoted study into variation in the speech of the same test taker with two different interlocutors. Building on ROSS & BERWICK (1992), LAZARATON (1996) andMCNAMARA (1996), Brown demonstrated that scores also varied, although not by as much as one may have expected. The paper raises the critical issue of whether variation should be allowed because it is part of the construct, or controlled because it leads to inequality of opportunity.
YEAR REFERENCES ANNOTATIONS THEME
Using FACETS, the researchers investigated variability due to test taker, prompt, rater, and rating categories. Test taker ability was the largest facet. Although there was evidence of rater variability, this did not threaten validity, and indicated that raters became more stable in their judgments over time. This adds to the evidence that socialization over time has an impact on rater behaviour. An important prototyping study. Pre-operational tasks were shown to experts who judge whether they represent the kinds of tasks that students would undertake at university. They are also presented with their own students' responses to the tasks and asked whether these are 'typical' of their work. The study shows that test development is a research-led activity, and not merely a technical task. Design decisions and the evidence for those decisions are part of a validation narrative.
Based on many years of research into personality and speaking test performance, Berry shows that levels of introversion and extroversion impact on contributions to conversation in paired-and group-formats, and results in differential score levels when ability is controlled for.

B, C, L 2008
Galaczi, E. D. (2008). Peer-peer interaction in a speaking test: The case of the First Certificate in English examination. Language Assessment Quarterly 5.2, 89-119.
Galaczi presents a discourse analytic study of the paired test format, in which two candidates are required to converse with each other, as well as the examiner/interlocutor. The research identified three interactive patterns in the data: 'collaborative', 'parallel' and 'asymmetric'. Tentative evidence is also presented to suggest that there is a relationship between scores on an 'Interactive Communication' rating scale. Building on BERRY (2007), Ockey investigates the effect of levels of 'assertiveness' on speaking scores in a group oral test, using MANCOVA analyses. Assertive students are found to have lower scores when placed in all assertive groups, and higher scores when placed with less assertive participants. The scores of non-assertive students did not change depending on group makeup. The results differ from BERRY, indicating that much more research is needed in this area.

B, C, L
YEAR REFERENCES ANNOTATIONS THEME Integrated task types have become widely used since their incorporation into TOEFL iBT. However, little research has been carried out into the use of source material in spoken responses, or how the integrated skill can be described in rating scale descriptors. The 'integration' remains elusive. In this study a discourse approach is adopted following ideas in DOUGLAS & SELINKER (1992) and FULCHER (1996) to define content related aspects of validity in integrated task types. The study provides evidence for the usefulness of integrated tasks in broadening construct definition.
Following KRAMSCH (1986), MCNAMARA (1997) and YOUNG (2002), May problematizes the notion of the speaking construct in a paired speaking test. However, she attempts to deal with the problem of how to award scores to individuals by looking at how raters focus on features of the speech of individual participants. The three categories of interpretation: understanding interlocutor's message, responding appropriately, and using communicative strategies, are not as important as the attempt to disentangle the individual from the event, while recognizing that discourse is co-constructed.

B, C, K
YEAR REFERENCES ANNOTATIONS THEME 2011 Nakatsuhara, F. (2011). Effects of test-taker characteristics and the number of participants in group oral tests. Language Testing 28.4, 483-508.
Building on BONK & OCKEY (2003) and other research into the group speaking test, Nakatsuhara used conversation analysis to investigate group size in relation to proficiency level and personality type. She discovered that more proficient extroverts talked more and initiated topics more when in groups of four than in groups of three. However, proficiency level resulted in more variation in groups of three. With reference to GALACZI (2008), she concludes that groups of three are more collaborative.
Very much against the trend, Van Moere makes a case for a return to assessing psycholinguistic speech 'facilitators', related to processing automaticity. These include response latency, speed of speech, length of pauses, and the reproduction of syntactically accurate sequences, with appropriate pronunciation intonation and stress. Task types are sentence repetition and sentence building. This approach is driven by an a-priori decision to use an automated scoring engine to rate speech samples. The validation argument stresses the objective nature of the decisions, compared with the unreliable and frequently irrelevant judgments of human raters. This is an exercise in reductionism par excellence, and is likely to reignite the debate on prediction to domain performance from 'atomistic' features that last raged in the early communicative language testing era. This paper applies fuzzy logic to our understanding of how raters score performances. This approach takes into account both rater decisions, and the levels of uncertainty in arriving at those decisions.
Nitta & Nakatsuhara investigate providing test takers with planning time prior to undertaking a paired speaking test. The unexpected findings are that planning time results in stilted prepared output, and reduced interaction between speakers.