Cambridge Michigan Language Assessments

Shaping English Language Assessments
with Research + Experience

Working Papers

CaMLA Working Papers present the findings from research undertaken as part of the Spaan Research Grant Program and by CaMLA staff.

2016

Investigating Lexico-grammatical Complexity as Construct Validity Evidence for the ECPE Writing Tasks: A Multidimensional Analysis

Investigating Lexico-grammatical Complexity as Construct Validity Evidence for the ECPE Writing Tasks: A Multidimensional Analysis

Author(s): Yan, Xun; Staples, Shelley
Working Paper Number: 2016-01
Source: CaMLA Working Papers
Page Count: 16

Abstract: The complexity of lexico-grammatical features is widely recognized as an integral part of writing proficiency in second language (L2) writing assessment. However, a remaining concern for the construct validation of writing tasks lies in the scalability of representative linguistic features in writing performances. Previous research suggests that distinctions across different levels of writing proficiency are not necessarily associated with individual lexico-grammatical features, but rather with the co-occurrence of multiple features (Biber, Gray, & Staples, 2016; Friginal, Li & Weigle, 2014; Jarvis, Grant, Bikowski & Ferris, 2003).

As an effort to investigate the scalability of lexico-grammatical complexity, this study used a multidimensional (MD) analysis to examine saliency and patterns of co-occurrence for 31 lexico-grammatical features in 595 writing performances on a large-scale, advanced-level English language proficiency examination, the Examination for the Certificate of Proficiency in English (ECPE). The linguistic features were classified into four categories: fluency, lexical sophistication, semantic categories for word classes, and general grammatical features, all of which have been found to characterize written discourse and advanced L2 writing proficiency (e.g., Biber, Gray, & Staples, 2016).

Results of the MD analysis indicate five underlying factors, representing five functional dimensions of lexico-grammatical complexity in ECPE writing performances: literate vs. oral discourse, topic-related content, prompt dependence vs. lexical diversity, overt suggestions, and stance vs. referential discourse. Together, the five dimensions accounted for 35% of the holistic score variance. While factor scores on the prompt-difference dimension did not yield significant correlation with the holistic ECPE writing scores awarded by human raters, correlations for the other four dimensions were linear and statistically significant. Among these four dimensions, only three dimensions demonstrated significant differences across essays of different score levels. Findings of this study present supportive evidence for different shades/layers of construct validity of ECPE writing tasks and suggest the scalability of the ECPE writing scale with respect to lexico-grammatical complexity.


2015

A Practical Guide to Investigating Score Reliability under a Generalizability Theory Framework

A Practical Guide to Investigating Score Reliability under a Generalizability Theory Framework

Author(s): Lin, Chih-Kai
Working Paper Number: 2015-01
Source: CaMLA Working Papers
Page Count: 11

Abstract: The current study aims to compare the precision of two analytical approaches to estimating score reliability in performance-based language assessments. The two methods operate under a generalizability theory framework and have been successfully applied in various language assessment contexts to deal specifically with sparse rated data. Given the advantages of working with fully crossed data, the two methods were designed to transform a sparse dataset into variants of fully crossed data. The rating method conceptualizes individual ratings, irrespective of the raters, as a random facet. The rater method identifies all possible blocks of fully crossed subdatasets from a sparse data matrix and estimates score reliability based on these fully crossed blocks. Results suggest that when raters are expected to have similar score variability, the rating method is recommended for operational use given that it is as precise as the rater method but much easier to implement in practice. Nevertheless, when raters are expected to have varying degrees of score variability, such as a mixture of novice and seasoned raters rating together, the rater method is recommended because it yields more precise reliability estimates. Informed by these results, the current study also demonstrates and carries out a step-by-step analysis plan to investigate the score reliability of the speaking component of the Examination for the Certificate of Proficiency in English (ECPE).



The Characteristics of the Michigan English Test Reading Texts and Items and their Relationship to Item Difficulty

The Characteristics of the Michigan English Test Reading Texts and Items and their Relationship to Item Difficulty

Author(s): Barkaoui, Khaled
Working Paper Number: 2015-02
Source: CaMLA Working Papers
Page Count: 36

Abstract: This study aimed, first, to describe the linguistic and discourse characteristics of the Michigan English Test (MET) reading texts and items and, second, to examine the relationships between the characteristics of MET texts and items, on the one hand, and item difficulty and bias indices, on the other. The study included 54 reading texts and 216 items from six MET forms that were administered to 6,250 test takers. The MET texts and items were coded in terms of 22 features. Next, item difficulty and bias indices were estimated. Then, the relationships between the characteristics of MET reading texts and items, on the one hand, and item difficulty and bias indices, on the other, were examined. The findings indicated that the sample of MET texts and items included in the study exhibited several desirable features that support the validity argument of the MET reading subsection. Additionally, some problematic characteristics of the texts and items were identified that need to be addressed in order to improve the test. The study demonstrates how to combine task and score analyses in order to examine important questions concerning the validity argument of second-language reading tests and to provide information for improving texts and items on such tests.



A Validation Study of the Reading Section of the Young Learners Tests of English (YLTE)

A Validation Study of the Reading Section of the Young Learners Tests of English (YLTE)

Author(s): Winke, Paula; Lee, Shinhye; Jieun Ahn, Irene; Choi, Ina; Cui, Yaqiong; Yoon, Hyung-Jo
Working Paper Number: 2015-03
Source: CaMLA Working Papers
Page Count: 30

Abstract: In this study we investigated the validity of the reading and writing sections of CaMLA’s Bronze and Silver Young Learners Tests of English (YLTE). A test’s validity can be analyzed from many angles. We took the following approach: First, we evaluated whether the tests are appropriate for measuring the reading and writing skills of a particular group of learners: 19 English language learners (ELLs) ages 7 to 9. We also looked specifically at the cognitive validity (Weir, 2005) of the tests, that is, whether the tests measure the skills intended by the test developers. We followed Green’s (2014) suggestions for monitoring a test’s cognitive validity: (a) we observed how the children performed and analyzed (qualitatively) their test-taking behaviors, and (b) we interviewed the children to try to understand what they thought about the test, how they found a correct answer, or how they decided on their responses.

Seven native speakers and 12 ELLs (with Korean or Mandarin Chinese native languages) took the tests. We videotaped the children as they took the tests, had each draw a picture of how he or she felt during each test, and interviewed the children about their test-taking experiences. Given the score outcomes, the tests appear reliable and consistent in discriminating learners from native speakers. Analyses indicated that three items on the Bronze test (out of 25 items) and five on the Silver (out of 40) were more difficult for native speakers than for ELLs. We showcase those eight items and use our qualitative data and research into child language development to propose reasons why the items were inversely discriminating. We argue that piloting on native speakers can reveal when incorrect responses stem from something other than reading or writing problems, such as from a lack of assessment literacy, developmentally-appropriate overgeneralizations of grammatical rules, or age-related limitations in morphological-rule learning or cognitive control. We conclude that all tests can be improved, even those that are already structurally and psychometrically reliable and valid.



Variability in the MELAB Speaking Task: Investigating Linguistic Characteristics of Test-Taker Performances in Relation to Rater Severity and Score

Variability in the MELAB Speaking Task: Investigating Linguistic Characteristics of Test-Taker Performances in Relation to Rater Severity and Score

Author(s): LaFlair, Geoffrey T.; Staples, Shelley; Egbert, Jesse
Working Paper Number: 2015-04
Source: CaMLA Working Papers
Page Count: 22

Abstract: The overall goal of this study was to examine the extent to which variability across test-taker performances is captured by score and affected by variability in rater severity. First, a Rasch analysis examined rater severity and rater use of the MELAB speaking scale. Second, the linguistic characteristics of test-taker performances were investigated in terms of their relationship with assigned scores and their relationship to rater severity. The results of the Rasch analyses indicated a wide range of rater severity and underuse of the lower end of the scale. The results of the linguistic analyses showed significant correlations with features of speech, interaction, and language and test-taker score. However, no significant correlations were found between linguistic features of test takers’ performances and rater severity. The results of these analyses provide evidence that the linguistic features typical of conversation occur more frequently as performance increases in the MELAB. Additionally, they provide partial evidence that the linguistic features of test-taker language elicited by the MELAB speaking task do not vary across raters.



Linguistic Features in MELAB Writing Task Performances

Linguistic Features in MELAB Writing Task Performances

Author(s): Jung, YeonJoo; Crossley,  Scott A.; McNamara, Danielle S.
Working Paper Number: 2015-05
Source: CaMLA Working Papers
Page Count: 17

Abstract: This study explores whether linguistic features can predict second language writing proficiency in the Michigan English Language Assessment Battery (MELAB) writing tasks. Advanced computational tools such as Coh-Metrix (Graesser, McNamara, Louwerse, & Cai, 2004), the Tool for the Automatic Analysis of Cohesion (TAACO: Crossley, Kyle, & McNamara, under review), and the Tool for the Automatic Analysis of Lexical Sophistication (TAALES; Kyle & Crossley, in press) were used to automatically assess linguistic features related to lexical sophistication, syntactic complexity, cohesion, and text structure of writing samples graded by expert raters. The findings of this study show that an analysis of linguistic features can be used to significantly predict human judgments of the essays for the MELAB writing tasks. Furthermore, the findings indicate the relative contribution of a range of linguistic features in MELAB essays to overall L2 writing proficiency scores. For instance, linguistic features associated with text length and lexical sophistication were found to be more predictive of writing quality in MELAB than those associated with cohesion and syntactic complexity. This study has important implications for defining writing proficiency at different levels of achievement in L2 academic writing as well as improving the current MELAB rating scale and rater training practices. Directions for future research are also discussed.


2014

Treating Either Ratings or Raters as a Random Facet in Performance-Based Language Assessments: Does it Matter?

Treating Either Ratings or Raters as a Random Facet in Performance-Based Language Assessments: Does it Matter?

Author(s): Lin, Chih-Kai
Working Paper Number: 2014-01
Source: CaMLA Working Papers
Page Count: 13

Abstract: This paper compares two analytical methods of estimating variance components in performance-based language assessments under the generalizability theory framework. Both methods deal with the analysis of sparse data commonly observed in a rated language test. First, the rater method identifies blocks of fully crossed subdatasets and then estimates variance components based on a weighted average across these subdatasets. Second, the rating method forces a sparse dataset to be a fully crossed one by conceptualizing ratings as a random facet and then estimates variance components by the usual crossed-design procedures. Specifically, the current paper compares the estimation precision of the two methods via Monte Carlo simulations. Results show that when raters exhibit similar variability in their scoring, either method has good estimates of variance components. However, when raters are heterogeneous in their score variability, the rater method yields more precise estimates than the rating method. Implications for methodological approaches to handling sparse rated data are discussed. Finally, the study demonstrates applications of the two methods in analyzing operational noncrossed datasets from the writing component of the Michigan English Language Assessment Battery (MELAB).

[/accordian]

Predicting Listening Item Difficulty with Language Complexity Measures: A Comparative Data Mining Study

Predicting Listening Item Difficulty with Language Complexity Measures: A Comparative Data Mining Study

Author(s): Aryadoust, Vahid; Goh, Christine C. M.
Working Paper Number: 2014-02
Source: CaMLA Working Papers
Page Count: 39

Abstract: Modelling listening item difficulty remains a challenge to this day. Latent trait models such as the Rasch model used to predict the outcomes of test takers’ performance on test items have been criticized as “thin on substantive theory” (Stenner, Stone, & Burdick, 2011, p.3). The use of regression models to predict item difficulty also has its limitations because linear regression assumes linearity and normality of data which, if violated, results in a lack of fit. In addition, classification and regression trees (CART), despite their rigorous algorithm, do not always yield a stable tree structure (Breiman, 2001).

Another problem pertains to the operationalization of dependent variables. Researchers have relied on content specialists or verbal protocols elicited from test takers to determine the variables predicting item difficulty. However, even though content specialists are highly competent, they may not be able to determine precisely the lower-level comprehension processes used by low-ability test takers just by reading test items. Furthermore, verbal protocols elicited during test-taking may interfere with the cognitive task (Sawaki & Nissan, 2009).

Previous reading research uses CART to investigate item difficulty, but despite being competently conducted, the resultant regression trees have been inconsistent across test forms (Gao, 2006). In the current proposed study, two classes of Artificial Neural Networks (i.e., Multilayer Perceptron ANN and the Adaptive Neuro-Fuzzy Inference System or ANFIS) are used to explore the effect of lexical and syntactic complexity of items and texts on MET listening items’ difficulty in seven MET listening tests. In addition, Coh-Metrix measures which have conventionally been used to measure reading text complexity are also applied in this investigation (Riazi & Knox, 2014). To our knowledge, these methods have not been applied to investigate lexical and syntactic complexity of the listening texts and items. Findings from the study will contribute to the validity argument for the Michigan English Test (MET) and provide additional empirical evidence to assist CaMLA in evaluating the quality of listening test items (see also Goh & Aryadoust, 2010).