By Jaylin Nesbitt and Sarah Quesen 

Nesbitt is a Research Associate and Quesen is the Director for the Assessment Research and Innovation team at WestEd.  

With ongoing technological advancements, many high-stakes assessment scoring programs are seeking more efficient methods to evaluate student responses. How do we ensure the introduction of machine language increases the reliability of scores without reproducing human biases?

In this post, we focus on the use of machine scoring to evaluate students on performance-based tasks and how practical approaches to culturally responsive and sustaining approaches in these scoring systems can help to ensure all students are evaluated fairly.

How Machine Scoring Is Currently Done 

When scoring performance-based tasks on any assessment—formative, interim, or summative—a rubric defines how to rate a student’s ability to show what they know on the constructs measured. Scoring can be done by humans or machines. In the age of generative AI, like ChatGPT, low-stakes machine scoring looks much different than just a few years ago. However, in high-stakes assessments, the machine-scoring model is trained on human-scored responses—it takes hundreds or thousands of human-scored papers to train most models.

The process starts with sample papers representing each score point being carefully scored to be used for future training or for monitoring the validity of scoring. Then additional human scorers are trained to score responses in accordance with the rating rubric. Typically, at least some papers are double scored by humans to check scoring agreement, and validity papers are dispersed throughout the process to monitor scoring.

A sample of these human-scored responses are divided into a set to train and a set to test an automated scoring model. Once the scoring model is on par with humans, the scoring engine can score alongside or take over for humans. Most programs keep humans in the loop to score paperswhen the confidence in the engine’s score is low and monitor the scoring throughout.

Ensuring Scoring Systems Are Fair to All Students 

There are several ways a scoring system can be improved at each point described below.

  1. When developing rubrics, students from different backgrounds and cultures may have different approaches to showing what they know. Diverse responses should be included as correct responses and coded into the rubric as such.
  2. Papers representing each score point used for training and monitoring human scorers should include papers representing all students—students from different races, ethnicities, incomes, multilingual learners, and students with an IEP. While this information need not be disclosed to the human scorers, it should be used to inform pulling the training and validity papers.
  3. Similarly, the training and test set for the automated scoring engine should be pulled in such a way as to ensure that it includes all students. Stratified sampling may be needed to ensure that there are sufficient counts of students in each group to be included in statistical analyses of scoring performance.

How do we ensure the scores from a scoring system are fair to all students?  When conducting statistical analyses of scores to evaluate scoring system agreement and accuracy, it is imperative to disaggregate these analyses by student group. Typically, analyses include correlations and standardized mean differences (SMDs). These types of metrics can be evaluated against set thresholds to flag areas where scoring agreement is low (e.g., the absolute value of SMD no greater than 0.10). Again, it is not enough to rely on overall summary measures, we must look at each student group separately to ensure that all students are receiving the same quality of scores from a scoring engine as they would from the human scorers.

A Culturally Responsive and Sustaining Approach to Scoring Systems 

A framework for evaluating automated scoring with an emphasis on fairness is being developed as part of larger work focusing on a culturally responsive and sustaining approach to scoring systems (White et al., in review). The main ideas of the approach to automated scoring evaluation include

  • Oversampling with intention to ensure that there are sufficient papers for statistical analyses.
  • Disaggregating scores by student group and carefully examining descriptive statistics including summary measures like correlations and standardized mean differences.
  • Evaluating measures of score point agreement by student group when summary measures are flagged.
  • Including all results of group-wise descriptive and statistical analyses in a technical report.

This evaluation helps to ensure that if papers are routed to an automated scoring engine they are being scored similarly to human scoring. Also, it ensures the results of the evaluation are transparent and available to other researchers. However, if the humans on which the engine trained are biased, then this evaluation will fall short of the goal of supporting valid score interpretations.

One step toward more inclusive approaches to automated scoring systems is better understanding the student population and integrating steps to prevent specific populations of students from penalization over others.  A recent study, The Presence of African American English (AAE) Markers in Student Writing (Nesbitt et al., 2023), investigated student responses on a large-scale assessment to determine if bias was present when features of AAE appeared in student writing. Using NLP methods to create an index for AAE in the response data investigates whether students using these features were experiencing any form of penalization in their scores. Without an index for language, it is not possible to statistically examine the relationship between a student’s language background and their score. This novel use of NLP yields valuable insight into the language features of AAE speakers, and this approach freely expands a better understanding of the language features of other student populations.

This study, and others like it, paves the way for future studies to explore large language models to inform better training of both human scorers and automated scoring models. Human raters are considered the gold standard, but humans are known to carry a wealth of biases based on experiences and beliefs, intentionally or not. Scoring programs need to do more than train scorers on how to rate papers to match a validity set. We need to include bias and sensitivity training to mitigate the transfer of human bias into the scoring system.

And before this, we need to consider training the rubric writers. And before this, we need to consider training the item writers. And before then, we need to consider training the standards writers. There are infinite steps before, before, before … that require us to question our assumptions about race, ethnicity, gender, language, and exceptionalities to ensure fairness and equity are centered in all we do.

We must take these steps, no matter how many, with intention so we can truly create and uphold an approach to assessment that is inclusive of the students we serve and prioritizes and honors their ways of knowing, being, and experiencing the world.