Feedback Prize Competition Series
The Feedback Prize series was designed to spur the development of open-source algorithms that help struggling students dramatically improve in writing.
The first competition in the series, Feedback Prize – Evaluating Student Writing, tasked participants with identifying argumentative elements in essays written by students in grades 6-12. Feedback Prize – Predicting Effective Arguments, the second competition in the series, built on its predecessor by asking participants to create models that can evaluate those argumentative elements as either adequate, effective, or ineffective. The third and final competition, Feedback Prize – English Language Learning, focuses specifically on English Language Learners and using language proficiency to more accurately score writing.
The last two Feedback competitions also featured a dual-prize track, with one track focusing on standard models and another track focusing on computationally efficient models. The computationally efficient track incentivizes participants to create models that are more environmentally friendly and easier to adapt to real-world educational contexts. The algorithms developed across all competition will help students receive more individualized feedback on their writing!
To learn more about the datasets and competition series, read the case study:
Effective writing is a critical skill for success in college and career, but few students graduate high school as proficient writers with less than a third of high school seniors being proficient writers, according to the National Assessment of Educational Progress (NAEP). Unfortunately, low-income, Black and Hispanic students fare far worse, with less than 15 percent considered proficient writers.
One way to help students improve their writing is to give students more opportunities to write and receive feedback on their writing. However, assigning more writing to students places a larger burden on teachers to generate timely feedback. One potential solution is the use of automated writing evaluation (AWE) systems, which can evaluate student writing and provide feedback independently.
The Feedback Prize Competition Series is just the first of a number of series that GSU, Vanderbilt, and the Lab plan to launch to develop algorithms that drive these systems. These new algorithms will be open-source and help educators with providing timely feedback to students.
As a part of this work, we collected a large number of student essays. In addition to hosting competitions, we plan to make the essay database available as a resource for educators and researchers to conduct analyses and research related to improving writing outcomes.
AUTOMATED WRITING EVALUATION
AWE systems have a long history. As early as 1968, Ellis Page developed the Project Essay Grade (PEG) program which is now used by Measurement Inc (Page, 1968). By 1982, programs like The Writers’ Workbench could review essays and provide feedback on spelling and grammar (Macdonald, L. Frase, Gingrich, and Keenan, 1982). The feedback algorithms found in AWE systems rely on corpora of essays that have been generally hand-coded by raters for specific elements related to writing. These elements may include holistic scores of writing quality, analytic scores of quality that focus on specific text elements like organization, grammar, or vocabulary use, or annotations of argumentative elements like claims.
Over the past two decades, there has been a sea change in AWE systems. In 2012, the Hewlett Foundation hosted the Automated Student Assessment Prize (ASAP) which sought to demonstrate the reliability of automated essay scoring (AES) and generate machine learning innovations for analyzing student writing. More recently, AWE systems have seen many advances, including the ability to not just evaluate text, but provide feedback on specific elements of student writing. And there is evidence that these are effective. For instance, Revision Assistant reports that districts that use its products outperform state averages and PEG Writing reports decreased grading burden for teachers that use it.
However, these systems are not without their problems. Foremost is the basic notion that they are proprietary, so the algorithms that drive their feedback and the effectiveness of that feedback are not available for examination. Many have argued that such tools are too easy to “game” and are not as reliable as humans. AWE systems largely do not focus on argumentation and lack the capability to identify and evaluate discourse structures found in argumentative writing. Additionally, many of these tools show bias towards specific subgroups. For example, ETS found that black students were more likely to receive below-average scores when evaluated on their automated system.
When it comes to ELLs, there are few, if any AWE systems that focus specifically on English language development. Writing is often the most challenging skill of language development for ELLs because it requires more language processing than reading and speaking.
Even if an English learner has tested “proficient” by state requirements, they have a different language development trajectory than a native English speaker. et, they don’t have the same level of monitoring. Moreover, some states allow students to reach English proficiency using an average of the scores across reading, speaking, and writing. As a result, some students no longer have an ELL status at school but have not scored proficient in writing.
If tools helped teachers and schools monitor English proficiency and/or writing proficiency especially among English learners, it would help schools identify ongoing challenges among English learners — both at the group and individual level — and appropriately target interventions.
DATASETS AND ANNOTATION
GSU, Vanderbilt, and the Lab collected a large number of essays from state and national education agencies, as well as non-profit organizations. From this collection, we developed The Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements (PERSUADE) corpus, consisting of argumentative essays written by students in grades 6-12, and the ELL Insight, Proficiency and Skills Evaluation (ELLIPSE) corpus, consisting of essays written by ELLs in grades 8-12.
The PERSUADE corpus contains over 25,000 argumentative essays written by U.S students in grades 6-12. The corpus includes demographic information on gender, race/ethnicity, and socioeconomic status. These essays were written as part of national and state standardized writing assessments from 2010-2020.
Each essay in the PERSUADE corpus was annotated by human raters for argumentative and discourse elements as well as hierarchical relationships between argumentative elements. In addition, each argumentative and discourse element and essay received a score for holistic quality. The corpus was annotated using a double-blind rating process with 100 percent adjudication such that each essay was independently reviewed by two expert raters and adjudicated by a third expert rater.
The annotation rubric was developed to identify and evaluate discourse elements commonly found in argumentative writing. The rubric went through multiple revisions based on feedback from two teacher panels as well as feedback from a research advisory board comprising experts in the fields of writing, discourse processing, NLP, and machine learning.
The ELLIPSE corpus contains over 7,000 essays written by ELLs in grades 8-12. The corpus includes demographic information on gender, race/ethnicity, and socioeconomic status. These essays were written as part of state standardized writing assessments from the 2018-19 and 2019-20 school years.
Essays in the ELLIPSE corpus were annotated by human raters for language proficiency levels using a five-point scoring rubric that comprised both holistic and analytic scales. The holistic scale focused on the overall language proficiency level exhibited in the essays, whereas the analytic scales included ratings of cohesion, syntax, phraseology, vocabulary, grammar, and conventions. Each essay was independently rated by two raters. All raters had an educational background in Applied Linguistics and were adequately trained for scoring. Score differences between two raters equal to or greater than two points were adjudicated through discussion by the raters.
The rubric was developed based on a number of state and industrial assessments used to assess language proficiency in ELLs, such as the Arizona English Language Learner Assessment (AZELLA), English Language Proficiency Assessment for the 21st Century (ELPA21), LAS Links, and WIDA. These assessments were modified to fit the student population and the writing task in our ELL corpus. Experts in the fields of writing instruction and ELL education advised in the development of the rubric.