By Viraj Kamdar
There’s been much ado about ChatGPT and its impact on academic writing. If this thing can write essays all on its own, will students ever write again? Will teachers be able to tell the difference between a student’s writing and the computer’s writing? The short answer is yes. If you’ve ever taught a class where a student is actually trying to plagiarize something, you know it almost immediately. It’s not easy to pass off an adult’s writing for that of a high school or middle school student. In college and graduate school that question gets harder to solve. However, many students (these days) submit their work to tools like TurnitIn, which uses similar Ai (Natural Language Processing/Machine Learning) to detect more traditional forms of plagiarism before teachers and professors read a student’s work. In addition, tools like ECREE can use samples of your own writing to not only track your progress across multiple dimensions and assignments, but tell if someone or something has picked up the keyboard for you. And almost every week, I hear about a new tool being used to safeguard classrooms against the use of ChatGPT. That said, the key here is to recognize that writing is digital now and there are both prizes and pitfalls to that undiscovered country.
AI and Academic Writing: A Brief History
Four and a half years ago, The Bill & Melinda Gates Foundation launched an Advanced Research and Development (R&D) Portfolio of Investments focused on solving the problem of argumentative writing in America. This portfolio was launched in response to The National Assessment of Education Progress, which highlights that only 8% of black student and 11% of Latin students are proficient in argumentative writing. This suggests, to me, that some of the most disenfranchised populations in our country haven’t been prepared to effectively advocate for themselves, their families, or their communities. There are a host of factors that got us here, but the area most ripe for new Advanced R&D investments in ed tech is around assessment tools that can be used for practice and feedback.
If the average teacher in New York City has 35 students in their class and teaches an average of five periods a day, that’s 175 students. If that teacher asks their students to write a two-page paper, then they are reading, grading, and tracking errors on 350 pages of student work for a single assignment. It takes at least a week to turn this around with feedback. Oftentimes, that work is then hung up on bulletin boards as an example of what students shouldn’t do next time, with no opportunity to meaningfully apply the feedback. Consequently, metacognition breaks down, problems become entrenched, and progress in writing gets stymied. This can go on for years in a student’s life.
Large-scale assessment players also faced similar challenges with the volume of essays that needed to be scored each year on the SAT, GRE, LSAT, etc. Nearly a decade ago, a group of testing and philanthropic organizations realized that assessment makers were spending a ton on overhead for the human resource capital needed to grade essays. In order to streamline efficiency, the sector issued a challenge called ASAP 1.0 (Automated Scoring and Assessment Prize) which was funded, in part, by the Hewlett Foundation. This competition galvanized the AI community in machine learning and natural language processing. They created algorithmic solutions that could effectively grade student’s argumentative writing as accurately as the best human raters. Largely, these algorithms honed highly complex pattern recognition or rules-based systems that could identify the use of claims, evidence, and transitional phrases. More importantly, these tools offered real-time, actionable feedback to students while they were writing. In addition, teachers saw massive time savings and were able to quickly target areas for intervention.
Over the next decade, private industry from the large-scale testing organizations (like Educational Testing Service and College Board), curriculum and supplemental tool developers (like Khan Academy, Newsela, and TurnitIn) were doing their own R&D around automated essay scoring tools that could not only reduce overhead on grading summative assessments, but also provide formative feedback to students in the moment of writing. At this point, the Gates Foundation came in offering a host of investments around research, development, early evidence capture (around usability), and ultimately validation studies (around student outcomes) with many of these tools. While the studies were largely delayed by COVID fallout in schools there were a host of solutions and findings that emerged. This writing portfolio at Gates Foundation can help us understand the next steps for the field, as well as provide potential ideas for future prize competitions and funding directions.
Prize Competitions: Recommendations and Next Steps
The portfolio of investments at the Gates Foundation was largely centered on proving the efficacy of Automated Essay Scoring (AES) tools with a few notable exceptions. After initial usability testing and early evidence capture, Mathematica recommended Random Control Trials (RCT) to understand the impact on student outcomes. However, teachers in the study suggested that they were more interested in tools that could help students in less high-stakes, regular writing opportunities. At The National Council of Teachers of English (NCTE) 2019, teachers offered their feedback on AI tools in the English Language Arts (ELA) classroom. Nearly two-thirds of survey respondents stated that they provided more short-form writing opportunities than long-form writing opportunities. Participants also used scaffolded strategies in short-form writing, such as short reflections, to build students up to long-form writing. These kinds of short-form writing assignments also exist in curricula and make up most of a student’s writing experience.
Therefore, future competitions should focus on developing algorithms that can power tools that detect errors in short responses and help students employ strategies to find the answer tailored to each question. In addition, teachers should receive real-time reports on student progress and challenge areas so that they can launch small group interventions.
The opportunity to invest in education technology R&D is often constrained by capital and the current state of the art in AI. Back in 2018, Natural Language Processing and Machine Learning had made real strides in formative assessment, and tools were capable of rating full essays with equivalent accuracy to the best human graders. There was both potential and interest in conducting a study on the efficacy of the existing technology to improve student outcomes in argumentative writing. That said, meta-analysis on writing research (Graham 2007) suggests that you need to go further back in a student’s writing process and develop pre-writing strategies. Helping students with comprehension, identifying evidence, organization, and synthesis, will produce the most dramatic outcomes. Currently, students using curricula do most of their writing in short responses that scaffold comprehension, evidence identification, organization, and synthesis. In order to create a moonshot in education, we need to focus future competitions on short response assessment and feedback. Creating the tools that offer AI feedback on short responses could happen directly with the support of large-scale assessment players, curriculum providers, and supplemental tool developers. The following are three recommendations around where AI can have the most outsized impact on student writing:
Build Comprehension Tools: Quill Reading for Evidence
Quill Reading for Evidence is an open-source digital tool that builds knowledge by having students write sentences rather than answer multiple-choice questions. Multiple-choice is the primary question type used in educational technology, but these questions limit opportunities for improving comprehension and retention. Quill’s technology is a substitute for multiple choice and asks students to analyze texts and demonstrate comprehension using short responses to open-ended prompts. Once a student has responded the tool will coach that student to extend and enhance their short responses such that they are accurate, specific and use evidence from the associated text. Quill conducted a study at a school serving low-income students in New York City. Students demonstrated .20 standard deviations of growth. This represents significant learning gains in the tool as well as improvement in external measures, including “great” growth. These kinds of results are almost unheard of in the education technology marketplace.
The key outcomes of Quill’s study indicate that students learned how to identify and use precise evidence from a source text. In addition, students demonstrated their comprehension of a text by using key evidence from the text in their own writing. In the initial sessions, students used weak evidence more frequently. With repeated feedback from Quill, they learned how to demonstrate strong evidence in writing. This work represents the development of a new, scalable question type that should not only be implemented in curriculum, but also in large-scale assessment and their associated preparatory courses. In order to move the entire field forward, prize competitions should create low-cost engines for local and large-scale assessment. These assessments should be designed to soak up data and create a trickle-down effect on implementation, intervention strategy identification and professional learning in emerging digital curricula such that practice and feedback opportunities yield stronger outcomes.
Build Outline/Synthesis Tools: CommonLit 360’s Pre-Writing Workflow
CommonLit 360 is a free, comprehensive English Language Arts curriculum that includes engaging units aligned to grade-level skills including reading, writing, discussion, and vocabulary. It features flexible lessons with pacing options to support all educators. In 2019, Gates Foundation funded the development of a pre-writing workflow that would help students to gain greater comprehension and organize their thinking before they began writing argumentative essays.
In addition, the Foundation generously supported a second study of CommonLit 360 during the 2021 academic year across five under-resourced school districts. This time, with the use of the digital workflow, Mathematica measured CommonLit 360’s effect on reading, writing, and social emotional learning. The study isn’t published yet, but so far, we know that students in the CommonLit 360 group saw 0.30 SDs on average gains in reading and 0.26 SD average gains in writing. For those of you who might not be as familiar with educational effect sizes, these numbers are really exciting. The Institute of Educational Sciences (IES) says that programs that deliver a +0.25-effect size have “national policy importance.” Raising student achievement by 2-tenths of a standard deviation results in a 2 percent increase in annual lifetime earnings. 0.23 is the expected reading gain for 1 full year of learning in grades 6-8 (Source: Bloom et al, 2008).
While these outcomes are profound, the framework is not unique. There are other such tools currently on the market, like ThinkCERCA or more recently, Outline (recently funded by NewSchools Venture Fund). While all these tools scaffold student writing and professional learning about writing instruction (for teachers), they do not attend to the overhead on teacher time for reviewing student work. To achieve the study results, students had to engage in at least 13 assignments over the course of the year. To create a real moonshot in education, AI like Quill’s could be applied to grade student work and save teacher time on the tasks they attend to the most. Practice and feedback tools like Quill’s, coach students in developinging more thorough responses that leverage evidence. In addition, real-time feedback on areas of challenge (like you see in Automated Essay Scoring) could help to foster small group work and teacher-mediated remediation during the outlining process.
Digital Coach/Chatbots: National Writing Project
OpenAI helped us start this conversation. I think we should return to it here with a final, ambitious recommendation. The engineers at OpenAI have made a splash with ChatGPT. However, Stanford researchers just upgraded Meta’s LLaMA to create Alpaca 7B, a model that performs at the same level as ChatGPT. The big takeaway here is that the open-source tools (i.e., LLaMa) are catching up to closed-source tools (like OpenAI’s) with relatively little cost or time allocations. This is important for the educational landscape because low-cost, open-source tools will be necessary to scale innovation in education technology. While OpenAI chatbots fit neatly into systems across business cases, deep knowledge of the art and science of each industry’s vertical will be required to build truly effective tools that surface the most relevant, aligned and effective responses for that industry.
To this end, the National Writing Project (NWP) has done extensive work in writing across genres with a focus on testing and improving national outcomes. They developed the College, Career, and Community Writers Program (C3WP). C3WP is designed to improve student’s argument writing through intensive teacher professional development, instructional resources, and formative assessment. In 2016, based on evidence of C3WP’s prior success in improving student achievement, NWP received a federal grant for Innovation (i3) Scaleup. The grant was used to test C3WP in new contexts. As part of this grant, SRI International (SRI) conducted a one-year random assignment evaluation of C3WP in grades 7–9 that found consistent program implementation and positive, statistically significant impacts on student writing achievement. Finally, the third evaluation of C3WP in secondary grades found positive and statistically significant effects on student achievement. The size, scale, rigor, and independence of these three studies provide a strong evidence base to support C3WP’s effectiveness in improving student’s secondary writing achievement at scale and in diverse contexts. In this use case, the instructional foundation has been laid and human-centered design work could help to map the coaching work that teachers do with students to inform future chatbot interactions.
Furthermore, NWP has been thinking with teachers about the use of AI since 2019 with its We Write Now Teacher’s Studio. Elyse Eidman-Aadahl, Executive Director of NWP, has said that “there is definitely an interest in creating a charter of a ‘writing coach’ to provide process guidance at the moment of need.” NWP and organizations like 826 National would be ideal in spearheading the development of an AI Chatbot that could coach students of all ages through their writing in a variety of genres.
I have always held that the figure of the innovator is that of Prometheus. Prometheus was the son of Zeus and while on Mount Olympus, he said to his father, ‘Look the people are cold and hungry. Let us give them fire so that they may warm their homes and feed their families.’ To which Zeus replied, ‘No, they will use it to make war.’ Turns out they were both right. As I said earlier, there are potential prizes and pitfalls to ChatGPT and AI, more generally, in education. One major concern in the field is that if technology begins to read our papers en masse, will we have eviscerated the primary purpose of writing—to engage in a discursive process with an authentic audience? To this end, we need to think first about principles in writing, as well as the coaching of writing, to guide development teams in their pursuit of producing a true moonshot in literacy.
Human interactions in the writing process would ideally come throughout the process but particularly during ideation, planning, and revision with formative AI feedback in between. One starting point might be the work that Greater Good Studio did around the ethical and equitable use of AI in the writing classroom. For example, in their first finding, around allowing students to write their own way, they note students should be offered flexible drafting and composition spaces so that they can comfortably move through their unique processes and ask questions to their peers or teachers. In these spaces, AI could prompt reflection, suggest edits and additions, or call out strengths built over time. AI could also be used to assess areas of challenge in order to link peers to mentors.
Finally, these recommendations taken together, could act as the vanguard for a digital portfolio in writing assessments. If the algorithms and tools were centralized and open-sourced to curriculum and testing organizations, the field could build an open-sourced corpus of student writing (with privacy protections) that could fuel innovation in the field for years to come. If states were to get on board with these new vehicles for assessment, they would have formative, continuous data around areas of challenge where policy and advocacy measures could launch measures around skills gaps or needs for support with the rank and file in schools. Finally, if instruction and professional development organizations could run continuous feedback loops on challenges and opportunities in assessment feedback, as well as teacher baseline knowledge and skills, we could better prepare teachers to use real-time formative feedback to drive instruction. This instrumentation of AI assessment tools in curricula should be aligned with state benchmarking for summative assessments and professional development. AI assessment tools could act as a means of fulfilling the promises of No Child Left Behind in offering truly data-driven direction to instruction.
Viraj Kamdar is a former senior program officer at the Bill & Melinda Gates Foundation, and a former director with the New York City Department of Education.