AIDE Dataset
Overview
Can AI be trained to detect plagiarism?
The influx of artificial intelligence and large language models (LLMs) in the classroom is a source of both excitement and concern among educators. As LLMs like ChatGPT become increasingly sophisticated, they are capable of generating text that is difficult to distinguish from human-written text. But what if AI could help detect and resolve the very problem it poses? A new dataset can now help learning scientists train AI to automatically detect and expose AI-generated essays and content. Now available here from the Learning Exchange, the AI Detection for Essays (AIDE) dataset is intended to foster open research and greater transparency around real-world AI detection techniques.
The AIDE dataset comprised a mix of 10,000 student-written essays and essays generated by a variety of LLMs. All of the essays were written in response to one of seven essay prompts.
Vanderbilt University, together with The Learning Agency Lab, an independent nonprofit based in Arizona, collaborated with Kaggle on the competition.
The goal of AIDE is to prevent plagiarism and help protect the learning gains and skills development of middle and high school students who might otherwise rely on LLMs to do their writing – and learning – for them.
Potential Uses
- Train artificial intelligence/natural language processing algorithms for automatic detection of student- or AI-generated essays
- Evaluate the performance of other AI models trained to detect AI-generated content
- Analyze the strengths and weaknesses of student writing relative to AI-generated writing