AIDE Dataset

Overview

Can AI be trained to detect plagiarism?

The influx of artificial intelligence and large language models (LLMs) in the classroom is a source of both excitement and concern among educators. As LLMs like ChatGPT become increasingly sophisticated, they are capable of generating text that is difficult to distinguish from human-written text. But what if AI could help detect and resolve the very problem it poses? A new dataset can now help learning scientists train AI to automatically detect and expose AI-generated essays and content. Now available here from the Learning Exchange, the AI Detection for Essays (AIDE) dataset is intended to foster open research and greater transparency around real-world AI detection techniques.

The AIDE dataset is derived from a recent competition that concluded early in 2024 and challenged participants to develop a machine learning model that accurately detects whether an essay was written by a student or an LLM. Such efforts can help mitigate the time-consuming and often impossible task of teachers working to accurately classify which essays were written by students and which were generated or augmented by AI. While current automated AI detection tools are still in the early stages of development and it is unclear how reliable they are, this new dataset can help learning engineers aid educators in new and more precise ways.

The AIDE dataset comprised a mix of 10,000 student-written essays and essays generated by a variety of LLMs. All of the essays were written in response to one of seven essay prompts.

Vanderbilt University, together with ​The Learning Agency Lab, an independent nonprofit based in Arizona, collaborated with Kaggle on the competition.

The goal of AIDE is to prevent plagiarism and help protect the learning gains and skills development of middle and high school students who might otherwise rely on LLMs to do their writing – and learning – for them.

AIDE © 2024 by The Learning Agency Lab is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/ This license enables reusers to distribute, remix, adapt, and build upon the material in any medium or format, and only so long as attribution is given to the creator. The AIDE dataset was used in the LLM-Detect AI Generated Text competition on Kaggle. The resulting algorithms from the competition can be found here. Many prize winners also provide further details to their algorithms on the discussion board here.

Potential Uses