PIILO Dataset

Overview

Can AI be trained to detect and protect users’ personal information?

As the use of artificial intelligence (AI) in education and classroom settings grows, a core challenge persists – protecting student and user privacy.

Personally identifiable information (or PII) is often a barrier to analyzing and creating open datasets that advance education because the public release of data with PII puts students at risk. To reduce these risks, it is crucial to screen and cleanse educational data, a process that data science could streamline.

By reducing or eliminating the presence of PII in educational data, which can be a barrier to more widespread AI adoption, developers could help lower the cost of releasing educational datasets. Safe and anonymous datasets can then be used to support learning science research that helps advance the development of new educational tools.

The Learning Exchange’s new PIILO dataset is an integral step toward better protecting personal information while enabling the creation of supportive AI tools and platforms in schools.  Vanderbilt University, together with ​The Learning Agency Lab, an independent nonprofit based in Arizona, collaborated with Kaggle on the  development of this dataset and an open data science competition to train AI  models on it.

The PIILO dataset comprises approximately 22,000 essays written by students  enrolled in an open online course and comes from a recent Kaggle competition that  concluded early in 2024. As part of the competition, all of the essays were written in  response to a single assignment prompt, which asked students to apply course  material to a real-world problem. The competition’s goal was to annotate personally  identifiable information (PII) found within the essays.

The PIILO dataset can better equip learning engineers with the tools needed to  facilitate reliable automated techniques for educational data. These techniques  could, in turn, allow researchers and industry leaders to tap into the potential of large  public educational datasets that help generate more effective tools and  interventions to assist teachers and students.

The PIILO dataset was  used in the PII Data Detection competition on Kaggle. The resulting algorithms from  the competition can be found here. Many prize winners also provide further details  to their algorithms on the discussion board here

 

PIILO © 2024 by The Learning Agency Lab is licensed under CC BY 4.0. To view a copy  of this license, visit https://creativecommons.org/licenses/by/4.0/ This license enables  reusers to distribute, remix, adapt, and build upon the material in any medium or  format, and only so long as attribution is given to the creator.

Potential Uses