PIILO Dataset
Overview
Can AI be trained to detect and protect users’ personal information?
As the use of artificial intelligence (AI) in education and classroom settings grows, a core challenge persists – protecting student and user privacy.
Personally identifiable information (or PII) is often a barrier to analyzing and creating open datasets that advance education because the public release of data with PII puts students at risk. To reduce these risks, it is crucial to screen and cleanse educational data, a process that data science could streamline.
By reducing or eliminating the presence of PII in educational data, which can be a barrier to more widespread AI adoption, developers could help lower the cost of releasing educational datasets. Safe and anonymous datasets can then be used to support learning science research that helps advance the development of new educational tools.
The Learning Exchange’s new PIILO dataset is an integral step toward better protecting personal information while enabling the creation of supportive AI tools and platforms in schools. Vanderbilt University, together with The Learning Agency Lab, an independent nonprofit based in Arizona, collaborated with Kaggle on the development of this dataset and an open data science competition to train AI models on it.
The PIILO dataset comprises approximately 22,000 essays written by students enrolled in an open online course and comes from a recent Kaggle competition that concluded early in 2024. As part of the competition, all of the essays were written in response to a single assignment prompt, which asked students to apply course material to a real-world problem. The competition’s goal was to annotate personally identifiable information (PII) found within the essays.
The PIILO dataset can better equip learning engineers with the tools needed to facilitate reliable automated techniques for educational data. These techniques could, in turn, allow researchers and industry leaders to tap into the potential of large public educational datasets that help generate more effective tools and interventions to assist teachers and students.
The PIILO dataset was used in the PII Data Detection competition on Kaggle. The resulting algorithms from the competition can be found here. Many prize winners also provide further details to their algorithms on the discussion board here.
PIILO © 2024 by The Learning Agency Lab is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/ This license enables reusers to distribute, remix, adapt, and build upon the material in any medium or format, and only so long as attribution is given to the creator.
Potential Uses
- Train artificial intelligence/natural language processing algorithms for automatic detection of PII in student discourse
- Develop and test data anonymization techniques
- Evaluate other PII data detection models for accuracy