As text analysis techniques are adopted more commonly as research tools in educational research, there are growing concerns about student privacy. Students may purposefully or inadvertently provide personal identifying information (PII) in both written or spoken texts that may make it difficult and/or illegal to share openly in open science domains.
Specifically, in the United States, The Family Educational Rights and Privacy Act (FERPA) protects the privacy of student education records. However, these records may be important artifacts that can be used to help researchers better understand the learning and educational process.
Thus, the process of de-identifying student data is an important element of student text analysis, but in big data situations, de-identifying student information manually is time-consuming and expensive.
There are a number of natural language processing (NLP) tools that can help with automatically de-identify PII. Much of the work on PII has focused on removing names and identifying information from medical records (14). Early methods relied on template matching approaches to find identifying information in medical records (20). While successful, the approach requires information on the patients in the datasets (or the students if the approach were to be extended to educational data).
The process of de-identifying student data is an important element of student text analysis, but in big data situations, de-identifying student information manually is time-consuming and expensive.
Such an approach is problematic because many of the named entities in students’ texts provide important information about students’ knowledge base and may be important predictors of educational outcomes. Removing them may influence text analyses and predictive models of student success.
For example, the de-identification program called Philter was developed at the University of California, San Diego and was designed for removing PII from medical data to make them HIPPA compliant. Philter uses a combination of regular expressions, Part Of Speech (POS), Entity Recognition (NER) tagging to achieve very high rates of text de-identification (~95%) for medical records.
However, for student data, Philter is an unsatisfactory solution because it does not take into consideration context and it will remove every entity it encounters and replace it with the placeholder “PHI.” So, for student essay data, Philter would remove popular culture references from student essays (“Harry Potter” and “Voldemort,” for example) and other instances of entities that the students may be using to provide support for arguments (i.e., evidence). A sample of a student essay ran through the Philter returns the following:
How to Protect Student Privacy in Education NLP?
One solution to this problem is hybrid in that it relies on NER, but keeps a human in the loop. Specifically, we have developed a new program that will search through student text for potentially personally identifiable information utilizing the NER used in spaCy . (Download the new tool on GitHub here.) The program will output the named entities per text in rows for humans to examine. The humans can then flag the named entities by text that seem to provide PII. This approach gives humans the ability to make quick decisions about which named entities may qualify as PII without having to read through the entire text. Once a text is flagged as potentially containing PII, the raters will then go back to that text and manually de-identify it if necessary.
The algorithms that underlie NER programs are “greedy” and will extract all information related to entities.
-Scott Crossley, Professor of Applied Linguistics and English as a Second Language at Georgia State University.