Whether it is feedback or the measurement of emotions, natural language processing (NLP) will radically change education.
This column looks at some of the top applications. It gives an overview of the available NLP open source tools, their reported reliability, along with potential shortcomings.
Given the power of feedback, we focused on ways that researchers, designers, educators can use NLP to analyze written text and give feedback for students. But we also flagged tools that provide students with graphic organizers.
The tools and applications identified here are all open source and thus can be accessed by anyone at any time.
How To Use NLP Tools
- Target writing constructs. To develop software that uses these tools, a developer should determine which writing constructs are the intended focus of the software and familiarize themselves with the open-source tools that target those constructs. Most – if not all – of the tools listed below have guides on how to install and interpret the results of their respective outputs. Use these guides as a starting point to identify the specific indices and key metrics that each tool will produce. Every tool is unique in their reporting and scaling so it is important to review the guides before running your corpus through the various interfaces.
- Download selected open source tools. All of these tools are either web-based or require a local download onto your computer. The column provides the specific download information under the description of each tool.
- Understanding the output. The output from these tools can be used for more than just predictive validity. It can also be used to provide formative feedback to the writer by identifying weak spots in their writing style or structure. The tools for grammar, spelling, and structure are examples where a mistake can be identified by the NLP tool and that mistake can then be sent to the writer along with an example of what “good” looks like. A caveat to this is that “good” when we consider writing style, structure, and semantics may vary as a function of domain and audience. So it is important to understand the corpus you are using and how human raters (which tends to be the gold standard) have defined ‘good’ in this domain as it relates to style, syntax, and semantics.
For research, NLP applications can provide helpful insight into the writing process. These tools are most effective when a specific research question has been identified beforehand as the vast amount of data that can be generated form these tools can at times be overwhelming. To use these tools effectively consider your research question, domain of interest, and your predictive validity evidence.
Key NLP Terms
Accuracy: It is the ratio of correct predictions or categorizations over all observations. Or, the percentage of total items classified correctly. Can range from 0 to 1 (or 0 – 100%). Higher scores mean greater accuracy. Thus, if you get a ratio score of 1 (or a percentage of 100%), it means that 100% of the classifications were accurate, whereas if you get a 0.3 it means 30% of the classifications were accurate. The higher the accuracy score, the better the tool is performing.
Precision: The ratio of correct predictions or categorizations over all correct observations. Can range from 0 to 1. Precision can be thought of as a measure of exactness. Precision tells us about how precise our measures were and how often we have false positives (a classification that is not accurate). The higher the better as it means that the classifications by the tool were precise in their categories (i.e., not categorizing items wrong).
Recall (Sensitivity): The ratio of correct predictions or categorizations over all relevant observations. That is of all the observations we could have been labeled (i.e., were relevant) how many did we actually label. Can range from 0 to 1. The higher the better. Recall can be thought of as a measure of completeness, a high precision is indicative of a low amount of false negatives.
F1: This is a single score that represents both precision and recall. Weighted average of Precision and Recall. Generally seen as more useful than accuracy alone as it relies on harmonic mean. Ranges from 0 to 1. The higher the better. A high f1 score means that the tool has high precision and recall. In general when possible use this metric in evaluating tools.
Pearson r: The strength of association between two variables. Can range from -1 to 1. Generally, for both positive and negative values – .0 to.3 (small association), .3 to.5 (medium association), .5 to 1 (large association). This is often used when comparing human raters to automated tools. Sometimes you may also see the metric R^2, which is simply the squared value of the r. The R^2 value relates to the amount of variance accounted for by the model.
Top NLP Tool By Focus Area
The following section highlights tools that identify writing constructs, or focus areas.
Conventions
Operalization: Identifies issues of grammar, spelling, punctuation
Recommended Tool: After The Deadline
Who uses this tool: After The Deadline is an open source tool designed to provide grammar, punctuation and spelling feedback on written text. This is a widely used tool for researchers, teachers and writers. This tool helps provide feedback on structure, grammar and spelling errors. It is often used during the writing process or to help with grading.
Reliability: Real Word Error Correction – Accuracy: 89.4% Recall: 27.1%. This is an acceptable/reliable level of accuracy.
Caveats: One caveat is that the reliability numbers for grammar are not readily available.
Interface: Download the AtD extension, addon or plugin. You can then follow the step by step guide on how to input text to be analyzed.
Operalization: Use of appropriate transition words and structures
Recommended Tool: Coh-Metrix – Coh-Metrix is an online NLP tool that analyzes texts on many levels of language and discourse, such word concreteness, syntax, cohesion, and narrativity. Specifically, use indice “CNCALL.”
Who uses this tool: Coh-Metrix is used by a variety of researchers at universities and educational technology companies. This is a well-respected tool that can calculate hundreds of indices that can be used for essay and writing evaluation in research and in schools. This tool is often used and implemented into other tools to help grade or evaluate written discourse.
Reliability: Previous work has shown that CNCALL had acceptable levels of accuracy at Precision .83 and Recall .83.
Caveats: Coh-Metrix has over 100 NLP indices. That can make interpretation overwhelming and very difficult. Please refer to the indice guide as there is a vast amount of information about the variety of information this tool captures.
Interface: Easy to use interface with step by step instructions.
Operalization: Maintains argument, flow of ideas, transition terms, sentence structure, pronoun use. Shows strength of argument, maintains argument. Cohesion generally refers to the presence or absence of explicit cues in the text that allow the reader to make connections between the ideas in the text.
Recommended Tool: Coh-Metrix – Coh-Metrix is a computer facility on the web that analyzes texts on many levels of language and discourse, such word concreteness, syntax, cohesion, and storyhood. There are many cohesion indices within the coh-metrix. One in particular is Referential Cohesion Indices: A text with high referential cohesion contains words, arguments, and ideas that overlap across sentences and the entire text, forming explicit threads that connect the text for the reader. Low cohesion text is typically more difficult to process because there are fewer connections that tie the ideas together for the reader.
Who uses this tool: Coh-Metrix is used by a variety of researchers at universities and educational technology companies. This is a well-respected tool that can calculate hundreds of indices that can be used for essay and writing evaluation in research and in schools. This tool is often used and implemented into other tools to help grade or evaluate written discourse.
Reliability: Previous work has shown that coh-metrix cohesion indices can classify cohesion differences (high & low) in text at a rate of 76.3%.
Caveats: Coh-Metrix has over 100 NLP indices. That can make interpretation overwhelming and very difficult. Please refer to the indice guide as there is a vast amount of information about the variety of information this tool captures.
Interface: Easy to use interface with step by step instructions.
Operalization: Maintains argument, flow of ideas, transition terms, sentence structure, pronoun use. Coherence, as compared to cohesion, refers to the understanding that the reader derives from the text (i.e., the coherence of the text in the mind of the reader). This includes the organization – or the structural elements of writing that promote overall text comprehension (i.e., Thesis, topic sentence, conclusion).
Recommended Tool: TAACO– A freely available text analysis tool that incorporates over 150 classic and recently developed indices related to text cohesion.
Who uses this tool: Researchers use this tool to help grade analyze written discourse. It is developed at University of Hawaii.
Reliability: Previous work has shown that TAACO indices related to coherence had a medium to high correlation with human judgements of coherence, r=.47.
Caveats: This tool has over 150 indices. Just like Coh-Metrix, please review the index guide and evaluate what you are trying to measure.
Interface: Easy to use interface with step by step instructions. Download the user guide as it gives step buy step instructions.
Operalization: Variety of evidence, trusted sources, explains how evidence supports the argument, use of counterargument.
Recommended Tool: ArgumenText
Who uses this tool: This tool is used by researchers, academics, and industry. It was developed at the University of Darmstadt. The website claims this tool is used by start-ups, industry and politicians.
Reliability: Previous work has reported an F1 = 0.53 for identifying arguments. The tool shows an f1 score of .53 in being able to identify the quality of arguments being made in the text.
Caveats: This tool has a lower accuracy and is still in development by researchers. Researcher permission is needed to access and so it may not be easily accessible by all.
Operalization: Sense of energy, appropriate audience, command over literary devices, sophistication of words, commonly misused words.
Recommended Tool: The Coh-Metrix Common Core Text Ease and Readability Assessor (T.E.R.A.) – is a computer facility on the web that analyzes texts on many levels of language and discourse, such word concreteness, syntax, cohesion, and narrativity. Specifically, provides measures of text “easability” and readability. It provides component profiles of text easability on five different dimensions: narrativity, syntactic simplicity, word concreteness, referential cohesion, and deep cohesion. Though each of these components are of interest for language sophistication, syntactic simplicity and narrativity are particularly relevant.
Who uses this tool: This tool is used by researchers and educators. This tool builds off of the Coh-Metrix tool which is a well respected tool that has hundreds of indices used for essay and writing evaluation in research and in schools. This tool is often used and implemented into other tools to help grade or evaluate written discourse.
Reliability: When classifying texts (science, history, narrative) recall and precision ranged from .47 to .92, with an average F-measure of .68.
Caveats: This tool only has 5 indices so interpretation is less overwhelming than other tools on this list. However, it is still important to review and understand what each one is meant to examine.
Interface: Easy to use interface with step by step instructions.
Plagiarism CheckerOperalization: Overlap between student’s work and other publically available (or submitted) work.
Recommended Tool: Sherlock – Sherlock can be used for intra-corpal collections of source-code or plain text. The tool supports most procedural and object-oriented languages but specific optimisations have been made for the Java programming language.
Who uses this tool: Integrated into BOSS online submission system.
Reliability:This tool has not yet reported reliability or accuracy metrics. Although do report “Comparable results to other intra-corpal plagiarism checkers.”
Caveats: This tool has actual reported accuracy / reliability metrics. It is also under active development and is still being researched / iterated upon.
Interface: User guides and tools on website for step by step guide.
Operalization: A way to visualize and organize information.
Recommended Tool: Essay Maker – An open source, online graphic organizer for essay writing. After completing the organization template, the essay can be exported as text.
Who uses this tool:Students, Writers, Teachers, and Academics.
Reliability: This tool has not yet reported reliability or accuracy metrics.
Caveats: This tool has no reported accuracy / reliability metrics. It is also on github and so may be harder for those not familiar with git and its flow to access and use.
Summarize
Operalization: Ability to summarize main ideas, take-aways, or self-explain the content.
Recommended Tool: Recall-Oriented Understudy for Gisting Evaluation (ROUGE). ROUGE compares overlapping units between a target summary and an ideal summary (usually human-written, unless generated with an automated text summarizer). This tool has been used to evaluate the success of computer-generated summaries, but can be used with any summary.
Who uses this tool: This tool is widely used and researched. It has a vast amount of publications on the accuracy of its metrics. It is open source and has been referenced by Microsoft as a tool for NLP. It seems to have adoption by both academics and industry.
Reliability: Correlations between human ratings and ROUGE measure scores generally between 0.77-0.99 for single document summaries and short summaries, but generally between 0.40-0.78 for multi-document summaries. This is a pretty wide range and may indicate reliability varies as a function of the actual text.
Caveats: This tool has a wide reliability range. It is also on github and so may be harder for those not familiar with git and its flow to access and use. For more see the program’s how-to guide.
-Ulrich Boser