In 2007, Professor Fei-Fei Li started assembling a massive dataset of 14 million pictures, labeled with the objects that appeared in those images. This dataset, dubbed ImageNet, spurred dramatic progress over the next decade in computer vision, the field of artificial intelligence that trains computers to understand images and videos. 

Such datasets can serve as “benchmark” challenges that researchers compete on, and incentivize advancements in fundamental and domain-specific fields. So what are the potential datasets that could drive a similarly transformative impact in education? How can the field benefit from data science competitions? The Learning Agency Lab aims to answer these questions by pulling together five data science competitions over the next year.


Crowdsourced solutions have been shown to kickstart new fields and develop new solutions. When the problem is narrowly defined — and has an algorithmic solution — data science competitions have yielded large results in areas ranging from home valuation to cervical cancer screening

While these kinds of competitions are prominent in other fields, data science competitions in education remain substantially underleveraged. In the past 10 years, there have only been three education data competitions that have attracted more than 500 teams. 

This low occurrence of competitions is likely due to a number of factors. One reason is the absence of deliberate effort to design and host competitions versus just posting data sets. For example CMU’s Datashop has very little usage, despite the quality of the data posted. Another is the lack of collaboration with competition platforms, like Kaggle, who know how to develop competitions that draw large interest and have talent readily available. For example, NSF funded an education data science competition in 2019 that was only announced at a research conference, resulting in fewer than 70 teams competing. In comparison, over 3,400 teams competed in the recent education-focused Data Science Bowl on Kaggle.

Due to the large amounts of data being produced by ed tech platforms, the Lab wants to help seed a dedicated track on a crowdsourced platform focused on soliciting, designing, and launching education data science competitions. We believe these data competitions can have a transformative impact on education research. This will include creating an open collection of educational datasets or creating an “ImageNet for Education.” By doing so, researchers would have access to high-quality datasets and any resulting algorithms from the competitions. 

Similar to the ImageNet moment and how it revealed the potential of open datasets and competitions, the field of learning engineering would benefit from a set of data science challenges that provide fresh perspectives and multidisciplinary expertise.


When Fei Fei Li started building a database called “ImageNet,” she wanted “to do something completely that was completely historically unprecedented and map out the entire world of objects.”

She knew that better algorithms would help to better detect images, but that the algorithms produced would only be as good as the dataset they were trained on. What started as a published open dataset quickly morphed into an annual competition for creating algorithms to identify objects with the lowest error rate. In 2017, the final year for the competition, the accuracy of the winning algorithm outperformed humans at 97.3 percent.
This moment is viewed by many as the catalyst for the AI boom, or rather it proved that high-quality datasets can drive huge algorithmic innovation. It also shows what is possible when datasets are open and accessible to diverse groups. 

As it stands there is no “ImageNet” for education. The Lab wants to change that by leveraging high-quality datasets that exist in education to solve problems facing educators today. 


One area that seems ripe for innovation is the prediction of high school dropouts. In the United States, an estimated 1.2 million students drop out of high school each year, and about 25% of high school freshmen fail to graduate from high school on time. There are a number of indicators that affect dropout rates including race, disability status, and parental education level to name a few. 

There are longitudinal datasets that provide insights on whether or not a student is likely to drop out of school. A data science competition based on these datasets could help to develop an algorithm to accurately assess a student’s likelihood to drop out of high school. With the number of dropouts in the United States, this kind of algorithm could help educators to better identify students at risk of dropping out and plan interventions to help them stay on track. 

​Another possible area to explore is short essay automated scoring. Back in 2012, Hewlett Packard hosted the Automated Student Assessment Prize (ASAP) to generate algorithms that could quickly and effectively grade student essays. It was a huge success and demonstrated that software has the potential to score short response essays. However, no such competition has been run since then. These algorithms are in their infancy and there is lots of room for improvement. 

A competition based on a dataset of short students essays (under 150 words) could help to develop new algorithms that effectively score these types or responses or improve the algorithms that already exist, many of which drive the essay scoring tools on the market. This kind of algorithm would be useful for overburdened teachers who spend hours grading student responses. 


We are actively looking for partners who want to help develop this effort. If you have a high-quality dataset and/or an idea for a competition related to education and learning – we’d love to hear from you! We are open to datasets from different subjects and ultimately want to create an open database where educators and researchers can actively engage with them. 

Ideal datasets would contribute to potential solutions for a challenge that exists in current education practices and could drive algorithmic innovation through a competition. For example, a dataset that describes learner engagement, learner activity, and outcome data on a digital learning platform could help to identify insights into what activities or points of engagement affect outcomes significantly. 

If you have a dataset you would be interested in sharing or ideas for a competition, feel free to reach out to Aigner Picou or fill out this contact form