What the University of Michigan Student Data Incident Reveals about Higher Ed Data Collection and Practices

Sydney Saubestre

April 8, 2024

Our personal data has become a hot commodity—traded and sold with little regard for the individuals it represents. The education policy ecosphere was recently atwitter after a data-related incident occurred ostensibly involving the University of Michigan (U-M). This incident has reignited concerns about the ethical use of student data and the responsibilities of academic institutions in safeguarding privacy and data security.

Here’s how it all unfolded: An engineer at Deepmind, a leading artificial intelligence company, received a targeted advertisement that appeared to be from the University of Michigan. The ad made a simple offer of access to an extensive dataset containing student papers and audio recordings of academic lectures—all for the (relatively) low price of $25,000. The backlash was swift, leading many to ask how U-M, a prestigious public university, could sell students’ work in such cavalier fashion?

Students have a right to know exactly how their data is being used by a university or the third-party vendors with which a university chooses to contract. The U-M incident serves as a timely reminder of why we need greater transparency and accountability around how universities manage and safeguard sensitive data, something U-M has itself prioritized in the past. Data transparency and privacy practices are necessary not just for the sake of compliance but to foster a culture of trust and integrity within academia and beyond.

Digging a bit deeper into the matter revealed a more complex story in which U-M had been wrongly vilified. In fact, the dataset in question was not a clandestine marketplace for student data but an open-access resource that has long been made available by U-M to “academics for free.” Further, the intentionally misleading ad was not posted by U-M but rather by Catalyst Research, which seems to sell scraped open-access data to other companies. U-M clarified that the dataset, spanning the years 1997 to 2007, was made up of voluntary contributions by students and purportedly stripped of identifying information. While the university emphasized that it had neither endorsed the advertisement nor worked with Catalyst Research to sell the data, the relationship between U-M and Catalyst Research remains unclear. U-M referred to the company as a “new third party vendor.”

Why Does the University of Michigan Occurrence Matter?

The U-M case raises questions around institutions’ responsiveness to developments in technology, particularly given the rapidly evolving landscape for artificial intelligence and data analytics during the data collection period. For example, participants in 1997 would likely not have envisioned their data being available to anyone on the internet in perpetuity. Moreover, the inclusion of voice recordings raises valid concerns about privacy and the potential for re-identification, highlighting the delicate balance between data sharing and individual rights. Questions also remain about the ethical implications of making such research data available for commercial licensing in the first place.

The dataset from the University of Michigan includes lecture recordings, small group discussions, and even office hour meetings between two people. While students are told to “just ignore [the recording],” many of the participants in the early stages of the study would likely not have a strong grasp of the internet’s open-access nature, let alone the way one's digital footprint could become permanent. While the majority of comments are innocuous, some contain personal information about a student’s health history or relationship status—anecdotes that participants may not have shared if they knew that information would be available on the internet at large. They also may not have agreed to participate if they knew how advances in technology have made data more revealing and allowed it to be used in ways that had not then been conceptualized, whether that be in training LLMs or in targeted attacks.

The Wider Context of Data Collection and Higher Education

The backlash that U-M faced reflects a broader societal perception that universities may not always act as responsible data stewards. This perception is exacerbated by the lack of transparency surrounding data collection, storage, and usage within academic settings as well as the questionable choices that have been made by some institutions. High-quality data is a foundational requirement of both academic research and effective governance—we need to understand a problem before we can solve it, and understanding a problem requires data. Data usage and privacy need not be in tension but should be jointly prioritized.

The controversy around this particular incident comes at a time when higher education is grappling with a wide range of issues around data usage and privacy, including the need for more accountability around students’ post-college outcomes, the dangers of algorithmic decision-making, and the high number of data breaches within higher education.

What’s the Path Forward?

While some institutions may already adhere to best practices, establishing a baseline framework for data governance is essential for safeguarding privacy and promoting transparency.

At OTI, we believe a baseline framework should include at least the following elements:

Establishing dedicated cybersecurity and privacy programs at higher education institutions
Recognizing that data anonymization is not an adequate stand-alone technique and should be considered (and re-considered) through a risk management lens that minimizes re-identification
Working together to establish standards around privacy enhancing technologies that build on existing research and recommendations
Prioritizing student privacy and data security, especially when working with third-party vendors or creating open-access datasets
Ensuring that data sharing practices align not only with legal requirements but evolving ethical standards and student preferences

Concerns about the misuse and exploitation of student data underscore the need for robust privacy protections. Just as with human subjects in research, individuals must actively and fully consent to the use of their data and should have the right to revoke consent, especially in scenarios in which technological advancements outpace ethical considerations.

The case of U-M's dataset serves as a poignant reminder of the ethical complexities inherent in data sharing practices. By embracing transparency, accountability, and ethical principles, universities can navigate the digital landscape with integrity while advancing knowledge and innovation for the greater good.