Working with data is inherently complex. It requires an extremely high level of attention to detail, even for what seems like a simple analysis, and as such any credible data analyst has at least one horror story of the embarrassment and anxiety that come with missing the mark. Unfortunately, when you work for a government agency, the stakes are considerably higher. All the more reason to have someone check your work. After high-profile errors involving data from the Department of Education, the Department must change its secretive data practices and release its source code.
Recently, Department of Education (ED) officials admitted that College Scorecard, their ambitious attempt to show student outcomes at every college that receives federal student aid, contained a coding error that inflated measures of loan repayment. Back in 2015 when the data was first released, it was met with praise from many for promoting transparency, but it also received its fair share of criticism among both conservatives and college and university representatives who argued the data was incomplete, and therefore misleading. That the recently announced coding error made the repayment rates for the average school appear nearly 20 percentage points higher than the reality and distorted the relationship between repayment rates and student income, makes this argument far more substantive.
It’s not hard to imagine these and other critics using the ED’s mistake in future debates around postsecondary data: if we can’t put out data that is accurate and reliable, why should we have it at all? And while the Department deserves credit for issuing its correction, the timing of the announcement and lack of transparency around the exact error leaves a lot to be desired. As a consequence, it will be hard for data advocates in the future to counter these claims without real steps towards transparency and openness.
Perhaps more troublingly, the error chips away at any remaining notion that the government is a trustworthy source of reliable information. In 2015, just 19% of Americans said they believed the government did the right thing most of the time. With the recent election of a president who actively plays into that sentiment, even seemingly inconsequential errors in data reporting could contribute to a troubling trend. The Wall Street Journal’s scathing accusations that the department may have manipulated the repayment data for political gain are likely to gain traction despite no evidence of ill intent, especially in light of the ongoing controversy of inaccuracies in estimating student loan costs.
ED deserves credit for an unprecedented effort to improve the information available on the costs and outcomes for different colleges, yet the ambitious nature of their task meant that the risks for error were substantially higher. While the agency announcement of plans to improve transparency by opening access to student aid data to qualified researchers with a restricted-use license is a step in the right direction, it likely does not go far enough.
The clearest step the Department can take to prevent mistakes and the criticisms that go along with them is to make their source code publicly available to any interested party. While such resources are generally not useful to the average consumer of higher education, for researchers looking to thoroughly understand how a particular data element was constructed, the language of the original code is by far the best resource available and is far more precise than any translation found in written documentation. Repayment rates, which are both complex and newly available to the public through the scorecard, generated significant interest at the time of their release. For these reasons, the research community would have almost certainly identified the error much more quickly than the Department was able to. Additionally, while privacy concerns need to be addressed when releasing actual data, this is very rarely the case when it comes to code. In fact, some state education agencies have already done exactly that and open source standards are common among other industries such as software development.
Reliable data is critical to creating policy that works, but errors such as these feed the worst instincts of many to assume incompetence and deceitful practices are at work. However, the truth of the matter is that while everyone makes (data) mistakes, making source code freely available to the public can help identify errors that slip through internal quality checks, while also sending a strong signal about the integrity of the operation.