More Data, Fewer Problems?

Chenxi Wang, Ph.D.

Feb. 8, 2017

The work of lifting countries out of poverty is never easy. But Eric Rozier thought he could make it far more effective.

Rozier is a computer science professor at Iowa State with a penchant for data science who saw a big problem at the World Bank. Every year the Bank issues development loans across the world. Some are earmarked for projects to help countries or regions improve infrastructure and basic living standards — think access to clean water and medicine. But each project is complex, and requires many international suppliers to provide many goods and services to make them a reality. Unsurprisingly, the supplier selection process is competitive.In an ideal world, the contract would go to the company that offers the best combination of price and quality. And then, the development project would be underway.

But all too often, that’s not how it works. The bank sees a non-trivial amount of fraud in the supplier bid and selection process. From setting up fake companies to collusions and price fixes, fraud in this process can lead to the embezzlement or siphoning of funds. That could mean no road, well, or access to medicine for the intended country or region.

In 2014, Rozier started a research project with Data Science for Good (DSSG) to “fight data with data.” His project focused on helping the bank identify fraud scenarios by building an automatic reasoning system that studies data patterns, and identifies what he calls “data integrity attacks”, which have a high correlation to potential fraud.

Rozier sees this as an attacker-vs-defender problem. The attackers — the fraudsters — aim to inject the system with bad data or subvert the decision process of the system. The defender’s task is to spot the bad data and/or the patterns of subversions.

An example of a data integrity attack, Rozier says, is collusion. In other words, multiple entities may work together to artificially inflate their bids on a loan in order to make a moderately high bid competitive. In those cases, either a single fraudster submits nearly all bids, sometimes over 90% of them, via fake companies. Or a supplier may collude with others to submit artificially inflated bids, subsequently dictating what a “reasonably-priced” bar is.

Another commonly seen fraud tactic relates to companies that the World Bank put on a “debarred” list, which hosts entities prohibited from submitting bids because of past violations or dubious practices. To thwart this process, a debarred company may obfuscate its identity by taking on a new name that is similar to a well-recognized, legitimate organization — think “PricewaterhoseCoopers” to resemble “PricewaterhouseCoopers” — or changing its name slightly to confuse automated identification algorithms — think “Amce inc” to “Acme Inc”.

And here, we get back to “fighting data with data.” In this situation, Rozier saw that “the defender” — in this case the World Bank — had fundamentally more data than a typical fraudster and attacker. The defender can leverage this fact to fight data fraud. For instance, the World Bank has an entire history of all bids submitted, both winning and losing ones. The fraudster may only have a partial view of such a history. In one case, Rozier’s study identified a number of fake bids that stood out because they followed a model of what bids looked like 10 years ago. “The bidding process has changed significantly in the last 5 years,” Rozier said, “these bids had the wrong parameters and patterns, and it was clear that they were manufactured bids.”

Rozier had discovered the key fighting attacks: “ You need a superior data model than your adversaries.”

Understanding how to preserve data integrity, and to build a superior data model to fight data fraud, is becoming ever more crucial. We rely on the integrity of financial records, election results, news, and medical records, to name just a handful of sources. If these data and the algorithms that process them become compromised, the very foundation of our society may be at risk. “The question is: how vulnerable is your system to data manipulation, gaming, and other forms of data integrity attacks, and what will you do about it?” Rozier asks.

It’s a question that’s just beginning to be answered. Data science for fraud and security is a relatively young field. Rozier’s work using semantic and syntactic clustering to resolve name conflicts, when tested on a World Bank data set, has greatly improved upon previous results gained using opensource tools like OpenRefine.

In addition, recent advances in deep learning applied in conjunction with data science, such as those seen in Google’s successful AI-driven GO game against the world’s best GO player, is “incredibly encouraging”, Rozier said. “The same deep learning techniques can be applied to fraud detection: We can label the data, learn something about the governing dynamics, refine the model, iterate, and eventually build sound analysis.”

Rozier is deep in the second phase of the project where he aims to apply deep learning principles to tackle data integrity. He believes that data science can help to effectively eliminate supplier fraud. “The hope is more projects can go forward unimpeded, and we’ll see more clean water, better infrastructure, and improved healthcare for more regions sooner rather than later.”

More Data, Fewer Problems?

How one computer science professor is using data to fight development aid fraud.

Blog Post

Chenxi Wang, Ph.D.

Feb. 8, 2017