The Center for Data Innovation spoke with David Hand, a data scientist at Imperial College, London. Hand discussed the opportunities and challenges of big data, and the role of data in policy and services.
Nick Wallace: Your book, The Improbability Principle, argues that highly improbable events are, in fact, commonplace. What lessons do you think there are for policymakers in this principle?
David Hand: I suppose the obvious one is that they’ve got to be prepared for improbable, unexpected, and dramatic events. Extreme floods, disasters overwhelming accident and emergency (A&E) departments, or unexpected election and referendum outcomes, for instance! Extreme floods is a good example: climate change means we’re going to get these events that might have been thought to be highly improbable happening more and more often.
I think in general there’s a fundamental lesson people need to take on board. People aren’t very good at thinking about probability. If I say that the probability that something will happen is 0.9, or 9/10, and it doesn’t happen, that doesn’t mean I got it wrong. It just means I thought it was more likely that it would happen. Even events with very low probability happen—just not very often. If the weather forecaster says there’s an 80 percent chance of sunshine and then it rains, that doesn’t mean they got it wrong. The whole point about probability and change and uncertainty is that they are not predictable, so you just work out what you think is the most likely event. And, in a sense, the fact that events with low probability happen is the driver behind the book: it’s about what makes those improbable events more likely.
This is where the difference between a scientist and a politician comes into play. A scientist can do the calculation, and then sit on the fence and say, “I think it’s most likely that this will happen tomorrow, but it might not.” The politician, however, has to make a decision on the basis of these predictions, and has to choose action A or action B, and in doing that they won’t take into account just the probabilities, they’ll also take into account the consequences. Suppose that the low-probability outcome carries a much greater cost than the high-probability one, then the politician might choose to go with the low-probability outcome, because the consequences could be so adverse. So, I really think the politicians have a much tougher job than the academics or the scientists, because the academics can say, “look: it could be this, or it could be this,” but the politician has to act.
Wallace: You’ve worked on bringing more data into the domain of retail credit by establishing credit ratings and predicting consumer fraud. Tell us a little more about this: what does data innovation bring to this sector, and what does it achieve for consumers?
Hand: Credit scoring in the retail sector—loans, car finance, mortgages, this sort of thing for individuals—has been around for a long time, long before the phrase “big data” ever became popular, but it certainly was big data. About twenty years ago, major credit card companies were handling a billion transactions a year. I’ve been working with these sorts of organizations to develop models that are better, more predictive: “will this person default?” “will they be interested in other financial products from the same organization?” and so on. You can’t get it 100 percent right, of course, because unexpected events intrude—the sort of things we were talking about just a moment ago—somebody loses their job or gets divorced or suffers a bereavement. But you can do pretty well.
People started developing these sorts of models in the late 1950s and 1960s, and since then they’ve been gradually improving them: gradually adding more data, gradually improving the machine learning and statistical modelling processes that these kinds of tools use. And so, the tools they’ve got now, the predictive models, are really very, very good. But it is based on very large data sets; it’s rather different from corporate credit rating, where you don’t have such massive, homogeneous data sets. The individual credit scoring stuff is pretty good, because it’s been around for a long time and you’ve got big data sets.
It’s good news from the consumer point of view as well. A bank won’t want to lend to somebody who is unlikely to be able to repay. So, I think anything you can do to build a model which will more accurately predict the outcome is beneficial to both sides. Using a poor model is bad news for everybody. Banks make loans to people who’re then going to suffer all sorts of problems because they can’t repay, and the bank is going to suffer because they won’t repay.
There are also situations where the model wasn’t very good in the past, and rejected people because it didn’t take into account all aspects of an individual. It didn’t take into account that one individual was a special case, because of such and such, and so on. It works both ways. For example, doctors are generally good credit risks. Perhaps the information being fed into the scorecard before was unable to record that a doctor has just come out of prison after serving a sentence for embezzlement. In that case, perhaps they wouldn’t be a good credit risk, but the old model was unable to take that into account. More sophisticated modern models use a much wider range of data. So it’s entirely possible that current models might not always make the same decision as previous models—although one would expect that for the bulk of applicants, the decision would be the same. It’s near the borderline where you would expect some differences: in the past, some professions were treated as one group, but over the course of time, more information allows you to discriminate more effectively between different people.
Wallace: You’ve also been part of efforts to make better use of public sector data with the UK Statistics Authority. What does the term “data-driven policy making” mean to you, and how far off do you think it is?
Hand: I think of the phrase “data-driven policymaking” as almost synonymous with “evidence-based policymaking”—I think of evidence and data as almost synonymous. Data is a particular kind of evidence. It’s telling you how the world really is and what’s really going on. We can contrast that with ideology-driven policy making, which basically says, “never mind the facts”—we could perhaps talk about this notion of a post-truth society.
The key point I want to make is “evidence equals data.” Ideology-driven decision making might say, “I think this is the right thing to do, I don’t care if it makes things worse”. Data-driven decision making is saying, “if we do it this way, the evidence suggests it will be beneficial,” and then we can evaluate policies. That’s not at all easy in the social sciences—in medicine, it’s tough enough, evaluating different treatments for example—but it’s much tougher in the social sciences. But it can be done. And so, with data-driven policymaking, you’re looking at what the needs actually are and what actually works.
One shouldn’t expect miracles. It doesn’t mean we can always get things right the first time. But with data, we can monitor and evaluate and assess, and if it’s not really working we can explore and experiment with adapting and changing things so that we can improve them—sensitively, and in an informed way. Data-driven policymaking should mean that things get better and better.
But I think there are issues that people are not properly aware of, and need to be aware of. Look back at the traditional statistical methods: because it took a lot of effort to collect the data, people very carefully formulated the questions and worked out the best way to gather the data that would enable them answer to those questions. Increasingly nowadays, we’re moving to this big data world where people are using operational data—data exhaust (data produced as a by-product) from credit transactions, for example—which haven’t been collected with a view to answering a specific question. And such data carries risks. There is always the potential that the data sets are distorted in subtle ways that you are not aware of.
Let me give you a very simple example from the credit-scoring world. The aim of building a credit-scoring machine-learning model is to try to predict whether an applicant will default on a loan. You get a load of past data describing applicants, and with their outcomes—whether they defaulted or not. And then you build a statistical model for new applicants, which will compare them to past applicants. But the problem is the data you’re building this model on is based on people that you gave a loan to in the past. And those people were the ones you thought would be good risks—you didn’t just give loans to everyone who applied. You had a previous scorecard or selection method to decide who to give the loan to. So the data that you’re building the model on is distorted. It’s already been through some sort of selection process. And I think these sorts of issues are pervasive, and I think quite a lot of the work going on in the big data and computational science world doesn’t take into account the risks that can come out of this.
There are data-selection issues and there are data quality issues, which people working with modern operational data sets—data exhaust data sets—need to be aware of, and I think are not fully aware of. I think there are real potential dangers there. Statisticians are intrinsically cautious about missing data. Computer scientists knew enough to coin the phrase, “garbage in, garbage out.” The data going in might not be complete garbage, but if is subtly distorted, or biased, or inadequate in some other way, then how much can you trust number it spits out?
Wallace: This seems relevant to the debate over algorithmic bias, and the fact it’s more a question of bias in the data than bias encoded into the algorithm itself. What are your thoughts on how policymakers should deal with that?
Hand: It really comes down to what you mean by bias. In the past, in credit scoring, by law you were not allowed to discriminate on the grounds of protected characteristics—including gender. Whereas in insurance, you could—women are generally safer drivers than men, and so you were allowed to charge them lower insurance premiums than men. And then a few years ago, the EU discussed all this and decided to make a change. And I expected that it would make credit more like insurance. Because the point is that insurers were calculating the best estimates of risk they could.
But what the EU did was go the other way. It made insurance more like credit, and said you can’t discriminate on the basis of gender. Does this mean that a man and a woman with identical scores should be treated the in the same way? Or does it mean that if you insure 90 percent of female applicants, you also have to insure 90 percent of male applicants? The results contradict, you can’t have both, they’re incoherent. So, the fundamental issue here is working out exactly what you mean by “bias” and “discrimination” and so on—and this is not really an algorithmic question, or even a data question. Once you’ve worked that out, then you can build models which will adjust for these sorts of distortions in the data sets.
Wallace: What do you think are the most difficult barriers to data innovation in the public sector?
Hand: I was going to say public suspicion, but I don’t think that’s right—I think it’s media suspicion, and what makes a good story. It’s the lack of the bigger picture, and narrow, one-sided views.
In the UK a few years ago, we had a bit of a fiasco: they were making plans to merge data from doctors’ surgeries with hospital data to produce a giant database, which would be immensely valuable for identifying disease courses, who was likely to suffer from a disease, what you should do about it, and so-on. A huge amount of benefit to the community and to individuals would have emerged from this. But there was a huge outcry, which I think was in large part drummed-up by the media, and which presented one side. It gave the impression that individuals’ data would become public—which of course it wouldn’t—rather than emphasizing the fact that individuals would benefit.
This brings me back to the previous point about bias and distorted data sets. If you allow people to say, “I don’t want to be included in that database,” you’re immediately distorting the database. One of the consequences is that you can’t draw valid conclusions from that database, which has an adverse effect on the community, and on individual people, who may be suffering from particular diseases. Allowing people to say “I don’t want to play a part in this” has a bad effect on other people and, potentially, them. If one person is saying it, other people are saying it, and some of those people may get the condition that person is going to get.
You often hear people say it’s easy to lie with statistics. That may be true, but it’s a great deal easier to lie without statistics, without the data, and without the evidence. And any advanced technology can be used for good or bad. You only have to look at nuclear technology, or biotechnology—they provide fuel and medical advances, as well as all sorts of horrible weapons. Exactly the same applies to data technology. The technology itself is amoral—it’s how you use it that counts.