The Center for Data Innovation spoke to Ben Pellegrini, co-founder and chief executive officer of Intellegens, a British startup that uses AI to find correlations in datasets with a lot of missing data, for a variety of purposes, such as improving how researchers predict the effects of experimental medicines. Pellegrini discussed how sparse data can still be big data, and how Intellegens’ technology grew out of research into experimental alloys.
This interview has been edited for clarity.
Nick Wallace: One of Intellegens’ use cases is drug discovery. Most readers will probably grasp why missing data might be a problem in drug testing. But how can AI help to alleviate the problem, and what effect does that have on the development of new medicines?
Ben Pellegrini: In drug discovery, Intellegens has been working with large sparse data matrices, looking at compound or drug-protein interactions, where we’re looking at, say, 2 million drugs against six thousand proteins. Typically, that dataset is very incomplete, down to 0.05 percent complete. That means there’s only really a recorded experiment or known data on the activations between certain proteins and certain drugs. When a drug company is looking for a new protein to activate, they’ll only test the drug against a handful of proteins.
Typically, AI is not very good at learning from poor or incomplete data. That is where Intellegens’ technology comes in. Our unique architecture can take in all of those values to learn any correlations that may exist. When you’re talking about a matrix of 2 million by 6,000, that’s 12 billion values that need to be calculated. If it’s only 0.05 percent complete, which sounds very small, you’re still looking at 6 million values within that matrix that can give you information about the missing data.
With each one of our predictions, we also give a measure of uncertainty. This allows drug discovery companies to use the data that we generate to feed into their existing models and existing workflows. It widens the starting point of where they might want to look for new drugs. Using AI, we can predict which compounds and proteins are likely to have an effect, with a given level of uncertainty. The companies can then either rule them out, or concentrate their efforts on clusters of data where AI has given a high prediction that there could be something interesting happening there.
Wallace: Intellegens’ technology also has uses in designing new materials. What are the benefits of using AI for this purpose?
Pellegrini: This technology was born out of materials research. Dr. Gareth Conduit, a co-founder of Intellegens, was trying to design a new super-alloy to be used in jet engines. He looked at the historical data available, and decided to try and use AI and neural networks to predict the best composition of a new super alloy. When he looked at the historical data, he realized it was quite sparse, in that there is a lot of data out there on millions of different materials, but with each material, researchers may have measured certain properties such as density, conductance—there’s probably a hundred properties of materials. Also the composition elements and the treatment processes are significant in the development of these alloys. When you put all of those pieces of data together, you realize you are looking at quite a sparse matrix again. When he tried to apply off-the-shelf or existing neural network architectures, he realized that they couldn’t handle this level of incomplete data. So he developed this way of using this incomplete data to discover all of the correlations. Then we have to go on and predict the ideal composition and treatment processes for a new alloy to achieve certain targets—these targets being certain strengths, certain weights, and certain costs. He optimized the material for 14 targets.
Wallace: You mentioned that you provide confidence levels for the correlations. How does Intellegens determine what level of confidence to have in correlations, and just how sparse can data be to still yield reliable conclusions?
Pellegrini: I think the sparser it gets, the more powerful our technology becomes. We have worked with datasets down to 0.05 percent complete at the start. Really you just need two values in each row of the dataset, and the larger the dataset gets, the more data points you’ll get. In that 0.05 percent case, we still had 6 million values to work with. Trying to identify the patterns obviously gets easier when you’ve got even more data, even if it’s sparse. But the way that that’s calculated is part of Gareth’s secret sauce.
Wallace: Are there any other domains where you can see this technology having an impact in the future?
Pellegrini: When the materials work began proving successful, Gareth realized that the technique is generic and can be applied to anything. The first side-step was into drug discovery. That’s where he did his first contract, and it was around that time that Intellegens was formed. But since then, as a start-up, we’re always looking for opportunities where sparse data exists.
Part of our problem is explaining to people how their data is “sparse,” which requires us to go into a new sector, understand the language and the problems they’re facing, and then look at their data slightly differently, to explain what data is missing. Everyone is talking about big data now because people are measuring a lot of data. But if people could, they’d measure a whole lot more. What we’re trying to do is to help them not need to make all those measurements.
We’re talking to lots of different people, and we’ve got proof-of-concept operations in patient analytics, where we’re looking at historical patient profiles, which are sparse because everyone’s different, and trying to suggest or optimize potential treatments to maximize outcomes. We’re also looking at predictive maintenance for infrastructure, where you’ve got assets within, say, the road network. You have data on the asset you want to maintain. But you might also want to have local weather data, local geology data, local traffic data, all these extra data points that may or may not be available. The result of that is an incomplete dataset. But all the elements impact on the outcomes.
We’ve got a couple of other minor projects: one looking at oil prices, for example, and one looking at the design of experiments. Because our tool allows us to put out uncertainties for each of our predictions, it can almost guide experimentalists to see where a model is the most uncertain, and where the most value from the next experiment would be.
Wallace: If you’re relying on AI to find correlations where otherwise none would be found, is explainability ever a challenge? How do you ensure the algorithm’s conclusions are falsifiable?
Pellegrini: It’s a challenge, and that could be a huge barrier in patient analytics. Is AI a medical device? If so, do you need to validate the entire process of how the decision was made?
In the materials world and the drug discovery world, at the moment what we’re doing is when we get a new customer, we typically get them to hold back some of their data, and to then verify the results that we give them. Also through our process, we do several cross-validation techniques to be confident in the results that we provide.
But giving that detailed level of explanation about how the results were achieved and how that level of confidence was given is not something we currently do. But I do see it as a potential issue in the future. However, the technology was developed through research, and the outcome of that research was a new super alloy, which has been experimentally verified and patented.