The Center for Data Innovation spoke to James Field, co-founder and chief executive officer of LabGenius, a London-based startup using AI to develop biological proteins. Field discussed how AI tools rapidly apply evolutionary methods to bioengineering, helping scientists to overcome design challenges and build new proteins with benefits ranging from medicine to manufacturing.
Nick Wallace: LabGenius uses AI to engineer new proteins. What uses are there for the proteins you create? What’s the goal?
James Field: Effectively what we’re building is a protein engineering platform that is agnostic to the types of proteins that you want to produce. So I guess the real question you’re asking is, “what’s the use case of proteins?”
Typical high-value use cases of proteins can be segmented into different applications based on functionality. Broadly, proteins segment into enzymes or binders, like antibodies. The biggest existing class of high-value proteins is in the therapeutic space, but they’re also massively important for powering a lot of industrial reactions too.
Some of the most important drugs in the world are proteins. I don’t have the statistics at hand, but therapeutics segment into small molecules, which are non-proteins, and biologics. Those biologics are either antibody therapeutics, which are often used to treat diseases like cancer and inflammation, or enzyme replacement therapies.
Proteins are also used for manufacturing. This could be for the manufacturing of therapeutics, often they’re used in small molecule manufacture. But they’re also used in the manufacture of a lot of other stuff. For example, I think most of the cheese that you eat: they used to extract the enzyme from calves stomachs, now that’s recombinantly made.
You don’t often see them, but proteins have a huge role. They touch our lives in all sorts of ways, from processing textiles to laundry detergents, to medicines, to processing food, a whole bunch of stuff really.
Wallace: Why is AI useful in bioengineering?
Field: It’s a really interesting question. The way I think of it, biology is this hugely complex problem, and as humans, we’re not really equipped to fully understand the complexity of biology. If you look at how protein engineers currently do protein engineering today: firstly, it’s a highly artisanal process that relies heavily on humans both for experimental design and execution. And as I say, humans are cognitively incapable as a species of fully understanding the complexity of biology, so this results in a process that’s inefficient and prone to failure.
So the value of deploying AI in this space is that it enables you to more intelligently understand the challenges of biology and traverse them. Practically, that means better experimental design and execution resulting in a lower probability of failure in engineering these molecules.
The development of any new biological product has a discovery phase, where you’re trying to make the new product itself, and the second phase is manufacture. AI is useful in both, but in very different ways. In the discovery phase, which is the sort of work we do, the primary question is, “how will different protein sequences behave?” This is not something that’s very easy to predict using traditional methods. But you can empirically generate a lot of data, and that’s something that you can feed into an AI to make future predictions in a way that you couldn’t using traditional rational design-based approaches. And then on the manufacturing side, it’s more around process optimization. But we deal less with that, we’re really a kind of discovery shop.
Wallace: On your website, you describe your work in evolutionary terms, where your AI system, aptly called EVA, takes on the role of natural selection in predicting what mutations constitute improvement. But nature only selects for reproductive success, it does not know or care what humans think is good. So what do you teach your algorithms to select for?
Field: Just so I understand the question correctly, you’re asking, “how do you de-couple organismal fitness from the fitness of the thing or system that you’re interested in?”
Wallace: That’s a much better way of putting it.
Field: You would traditionally face that problem if you were testing these molecules within the context of a living system. Now the beauty of what we can do, is we can de-couple that pretty nicely. We’ll make a trillion unique physical DNA sequences that we’ll have in the lab. Then each of those DNA sequences can be converted to protein. And then each of those proteins can be individually evaluated in the real world. Because you’re doing that whole process outside of an organism, there is no coupling of the organism’s fitness to the fitness of the actual molecule that you’re interested in. So the answer is that by physically decoupling those two processes, by decoupling the evaluation processes of the protein from a cellular context, you don’t suffer from that problem.
You still have to have an objective model of fitness. It’s often precise engineering of a model’s biochemical and biophysical characteristics that are challenging, and these are things that you can simultaneously measure when evaluating one or more proteins. For every different protein and every different application space, the precise requirements of the biochemical and biophysical characteristics will be different.
For example, if you wanted to deliver a protein to the gastrointestinal tract, this an environment that has specifically evolved over millions of years to break proteins down. You have to re-engineer that molecule such that it can resist that kind of onslaught. You can build a trillion bearings for proteins and simultaneously test them for resistance and retained functionality in that environment, and then you’re effectively functioning a scoring mechanism against each of those molecules.
What you end up with is a sequence of DNA with a corresponding fitness, and you can have millions of those, and then the way you apply the AI is to understand why some of those sequences perform better than others, to extract the underlying genetic design rules that determine fitness, and then to iterate on that in successive rounds of evolution.
Wallace: What data do you use to train EVA, and where does it come from?
Field: Maybe to answer that question it would be helpful if I talk you through our process. I’ll start at a very high level and then drill down.
At a very high level, what I think of is that we’re building a computer where some of the operations are conducted in the virtual world and some of them are conducted in the physical world. To talk you through that process, we’ll generate protein design in the virtual world. Then that design will actually be built and tested in the real world, and then the data will feed back into the virtual world, and you learn from it in the virtual world, you extract knowledge from it, and then you re-invent a new design. This kind of approach is really good for problems where an accurate in silico model is unavailable, and you can generate high throughput empirical data cheaply.
That’s how the high-level process works, and if I was to talk you through it with a little bit more granularity, we will design trillions of DNA sequences that we will physically construct ourselves in our own laboratory. Then we’ll score each of those sequences empirically based on a fitness function. Using sequencing, you pull back which sequences work well and which ones don’t work so well, and that’s what enables you to train your algorithm and build your model.
So to answer your question of where does the data come from, we generate proprietary datasets in-house, and the reason that we do that is it’s very cheap for us to generate high quality datasets. We’ll pull all of the additional insight that we can from open-access databases, but the kinds of insights that you can glean are often quite broad and non-specific. The way I think about it is you can leverage broad design rules and narrow design rules. Narrow design rules are easier to extract from very focused proprietary datasets, and broad design rules, such as the genetic design rules that underpin a lot of the fundamental processes, you can extract those from open data sets.
Wallace: Where do you think protein engineering is going to have the greatest impact over the next few years? What are the most uncrackable nuts in protein engineering?
Field: These molecules, although you don’t see them, they touch all parts of our lives. And I think the way that this kind of technology plays out in these spaces is, number one, improving how those molecules work, so optimising them so that we can gain incremental improvements across those different products. But additionally, and perhaps more valuably, enabling us to solve protein engineering problems that to date haven’t been solvable with traditional approaches. There is a whole range of different molecules that scientists haven’t been able to take to market because they haven’t been able to engineer those molecules with the right biochemical and biophysical processes.
So I guess the summary would be: existing proteins working better, novel proteins brought to market, and perhaps because of increased efficiency during the discovery phase, an increased number of new products being brought to market.
As for the most uncrackable nuts: if you ask any protein engineer that question they’ll give you a different answer, and it’s because there are so many difficult problems across so many different areas. In every single application space of proteins there are challenges. It could be that the molecule isn’t soluble, it could be the activity isn’t high enough, it could be the stability isn’t high enough.
Just to give you some real world applications of what that actually means: if you look at a lot of vaccines, because they’re not thermostable, that means you require a cold chain to transport them. A lot of therapeutic antibodies may be prone to aggregation, which again, limits their utility or shelf life. A lot of industrial enzymes’ activity may be too low, which makes the process economically not viable. In every single area of protein engineering, there are different challenges.