This partial transcript has minor edits to improve readability. The complete podcast is available on Spotify, Apple Podcasts, Stitcher, and much more.
Hello and welcome to Forward. Our guest today is Casper Wilstrup, the CEO and founder of Abzu.
Abzu was born in 2018 in Barcelona and Copenhagen, when seven co-founders set out to challenge the paradigm of traditional black-box AI. Today, Abzu has customers among the top 10 pharmaceutical companies, biotech enterprises, and academic institutions.
Casper is an AI researcher with 25 years of experience building high performance systems for data processing, analysis, and AI. At Abzu, Casper applies his passion for advanced data analysis and adaptive systems to achieve his dream of revolutionizing artificial intelligence.
In this episode, we discuss how this ability to address use cases generally considered unsuitable for modern AI. Where data is scarce, or causality must be established. We also talk about their work with life science and healthcare organizations, how to teach the machine system 1 and system 2 thinking, and Abzu’s path to general artificial intelligence.
Casper, welcome to the show.
Thank you, Artem, very nice to be here.
The AI system you are building at Abzu is unlike any other system we hear about these days, like GPT-3, or PaLM, or stable diffusion.
What do you see as the main shortcomings of these modern machine learning systems that led you to start working on Abzu?
Shortcomings is maybe too strong of a word. I think most of the technologies out there are all based on deep learning, the concept of deep learning, and neural networks, and have some unique properties that are very promising and useful in certain contexts.
But it also has limitations or situations where it’s not the most appropriate way of doing machine learning. Or maybe I should even start using the term machine reasoning. And that has to do with what it was designed for.
Neural networks are designed to be as predictive as possible. Given a set of data, it can find arbitrarily complex interactions in that data and very accurately predict what will happen or properties of that data set based on on the data itself. And for that, it’s it’s very powerful. It has a few disadvantages, one being that neural networks are very data hungry. But the other, and maybe the main advantage that motivated me to do what we’re doing here at Abzu, is that deep learning models are just not designed to be particularly explainable. They are designed with prediction in mind. And they’re good at that if you have sufficient data, but they’re not good at giving reasons for why they came up with the predictions they came up with.
How do you see the reasons for that, in terms of the technical implementation of those systems? Is it an intrinsic reason that cannot be overcome? Or is it just us not being yet able to develop that quality?
It’s fundamentally an intrinsic reason that cannot be overcome. But you can go a long way with different types of machine interpretation approaches, what people some people call explainable AI, to try to tease out the underlying structures in the behavior of a deep learning network. So you can mitigate the problem by applying a lot of different algorithms – the machine learning listeners in the audience will have heard about SHAP and LIME and approaches like that – that can actually go some way in explaining the behavior of an otherwise black-box machine learning model like like a neural network.
But at the end of the day, you are explaining a behavior that is learned and which is arbitrarily complex. So you’re not you’re not ever getting to the bottom of what the machine learning system actually does, if it’s based on on deep learning.
So again, the answer is yes and no. There’s something intrinsic in the way that deep learning is designed, that will make it forever a little opaque, how it arrives at the predictions of decisions that it arrives at, but we can do various things to get closer to understanding it. And that’s valid in its own right. But I think there is another way and that is to, at least for the kinds of situations where you don’t really need a deep learning network, where you can go more directly for the explanation in in the data and bypass the entire problem.
And sometimes, I guess some listeners will have will have heard about Daniel Kahneman and his book Thinking Fast and Slow, which is a very popular book. And there’s a lot of interesting stuff in that book. But one of the things that he got that’s a metaphor carried out throughout the book is the idea of system 1 and system 2 being two modes of reasoning in the human brain.
And system 1 is fast, intuitive decisions. It’s like when I see a picture of a bus I immediately know it’s a bus. It doesn’t really take any conscious processing for me to realize that here I’m looking at a bus. And that’s what a neural network is like: It’s really good at coming up with decisions based on large input data sets, like in this case, a very complicated picture and identifying a bus in that picture.
But that’s a system 1 style analysis that happens in the human brain. But we also have another mode of operating, a mode where we switch into thinking mode. Actually, most neuroscientists think that this resides in our neocortex, where we can really think about how things fit together from a logical or causal point of view. So if I say, for instance, “How many red cars are there in California?” Then, quite likely, you will activate your system 2 in your brain, and you’ll start to think, “How many people are in California? And how many of those have cars? And what fraction of those cars are red?” And then you’ll arrive at some answer.
But the interesting thing is how different is that from how you recognize a bus from a picture. One is a fast, intuitive, data-driven, almost inborn ability that the human brain has. And the other is a learned thing that we can practice, we can come become better at reasoning, and it requires energy. And certainly we also will power to reason through it. And Daniel Kahneman refers to these two ways of thinking system 1 and system 2.
If you think about that, in the context of artificial intelligence, it’s very clear that deep learning is a system 1 approach. It does exactly that. And it does it very well, just like the human visual cortex does that exactly very well, for object recognition. So it shouldn’t be a surprise that deep learning also turns out to be the best technology for quickly identifying objects and pictures for finding patterns in unstructured audio data, for instance, and sequences of data, where it’s like it learned intuitive quick decision that takes place.
But it will forever be difficult to really explain how a system 1 arrives at its decision. So that’s that’s why a system 2 style approach is also also useful in artificial intelligence. And I guess that’s one way of phrasing what we set out to do at Abzu is to build an AI that is purely and clearly a system 2 style reasoning approach.
This is a very good framing of what you’re working on. And, interestingly, Kahneman talks also about how you can use – in that example of the cars in California – how you can use system 1 to try to estimate the number of cars, you will get the wrong answer. But you might still be sure that the answer is good, because system 1 can hallucinate just like our neural networks can and give you completely wrong answer. And still, you will be convinced it’s a good one.
And he actually warns about not using or not relying on system 1, as much as we do in our daily lives and in our decision making.
I think anyone who wants to understand neural networks capabilities today, using the book, Kahneman’s book, as a guide is a good idea.
Yes. And actually, Kahneman even takes a step further in his analogy, and I’m not sure if he thinks about it this way himself. But a very clear guiding thought throughout the book is that you can activate system 2 on demand. It’s a limited resource, but you can actually analyze activated, and it can then go in and reexamine the decisions made by system 1 and justify it.
This is where Kahneman gets very interested in how it more behaves like a lawyer than an explainer. But nonetheless, the idea of system 2 being activated to check out the conclusions of system 1 is a very important idea. And I think if you think about it in the context of the bus recognition example before. If you close your eyes and quickly imagine seeing a bus, then right away you’ll be able to recreate, “What does a bus look like.” So that’s an experience that’s fairly easy for us to recreate. But it’s interesting to say even if it’s an imagined bus, then if I asked you, “How did you know that what you just imagined was a bus?” Then you don’t phrase it in terms of anything that has anything to do with the picture. You start justifying your decisions based on the fact that it is long and square and has wheels and has windows and has all the properties that we come to be familiar with in a bus. But actually, well, you most likely didn’t see that when you saw the bus, right? So it’s an afterthought. When somebody asks you why is this a bus, then you start justifying your claim that it is a bus based on things that you didn’t actually see before deciding it was a bus. And it even works with a mental picture of a bus. Even if you didn’t imagine the windows when you are imagining the bus, you will still say that that’s part of the reason you think that what you were imagining was a bus.
So this is an interesting property of the human mind that we can actually introspect so strongly on the behavior of system 1. And if you look at the way the brain is structured, that’s not really surprising, like the neocortex is wrapped around the entire older parts of the brain and reaches into the inner parts of the brain and actually has access to a lot of the working there. So maybe, and this is me speculating about how we our brains evolved to be the way they are, but maybe that’s exactly the point. The neocortex is a high energy-consuming, limited resource that is able to go in and analyze the behavior of a more original primitive brain that we inherited from our forebears.
One way of thinking about the state of AI today is that we have run forward with the ideas of system 1. And I think we’ve become fairly good at mimicking that in a lot of the research into neural networks. But we’re lacking sorely in developing a system 2 to go in conjunction with a system 1 in our AI approaches.
The way I see it, is quite similar to what you just said: I have these modules of our mind that I call intuitive estimator, then imaginator (a dreamer), then muscle memory, and then reasoning and logical thinking.
And we started with the intuitive estimator, the first neural networks, some call them predictors, but to predict kinda implies you can look into the future somehow, but they don’t. They just estimate based on the past data. Then we move to the muscle memory types of things, like the things that can keep your car within a lane based on pure visual data stream or can control plasma in fusion reactors. All those things that we can train the systems to do automatically like we train ourselves. And then we went to dreamers, like Jupiter-3, or DALL·E, or stable diffusion, where we can come up with texts, coherent texts, or coherent pictures. That won’t be perfect, because the reasoning piece is lacking, so the horse might have five legs instead of four, and all other issues. But they will be quite, quite impressive.
And now, the next important piece is the logical reasoning, the thing that you work on. So let’s talk about the QLattice. It is a symbolic regressor, as far as I understand. What is symbolic regression, and how does the QLattice address these issues of the ML models that we talked about?
The QLattice is currently used mainly for symbolic regression. But at its core, it’s actually not a symbolic regressor. At its core, it’s a graph-finding algorithm. And by “graph-finding algorithm”, here I mean the word “graph” in the computer science sense of the word.
So a graph is a thing that has nodes and edges. So if you think about relating some property – let’s be a little mathematical and say that you’re relating Y to X1 and X2 – then there’s various ways you can do that. You can take X1, and then you can connect it with an add node, and then you can take X2 and connect them to the same add node, and then you connect that output to Y. Then you have a graph where X1 and X2 are at the initial nodes, and the plus is a node in and of itself that the other two are connected to, and so on.
If you can imagine such a graph in your head, it’s like a little two legged fork with Y at the end and the X1 and and X2 at the two legs.
That kind of thing is what a computer scientist would call a graph. And they can be arbitrarily complex, they can be much bigger than that with nodes and edges all over the place. And in a way, almost everything that we come across in terms of flow diagrams, decision trees, mathematical expressions, so on, can be written out as a graph. So in being able to find graphs, we can solve problems from finding mathematical expressions to finding logical flow diagrams to finding patient networks and so on. All the kinds of things that you would be able to write out as a graph can be found using a graph-finding algorithm.
Now, the problem with graph-finding algorithms is that if you have a set of starting points and you’re looking for the best graph to get to another point – the endpoint – then there’s an infinite set of routes that you can actually take. That’s the annoying thing. Some people would say that’s why graph search algorithms tend to be what computer scientists would call “NP hard”. It’s like there is no logical way to say that you’ve exhaustively searched through the entire space of all possible graphs that you could serve.
This has been tackled in a range of different ways traditionally with with methods from evolutionary computing , something called genetic programming, but there has been other evolutionary computing approaches to try to tackle that problem. Unsuccessfully! And this is actually as old as AI itself, to try to do AI with graph-searching algorithms. It’s all the way back to McCarthy, actually, the inventor of even the term AI back in the 50s. And that entire family of AI algorithms has been dubbed “symbolic artificial intelligence”. They didn’t call it themselves that back then, Turing and McCarthy, they called it “AI”. But nowadays, we tend to call that “symbolic AI” because it’s looking for graphs that can be symbolically represented to explain how things relate to each other.
Now, the idea of symbolic AI kind of went south when the second AI winter didn’t actually quite work. And then deep learning showed up (neural networks as we called them at the time before they were rebranded). And it turned out to work very well. And then neural networks, in a way, squeezed the idea of symbolic AI in the background.
This happened back in ’89, ’90, ’91, ’92, actually, when I was a fairly young physics student. And since then, we’ve kind of given up on symbolic artificial intelligence and focused exclusively on deep learning. The technical term for deep learning and FiLM and similar approaches is actually “sub-symbolic artificial intelligence.”
And I guess today, most researchers who know these terms would say sub-symbolic artificial intelligence won, but I would say sub-symbolic artificial intelligence is system 1. You’re never going to solve system 2 style problems with a system 1 style approach. So in order to revisit this problem, we have to look at symbolic artificial intelligence again.
And symbolic artificial intelligence is, as I said, very well phrased as graph-searching algorithms. They’re just hard because you’re searching through these annoying infinite spaces.
So the QLattice, which is the algorithm that we have invented here at Abzu, is essentially an algorithm that uses some tricks that were developed to study quantum fields, which is something I worked with a long time ago, to search these infinite spaces of possible graphs and come up with graphs that match a certain criteria after looking through an infinite set of possible graphs.
It’s sort of intricate, how can you actually look through an infinite infinite space? And maybe we’ll get to talk more about that. But the outcome, essentially, is that you get the graphs out that are the best fit to the question you ask.
So I think what you can say is that we here at Abzu have solved that problem. We’ve found the missing piece to make symbolic artificial intelligence work. Now symbolic artificial intelligence and symbolic regression isn’t exactly the same. Symbolic regression is about finding graphs that are at the same time also mathematical expressions.
So currently, we are here at Abzu are using the QLattice for symbolic regression mainly. And if you use our freely available version of our package – which you can find on PyPy if you search for QLattice – then you will get the symbolic regressor. But there’s much more to it. And we have much bigger plans for views of other graph-search problems in the future.
Okay, so we definitely should talk about both parts, starting with the part that is currently available to your users and customers, the symbolic regression, and ability to find the right formula to approximate the data that the user has.
Talk about how it works, what kind of benefits it gives to the potential user, and what kind of data it needs.
Starting with the benefit, and then the data. And then we can say, particularly if you remind me, we can return to how it works in the end.
But the benefit. So the analogy – it actually isn’t mine, it’s one of my colleague’s, Meera, who came up with this analogy – is Kepler’s Law of Planetary Motion.
There once was a Danish astronomer called Tycho Brahe, who sat on an island in Øresund here, the water that I’m actually looking at if I look out the window, and he was studying the orbits of Mars.
So if you ever wrote down very clearly, how does Mars traverse the sky. And if you take that data, and you put it into a neural network, you will get a model that can accurately predict where Mars will be on Earth’s sky at any point in time in the future, right? There’s sufficient data to train a good neural network on it. So you can actually predict where will Mars be on on the heavens on a given day, with a fair accuracy.
Now, if you just want to know where Mars is, then all good, right? That’s what you needed. You don’t care about why Mars is there and why it moves the way it does. But if you instead take the same data and put it into the QLattice, what comes out is a formula. And that formula expresses the location of Mars as an ellipsoidal mathematical equation. This ellipsoidal mathematical equation is actually what we nowadays know as Kepler’s Law of Planetary Motion. It is this elegant rule that says that planets move in ellipses, and the sun is in one of the foci of the ellipses, and the rate of motion is also spelled out by Keplar’s Second Law.
So this is a parametric formula that describes the motion of Mars. It just so happens that it also describes the motion of every other planet, or every other celestial object in the universe, because it is fundamentally governed by an even more fundamental rule, which is Newton’s Law of Universal Gravitation.
But the point here is: If you care about Kepler’s Law, then a neural network will do nothing for you. It will stop you blank with a predictor. And that’s how far you will get and you won’t get any further. If you care about the underlying rules governing the the observations you make, then you need a system like ours.
And again, if you think back in the context of system 1 and system 2, it stands to reason that you could teach – I don’t know, an ant – to always look at where Mars will be, because it’s an intuitive judgment thing that can be learned to do intuitively. But it required system 2 and a very thorough reasoning process on behalf of Johannes Kepler to actually take these observations and turn them into Kepler’s Laws of Planetary Motion.
So using the QLattice gives you Kepler’s Laws. And using a neural network gives you an estimator for where Mars will be. And there are also lies the benefit. If you care about the generalizable picture, if you want to understand the underlying question of “why,” then you should go with a symbolic approach. And that’s why the users we have today are the people who care, right?
So there are many situations – I’m not saying that you should always care – there are many situations where it really doesn’t matter, you just want to predict. But there are more situations than you would think where you really do care about the “why.” Maybe you just have forgotten because you’ve given up on getting the “why.”
Let’s say for instance, that you’re studying your churn, you have a starting customer churn at some ecommerce store. Yes, you can use various types of machine learning to very accurately predict who will churn next. And maybe that’s good enough, then you can call them and say, “Please don’t leave my business.” And that’s fine. But again, if there existed a technology that was able to not tell you that people would churn but why they churned – “The reason they’re churning is because they’re waiting for more than three hours for support calls” or some other reason – then I think most people would prefer that.
So there’s always an added benefit in the “why” I would say, but the importance of “why” scales from almost irrelevant to the most crucial thing of all, depending on the actual use case.
So our users tend to be towards the situation of the case where they care more about the “why” than the “what”. And that’s a benefit.
In terms of data and the data needed, the answer is a little complicated, actually. So neural networks are very wasteful with data. The QLattice always works a lot better than neural networks with the same amount of data if there is an underlying causal explanation to be found in the data. But that’s because neural networks are actually, probably, pretty much the poorest example of a machine learning technique if you don’t have huge amounts of data available.
There are other machine learning techniques that are better fitted for less data, like random forest approaches, or various kinds of ensemble estimators, gradient boosting, and any Bayesian networks or self information networks.
There are approaches that work reasonably well with small amounts of data than deep learning does. The QLattice scales, interestingly, with the amount of data available, because the QLattice itself being a graph-search algorithm will just give you the statistically most likely explanation present in the data.
So if you’ve collected data about 18 patients (we have customers who have data about the full gene sequencing of 18 patients with a rare disease). And if you just have 18 patients, there are limits to what you can statistically infer from that data set. And the QLattice will comply with those limits.
So it will tell you that, “In this model, I can only say that it seems to be this gene and not this gene that could be supported by the data with a certain probability.” And that means that the QLattice works well, even at, say, 18 data points in this case. But, of course, it cannot then find very complicated relationships because they are not just not statistically supported in the data set.
As more and more data becomes available, the QLattice can find more and more hidden, more and more complex nonlinear interactions in the data set. Until in the limit of infinite data, it approaches the same behavior as a neural network. If you let the QLattice run wild, say, “Just give me the most arbitrarily complex equation you can find given this huge data set of millions of observations”, then what comes out actually will look a little bit like a neural network, because a neural network is also a very large mathematical expression in its own right.
So the point here is that the QLattice, in that it complies with the rules of statistics, will work for arbitrarily small data sets, but it will give you arbitrarily or, thereby it will necessarily give you simpler and simpler explanations because they need to be supported by the actual data that was shown.
Usually symbolic regression tries to find the simplest formula that would approximate the data. Is that the case for the QLattice?
When I say – it finds a graph that matches the criteria that we asked it to. So if we ask it to, if we if we define what we mean by simplicity, then it will find the model that is simplest. But “simple” is a hard concept to define. The way we deal with it is we define it in terms of statistical likelihood.
So the default behavior for our symbolic regressor is to sort the graphs, and thereby the formula that it returns, by how statistically likely they are to not be spurious. The risk is always that if you make an arbitrary and complex equation, you can match anything, right? Generate a large random sequence of data and find a function that can translate that into another arbitrarily random generated sequence. That’s always possible. If you allow the equation to be arbitrarily complex.
But the complexity also spells out the number of degrees of freedom in the model. And thereby you can use it to estimate the probability that this isn’t itself spurious. That’s what’s called the “archaic information criterion” in statistics. There’s another one, an adaptive one, called the “Bayesian information criterion”. And it’s debated which one is correct – we support both.
So the default behavior for the freely available version of the QLattice is to use a Akaike information criteria, and in the sorting, which means that you just get the models that are statistically most likely. And that means that the more data you have, the more complex the models can be while still retaining the property of being the statistical most likely.
I would like to say, though, that one of the things that most of our users appreciate the most, actually, about the QLattice, is not really the model that it gives, but the fact that it gives you an infinite list of models. Because the way you can think about this is that there’s an infinite set of possible formula equations you could try out. And you can think about that as being the input into the QLattice. And the output is the same list list now sorted by a statistical probability.
So item number one is going to be the most likely one, item number two is probably going to be almost as likely, the differences most likely very marginal. So in the hands of a researcher who can interpret these formula that gets shown to him or her, it allows the researcher to incorporate the hypothesis or the models it gets shown with their own prior knowledge of how the system works.
So quite often, the users we work with, our customers, will actually not take the most likely model. Let’s take this number seven or the 23rd most likely model, because they already know this model is likely for some other reason being that this gene that impacts the growth of the tumor is known to also impact other growth of tumor cells, things like that. So in that sense, the QLattice doesn’t chop off the ability of the user to think for themselves.
I’ve spoken with another journalist once who compared it to a thinking hat, and I kind of liked that metaphor where the QLattice not so much gives you an answer. It augments your own ability to find answers by showing you what could be true and then allowing you to decide what which of these things to explore further. Yeah, I like the metaphor, and I think it’s also quite true.
I have to add, though, that actually, Einstein’s Special Theory of Relativity is, from a mathematical standpoint, actually simpler than the Newton’s Three Laws of Motion.
But that’s a little detail. They’re just really weird. But actually, the equation itself – one over the square root of one plus v squared over c squared – That’s the Lorentz Contraction Factor that has one degree of freedom. That’s v squared.
But anyway, that is true. We work mostly with disease mechanisms and disease etiology, understanding how diseases arise and how they evolve, and also with the pharmaceutical companies at at modeling how you can then produce molecules that will prevent this process.
So that’s, that’s kind of our business domain. So if you think about etiology, or how diseases arise, it’s actually a very, what a statistician might call a degenerate problem, meaning that if you look at a thing, like let’s say breast tumor, a disease that we’ve started a lot.
Breast tumor isn’t one thing. There’s like a lot of different, unfortunate combinations of genes that can that can radically increase the probability of developing a breast tumor. And if we take a data set, let’s say a case that we ran with 700 women with breast tumor, then there are some explanations that jumps out the moment you put them through the QLattice, right?
So for 60% of the women, the occurrence of breast cancer can very accurately be predicted by the property of them having two specific genes in an unfortunate combination. In this specific case, the combination was not linear. So it had never been found before because apart from the QLattice, pretty much the only thing you do with this kind of data that is linear regression.
But this was not linear, it’s like a hotspot. Unfortunate levels of these two interacting in a bad way that created this rapid tumor growth that accelerated the the body’s own repair mechanisms.
So that explained about 65% of the of the of the tumor growth cases. But the other explanations for the remaining is probably more complicated than that. Probably involves three or four or five genes, and therefore it cannot be found in this data set. So if we collect more data, we can find the additional explanations and piece by piece say, alright, this is one etiological mechanism. This is how the disease develops in some women. And this is how it develops in the remainder, in a smaller part of women, and so on and so on. And the more data they get we get, the more we can divide and conquer the problem until only the rarest occurrences of breast cancer remain unexplained.