Hi, my name is Casper. And I’m here to tell you about a company called Abzu that I have founded together with six other people.
And actually, the idea for Abzu came to me about 25 years ago when I was a young student at the University of Copenhagen working on quantum field simulations and building computers to analyze quantum fields. And back then I had the idea that perhaps something in this methodology could be reused to explore data sets and do what people were at that time trying to do using the then state-of-the-art artificial intelligence methods. Which were, by the way, at that time, also neural networks. And that’s still what everybody’s talking about.
This idea bubbled in the back of my head for about 25, 20 years. And then about three years ago, I actually went around and collected a group of people to found the company Abzu together with me to see if we could realize this somewhat crazy nerdy idea in real life. So that became Abzu. And then we worked for a couple of years to actually build the core technology that we’ve envisioned. And we, I guess, perhaps against all odds, succeeded. And have created what is essentially a completely new class of artificial intelligence.
So if people here are familiar with things like say, neural networks, or random forests, gradient boosting, ensemble methods, then, first of all, kind of park that at the side of your mind, because what I’m going to talk to you about today is not like that. It has many similar use cases. But it’s not actually the same kind of technology that you see being applied when most other people talk about artificial intelligence.
The idea is that you have some kind of data that you know represents some phenomenon that you want to study. But you actually want to extract the real meaning of that: why does that data look the way it does?
Scientists are generally interested in explaining things, they want to explain phenomena, they want to explain why Mars moves in the orbit around the sun that it does. They don’t just want to predict where Mars is going to be. So if you know the explanation, the prediction is easy. But you can also be in a situation where you can predict but you can actually not explain.
Today, I’m actually going to talk to you in and on two different levels about that. First, I’m going to quickly try to introduce the technology, what it does, and how it works, and what it looks like to the user. And then for the second half of the presentation, I’ll zoom into a specific use case where we’ve applied this technology in the health science / pharmacological research space. So first, we’ll start general. And then we’ll dive into this more specific, concrete example, that comes from studying the actual causes of breast cancer.
But let’s start with the with the actual technology. So imagine you’re a researcher, some kind of researcher, who has a question in your mind that you would like answered. That could be, “Can we predict the revenue of companies based on information we have about that company today?” Or, “Could we predict when a windmill is going to break down?” Or, “Could we perhaps understand why certain drugs are toxic to the liver, while others are not?” It doesn’t so much matter.
Imagine you are a researcher looking for new knowledge…
The point is: You have a set of perhaps limited data that you’ve collected that describes the phenomenon, and then you have a thought in your head that this data can somehow reveal the underlying relationship in that data. So the problem that you come to is that once you’ve collected that set of data, and then you’re looking at trying to use the data to explain a phenomenon, there’s actually – well, the truth is there’s an infinite set of possible ways that you can combine that data to explain the model. Well, that’s the binomial coefficient of a 2000 choose two, that’s about 2 million different potential two feature models. And that’s not even taking the interaction terms into consideration. So no human can actually explore that data.
What researchers will do is, they’ll have some hunch, and then they’ll actually try to compose a model of, say, maybe it is this feature times this feature that multiplied together yields the thing I want to explain. Oh, well, that didn’t work. Try, try again, for say, 2 million times in a row.
So that doesn’t really work, right? You can’t really do this. And 2 million is low, is a low number, and it’s just simple models that you’re looking at. If the models are in four features, for instance, then we’re talking about 6 billion different combinations of those models. Researchers have traditionally relied on intuition to solve this problem. But intuition can only take you so far.
What the QLattice does is it actually creates a mathematical representation of all those 6 billion math potential models at the same time. That’s the “Q” part of it. And then it seeks through this virtually infinite space, and returns some models that are most likely to actually match the data. So then if each dot on this represents a possible model, then then the actual the dots here, the highlighted dots, are the ones that are models that will efficiently explain the data.
The QLattice finds the simplest – and otherwise hidden – explanations.
So here’s an example: Perhaps, perhaps, a model, is a — this is a bivariate Gaussian mathematical equation — perhaps that is a good explanation of the data.
You get the ability to not only get a black-box model that can predict whatever you’re trying to study, but it can actually give you potential causal, or, or at least scientifically correct relative relations in the data, that can explain the thing you’re trying to understand.
For the researcher, you can cut away a lot of what we call “aimless exploration”. You’re just trying and trying and trying different potential explanations in your data. And actually, to do that, all you need to do is to have access to our QLattice. A QLattice is this high-performance computer simulator that we operate. But for the end user, you don’t have to care much about that. If you have access to a QLattice, you install a library on your own machine that behaves like a traditional data science library.
The QLattice is a high-performance, easy-to-use simulator.
And this traditional data science library will just give you a list of potential mathematical equations that can explain your data. We call that library Feyn® after an American physicist called Richard Feynman, who actually invented the formulation of quantum field theory that we’re using in the simulator. From the point of view of the user, this is a quite traditional data science tool. But from the point of view of what you get out of it, you get simple mathematical explanations of phenomena rather than convoluted black-box models that are hard for humans to interpret.
So this is, of course, important in many situations in life, as quite often you want to know why something happens not just that it will happen. But that’s essentially relevant in virtually any field where machine learning is applied today.
So Abzu, although we’ve developed this technology that is very general — it could essentially be applied whenever you apply machine learning — we’ve chosen to zoom in on a specific market vertical, which is pharmaceutical and health science research.
So the case that I’m going to switch to now demonstrates the value of the technology in this specific field. The actual case is breast cancer. It’s a pretty significant health problem. There’s a there’s a lot of women who are who are hit by breast cancer every year. You can split breast cancer up in two categories. Something that’s called ductal carcinomas, which is a cancer that arises in the milk ducts in the breast. And then you have lobular carcinomas that arise in the in the actual milk-producing lobules in the breast. And they are quite different. I just want to highlight that because it will become important a few slides down the road. But from the point of view of the of the patient, carcinomas is just carcinomas. It’s a severe, severe and dangerous cancer.
You can split breast cancer up in two categories: lobular and ductal.
The problem with studying these kinds of problems is that there’s so many data points that you could take into consideration. We know that the genome plays a role in in causing cancer. We also know that genes, the actual genes that get expressed in your body, play a role in causing cancer. We also know that a lot of environmental factors plays a role in causing cancer. So if you just imagine all the different things we could consider as causal factors for cancer, or even poor outcomes of cancer, then that becomes a very, very large space. It’s exactly the kind of situation that the QLattice is well suited to deal with. So data can be from can from the smallest scale from the from the genome level, all the way up to micro organisms, or even behavioral patterns, or clinical data, like smoking or non smoking, and health / lifestyle things. And all these things can play into the final model.
Using small and wide data sets.
In this case we collected a bunch of data. About 705 different women, all suffering from breast cancer. Almost 100 of these women actually ended up dying from the disease. Fortunately, most of them were actually cured. So what we did was we collected data about things that came from the genome itself, things that came from mutations to the genome. We all have a genome. The genome changes during our lifetime, and also when cancer starts evolving the genes the actual genome in the cells in the breast changes. Finally, there’s some data about the proteins that these genes create. So we collected all that data in a in a in a data set. And then we asked the QLattice to give us the simplest possible model that it could do to explain this data.
Multi-omics data sets can be small and wide!
And what the QLattice did is seek through the millions and millions of potential models for the data, and then it came up with this very simple model. I want to highlight here that this very same data set has been studied by researchers using random forest models, which is kind of the state-of-the-art method, when you have very wide, but not that many individuals in the data set. And so we know kind of how good predictive accuracy you can get out of this data set.
Exploring hypotheses with simple and interpretable AI models.
What you have here is actually two specific genes being expressed, APOB and MYOC, which are two different proteins that our body secretes based on these genes. That has to be a certain level in order for you to have a good or, or conversely, a bad outcome from your breast cancer. So the numbers here, 0.65 and 0.68, shows a relative difference between a model that takes everything into consideration. And this this very simple model. So what is it actually saying?
This is what a what a data scientist would call a "model decision boundary".
This is what a what a data scientist would call a model decision boundary. So the green highlighted area in the middle of the graph are the women who have are likely to have a very poor outcome. I apologize for the colors, they should be inverted. But essentially, if you’re in this green sphere in the middle, you will die from your breast cancer. And if you’re not, then you are most likely to not die from your breast cancer. So what the model is really saying is there’s a hotspot: there’s a group, there’s a certain level of APOB and MYOC, that you just don’t want to have. Because if you’re in that range, you’re very likely to have a very poor outcome.
This model was not as good as the random forest model across all the datasets. It has this simplicity. But actually, this also opens up the door to a further very interesting scientific point, which is: “What are the women in this data set that the model is bad at predicting?” So this is where the ductal and lobular carcinoma problem comes in again. When we studied this model, what we could immediately see is that the women for whom this model works perfect are the women who have a ductal carcinoma. Whereas for the women who have a lobular carcinoma, this model was actually not particularly good.
Already here we’re seeing two different causal pathways for poor outcome, and in carcinomas in breast cancer that depends on the type of carcinoma that we didn’t actually know played that role before. This is another important scientific conclusion. Anyway, I won’t stop here. We went on, of course, and asked for more complicated model, and said, “What can you do if you can use up to four features, QLattice?”
Allowing for larger, more complex models - but still simple explanations.
And what the QLattice then did was this four feature model where it brings in some genes and some specific mutations of gene expressions with specific mutations that now completely outperforms the competing random forest model.
So these are some details about how about how that gene works. But I’ll skip that. Because I want to finish up here with saying that this is a single example. I used the example of breast cancer where we discovered interesting – sad, but interesting – causes of breast cancer, to poor outcomes of breast cancer.
But, of course, it’s not the only thing we’ve studied. We have a very similar case where we explain liver cancer, where we also, as this graph shows, outperform the black-box models due to the pure predictive and scientific facts, fact-fullness of these models. There’s also a model here where a simple three-feature model explained Alzheimer’s, or predicted that certain patients will develop Alzheimer’s, which is yet another important health problem.
The QLattice outperforms traditional machine learning and black-box models.
So that’s why we at Abzu, with the 25 people we are today, find this so stimulating to work with. We have built a technology that I personally think is pretty cool. But we’re actually using that to solve a problem that is so incredibly important. And it’s like, it’s the best thing to go to work every day. And knowing that what you’re doing is helping solve problems of this caliber.
We need to sell our QLattice. We sell it to pharmaceutical companies. But we also want to give this in the hands of the research community at large.
You can actually sign up to use a scaled-down version, and kind of on spillover on our computer cluster version of the QLattice, that we call the community QLattice. And if you visit our website, you’ll see the details about how to do that. It’s not that hard. And then you can do your own data science project. So analyze data that you would, perhaps, usually have analyzed using one or the other machine learning techniques. Using the QLattice and see the benefit of getting a simple mathematical model out. So, if you want to, I encourage you to try it.