Using LLM Models and Explainable ML to Analyse Biomarkers at Single Cell Level for Improved Understanding of Diseases.

Jonas Elsborg, Marco Salvatore.


Single-cell RNA sequencing (scRNA-seq) technology has significantly advanced our understanding of the diversity of cells and how this diversity is implicated in diseases. Yet, translating these findings across various scRNA-seq datasets poses challenges due to technical variability and dataset-specific biases. To overcome this, we present a novel approach that employs both an LLM-based framework and explainable machine learning to facilitate generalization across single-cell datasets and identify gene signatures to capture disease-driven transcriptional changes.

Our approach uses scBERT, which harnesses shared transcriptomic features among cell types to establish consistent cell-type annotations across multiple scRNA-seq datasets. Additionally, we employ a symbolic regression algorithm to pinpoint highly relevant yet minimally redundant models and features for inferring a cell type’s disease state based on its transcriptomic profile. We ascertain the versatility of these cell-specific gene signatures across datasets, showcasing their resilience as molecular markers to pinpoint and characterize disease-associated cell types.

Validation is carried out using four publicly available scRNA-seq datasets from both healthy individuals and those suffering from ulcerative colitis (UC). This demonstrates our approach’s efficacy in bridging disparities specific to different datasets, fostering comparative analyses. Notably, the simplicity and symbolic nature of the retrieved gene signatures facilitate their interpretability, allowing us to elucidate underlying molecular disease mechanisms using these models.



As illustrated in figure 1 a, we have implemented a novel workflow to investigate the distinction between healthy and ulcerative colitis samples. Our approach involved integrating four distinct datasets, each containing valuable information about these samples. By harnessing the power of the gut cell atlas, we fine-tuned the advanced language model, scBERT, to gain deeper insights into the molecular profiles and cellular characteristics of the samples. The next step in our workflow was to utilize the refined scBERT to train the QLattice for each dataset. This enabled us to analyze and interpret complex relationships within the data, leading to a more thorough understanding of the underlying patterns and factors contributing to the disease. The outcome of our integrated approach was the successful derivation of precise cell type-specific gene signatures. These signatures hold significant potential for enhancing our understanding of ulcerative colitis and could potentially aid in the development of more targeted and effective treatments. One of the strengths of our workflow lies in its versatility. It can be readily adapted to study other diseases or conditions, given the availability of similar datasets. This adaptability offers a powerful tool to unlock valuable insights into various pathologies, leading to a more comprehensive understanding of disease mechanisms. By focusing on the transferability and performance of predictive models trained on different cell types using key gene features, we observed remarkable results, showcasing both the transferability of these models and their high predictive accuracy. The models exhibited impressive performance metrics, particularly in the prediction of disease samples. Notably, the Arterial Capillary model displayed a remarkable PR AUC of 0.95 and a score of 0.94, indicating its robust predictive capability. Similarly, the Goblet Cell model achieved a PR AUC of 0.94 and a score of 0.91, highlighting its effectiveness in disease prediction. Moreover, the BEST2+ Goblet Cell model demonstrated exceptional performance, with a striking PR AUC of 0.96, suggesting its potential as a powerful predictive tool for disease samples. Excitingly, our analysis also revealed promising transferability of these predictive models across different cell types. Models trained on specific cell types showed remarkable adaptation and success when applied to new datasets. This transferability is especially evident in the Intestinal Stem model, which achieved a PR AUC of 0.95 and a score of 0.86, underscoring its potential for wider applicability. These findings emphasize the potential of leveraging predictive models based on key gene features to successfully classify disease samples in diverse cell types. The high performance and transferability of these models hold significant promise for practical applications in disease diagnosis and precision medicine. The adaptability of our approach also opens the door to personalized treatments and precision medicine. By tailoring the workflow to different diseases, we can uncover specific characteristics and factors unique to each condition, paving the way for more effective and individualized therapeutic strategies. These alternative approach pave the way for future advancements in disease classification and personalized treatment strategies, contributing to improved patient outcomes and transformative medical applications. On the machine learning side, our analysis shows that when the resolution of data is increased, the resolution of models can be decreased. This in turn leads to better model interpretability, which is critical for moving computational discoveries into translational insights. In our case, we showed that by finely resolving the cell subspace of scRNA-seq datasets using a performant large language model for annotation, we could discover simple but highly predictive and transferable gene signatures. These were analyzed for their biological significance, and we were able to show that the signatures that transfer best between datasets have plausible underlying reasons for performing better than others. In conclusion, our comprehensive workflow, encompassing the integration of diverse datasets and advanced analytical techniques, holds great promise for advancing our knowledge of various diseases. By unraveling the complexities of these conditions, we can make significant strides towards improving patient outcomes and ultimately achieving more targeted and personalized healthcare in the future.

Try the QLattice.

Experience the future of AI, where accuracy meets simplicity and explainability.

Models developed by the QLattice have unparalleled accuracy, even with very little data, and are uniquely simple to understand.

The QLattice

Share this preprint.

The QLattice accelerates discoveries with explainable insights.​

Researchers and and scientists cite Abzu’s QLattice symbolic AI in industry-leading journals for introducing a new standard of performance and explainability to data sets.

Subscribe for
notifications from Abzu.

You can opt out at any time. We’re cookieless, and our privacy policy is actually easy to read.