
Extracting and analyzing relevant medical information from large-scale databases such as biobanks poses considerable challenges. To exploit such ’big data’, previous attempts have focused on large sampling algorithms that model individual data points. However, since these algorithms sample the entire dataset millions of times, their theoretically very high level of precision comes at a prohibitive computational cost and therefore remains unattainable. To overcome this, scientists previously developed approaches that sacrifice accuracy for speed.
In a bid to optimize precision and performance, researchers from the groups of Matthew Robinson and Marco Mondelli at the Institute of Science and Technology Austria (ISTA) developed an algorithm that can extract and analyze information from the world’s most extensive biobank with unprecedented accuracy and speed. Ultimately, their method, presented here using the model complex trait of human height, could advance personalized medicine in the context of diagnostics-and even further forensics.
Algorithmic innovation using human height
The team’s approach draws on the recently established mathematical framework known as "approximate message passing" (AMP), to which Mondelli has made significant contributions. Their new method, dubbed "genomic Vector Approximate Message Passing" or gVAMP, enhances the framework’s ability to extract complex information from the dataset at hand. "Whereas other methods tend to analyze one snippet at a time before combining the results, gVAMP functions as a ’joint estimation’ method. Therefore, it provides a detailed overview of the effects on a trait in the context of all variants across massive-scale genetic datasets," says ISTA PhD candidate Al Depope, the study’s first author. "We can speak of an algorithmic innovation."
To develop their method, the team chose human height, an established model for the genetic analysis of complex traits. "Examining human height allowed us to explore the limits of computational scalability with gVAMP, both in the number of genome sequences as well as the number of variants involved," says Depope. Indeed, the trait is influenced by a whopping 17 million variants, which the team could analyze simultaneously in hundreds of thousands of whole-genome sequences from anonymized volunteers contained in the UK Biobank , the world’s most comprehensive dataset of biological, health, and lifestyle information. "What I find particularly important is the interpretability of our algorithm when applied in biology. In addition to allowing us to predict people’s height from their DNA more accurately than before, it also allows us to pinpoint the specific DNA regions involved," says ISTA postdoc and co-author Jakub Bajzik.
Outperforming existing methods
When gVAMP predicts human height and the contribution of individual genetic variants, the algorithm creates this data for the first time. As a result, there is no pre-existing data on human height against which to benchmark the method. "Essentially, the question here is ’how do we know that gVAMP picked out the true variants?’" Depope explains.
From personalized medicine to forensics?
The interdisciplinary study combines expertise in information theory, mathematics, genomics, and software engineering. Bajzik’s background in computer science complemented Depope’s focus on theory and math. Robinson, who specializes in state-of-the-art statistical models for genomic data, co-supervised the project with Mondelli, who seeks to develop robust inference methods in information theory to address data-driven challenges in engineering and natural sciences.


