VariantSpark: machine learning for ultra-high dimensional data

A decorative image representing a network or web using dots and lines made of light and a dark background

Machine learning analysis suite specifically tailored for clinical and genomic data analysis.

The challenge

Clinical and genomic datasets are becoming more complex.

Researchers are sourcing data from more people and collecting more clinical information per person, resulting in high-dimensional data.

Current machine learning (ML) solutions are not designed to deal with ultra high-dimensional data. A common work-around has been to pre-filter the “raw” datasets or limit the analysis to independent markers only, in order to reduce the computational cost.

However, eliminating information or only assessing independent drivers in complex diseases can potentially bias results and limit the accuracy.

Our response

We developed VariantSpark, a machine learning framework able to deal with the depth and complexity of clinical data.

VariantSpark can cluster clinical profiles or identify disease-associated features from trillions of data points (thousands of samples with millions of features each) in just 30 minutes.

VariantSpark is available through the Digital Marketplace on Amazon Web Services (AWS). It can be deployed on Microsoft Azure and Google Cloud Platform (GCP, via Terra notebook).

The underlying random forest algorithms offer explainable machine learning. Which features contributed in what proportion to the overall prediction outcome are visualised in an interactive graph.

An image which contains graphics that shows a three stage process explaining At box one that VariantSpark is available through the Digital Marketplace on Amazon Web Services (AWS), and at box two that it can be deployed with digital instructions on Microsoft Azure and Google Cloud Platform (GCP, via Terra notebook), and at box three that it is automatically constructed using cloud construction.

Obtaining, deploying and constructing (automatic using cloud construction) VariantSpark.


VariantSpark is 90% faster than traditional compute frameworks and requires 80% fewer samples to detect statistically significant signal.

It detects stronger predictive markers by considering the full dataset and the interaction between multiple clinical features.

In a genome-wide-association case/control set-set up, VariantSpark can identify higher order interactions between genomics features.

In the clinical setting, VariantSpark enables personalised genomic insights at point-of-care, by finding patients-like-mine based on genomic similarity to other patients.

An image showing graphics in flow box style with "Wide" data in the first box flowing to a box that contains Variant Spark (which explains VariantSpark is Machine Learning for High High-Dimensional Data and that it operates on core Spark) and flows to a box which says Insights and explains it finds disease genes in people

VariantSpark is machine learning for high-dimensional data. It can be used for applications such as disease gene discovery.

For updates visit here.