fastqq R package

Creating quantile quantile plots faster

TLDR: This is a wrap-up of a small project I made in 2021, where I managed to speed up the creation of QQ-plots by a factor 80 in the case I was working on, ( project on Github).

In 2021 I created some quantile-quantile plots at work, mainly using packages such as qqman. It is an analysis plot that compares data distribution from another distribution (typically a theoretical distribution). You can use these kinds of plots to informally answer questions like: “is my data from a normal distribution?” or “Are all these p-values just random?”. QQ-plots are not a statistical test, but a quick way to gauge whether the question about the similarity of these distributions should be investigated further.

The qqman package is implemented in R, targeted for Genome Wide Association Studies (GWAS). A GWAS has millions of p-values, where one of the objectives of a quantile-quantile analysis is to compare the distribution of the GWAS p-values to a uniform distribution. Suppose there is a great deviance from a uniform distribution (enrichment of strong associations). In that case, we informally assume that the associations from the GWAS are not just some noise, and that there are possibly some biologically relevant signals in the data. When the qqman package was implemented, the number of associations in a typical GWAS was not as large as today, and more and more genetic variants are continuously being discovered and added to these analyses.

The scatter plot function in R has a constant time and memory overhead for each added point, meaning that for tens of millions of p-values, it took more than 10 minutes to create one of these plots. A typical modern GWAS (year 2022 when this is written) will have 20-100 million p-values. Another issue is that the object created contains all the underlying data. This object is bulky and if I was using Rstudio, then it could crash or hang.

I decided to take the matter into my own hands and implement a solution. Looking at the QQ-plots, it is obvious that a lot of the points are redundant. The points are plotted so close to each other, that removing many of them will have no visible impact on the final plot. Because I had always wanted to release a package on CRAN with part of the implementation in C++, I decided to make this pruning part in C++, using RCpp. I also added drop-in replacement functions for the qqnorm and qqplot functions from the stats package.

The package is now on CRAN, and the development version is on Github. I do not plan any major changes, but please post issues or feature requests on Github.

Guðmundur Einarsson
Guðmundur Einarsson
Research Scientist

I am interested in statistics, genetics and machine learning.

Previous

Related