Covid-19 serology studies: a meta-analysis using hierarchical modelling

Serology studies are front-and-center in the news these days. Reports out of Santa Clara county, California, San Miguel County, Colorado, and Los Angeles suggest that a non-trivial fraction, more than 1%, of the population has SARS-CoV-2 antibodies in their bloodstream. European cities are following suit – they too are conducting serology studies and finding important fractions as well. The catch is that many of these studies find an antibody prevalence comparable to the false positive rate of their respective serology tests. The low statistical power associated with each study has invited criticism, in particular, that the results cannot be trusted and that the study authors should temper their conclusions.

But all is not lost. Jerome Levesque (also a federal data scientist and the manager of the PSPC data science team) and I performed a meta-analysis on the results from Santa Clara County (CA), Los Angeles County (CA), San Miguel County (CO), Chelsea (MA), Geneve (Switzerland), and Gangelt (Germany). We used hierarchical Bayesian modelling with Markov Chain Monte Carlo (MCMC), and also generalized linear mixed modelling (GLMM) with bootstrapping. By painstakingly sleuthing through pre-prints, local government websites, scientific briefs, and study spokesperson media interviews, we not only obtained the data from each study, but we also found information on the details of the serology test used in each study. In particular, we obtained data on each serology test’s false positive rate and false negative rate through manufacturer websites and other academic studies. We take the data at face value and we do not correct for any demographic bias that might exist in the studies.

Armed with this data, we build a generalized linear mixed model and a full Bayesian model with a set of hyper-priors. The GLMM does the usual shrinkage estimation across the study results, and across the serology test false positive/negative rates while the Bayesian model ultimately generates a multi-dimensional posterior distribution, including not only the false positive/negative rates but also the prevalence. We use Stan for the MCMC estimation. With the GLMM, we estimate the prevalence by bootstrapping with the shrunk estimators, including the false positive/negative rates. Both methods give similar results.

We find that there is evidence of high levels of antibody prevalence (greater than 1%) across all reported locations, but also that a significant probability mass exists for levels lower than the ones reported in the studies. As an important example, Los Angeles show a mode of approximately 4%, meaning that about 400,000 people in that city have SARS-CoV-2 antibodies. Given the importance of determining societal-wide exposure to SARS-CoV-2 for correct inferences of the infection fatality rate and for support to contact tracing, we feel that the recent serology studies contain an important and strongly suggestive signal.

Our inferred distributions for each location:

Prevalence density functions (marginal posterior distribution) from the Bayesian MCMC estimation.
Prevalence density functions from the GLMM bootstrap.
Prevalence with the false positive rate (Bayesian MCMC).
Prevalence with the false positive rate (GLMM bootstrap).

2 thoughts on “Covid-19 serology studies: a meta-analysis using hierarchical modelling”

  1. Hi, these are pretty interesting. In a similar vein (also motivated by all the arguments about Santa Clara in particular: ) ), I came up with an algorithm that directly computes the prevalence posterior for an imperfect test. You can run it at

    (there’s also a preprint describing the math and the algorithm: )

    It’s a beta-binomial model at heart, and the math is not very complicated. But it results in a very efficient MC algorithm that runs directly in the browser in javascript. I am hoping research groups can use this to get quick estimates of prevalance on-the-fly, without needing to be well versed enough in python, R, Stan, Bayes stats, etc. There’s several examples linked at the site – all the various versions of the Santa Clara study and one from Kobe Japan. I will include the relevant examples from your paper shortly. All feedback appreciated!

    One thing worth noting is that the prevalence posterior acquires a second mode at zero when the false-positive rate is sufficiently close to the prevalence. You can see that in this plot of the original Santa Clara study, Scenario 3:

    The implementation is also on github:

Leave a Reply

Your email address will not be published.