Covid-19 branching model: details and calibration to data

Last month my collaborators, Jerome Levesque and David Shaw, and I built a branching process model for describing Covid-19 propagation in communities. In my previous blog post, I gave a heuristic description of how the model works. In this post, I want to expand on some of the technical aspects of the model and show how the model can be calibrated to data.

The basic idea behind the model is that an infected person creates new infections throughout her communicable period. That is, an infected person “branches” as she generates “offspring”. This model is an approximation of how a communicable disease like Covid-19 spreads. In our model, we assume that we have an infinite reservoir of susceptible people for the virus to infect. In reality, the susceptible population declines – over periods of time that are much longer than the communicable period, the recovered population pushes back on the infection process as herd immunity builds. SIR and other compartmental models capture this effect. But over the short term, and especially when an outbreak first starts, disease propagation really does look like a branching process. The nice thing about branching processes is that they are stochastic, and have lots of amazing and deep relationships that allow you to connect observations back to the underlying mechanisms of propagation in an identifiable way.

In our model, both the number of new infections and the length of the communicable period are random. Given a communicable period, we model the number of infectious generated, $Q(t)$ , as a compound Poisson process,

(1) $\begin{equation*}Q(t) = \sum_{i=1}^{N(t)} \, Y_i,\end{equation*}$

where $N(t)$ is the number of infection events that arrived up to time $t$ , and $Y_i$ is the number infected at each infection event. We model $Y_i$ with the logarithmic distribution,

(2) $\begin{equation*}\mathbb{P}(Y_i =k) = \frac{-1}{\ln(1-p)}\frac{p^k}{k}, \hspace{2em} k \in {1,2,3,\ldots}.\end{equation*}$

which has mean, $\mu = -\frac{p}{(1-p)\ln(1-p)}$ . The infection events arrive exponentially distributed in time with arrival rate $\lambda$ . The characteristic function for $Q(t)$ reads,

(3) $\begin{align*}\phi_{Q(t)}(u) &=\mathbb{E}[e^{iuQ(t)}] \\ &= \exp\left(rt\ln\left(\frac{1-p}{1-pe^{iu}}\right)\right) \\ &= \left(\frac{1-p}{1-pe^{iu}}\right)^{rt},\end{align*}$

with $\lambda = -r\ln(1-p)$ and thus $Q(t)$ follows a negative binomial process,

(4) $\begin{equation*}Q(t) \sim \mathrm{NB}(rt,p).\end{equation*}$

The negative binomial process is important here. Clinical observations suggest that Covid-19 is spread mostly by a minority of people in large quantities. Research suggests that the negative binomial distribution describes the number of infections from infected individuals. In our process, during a communicable period, $t$ , an infected individual infects $Q(t)$ people based on a draw from the negative binomial with mean $rtp/(1-p)$ . The infection events occur continuously in time according to the Poisson arrivals. However, the communicable period, $t$ , is in actuality a random variable, $T$ , which we model as a gamma process,

(5) $\begin{equation*}f_{T(t)}(x) = \frac{b^{at}}{\Gamma(at)} x^{at-1}e^{-b x},\end{equation*}$

which has a mean of $\mean{T} = at/b$ . By promoting the communicable period to a random variable, the negative binomial process changes into a Levy process with characteristic function,

(6) $\begin{align*}\mathbb{E}[e^{iuZ(t)}] &= \exp(-t\psi(-\eta(u))) \\ &= \left(1- \frac{r}{b}\ln\left(\frac{1-p}{1-pe^{iu}}\right)\right)^{-at},\end{align*}$

where $\eta(u)$ , the Levy symbol, and $\psi(s)$ , the Laplace exponent, are respectively given by,

(7) $\begin{align*}\mathbb{E}[e^{iuQ(t)}] &= \exp(t\,\eta(u)) \\\mathbb{E}[e^{-sT(t)}] &= \exp(-t\,\psi(s)), \end{align*}$

and so,

(8) $\begin{align*}\eta(u) &= r\ln\left(\frac{1-p}{1-pe^{iu}}\right), \\\psi(s) &= a\ln\left(1 + \frac{s}{b}\right).\end{align*}$

$Z(t)$ is the random number of people infected by a single infected individual over her random communicable period and is further over-dispersed relative to a pure negative binomial process, getting us closer to driving propagation through super-spreader events . The characteristic function in eq.(6) for the number of infections from a single infected person gives us the entire model. The basic reproduction number $R_0$ is,

(9) $\begin{align*}R_0 &= \left(\frac{at\lambda}{b}\right)\left(\frac{-p}{\ln(1-p)(1-p)}\right).\end{align*}$

From the characteristic function we can compute the total number of infections in the population through renewal theory. Given a random characteristic $\chi(t)$ , such as the number of infectious individuals at time $t$ , (e.g., $\chi(t) = \mathbb{I}(t \in [0,\lambda_x))$ where $\lambda_x$ is the random communicable period) the expectation of the process follows,

(10) $\begin{equation*}\mathbb{E}(n(t)) = \mathbb{E}(\chi(t)) + \int_0^t\mathbb{E}(n(t-u))\mathbb{E}(\xi(du)).\end{equation*}$

where $\xi(du)$ is the counting process (see our paper for details). When an outbreak is underway, the asymptotic behaviour for the expected number of counts is,

(11) $\begin{equation*}\mathbb{E}(n_\infty(t)) \sim \frac{e^{\alpha t}}{\alpha\beta},\end{equation*}$

where,

(12) $\begin{align*}\alpha &= \lambda\mu \left(1 - \left(\frac{b}{\alpha +b}\right)^{at}\right) \\\beta & = \frac{1}{\alpha}\left(1 - \frac{at\lambda \mu}{b}\left(\frac{b}{\alpha +b}\right)^{at+1}\right).\end{align*}$

The parameter $\alpha > 1$ is called the Malthusian parameter and it controls the exponential growth of the process. Because the renewal equation gives us eq.(11), we can build a Bayesian hierarchical model for inference with just cumulative count data. We take US county data, curated by the New York Times, to give us an estimate of the Malthusian parameter and therefore the local $R$ -effective across the United States. We use clinical data to set the parameters of the gamma distribution that controls the communicable period. We treat the counties as random effects and estimate the model using Gibbs sampling in JAGS. Our estimation model is,

(13) $\begin{align*}\log(n) &= \alpha t + \gamma + a_i t + g_i + \epsilon \nonumber \\a_i &\sim \text{N}(0,\sigma_1^2) \nonumber \\g_i & \sim \text{N}(0,\sigma_2^2) \nonumber \\\epsilon &\sim \text{N}(0,\sigma^2),\end{align*}$

where $i$ is the county label; the variance parameters use half-Cauchy priors and the fixed and random effects use normal priors. We estimate the model and generate posterior distributions for all parameters. The result for the United States using data over the summer is the figure below:

Summer 2020 geographical distribution of R-effective across the United States: 2020-07-01 to 2020-08-20.

Over the mid-summer, we see that the geographical distribution of $R_{eff}$ across the US singles out the Midwestern states and Hawaii as hot-spots while Arizona sees no county with exponential growth. We have the beginnings of a US county based app which we hope to extend to other countries around the world. Unfortunately, count data on its own does not allow us to resolve the parameters of the compound Poisson process separately.

If we have complete information, which might be possible in a small community setting, like northern communities in Canada, prisons, schools, or offices, we can build a Gibbs sampler to estimate all the model parameters from data without having to rely on the asymptotic solution of the renewal equation.

Define a complete history of an outbreak as a set of $N$ observations taking the form of a 6-tuple:

(14) $\begin{equation*}(i,j,B_{i},D_{i},m_{i},o_{i}),\end{equation*}$

where,

$i$ : index of individual, $j$ : index of parent, $B_{i}$ : time of birth, $D_{i}$ : time of death, $m_{i}$ : number of offspring birth events, $o_{i}$ : number of offspring.

With the following summary statistics:

(15) $\begin{align*}L & = \sum_{i} D_{i} - B_{i};\,\, \Lambda = \prod_{i} (D_{i} - B_{i}) \nonumber \\ M & = \sum_{i} m_{i};\,\, O = \sum_{i} o_{i} \nonumber \end{align*}$

we can build a Gibbs sampler over the models parameters as follows:

(16) $\begin{align*}p\,|\,r,L,O & \sim \text{Beta}\left(a_{0} + O,b_{0} + r L\right) \nonumber \\r\,|\,p,L,M & \sim \text{Gamma}\left(\eta_{0}+M,\rho_{0}-L\log(1-p)\right) \nonumber \\b\,|\,a,L,N & \sim \text{Gamma}\left(\gamma_{0}+aN,\delta_{0}+L\right)\nonumber \\a\,|\,b,\Lambda,N & \sim\text{GammaShape}\left(\epsilon_{0}\Lambda,\zeta_{0}+N,\theta_{0}+N\right)\end{align*}$

where $a_0, b_0, \eta_0, \rho_0, \gamma_0, \zeta_0, \epsilon_0, \theta_0$ are hyper-parameters.

Over periods of time that are comparable to the communicable window, such that increasing herd immunity effects are negligible, a pure branching process can describe the propagation of Covid-19. We have built a model that matches the features of this disease – high variance in infection counts from infected individuals with a random communicable period. We see promise in our model’s application to small population settings as an outbreak gets started.

2 thoughts on “Covid-19 branching model: details and calibration to data”

Matt Hurst says:

September 22, 2020 at 6:17 pm

Really interesting work! Can I ask a question? Was there a specific need for a Bayesian framework versus frequentist? A philosophical choice or something else? Just curious. Sometimes it’s not so clear to me why one approach is taken over another. Cheers,
Matt

1. David says:
  
  September 23, 2020 at 1:56 am
  
  Thanks, Matt. We rely heavily on mixed modelling techniques and Bayesian methods work rather naturally in those settings. There are ways to do mixed modelling without Bayesian methods – we use whatever takes best advantage of the data. If you would like to learn more about mixed modelling, including MCMC approaches, you might find Data Analysis Using Regression and Multilevel/Hierarchical Models, by Andrew Gelman, and Jennifer Hill worth a peek.

2 thoughts on “Covid-19 branching model: details and calibration to data”

Leave a Reply Cancel reply