Ontario ICU arrivals and departures as a Skellam process

Figure 1: Net daily change in the Ontario ICU census with Skellam inferred rates. The red lollipop points are days of net inflow and blue lollipop points are days of net outflow. The red ribbon and blue ribbons are the inferred daily ICU arrival and departure rates from the Skellam process. Lockdown imposition and lifting dates indicated.

As Ontario enters a new lockdown, the main concern is a potential overrun of our ICU capacity. A year ago, the Ontario government announced a planned major expansion of provincial ICU capacity, but, given our current limits, it appears that either the plan never came to fruition or that the efforts woefully underestimated possible COVID-19 ICU demand. The Ontario COVID-19 data portal gives the daily ICU COVID-19 census from which we can compute daily changes. As of April 4, Ontario has 451 ICU patients associated with COVID-19. The daily changes give us the net flow but the counting does not tell us how many people actually arrived or departed each day. When we see a daily uptick of 25 people, is that 25 arrivals with no departures or is it 50 arrivals with 25 departures? Both scenarios give the same net change.

Even though the COVID-19 data shows only the net change, we can infer the arrival and departure rates using the Skellam distribution. The Skellam distribution is the difference of two Poisson distributions, k = n_1 - n_2, given by the convolution with a negative sum, namely,

(1)   \begin{equation*} s(k; \mu_1,\mu_2) = \sum_{n=-\infty}^\infty\, p(k+n; \mu_1) p(n; \mu_2),\end{equation*}

where s(k; \mu_1, \mu_2) has support on \{\ldots, -2,-1,0,1,2,\dots\}, and p(k;\mu) = \frac{\mu^k}{k!}e^{-\mu} is the Poisson. The resulting distribution can be expressed in terms of the modified Bessel function,

(2)   \begin{equation*}s(k; \mu_1, \mu_2) = e^{-(\mu_1 + \mu_2)}\left(\frac{\mu_1}{\mu_2}\right)^{k/2} I_{|k|}(2\sqrt{\mu_1\mu_2}) .\end{equation*}

I take the daily difference in the Ontario ICU census data and estimate the Skellam distribution over a 21 day sliding window. The Skellam parameters are the ribbons in figure 1, smoothed with cubic splines with cross validation, and superimposed on the daily net flow. The red ribbon is the arrival process and the blue ribbon is the departure process. Each lollipop point is the net change in the ICU occupation on that day. In figure 2 I show a Q-Q plot of the Pearson residuals from the Skellam model – the model is a reasonably good description of the data generating process. In my analysis I use the R packages Skellam and mgcv (for spline estimation). For clarity, I indicate the lockdown imposition and lockdown lift dates. As of early April 2021, it looks like we are approaching an average of 50 admissions to the ICU every day with an average departure of about 40.

Figure 2: Regression diagnostics

Causation is hard to determine in regressions such as this model, but we see that after the lockdown date on December 26, 2020, the arrival rate quickly plateaued only to begin rising again near the end of the lockdown period. It appears that the lockdown this winter had the effect of tempering arrivals into the ICU. But as always with statistics, we need to be cautious about making causal statements. The arrivals in the ICU are from people who were infected at an earlier date, probably in the neighbourhood of one week earlier. Thus, we see that the arrival rate remains high from the beginning of the lockdown on December 26 through the end of December and early January. We should expect this inertia from infections that occurred near or before the imposition of the lockdown, although there is a distinct possibility that people started to change their behaviour before the lockdown date. But notice that the arrival rate begins to rise sharply almost exactly coincidental with the end of the lockdown on February 16, 2021. This observation suggests that infections were beginning to accelerate at least a week, maybe two, prior to the end of the lockdown. It appears that we would have seen a surge at some level in Ontario’s ICUs even if the lockdown had been maintained through the end of February or longer. Of course it’s quite probable that the resulting surge would have been smaller than the one we are currently witnessing, but to gain insight into such a counterfactual we need to know what caused the acceleration in infections prior to the end of the lockdown. If infections started to rise before the end of the lockdown, is it the result of the new more contagious variants? Is it lockdown fatigue? Is it spread among essential workers? Maybe it’s a combination of effects; I don’t know. Lockdown efficacy rests on that understanding.

Notice that the arrival process turned downward, or at least plateaued, about two weeks after the lockdown date of December 26, 2020, but also notice that the daily number of departures has been steadily increasing throughout the entire winter. Of course we expect that as arrivals increase, departures will eventually increase as well – everyone who enters the ICU eventually leaves. In a steady-state queue, the average departure rate will equal the average arrival rate. With time variation in the arrival process the departure rate will lag as the arrival rate waxes and wanes. It might be instructive to think about the average server load (ICU occupation) of an M/M/c queue (exponentially distributed arrival time, exponentially distributed server times, with c servers) as an approximation to the ICU dynamics. In an M/M/c queue the average number of busy servers (ICU bed occupation) is,

(3)   \begin{equation*} \bar c = \frac{\mu}{\tau}, \end{equation*}

where \mu is the arrival rate and \tau is the average service rate. For example, given an arrival rate of 50 people per day and an average length of stay of 14 days in the ICU, the steady-state average number of people in the ICU would be 700 patients, provided the number of beds (servers) is larger than 700. I don’t know if the length of stay has been varying over time. It would be great to do a full time dependent competing risk survival analysis on that data, but the steady rise in the departure rate that we see over the winter might indicate that the length of stay in the ICU has also been changing with the arrival rate. I don’t know.

There are a number of ways to extend this work. The estimator is crude – instead of a naive 21 day sliding window, it could be built with a Bayesian filter. The Q-Q plot in figure 2 shows that the naive sliding window generates a good fit to the data but if we want to back-out robust estimates of the timing of the infections associated with the arrivals, it might be worth exploring filtering. Second, it is possible to promote the parameters of the Skellam process to functions of other observables such as positive test rates, lagged test rates, time, or any other appropriate regressor and then estimate the model through non-linear minimization. In fact, initially I approached the problem using a polynomial basis over time with regression coefficients, but the sliding window generates a better fit.

Where does this leave us? The precipitous drop in ICU arrivals throughout the summer of 2020 with low social distancing stringency and no lockdowns, along with the steady rise in arrivals beginning in the fall, suggest that Covid-19 will become part of the panoply of seasonal respiratory viruses that appear each year. Hopefully with the vaccines and natural exposure, future seasonal case loads will occur with muted numbers relative to our current experience. Given that the largest mitigating effect, next to vaccines, seems to be the arrival of spring and summer, lockdown or not, the warming weather accompanied by greater opportunities to be outdoors should have a positive effect. Covid-19 cases are plummeting across the southern US states like Texas and Alabama, yet cases are rising in colder northern states like New York and New Jersey. While increased herd immunity in the US from higher levels of vaccination and higher levels of natural exposure must be having an effect on reducing spread, it really does seem like Covid-19 behaves as a seasonal respiratory illness. Hopefully we can expect better days ahead, but, in the end, it is very hard to maintain a large susceptible population in the face of a highly contagious virus that has already become widespread. Perhaps we should have been looking for better public health policies all along.

COVID-19 death count conspiracy theories don’t add up, but won’t someone think of the children?

The Canadian media reminds us daily of the COVID-19 death count. They never fail to mention each time we reach a grim milestone. But I got to thinking, how are COVID-19 deaths really counted? What is the effect size on mortality? In some cases it might be obvious, but I am sure there are many medical cases which are difficult to label, especially since severe illness from COVID-19 strongly tracks the febrile, people with comorbidities, and people of advanced age. Are we under-counting or over-counting COVID-19 deaths? Conspiracy theories abound with wild claims about COVID-19 inflated death counts.

One way we might be able to help calibrate our counting is looking into changes in the Canadian mortality rate. Sure, labelling a COVID-19 death might be difficult, but counting deaths themselves should be much easier. Statistics Canada has several weekly mortality time series for the Canadian population. Using the website https://www.mortality.org/, I download the Statistics Canada mortality data. I have no way of determining the data accuracy, but I will assume that the data are complete enough for my modelling purposes. As always, conclusions are only as good as the data we have.

The first plot shows the raw Statistics Canada mortality rate data. The data are split across cohorts along with the total population. We see some interesting patterns. First, the mortality rate in Canada has been increasing over the last ten years. This increase is not surprising as Canada’s median age is increasing – the effect is probably due to our aging population. Second, the death rate in our oldest cohorts is decreasing. We are probably seeing healthier people reaching these cohorts each year given that smoking rates decreased over the last 30 years and we are probably also seeing the effect of better medicine. That we see declines in the mortality rate of the oldest cohorts but an increasing rate for the entire population is an example of Simpson’s Paradox. The fraction of the Canadian population entering the oldest cohorts is increasing, but we all die at some point!

Annualized Canadian mortality rates by cohort with weekly observations.

A casual glance at the plot doesn’t seem to scream a large COVID-19 effect in 2020, but we have to be careful – looks can be deceiving. We have a long term trend superimposed on a noisy periodic time series and we are trying to detect changes at the tail end of the process. A nice way to proceed is to use generalized additive models (GAM). I will be using Simon Wood’s mgcv R package.

The model I have in mind is a GAM with a cyclic cubic spline with a 52 week period, s_1, and a trend spline, s_2,

(1)   \begin{equation*}\text{rate}_i = s_1(\text{week.number}_i) + s_2(\text{time}_i) + \epsilon_i.\end{equation*}

It’s a simple model, but it should be able to pull out a signal or trend, if one exists.

Given that COVID-19 has proven more dangerous to the 75-84 year old cohort, I will start there. The second plot shows the results of the model normalized to the downward linear drift in the data. My model does well fitting the data – well enough to understand the broad features – but an examination of the residuals, while normal in their distribution, exhibit some autocorrelation. There is subtle time series structure that my model misses but for the purposes of detecting the long term trend, this model does reasonably well. We see that the cyclic part peaks during the fall/winter, and the spline trend term sharply moves up through 2020, the effect from COVID-19. The trend component also shows the years with a particularly strong flu season.

GAM: Cyclic and trend components with normalized mortality data for 75-84 year old cohort.

In the next figure I show the long term trend in the mortality rate for the 75-84 year old cohort relative to the background. We see that COVID-19 has lifted the mortality rate for this group by about 4.5%, about 4 times the effect of a bad flu year. The final figure shows the long term trend with the original data. Repeating this analysis for the entire population, we see that the over all effect of COVID-19 on the Canadian population is to lift the mortality rate by about 5% above background. We can see the effect in the next two figures.

GAM trend component relative to the background mortality rate in the 75-84 year old cohort.

This helps us answer the question about over counting. If the effect is about 5% and if it sustains for all of 2020, given that about 280,000 Canadians died in 2019, the results suggest that the annual total COVID-19 effect is 15,000 excess in Canada. While it looks like the current official number of 21,000 COVID-19 deaths might be a little bit high, we need to caution that conclusion based on substitution effects. People travelled less in 2020, so some types of accidents are probably down with deaths from COVID-19 making up part of the difference. Those types of substitution effects would mask the severity of COVID-19 in the mortality rate data. On the other hand, there are other substitution effects, such as self-harm, that have probably increased over 2020, which push up the overall mortality rate from non-COVID-19 (or at least indirect) causes. A much more detailed analysis of all the substitution effects would be required to really know just how much over or under counting is present in the COVID-19 death totals. The official count broadly concords with my simple GAM analysis on the mortality rate. My guess is that if there is bias in the official tally, it would probably lean a touch in the over counting direction, but nothing like the crazy conspiracy theories.

GAM: inferred trend in the mortality rate for the 75-84 year old cohort.
GAM: inferred trend in the mortality rate for all Canadians.

The shocking part of what the data teach us is in the 0-14 year old cohort. I repeat the same analysis and show the long term trend in the figures below. We see that in 2020, the mortality rate for children under 15 years old is up over 20%! That equates to approximately an extra 10 childhood deaths each week across Canada. We know that COVID-19 has a very low mortality rate with children, lower in fact than the seasonal flu, and yet we see a dramatic rise in childhood mortality in the data. This analysis is not causal, but if COVID-19 is an unlikely culprit, it must be some other environmental factor. We know that child abuse and domestic violence has increased as the result of stay-at-home orders. Teachers and school staff have a difficult time detecting abuse and neglect in online classes. When this pandemic ends, it will be very much worth while combing through the 0-14 year old cohort data and unpacking the effects of COVID-19 mitigation efforts on childhood well-being.

GAM trend component relative to the background mortality rate in the 0-14 year old cohort.
GAM: inferred trend in the mortality rate for children under 15 years old.

Covid-19 branching model: details and calibration to data

Last month my collaborators, Jerome Levesque and David Shaw, and I built a branching process model for describing Covid-19 propagation in communities. In my previous blog post, I gave a heuristic description of how the model works. In this post, I want to expand on some of the technical aspects of the model and show how the model can be calibrated to data.

The basic idea behind the model is that an infected person creates new infections throughout her communicable period. That is, an infected person “branches” as she generates “offspring”. This model is an approximation of how a communicable disease like Covid-19 spreads. In our model, we assume that we have an infinite reservoir of susceptible people for the virus to infect. In reality, the susceptible population declines – over periods of time that are much longer than the communicable period, the recovered population pushes back on the infection process as herd immunity builds. SIR and other compartmental models capture this effect. But over the short term, and especially when an outbreak first starts, disease propagation really does look like a branching process. The nice thing about branching processes is that they are stochastic, and have lots of amazing and deep relationships that allow you to connect observations back to the underlying mechanisms of propagation in an identifiable way.

In our model, both the number of new infections and the length of the communicable period are random. Given a communicable period, we model the number of infectious generated, Q(t), as a compound Poisson process,

(1)   \begin{equation*}Q(t) = \sum_{i=1}^{N(t)} \, Y_i,\end{equation*}

where N(t) is the number of infection events that arrived up to time t, and Y_i is the number infected at each infection event. We model Y_i with the logarithmic distribution,

(2)   \begin{equation*}\mathbb{P}(Y_i =k) = \frac{-1}{\ln(1-p)}\frac{p^k}{k}, \hspace{2em} k \in {1,2,3,\ldots}.\end{equation*}

which has mean, \mu = -\frac{p}{(1-p)\ln(1-p)}. The infection events arrive exponentially distributed in time with arrival rate \lambda. The characteristic function for Q(t) reads,

(3)   \begin{align*}\phi_{Q(t)}(u) &=\mathbb{E}[e^{iuQ(t)}] \\ &= \exp\left(rt\ln\left(\frac{1-p}{1-pe^{iu}}\right)\right) \\ &= \left(\frac{1-p}{1-pe^{iu}}\right)^{rt},\end{align*}

with \lambda = -r\ln(1-p) and thus Q(t) follows a negative binomial process,

(4)   \begin{equation*}Q(t) \sim \mathrm{NB}(rt,p).\end{equation*}

The negative binomial process is important here. Clinical observations suggest that Covid-19 is spread mostly by a minority of people in large quantities. Research suggests that the negative binomial distribution describes the number of infections from infected individuals. In our process, during a communicable period, t, an infected individual infects Q(t) people based on a draw from the negative binomial with mean rtp/(1-p). The infection events occur continuously in time according to the Poisson arrivals. However, the communicable period, t, is in actuality a random variable, T, which we model as a gamma process,

(5)   \begin{equation*}f_{T(t)}(x) = \frac{b^{at}}{\Gamma(at)} x^{at-1}e^{-b x},\end{equation*}

which has a mean of \mean{T} = at/b. By promoting the communicable period to a random variable, the negative binomial process changes into a Levy process with characteristic function,

(6)   \begin{align*}\mathbb{E}[e^{iuZ(t)}] &= \exp(-t\psi(-\eta(u))) \\ &= \left(1- \frac{r}{b}\ln\left(\frac{1-p}{1-pe^{iu}}\right)\right)^{-at},\end{align*}

where \eta(u), the Levy symbol, and \psi(s), the Laplace exponent, are respectively given by,

(7)   \begin{align*}\mathbb{E}[e^{iuQ(t)}] &= \exp(t\,\eta(u)) \\\mathbb{E}[e^{-sT(t)}] &= \exp(-t\,\psi(s)), \end{align*}

and so,

(8)   \begin{align*}\eta(u) &= r\ln\left(\frac{1-p}{1-pe^{iu}}\right), \\\psi(s) &= a\ln\left(1 + \frac{s}{b}\right).\end{align*}

Z(t) is the random number of people infected by a single infected individual over her random communicable period and is further over-dispersed relative to a pure negative binomial process, getting us closer to driving propagation through super-spreader events . The characteristic function in eq.(6) for the number of infections from a single infected person gives us the entire model. The basic reproduction number R_0 is,

(9)   \begin{align*}R_0 &= \left(\frac{at\lambda}{b}\right)\left(\frac{-p}{\ln(1-p)(1-p)}\right).\end{align*}

From the characteristic function we can compute the total number of infections in the population through renewal theory. Given a random characteristic \chi(t), such as the number of infectious individuals at time t, (e.g., \chi(t) = \mathbb{I}(t \in [0,\lambda_x)) where \lambda_x is the random communicable period) the expectation of the process follows,

(10)   \begin{equation*}\mathbb{E}(n(t)) = \mathbb{E}(\chi(t)) + \int_0^t\mathbb{E}(n(t-u))\mathbb{E}(\xi(du)).\end{equation*}

where \xi(du) is the counting process (see our paper for details). When an outbreak is underway, the asymptotic behaviour for the expected number of counts is,

(11)   \begin{equation*}\mathbb{E}(n_\infty(t)) \sim \frac{e^{\alpha t}}{\alpha\beta},\end{equation*}

where,

(12)   \begin{align*}\alpha &= \lambda\mu \left(1 - \left(\frac{b}{\alpha +b}\right)^{at}\right) \\\beta & = \frac{1}{\alpha}\left(1 - \frac{at\lambda \mu}{b}\left(\frac{b}{\alpha +b}\right)^{at+1}\right).\end{align*}

The parameter \alpha > 1 is called the Malthusian parameter and it controls the exponential growth of the process. Because the renewal equation gives us eq.(11), we can build a Bayesian hierarchical model for inference with just cumulative count data. We take US county data, curated by the New York Times, to give us an estimate of the Malthusian parameter and therefore the local R-effective across the United States. We use clinical data to set the parameters of the gamma distribution that controls the communicable period. We treat the counties as random effects and estimate the model using Gibbs sampling in JAGS. Our estimation model is,

(13)   \begin{align*}\log(n) &= \alpha t + \gamma + a_i t + g_i + \epsilon \nonumber \\a_i &\sim \text{N}(0,\sigma_1^2) \nonumber \\g_i & \sim \text{N}(0,\sigma_2^2) \nonumber \\\epsilon &\sim \text{N}(0,\sigma^2),\end{align*}

where i is the county label; the variance parameters use half-Cauchy priors and the fixed and random effects use normal priors. We estimate the model and generate posterior distributions for all parameters. The result for the United States using data over the summer is the figure below:

Summer 2020 geographical distribution of R-effective across the United States: 2020-07-01 to 2020-08-20.

Over the mid-summer, we see that the geographical distribution of R_{eff} across the US singles out the Midwestern states and Hawaii as hot-spots while Arizona sees no county with exponential growth. We have the beginnings of a US county based app which we hope to extend to other countries around the world. Unfortunately, count data on its own does not allow us to resolve the parameters of the compound Poisson process separately.

If we have complete information, which might be possible in a small community setting, like northern communities in Canada, prisons, schools, or offices, we can build a Gibbs sampler to estimate all the model parameters from data without having to rely on the asymptotic solution of the renewal equation.

Define a complete history of an outbreak as a set of N observations taking the form of a 6-tuple:

(14)   \begin{equation*}(i,j,B_{i},D_{i},m_{i},o_{i}),\end{equation*}

where,

i: index of individual, j: index of parent, B_{i}: time of birth, D_{i}: time of death, m_{i}: number of offspring birth events, o_{i}: number of offspring.

With the following summary statistics:

(15)   \begin{align*}L & = \sum_{i} D_{i} - B_{i};\,\,  \Lambda  = \prod_{i} (D_{i} - B_{i}) \nonumber \\ M & = \sum_{i} m_{i};\,\,  O = \sum_{i} o_{i} \nonumber \end{align*}

we can build a Gibbs sampler over the models parameters as follows:

(16)   \begin{align*}p\,|\,r,L,O & \sim \text{Beta}\left(a_{0} + O,b_{0} + r L\right) \nonumber \\r\,|\,p,L,M & \sim \text{Gamma}\left(\eta_{0}+M,\rho_{0}-L\log(1-p)\right) \nonumber \\b\,|\,a,L,N & \sim \text{Gamma}\left(\gamma_{0}+aN,\delta_{0}+L\right)\nonumber \\a\,|\,b,\Lambda,N & \sim\text{GammaShape}\left(\epsilon_{0}\Lambda,\zeta_{0}+N,\theta_{0}+N\right)\end{align*}

where a_0, b_0, \eta_0, \rho_0, \gamma_0, \zeta_0, \epsilon_0, \theta_0 are hyper-parameters.

Over periods of time that are comparable to the communicable window, such that increasing herd immunity effects are negligible, a pure branching process can describe the propagation of Covid-19. We have built a model that matches the features of this disease – high variance in infection counts from infected individuals with a random communicable period. We see promise in our model’s application to small population settings as an outbreak gets started.

An effective theory of Covid-19 propagation

To fight Covid-19, we need an understanding of how the virus propagates in a community. Right now, the workhorse engine in the literature and in the media are compartmental models: (S)usceptible (I)nfected (R)ecovered and its cousins. The most popular versions of these models start with a set of coupled ordinary differential equations which, when solved, generate paths for each compartment. For example, in the simple SIR model, the infection starts with a large susceptible population which diminishes as the infected population rises. The infected population eventually declines as the recovered population increases and eventually the infected population goes to zero as the outbreak ends. The differential equations govern the dynamics of each compartment, generating equations of motion for the population.

Covid-19 propagation as a gamma subordinated negative binomial branching process

SIR models work well when we are discussing large populations, when we are interested in population averages, and when the random nature of the transmission doesn’t particularly matter. The solution to the coupled set of differential equations is deterministic. But when an outbreak gets started or when we are interested in the dynamics in the small population limit, we need more than deterministic models. SIR compartmental model with its focus on averages is not enough when dealing with the very early stages of an outbreak – and it’s the early stages where we really want our mitigation strategies to be the most successful. We need a stochastic model of transmission to understand the early stages of an outbreak.

Together with my colleagues Jerome Levesque, and David Shaw, we built a branching model of Covid-19 propagation. The idea is that an infected person randomly infects other people over the course of the communicable period. That is, we model transmission by imagining that an infected person generates “offspring”, continuous in time, during the communicable period and that each “child” has the same statistical law for generate more “offspring”. The infections branch out from each infected person into a tree that makes up the infected community. So while on average an infected person will infect R0 other people (the basic reproduction number) during the communicable period, there are a range of possible outcomes. We could get lucky and an initially infected person might not spread the virus at all, or we could get unlucky and the initially infected person might become a super-spreader, all in a model with the same R0. In fact, even with R0>1, there can be a substantial probability that the outbreak will go extinct on its own, all depending on the statistical branching nature of transmission.

In some research communities, there is a tendency to use agent based models to capture the stochastic nature of an outbreak. Such models simulate the behaviour of many different individuals or “agents” in a population by assigning a complicated set of transmission rules to each person. In the quest for high fidelity, agent based models tend to have lots of parameters. While agent based approaches have merit, and they have enjoyed success in many fields, we feel that in this context these models are often too difficult to interpret, contain many layers of hidden assumptions, are extraordinarily difficult to calibrate to data while containing lots of identifiability issues, are easily susceptible to chaotic outputs, and obscure trade-off analysis for decision makers. In a decision making context we need a parsimonious model, one that gets the essentials correct and generates insight for trade-off analysis that decision makers can use. We need an effective theory of Covid-19 propagation in which we project all the complicated degrees of freedom of the real world down to a limited number of free parameters around which we can build statistical estimators.

The abstract of our paper:

We build a parsimonious Crump-Mode-Jagers continuous time branching process of Covid-19 propagation based on a negative binomial process subordinated by a gamma subordinator. By focusing on the stochastic nature of the process in small populations, our model provides decision making insight into mitigation strategies as an outbreak begins. Our model accommodates contact tracing and isolation, allowing for comparisons between different types of intervention. We emphasize a physical interpretation of the disease propagation throughout which affords analytical results for comparison to simulations. Our model provides a basis for decision makers to understand the likely trade-offs and consequences between alternative outbreak mitigation strategies particularly in office environments and confined work-spaces.

We focus on two parameters that decision makers can use to set policy: the average waiting time between infectious events from an infected individual, and the average number of people infected at an event. We fix the communicable period (distribution) from clinical data. Those two parameters go into the probabilistic model for branching the infection through the population. The decision maker can weigh trade-offs like restricting meeting sizes and interaction rates in the office while examining the extinction probabilities, growth rates, and size distributions for each choice.

You can find our paper here: https://medrxiv.org/cgi/content/short/2020.07.08.20149039v1

Covid-19 and decisions under uncertainty

Three excellent essays recently appeared in the Boston Review by Jonathan Fuller, John Ioannidis, and Marc Lipsitch on the nature of epidemiology, and the use of data in making public health decisions. Each essay makes great points, especially Professor Ioannidis’ emphasis that Covid-19 public health decisions constitute trade-offs – other people will die based on our decisions to mitigate Covid-19. But I think all of all three essays miss the essential question, which transcends Covid-19 or even public health:

What is the optimal way to make irreversible decisions under uncertainty?

The answer to this question is subtle because it involves three competing elements: time, uncertainty, and irreversibility. In a decision making process, time gives us the opportunity to learn more about the problem and remove some of the uncertainty, but it’s irreversibility that makes the problem truly difficult. Most important problems tend to have a component of irreversibility. Once we make the decision, there is no going back or it is prohibitively expensive to do so, and our opportunity to learn more about the problem is over.

Irreversibility coupled with uncertainty and time means there is value in waiting. By acting too soon, we lose the opportunity to make a better decision, and by waiting too long, we miss the opportunity altogether. Waiting and learning incurs a cost, but that cost is often more than offset by the chance to make a better and more informed decision later. The value of waiting in finance is part of option pricing and that value radically changes optimal decision making. There is an enormous amount of research on option valuation with irreversible decisions. The book Investment Under Uncertainty by Avinash Dixit and Robert Pindyck provides a wonderful introduction to the literature. When faced with an irreversible decision, the option value can be huge, even dwarfing the payoff from immediate action. At times, learning is the most valuable thing we can do. But for the option to have value, we must have time to wait. In now-or-never situations, the option loses its value completely simply because we have no opportunity to learn. The take-a-way message is this: the more irreversible the decision, the more time you have, and the more uncertainty you face, the larger the option value. The option value increases along each of these dimensions, thereby increasing the value of waiting and learning.

Covid-19 has all of these ingredients – time, uncertainty, and irreversibility. Irreversibility appears through too many deaths if we wait too long, and economic destruction requiring decades of recovery if we are too aggressive in mitigation (while opening up global financial risks). There is a ton of uncertainty surrounding Covid-19 with varying degrees of limited time windows in which to act.

Those who call for immediate and strong Covid-19 mitigation strategies recognize irreversibility – we need to save lives while accepting large economic costs – and that even though we face enormous uncertainty, the costs incurred from waiting are so high compared to immediate action that the situation is ultimately now-or-never. There is no option value. Those who call for a more cautious and nuanced approach also see the irreversibility but feel that while the costs from learning are high and time is short, the option value is rescued by the enormous uncertainty. With high uncertainty, it can be worth a lot to learn even a little. Using the lens of option valuation, read these two articles by Professor Ioannidis and Professor Lipsitch from this March and you can see that the authors are actually arguing over the competing contributions of limited time and high uncertainty to an option’s value in an irreversible environment. They disagree on the value of information given the amount of time to act.

So who’s right? In a sense, both. We are not facing one uncertain irreversible decision; we face a sequence of them. When confronted by a new serious set of problems, like a pandemic, it can be sensible to down-weight the time you have and down-weight the uncertainty (by assuming the worst) at the first stage. Both effects drive the option value to zero – you put yourself in the now-or-never condition and you act. But for the next decision, and the next one after that, with decreasing uncertainty over time, you change course, and you use information differently by recognizing the chain of decisions to come. Johan Giesecke makes a compelling argument about the need for a course change with Covid-19 by thinking along these lines.

While option valuation can help us understand the ingredients that contribute to waiting, the uncertainty must be evaluated over some probability measure, and that measure determines how we weigh consequences. There is no objectively correct answer here. How do we evaluate the expected trade-off between excess Covid-19 deaths among the elderly vs a lifetime of lost opportunities for young people? How much extra child abuse is worth the benefit of lockdowns? That weighing of the complete set of consequences is part of the totality of evidence that Professor Ioannidis emphasizes in his essay.

Not only does time, uncertainty, and irreversibility drive the option value, but so does the probability measure. How we measure consequences is a value judgment, and in a democracy that mesaure must rest with our elected officials. It’s here that I fundamentally disagree with Professor Lipsitch. In his essay, he increasingly frames the philosophy of public health action in terms of purely scientific questions. But public action, the decision to make one kind of costly trade-offs against another – and it’s always about trade-offs – is a deeply political issue. In WWII, President Truman made the decision to drop the bomb, not his generals. Science can offer the likely consequences of alternative courses of public health actions, but it is largely silent on how society should weigh them. No expert, public health official, or famous epidemiologist has special insight into our collective value judgment.

Covid-19 serology studies: a meta-analysis using hierarchical modelling

Serology studies are front-and-center in the news these days. Reports out of Santa Clara county, California, San Miguel County, Colorado, and Los Angeles suggest that a non-trivial fraction, more than 1%, of the population has SARS-CoV-2 antibodies in their bloodstream. European cities are following suit – they too are conducting serology studies and finding important fractions as well. The catch is that many of these studies find an antibody prevalence comparable to the false positive rate of their respective serology tests. The low statistical power associated with each study has invited criticism, in particular, that the results cannot be trusted and that the study authors should temper their conclusions.

But all is not lost. Jerome Levesque (also a federal data scientist and the manager of the PSPC data science team) and I performed a meta-analysis on the results from Santa Clara County (CA), Los Angeles County (CA), San Miguel County (CO), Chelsea (MA), Geneve (Switzerland), and Gangelt (Germany). We used hierarchical Bayesian modelling with Markov Chain Monte Carlo (MCMC), and also generalized linear mixed modelling (GLMM) with bootstrapping. By painstakingly sleuthing through pre-prints, local government websites, scientific briefs, and study spokesperson media interviews, we not only obtained the data from each study, but we also found information on the details of the serology test used in each study. In particular, we obtained data on each serology test’s false positive rate and false negative rate through manufacturer websites and other academic studies. We take the data at face value and we do not correct for any demographic bias that might exist in the studies.

Armed with this data, we build a generalized linear mixed model and a full Bayesian model with a set of hyper-priors. The GLMM does the usual shrinkage estimation across the study results, and across the serology test false positive/negative rates while the Bayesian model ultimately generates a multi-dimensional posterior distribution, including not only the false positive/negative rates but also the prevalence. We use Stan for the MCMC estimation. With the GLMM, we estimate the prevalence by bootstrapping with the shrunk estimators, including the false positive/negative rates. Both methods give similar results.

We find that there is evidence of high levels of antibody prevalence (greater than 1%) across all reported locations, but also that a significant probability mass exists for levels lower than the ones reported in the studies. As an important example, Los Angeles show a mode of approximately 4%, meaning that about 400,000 people in that city have SARS-CoV-2 antibodies. Given the importance of determining societal-wide exposure to SARS-CoV-2 for correct inferences of the infection fatality rate and for support to contact tracing, we feel that the recent serology studies contain an important and strongly suggestive signal.

Our inferred distributions for each location:

Prevalence density functions (marginal posterior distribution) from the Bayesian MCMC estimation.
Prevalence density functions from the GLMM bootstrap.
Prevalence with the false positive rate (Bayesian MCMC).
Prevalence with the false positive rate (GLMM bootstrap).

Covid-19: A case fatality rate app for US counties

I made a web app on estimating the case fatality rate (CFR) of Covid-19 across all the US counties. I use a binomial GLMM with nested random effects (state, state:county) using the R package lme4. Every time you reload the app, it fetches the most recent data and re-estimates the model.

The model “shrinks” the simple CFR estimates (dividing deaths by cases) at the county level by “learning” across the other counties within the state and by “learning” across the states themselves. The model pulls in or pushes out estimates that are too large or too small because they come from a county with a small sample size. It’s a bit like trying to estimate the true scoring rate of the NHL teams after watching only the first 10 games of the season. There will be a couple of blow-outs and shut-outs and we need to correct for those atypical results in small samples – but we should probably shrink the Leafs ability to score down to zero just to be safe 😉

The CFR data has limitations because the people who get tested are biased toward being sick, often very sick. The infection fatality rate (IFR), which is what we all really want to know, requires testing far more people. Recent evidence suggests that the IFR will end up much lower than the current CFR estimates.

The app shows the how the naive empirical estimate of the CFR compares to the shrunk estimator from the model. I also provide a forest plot to show the prediction intervals of the model estimates, including the contributions from the random effects. The prediction intervals I report are conservative. I use R’s merTools predictInterval() to include uncertainty from the residual (observation-level) variance, and the uncertainty in the grouping factors by drawing values from the conditional modes of the random effects using the conditional variance-covariance matrix. I partially corrected for the correlation between the fixed and random effect. Prediction interval estimation with mixed models is a thorny subject and short of a full Bayesian implementation, a full bootstrap of the lme4 model is required for the best estimates of the prediction interval. Unfortunately, bootstrapping my model takes too long for the purposes of my app (and so does the MCMC in a Bayesian implementation!). For details on use of the use of merTools::predictInterval(), see Prediction Intervals from merMod Objects by Jared Knowles and Carl Frederick.

Hopefully Covid-19 will pass soon. Stay safe.

Become a GAMM-ateur climate scientist with mgcv

I love tennis. I play tennis incessantly. I follow it like a maniac. This January, my wife and I attended the Australian Open, and then after the tournament we played tennis every day for hours in the awesome Australian summer heat. During a water break one afternoon, I checked the weather app on my phone; the mercury reached 44 C!

The Aussie Open 2019: Rafael Nadal prepares to serve in the summer heat.

It got me to thinking about climate change and one of the gems in my library, Generalized Additive Models: An introduction with R by Professor Simon N. Wood – he is also the author of the R package mgcv (Mixed GAM Computation Vehicle with Automatic Smoothness Estimation).

First, Wood’s book on generalized additive models is a fantastic read and I highly recommend it to all data scientists – especially for data scientists in government who are helping to shape evidence based policy. In the preface the author says:

“Life is too short to spend too much time reading statistical texts. This book is of course an exception to this rule and should be read cover to cover.”

I couldn’t agree more. There are many wonderful discussions and examples in this book with breadcrumbs into really deep waters, like the theory of soap film smoothing. Pick it up if you are looking for a nice self-contained treatment of generalized additive models, smoothing, and mixed modelling. One of the examples that Wood works through is the application of generalized additive mixed modelling to daily average temperatures in Cairo Egypt (section 7.7.2 of his book). I want to expand on that discussion a bit in this post.

Sometimes we hear complaints that climate change isn’t real, that there’s just too much variation to reveal any signal. Let’s see what a bit of generalized additive modelling can do for us.

A generalized linear mixed modelling (GLMM) takes the standard form:

    \begin{align*}\boldsymbol{\mu}^b &= \mathbb{E}({\bf y}\mid{\bf b}), \\ g(\mu_i^b) &= {\bf X}_i\boldsymbol{\beta}+ {\bf Z}_i{\bf b}, \\ {\bf b} &\sim N({\bf 0}, {\boldsymbol{\psi}}_\theta), \\ y_i\mid{\bf b} &\sim \text{exponential family dist.,}\end{align}

where g is a monotonic link function, {\bf b} contains the random effects with zero expected value and with a covariance matrix {\boldsymbol{\psi}}_\theta parameterized by \theta. A generalized additive model uses this structure, but the design matrix {\bf X} is built from spline evaluations with a “wiggliness” penalty, not on the regressors directly (coefficients correspond to the coefficients of the spline). For details, see Generalized Additive Models: An Introduction with R, Second Edition.

The University of Dayton has a website with daily average temperatures from a number of different cities across the world. Let’s take a look at Melbourne, Australia – the host city of the Australian Open. The raw data has untidy bits, and in my R Markdown file I show my code and the clean up choices that I made.

The idea is to build an additive mixed model with temporal correlations. Wood’s mgcv package allows us to build rather complicated models quite easily. For details on the theory and the implementations mgcv, I encourage you to read Wood’s book. The model I’m electing to use is:

    \begin{equation*} \text{temp}_i = s_1(\text{time.of.year}_i) + s_2(\text{time}_i) + e_i,\end{equation}

where

e_i = \phi_1 e_{i-1} + \phi_2 e_{i-2}+ \epsilon_i, \epsilon_i \sim N(0,\sigma^2), s_1(\cdot) is a cyclic cubic smoothing spline that captures seasonal temperature variation on a 365 day cycle, and s_2(\cdot) is a smoothing spline that tracks a temperature trend, if any. I’m not an expert in modelling climate change, but this type of model seems reasonable – we have a seasonal component, a component that captures daily autocorrelations in temperature through an AR(2) process, and a possible trend component if it exists. To speed up the estimation, I nest the AR(2) residual component within year.

The raw temperature data for Melbourne, Australia is:

Daily mean temperature in Melbourne: 1995 – 2019.

We see a clear season pattern in the data, but there is also a lot of noise. The GAMM model will reveal the presence of a trend:

Climate change trend in Melbourne: 1995 – 2019.

We can see that Melbourne has warmed over the last two decades (by almost 2 C). Using the Dayton temperature dataset, I created a website based on the same model that shows temperature trends across about 200 different cities. Ottawa, Canada (Canada’s capital city) is included among the list of cities and we can see that the temperature trend in Ottawa is a bit wonky. We’ve had some cold winters in the last five years and while the Dayton data for Ottawa is truncated at 2014, I’m sure the winter of 2018-2019 with its hard cold spells would also show up in the trend. This is why the phenomenon is called climate change – the effect is, and will continue to be, uneven across the planet. If you like, compare different cities around the world using my website.

As a point of caution, climate change activists should temper their predictions about how exactly climate change will affect local conditions. I recall that in 2013 David Suzuki wrote about what climate change could mean for Ottawa, saying

…one of Canada’s best-loved outdoor skating venues, Ottawa’s Rideau Canal, provides an example of what to expect…with current emissions trends, the canal’s skating season could shrink from the previous average of nine weeks to 6.5 weeks by 2020, less than six weeks by 2050 and just one week by the end of the century. In fact, two winters ago, the season lasted 7.5 weeks, and last year it was down to four. The canal had yet to fully open for skating when this column was written [January 22, 2013].

The year after David Suzuki wrote this article, the Rideau Skateway enjoyed the longest consecutive days of skating in its history and nearly one of the longest seasons on record. This year (2019) has been another fantastic skating season, lasting 71 days (with a crazy cold winter). My GAMM analysis of Ottawa’s daily average temperature shows just how wild local trends can be. Unfortunately, statements like the one David Suzuki made fuels climate change skeptics. Some people will point to his bold predictions for 2020, see the actual results, and then dismiss climate change altogether. I doubt that David Suzuki intends that kind of advocacy! Climate change is complicated, not every place on the planet will see warming and certainly not evenly. And if the jet stream becomes unstable during the North American winter, climate change may bring bitterly cold winters to eastern Canada on a regular basis – all while the Arctic warms and melts. There are complicated feedback mechanisms at play; so persuading people about the phenomenon of climate change with facts instead of cavalier predictions is probably the best strategy.

Now, establishing that climate change is real and persuading people of its existence is only one issue – what to do about it is an entirely different matter. We can agree that climate change is real and mostly anthropogenic, but it does not imply that the climate lobby’s policy agenda inexorably follows. Given the expected impact of climate change on the global economy and how to think about its economic consequences in a world of scarce resources, we should seek the best evidence based policy solutions available, see for example:

Let’s use the best evidence, both from climate science and economics, as our guide for policy in an uncertain future.

Improve operations: reduce variability in the queue

I recently came across a cute problem at Public Services and Procurement Canada while working on queue performance issues. Imagine a work environment where servers experience some kind of random interruption that prevents them from continuously working on a task. We’ve all been there—you’re trying to finish important work but you get interrupted by an urgent email or phone call. How can we manage this work environment? What are the possible trade-offs? Should we try to arrange for infrequent but long interruptions or should we aim for frequent but short interruptions?

Let’s rephrase the problem generically in terms of machines. Suppose that a machine processes a job in a random run-time, T, with mean \mathbb{E}(T) = \bar t and variance \text{Var}(T) = \sigma_t^2, but with an otherwise general distribution. The machine fails with independent and identical exponentially distributed inter-arrival times. The mean time between failures is \bar f. When the machine is down from failure, it takes a random amount of time, R, to repair. The machine failure interrupts a job in progress and, once repaired, the machine continues the incomplete job from the point of the interruption. The repair time distribution has mean \mathbb{E}(R) = \bar r and variance \text{Var}(R) =\sigma_r^2 but is otherwise general. The question is: What is the mean and variance of the total processing time? The solution is a bit of fun.

The time to complete a job, \tau, is the sum of the (random) run-time, T, plus the sum of repair times (if any). That is,

    \begin{equation*}\tau = T + \sum_{i=1}^{N(T)} R_i,\end{equation}

where N(T) is the random number of failures that occur during the run-time. First, condition \tau on a run-time of t and thus,

    \begin{equation*}\tau|t = t + \sum_{i=1}^{N(t)} R_i.  \end{equation}

Now, since N(t) is the number of failures by time t with exponentially distributed failures, this is a Poisson counting process:

    \begin{align*} \mathbb{E}(\tau|t) &= t + \mathbb{E}\left(\sum_{i=1}^{N(t)} R_i\right) \\ &= t + \sum_{k=0}^\infty \mathbb{E}\left[\left.\sum_{i=1}^k R_i \right| N(t) = k\right]\mathbb{P}(N(t) = k) \\ &= t + \sum_{k=1}^\infty (k\bar r) \frac{(t/\bar f)^k}{k!}e^{-t/\bar f} \\ & = t + \bar r \frac{t}{\bar f}e^{-t/\bar f} \sum_{k=1}^\infty \frac{(t/\bar f)^{k-1}} {(k-1)!} \\ &= t \left(\frac{\bar f + \bar r}{\bar f}\right) \end{align}

By the law of iterated expectations, \mathbb{E}(\mathbb{E}(\tau|t)) = \mathbb{E}(\tau), and so,

    \begin{equation*} \mathbb{E}(\tau) = \mathbb{E}\left[t \left(\frac{\bar f + \bar r}{\bar f}\right)\right] = \bar t \left(\frac{\bar f + \bar r}{\bar f}\right), \end{equation}

which gives us the mean time to process a job. Notice that in the limit that \bar f\rightarrow\infty, we recover the expected result that the mean processing time is just \mathbb{E}(T) = \bar t.

To derive the variance, recall the law of total variance,

    \begin{equation*}\text{Var}(Y) = \text{Var}(\mathbb{E}(Y|X))+ \mathbb{E}(\text{Var}(Y|X))\end{equation}

From the conditional expectation calculation, we have

    \begin{equation*}\text{Var}(\mathbb{E}(\tau|t) )= \text{Var}\left[ t \left(\frac{\bar f + \bar r}{\bar f}\right)\right] = \sigma_t^2 \left(\frac{\bar f + \bar r}{\bar f}\right)^2.\end{equation}

We need \mathbb{E}(\text{Var}(\tau|t)). For fixed t, we use the Laplace transform of the sum of the random repair times, S = \sum_{i=1}^{N(t)} R_i, that is,

    \begin{align*} \mathcal{L}(u) = \mathbb{E}[\exp(uS)] &= \mathbb{E}\left[\exp\left(u\sum_{i=1}^{N(t)} R_i\right)\right] \\ &= \sum_{k=0}^\infty \left[(\mathcal{L}_r(u))^k | N(t) =k \right] \frac{(t/\bar f)^k}{k!}e^{-t/\bar f} \\ &= \exp\left(\frac{t}{\bar f}(\mathcal{L}_r(u) -1)\right), \end{align}

where \mathcal{L}_r(u) is the Laplace transform of the unspecified repair time distribution. The second moment is,

    \begin{equation*}\mathcal{L}^{\prime\prime}(0) = \left( \frac{t\mathcal{L}^\prime_r(0)}{\bar f}\right)^2 + \frac{t}{\bar f}\mathcal{L}^{\prime\prime}_r(0).\end{equation}

We have the moment and variance relationships \mathcal{L}^\prime_r(0) = \bar r and \mathcal{L}^{\prime\prime}_r(0) - (\mathcal{L}^\prime_r(0))^2 = \sigma_r^2, and thus,

    \begin{align*} \mathbb{E}[\text{Var}(\tau|T=t)] &= \mathbb{E}[\text{Var}(t + S)|T=t] \\&= \mathbb{E}[\mathcal{L}^{\prime\prime}(0) - (\mathcal{L}^\prime(0))^2] \\ &= \mathbb{E}\left[ \left(\frac{t\bar r}{\bar f}\right)^2  + \frac{t(\sigma_r^2 + \bar r^2)}{\bar f} - \left(\frac{t\bar r}{\bar f}\right)^2 \right] \\ & =  \mathbb{E}\left[\frac{t(\sigma_r^2 + \bar r^2)}{\bar f}\right] = \frac{\bar t(\sigma_r^2 + \bar r^2)}{\bar f} . \end{align}

The law of total variance gives the desired result,

    \begin{align*} \text{Var}(\tau) &=  \text{Var}(\mathbb{E}(\tau|t) ) + \mathbb{E}[\text{Var}(\tau|t)] \\ & = \left(\frac{\bar r + \bar f}{\bar f}\right)^2\sigma_t^2 + \left(\frac{\bar t}{\bar f}\right) \left(\bar r^2 + \sigma_r^2\right) \end{align}

Notice that the equation for the total variance makes sense in the \bar f \rightarrow \infty limit; the processing time variance becomes the run-time variance. The equation also has the expected ingredients by depending on both the run-time and repair time variance. But the equation also has a bit of a surprise, it depends on the square of the mean repair time, \bar r^2. That dependence leads to an interesting trade-off.

Imagine that we have a setup with fixed \sigma_t, \sigma_r, and \bar t, and fixed \mathbb{E}(\tau) but we are free to choose \bar f and \bar r. That is, for a given mean total processing time, we can choose between a machine that fails frequently with short repair times or we can choose a machine that fails infrequently but with long repair times. Which one would we like, and does it matter since either choice leads to the same mean total processing time? At fixed \mathbb{E}(\tau) we must have,

    \begin{equation*} \left(\frac{\bar f + \bar r}{\bar f}\right)=K,\end{equation}

for some constant K. But since the variance of the total processing time depends on \bar r, different choices of \bar f will lead to different total variance. The graph shows different iso-contours of constant mean total processing time in the total variance/mean time between failure plane. Along the black curve, we see that as the mean total processing time increases, we can minimize the variance by choosing a configuration where the machine fails often but with short repair times.

Minimizing total variance as a function of mean time between failures.

Why is this important? Well, in a queue, all else remaining equal, increasing server variance increases the expected queue length. So, in a workflow, if we have to live with interruptions, for the same mean processing time, it’s better to live with frequent short interruptions rather than infrequent long interruptions. The exact trade-offs depend on the details of the problem, but this observation is something that all government data scientists who are focused on improving operations should keep in the back of their mind.