Improve operations: reduce variability in the queue

I recently came across a cute problem at Public Services and Procurement Canada while working on queue performance issues. Imagine a work environment where servers experience some kind of random interruption that prevents them from continuously working on a task. We’ve all been there—you’re trying to finish important work but you get interrupted by an urgent email or phone call. How can we manage this work environment? What are the possible trade-offs? Should we try to arrange for infrequent but long interruptions or should we aim for frequent but short interruptions?

Let’s rephrase the problem generically in terms of machines. Suppose that a machine processes a job in a random run-time, T, with mean \mathbb{E}(T) = \bar t and variance \text{Var}(T) = \sigma_t^2, but with an otherwise general distribution. The machine fails with independent and identical exponentially distributed inter-arrival times. The mean time between failures is \bar f. When the machine is down from failure, it takes a random amount of time, R, to repair. The machine failure interrupts a job in progress and, once repaired, the machine continues the incomplete job from the point of the interruption. The repair time distribution has mean \mathbb{E}(R) = \bar r and variance \text{Var}(R) =\sigma_r^2 but is otherwise general. The question is: What is the mean and variance of the total processing time? The solution is a bit of fun.

The time to complete a job, \tau, is the sum of the (random) run-time, T, plus the sum of repair times (if any). That is,

    \begin{equation*}\tau = T + \sum_{i=1}^{N(T)} R_i,\end{equation}

where N(T) is the random number of failures that occur during the run-time. First, condition \tau on a run-time of t and thus,

    \begin{equation*}\tau|t = t + \sum_{i=1}^{N(t)} R_i.  \end{equation}

Now, since N(t) is the number of failures by time t with exponentially distributed failures, this is a Poisson counting process:

    \begin{align*} \mathbb{E}(\tau|t) &= t + \mathbb{E}\left(\sum_{i=1}^{N(t)} R_i\right) \\ &= t + \sum_{k=0}^\infty \mathbb{E}\left[\left.\sum_{i=1}^k R_i \right| N(t) = k\right]\mathbb{P}(N(t) = k) \\ &= t + \sum_{k=1}^\infty (k\bar r) \frac{(t/\bar f)^k}{k!}e^{-t/\bar f} \\ & = t + \bar r \frac{t}{\bar f}e^{-t/\bar f} \sum_{k=1}^\infty \frac{(t/\bar f)^{k-1}} {(k-1)!} \\ &= t \left(\frac{\bar f + \bar r}{\bar f}\right) \end{align}

By the law of iterated expectations, \mathbb{E}(\mathbb{E}(\tau|t)) = \mathbb{E}(\tau), and so,

    \begin{equation*} \mathbb{E}(\tau) = \mathbb{E}\left[t \left(\frac{\bar f + \bar r}{\bar f}\right)\right] = \bar t \left(\frac{\bar f + \bar r}{\bar f}\right), \end{equation}

which gives us the mean time to process a job. Notice that in the limit that \bar f\rightarrow\infty, we recover the expected result that the mean processing time is just \mathbb{E}(T) = \bar t.

To derive the variance, recall the law of total variance,

    \begin{equation*}\text{Var}(Y) = \text{Var}(\mathbb{E}(Y|X))+ \mathbb{E}(\text{Var}(Y|X))\end{equation}

From the conditional expectation calculation, we have

    \begin{equation*}\text{Var}(\mathbb{E}(\tau|t) )= \text{Var}\left[ t \left(\frac{\bar f + \bar r}{\bar f}\right)\right] = \sigma_t^2 \left(\frac{\bar f + \bar r}{\bar f}\right)^2.\end{equation}

We need \mathbb{E}(\text{Var}(\tau|t)). For fixed t, we use the Laplace transform of the sum of the random repair times, S = \sum_{i=1}^{N(t)} R_i, that is,

    \begin{align*} \mathcal{L}(u) = \mathbb{E}[\exp(uS)] &= \mathbb{E}\left[\exp\left(u\sum_{i=1}^{N(t)} R_i\right)\right] \\ &= \sum_{k=0}^\infty \left[(\mathcal{L}_r(u))^k | N(t) =k \right] \frac{(t/\bar f)^k}{k!}e^{-t/\bar f} \\ &= \exp\left(\frac{t}{\bar f}(\mathcal{L}_r(u) -1)\right), \end{align}

where \mathcal{L}_r(u) is the Laplace transform of the unspecified repair time distribution. The second moment is,

    \begin{equation*}\mathcal{L}^{\prime\prime}(0) = \left( \frac{t\mathcal{L}^\prime_r(0)}{\bar f}\right)^2 + \frac{t}{\bar f}\mathcal{L}^{\prime\prime}_r(0).\end{equation}

We have the moment and variance relationships \mathcal{L}^\prime_r(0) = \bar r and \mathcal{L}^{\prime\prime}_r(0) - (\mathcal{L}^\prime_r(0))^2 = \sigma_r^2, and thus,

    \begin{align*} \mathbb{E}[\text{Var}(\tau|T=t)] &= \mathbb{E}[\text{Var}(t + S)|T=t] \\&= \mathbb{E}[\mathcal{L}^{\prime\prime}(0) - (\mathcal{L}^\prime(0))^2] \\ &= \mathbb{E}\left[ \left(\frac{t\bar r}{\bar f}\right)^2  + \frac{t(\sigma_r^2 + \bar r^2)}{\bar f} - \left(\frac{t\bar r}{\bar f}\right)^2 \right] \\ & =  \mathbb{E}\left[\frac{t(\sigma_r^2 + \bar r^2)}{\bar f}\right] = \frac{\bar t(\sigma_r^2 + \bar r^2)}{\bar f} . \end{align}

The law of total variance gives the desired result,

    \begin{align*} \text{Var}(\tau) &=  \text{Var}(\mathbb{E}(\tau|t) ) + \mathbb{E}[\text{Var}(\tau|t)] \\ & = \left(\frac{\bar r + \bar f}{\bar f}\right)^2\sigma_t^2 + \left(\frac{\bar t}{\bar f}\right) \left(\bar r^2 + \sigma_r^2\right) \end{align}

Notice that the equation for the total variance makes sense in the \bar f \rightarrow \infty limit; the processing time variance becomes the run-time variance. The equation also has the expected ingredients by depending on both the run-time and repair time variance. But the equation also has a bit of a surprise, it depends on the square of the mean repair time, \bar r^2. That dependence leads to an interesting trade-off.

Imagine that we have a setup with fixed \sigma_t, \sigma_r, and \bar t, and fixed \mathbb{E}(\tau) but we are free to choose \bar f and \bar r. That is, for a given mean total processing time, we can choose between a machine that fails frequently with short repair times or we can choose a machine that fails infrequently but with long repair times. Which one would we like, and does it matter since either choice leads to the same mean total processing time? At fixed \mathbb{E}(\tau) we must have,

    \begin{equation*} \left(\frac{\bar f + \bar r}{\bar f}\right)=K,\end{equation}

for some constant K. But since the variance of the total processing time depends on \bar r, different choices of \bar f will lead to different total variance. The graph shows different iso-contours of constant mean total processing time in the total variance/mean time between failure plane. Along the black curve, we see that as the mean total processing time increases, we can minimize the variance by choosing a configuration where the machine fails often but with short repair times.

Minimizing total variance as a function of mean time between failures.

Why is this important? Well, in a queue, all else remaining equal, increasing server variance increases the expected queue length. So, in a workflow, if we have to live with interruptions, for the same mean processing time, it’s better to live with frequent short interruptions rather than infrequent long interruptions. The exact trade-offs depend on the details of the problem, but this observation is something that all government data scientists who are focused on improving operations should keep in the back of their mind.

Data science in government is really operations research

Colby Cosh had an interesting article in The National Post this week, Let’s beat our government-funded AI addiction together. In his article he refers to a Canadian Press story about the use of artificial intelligence in forest fire management. He has this to say:

“You start with your observations. What have you seen in the past decades in terms of where wildfires have occurred and how big they got? And you look for correlations with any factor that might have any impact. The question is which data really does have any correlation. That’s where the AI comes in play. It automatically figures those correlations out.”

As a reader you might be saying to yourself “Hang on: up until the part where he mentioned ‘AI’, this all just sounds like… regular scientific model-building? Didn’t statistics invent stuff like ‘correlations’ a hundred years ago or so?” And you’d be right. We are using “AI” in this instance to mean what is more accurately called “machine learning.” And even this, since it mentions “learning,” is a misleadingly grandiose term.

Cosh has a point. Not only are labels like artificial intelligence being attached to just about everything involving computation these days, but just about everyone who works with data is now calling themselves a data scientist. I would like to offer a more nuanced view, and provide a bit of insight into how data science actually works in the federal government as practiced by professional data scientists.

Broadly, data science problems fall into two areas:

1) Voluminous, diffuse, diverse, usually cheap, data with a focus on finding needles in haystacks. Raw predictive power largely determines model success. This situation is the classic Big Data data science problem and is tightly associated with the realm of artificial intelligence. The term Big Data sometimes creates confusion among the uninitiated – I’ve seen the occasional business manager assume that large data sets refer to a file that’s just a bit too large to manage with Excel. In reality, true Big Data comprises of data sets that cannot fit into memory on a single machine or be processed by a single processor. Most applications of artificial intelligence require truly huge amounts of training data along with a host of specialized techniques to process it. Examples include finding specific objects within a large collection of videos, voice recognition and translation, handwriting and facial recognition, and automatic photo tagging.

2) Small, dense, formatted, usually expensive data with a focus on revealing exploitable relationships for human decision making. Interpretability plays a large role in determining model success. Unlike the Big Data problems, relevant data almost always fit into memory on a single machine amenable to computation with a limited number of processors. These moderate-sized problems fit within the world of operations research and theoretical models of the phenomenon provide important guides. Examples include modelling queues, inventories, optimal stopping, and trade-offs between exploration and exploitation. A contextual understanding of the data and the question is paramount.

Government data science problems are almost always of the second type, or can be transformed into the second type with a bit of effort. Our data is operational in nature, expensive, dense, small (under 30 GB), rectangular, approximately well-formatted (untidy with some errors, but not overly messy) with a host of privacy and sometimes security concerns. Government decision makers seek interpretable relationships. The real world is more complicated than any mathematical model, hence the need for a decision maker in the first place. The decision maker’s experience is an essential part the process. As Andrew Ng points out in the Harvard Business Review, What Artificial Intelligence Can and Can’t Do Right Now,

“If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future.”

Government decision making usually does not conform to that problem type. Data science in government is really operations research by another name.

Often analysts confuse the two types of data science problems. Too often we have seen examples of the inappropriate use of black-box software. Feeding a few gigabytes of SQL rectangular data into a black-box neural net software package for making predictions in a decision making context is almost certainly misplaced effort. Is the black-box approach stronger? As Yoda told Luke, “No, no, no. Quicker, easier, more seductive.” There is no substitute for thinking about the mathematical structure of the problem and finding the right contextual question to ask of the data.

To give a more concrete example, in the past I was deeply involved with a queueing problem that the government faced. Predicting wait times, queue lengths, and arrivals, is not a black-box-plug-and-play problem. To help government decision makers better allocate scarce resources, we used queueing theory along with modern statistical inference methods. We noticed that servers across our queue came from a heterogeneous population of experience and skill, nested within teams. We estimated production using hierarchical models and Markov Chain Monte Carlo which we used to infer some aspects of our queueing models. We were not thinking about driving data into black-boxes, we were more concerned with the world of random walks, renewal theory, and continuous time Markov chains. Our modelling efforts engendered management discussions that focus on trade-offs between a reduction in server time variance, increasing average service speed, and adding to queue capacity; all of which play a role in determining the long term average queue length and all of which have their own on-the-ground operational quirks and costs. Data science, as we practice it in the civil service, moves management discussions to a higher level so that the decision maker’s unique experience and insight becomes crucial to the final decision. Raw predictive power is usually not the goal – an understanding of how to make optimal trade-offs with complex decisions is.

Data science in government is about improving decisions through a better understanding of the world. That’s mostly the application of operations research and that is how our group applies computation and mathematics to government problems.