Data science in government is really operations research

Colby Cosh had an interesting article in The National Post this week, Let’s beat our government-funded AI addiction together. In his article he refers to a Canadian Press story about the use of artificial intelligence in forest fire management. He has this to say:

“You start with your observations. What have you seen in the past decades in terms of where wildfires have occurred and how big they got? And you look for correlations with any factor that might have any impact. The question is which data really does have any correlation. That’s where the AI comes in play. It automatically figures those correlations out.”

As a reader you might be saying to yourself “Hang on: up until the part where he mentioned ‘AI’, this all just sounds like… regular scientific model-building? Didn’t statistics invent stuff like ‘correlations’ a hundred years ago or so?” And you’d be right. We are using “AI” in this instance to mean what is more accurately called “machine learning.” And even this, since it mentions “learning,” is a misleadingly grandiose term.

Cosh has a point. Not only are labels like artificial intelligence being attached to just about everything involving computation these days, but just about everyone who works with data is now calling themselves a data scientist. I would like to offer a more nuanced view, and provide a bit of insight into how data science actually works in the federal government as practiced by professional data scientists.

Broadly, data science problems fall into two areas:

1) Voluminous, diffuse, diverse, usually cheap, data with a focus on finding needles in haystacks. Raw predictive power largely determines model success. This situation is the classic Big Data data science problem and is tightly associated with the realm of artificial intelligence. The term Big Data sometimes creates confusion among the uninitiated – I’ve seen the occasional business manager assume that large data sets refer to a file that’s just a bit too large to manage with Excel. In reality, true Big Data comprises of data sets that cannot fit into memory on a single machine or be processed by a single processor. Most applications of artificial intelligence require truly huge amounts of training data along with a host of specialized techniques to process it. Examples include finding specific objects within a large collection of videos, voice recognition and translation, handwriting and facial recognition, and automatic photo tagging.

2) Small, dense, formatted, usually expensive data with a focus on revealing exploitable relationships for human decision making. Interpretability plays a large role in determining model success. Unlike the Big Data problems, relevant data almost always fit into memory on a single machine amenable to computation with a limited number of processors. These moderate-sized problems fit within the world of operations research and theoretical models of the phenomenon provide important guides. Examples include modelling queues, inventories, optimal stopping, and trade-offs between exploration and exploitation. A contextual understanding of the data and the question is paramount.

Government data science problems are almost always of the second type, or can be transformed into the second type with a bit of effort. Our data is operational in nature, expensive, dense, small (under 30 GB), rectangular, approximately well-formatted (untidy with some errors, but not overly messy) with a host of privacy and sometimes security concerns. Government decision makers seek interpretable relationships. The real world is more complicated than any mathematical model, hence the need for a decision maker in the first place. The decision maker’s experience is an essential part the process. As Andrew Ng points out in the Harvard Business Review, What Artificial Intelligence Can and Can’t Do Right Now,

“If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future.”

Government decision making usually does not conform to that problem type. Data science in government is really operations research by another name.

Often analysts confuse the two types of data science problems. Too often we have seen examples of the inappropriate use of black-box software. Feeding a few gigabytes of SQL rectangular data into a black-box neural net software package for making predictions in a decision making context is almost certainly misplaced effort. Is the black-box approach stronger? As Yoda told Luke, “No, no, no. Quicker, easier, more seductive.” There is no substitute for thinking about the mathematical structure of the problem and finding the right contextual question to ask of the data.

To give a more concrete example, in the past I was deeply involved with a queueing problem that the government faced. Predicting wait times, queue lengths, and arrivals, is not a black-box-plug-and-play problem. To help government decision makers better allocate scarce resources, we used queueing theory along with modern statistical inference methods. We noticed that servers across our queue came from a heterogeneous population of experience and skill, nested within teams. We estimated production using hierarchical models and Markov Chain Monte Carlo which we used to infer some aspects of our queueing models. We were not thinking about driving data into black-boxes, we were more concerned with the world of random walks, renewal theory, and continuous time Markov chains. Our modelling efforts engendered management discussions that focus on trade-offs between a reduction in server time variance, increasing average service speed, and adding to queue capacity; all of which play a role in determining the long term average queue length and all of which have their own on-the-ground operational quirks and costs. Data science, as we practice it in the civil service, moves management discussions to a higher level so that the decision maker’s unique experience and insight becomes crucial to the final decision. Raw predictive power is usually not the goal – an understanding of how to make optimal trade-offs with complex decisions is.

Data science in government is about improving decisions through a better understanding of the world. That’s mostly the application of operations research and that is how our group applies computation and mathematics to government problems.

Leave a Reply

Your email address will not be published.