Become a GAMM-ateur climate scientist with mgcv

I love tennis. I play tennis incessantly. I follow it like a maniac. This January, my wife and I attended the Australian Open, and then after the tournament we played tennis every day for hours in the awesome Australian summer heat. During a water break one afternoon, I checked the weather app on my phone; the mercury reached 44 C!

The Aussie Open 2019: Rafael Nadal prepares to serve in the summer heat.

It got me to thinking about climate change and one of the gems in my library, Generalized Additive Models: An introduction with R by Professor Simon N. Wood – he is also the author of the R package mgcv (Mixed GAM Computation Vehicle with Automatic Smoothness Estimation).

First, Wood’s book on generalized additive models is a fantastic read and I highly recommend it to all data scientists – especially for data scientists in government who are helping to shape evidence based policy. In the preface the author says:

“Life is too short to spend too much time reading statistical texts. This book is of course an exception to this rule and should be read cover to cover.”

I couldn’t agree more. There are many wonderful discussions and examples in this book with breadcrumbs into really deep waters, like the theory of soap film smoothing. Pick it up if you are looking for a nice self-contained treatment of generalized additive models, smoothing, and mixed modelling. One of the examples that Wood works through is the application of generalized additive mixed modelling to daily average temperatures in Cairo Egypt (section 7.7.2 of his book). I want to expand on that discussion a bit in this post.

Sometimes we hear complaints that climate change isn’t real, that there’s just too much variation to reveal any signal. Let’s see what a bit of generalized additive modelling can do for us.

A generalized linear mixed modelling (GLMM) takes the standard form:

    \begin{align*}\boldsymbol{\mu}^b &= \mathbb{E}({\bf y}\mid{\bf b}), \\ g(\mu_i^b) &= {\bf X}_i\boldsymbol{\beta}+ {\bf Z}_i{\bf b}, \\ {\bf b} &\sim N({\bf 0}, {\boldsymbol{\psi}}_\theta), \\ y_i\mid{\bf b} &\sim \text{exponential family dist.,}\end{align}

where g is a monotonic link function, {\bf b} contains the random effects with zero expected value and with a covariance matrix {\boldsymbol{\psi}}_\theta parameterized by \theta. A generalized additive model uses this structure, but the design matrix {\bf X} is built from spline evaluations with a “wiggliness” penalty, not on the regressors directly (coefficients correspond to the coefficients of the spline). For details, see Generalized Additive Models: An Introduction with R, Second Edition.

The University of Dayton has a website with daily average temperatures from a number of different cities across the world. Let’s take a look at Melbourne, Australia – the host city of the Australian Open. The raw data has untidy bits, and in my R Markdown file I show my code and the clean up choices that I made.

The idea is to build an additive mixed model with temporal correlations. Wood’s mgcv package allows us to build rather complicated models quite easily. For details on the theory and the implementations mgcv, I encourage you to read Wood’s book. The model I’m electing to use is:

    \begin{equation*} \text{temp}_i = s_1(\text{time.of.year}_i) + s_2(\text{time}_i) + e_i,\end{equation}

where

e_i = \phi_1 e_{i-1} + \phi_2 e_{i-2}+ \epsilon_i, \epsilon_i \sim N(0,\sigma^2), s_1(\cdot) is a cyclic cubic smoothing spline that captures seasonal temperature variation on a 365 day cycle, and s_2(\cdot) is a smoothing spline that tracks a temperature trend, if any. I’m not an expert in modelling climate change, but this type of model seems reasonable – we have a seasonal component, a component that captures daily autocorrelations in temperature through an AR(2) process, and a possible trend component if it exists. To speed up the estimation, I nest the AR(2) residual component within year.

The raw temperature data for Melbourne, Australia is:

Daily mean temperature in Melbourne: 1995 – 2019.

We see a clear season pattern in the data, but there is also a lot of noise. The GAMM model will reveal the presence of a trend:

Climate change trend in Melbourne: 1995 – 2019.

We can see that Melbourne has warmed over the last two decades (by almost 2 C). Using the Dayton temperature dataset, I created a website based on the same model that shows temperature trends across about 200 different cities. Ottawa, Canada (Canada’s capital city) is included among the list of cities and we can see that the temperature trend in Ottawa is a bit wonky. We’ve had some cold winters in the last five years and while the Dayton data for Ottawa is truncated at 2014, I’m sure the winter of 2018-2019 with its hard cold spells would also show up in the trend. This is why the phenomenon is called climate change – the effect is, and will continue to be, uneven across the planet. If you like, compare different cities around the world using my website.

As a point of caution, climate change activists should temper their predictions about how exactly climate change will affect local conditions. I recall that in 2013 David Suzuki wrote about what climate change could mean for Ottawa, saying

…one of Canada’s best-loved outdoor skating venues, Ottawa’s Rideau Canal, provides an example of what to expect…with current emissions trends, the canal’s skating season could shrink from the previous average of nine weeks to 6.5 weeks by 2020, less than six weeks by 2050 and just one week by the end of the century. In fact, two winters ago, the season lasted 7.5 weeks, and last year it was down to four. The canal had yet to fully open for skating when this column was written [January 22, 2013].

The year after David Suzuki wrote this article, the Rideau Skateway enjoyed the longest consecutive days of skating in its history and nearly one of the longest seasons on record. This year (2019) has been another fantastic skating season, lasting 71 days (with a crazy cold winter). My GAMM analysis of Ottawa’s daily average temperature shows just how wild local trends can be. Unfortunately, statements like the one David Suzuki made fuels climate change skeptics. Some people will point to his bold predictions for 2020, see the actual results, and then dismiss climate change altogether. I doubt that David Suzuki intends that kind of advocacy! Climate change is complicated, not every place on the planet will see warming and certainly not evenly. And if the jet stream becomes unstable during the North American winter, climate change may bring bitterly cold winters to eastern Canada on a regular basis – all while the Arctic warms and melts. There are complicated feedback mechanisms at play; so persuading people about the phenomenon of climate change with facts instead of cavalier predictions is probably the best strategy.

Now, establishing that climate change is real and persuading people of its existence is only one issue – what to do about it is an entirely different matter. We can agree that climate change is real and mostly anthropogenic, but it does not imply that the climate lobby’s policy agenda inexorably follows. Given the expected impact of climate change on the global economy and how to think about its economic consequences in a world of scarce resources, we should seek the best evidence based policy solutions available, see for example:

Let’s use the best evidence, both from climate science and economics, as our guide for policy in an uncertain future.

Data science in government is really operations research

Colby Cosh had an interesting article in The National Post this week, Let’s beat our government-funded AI addiction together. In his article he refers to a Canadian Press story about the use of artificial intelligence in forest fire management. He has this to say:

“You start with your observations. What have you seen in the past decades in terms of where wildfires have occurred and how big they got? And you look for correlations with any factor that might have any impact. The question is which data really does have any correlation. That’s where the AI comes in play. It automatically figures those correlations out.”

As a reader you might be saying to yourself “Hang on: up until the part where he mentioned ‘AI’, this all just sounds like… regular scientific model-building? Didn’t statistics invent stuff like ‘correlations’ a hundred years ago or so?” And you’d be right. We are using “AI” in this instance to mean what is more accurately called “machine learning.” And even this, since it mentions “learning,” is a misleadingly grandiose term.

Cosh has a point. Not only are labels like artificial intelligence being attached to just about everything involving computation these days, but just about everyone who works with data is now calling themselves a data scientist. I would like to offer a more nuanced view, and provide a bit of insight into how data science actually works in the federal government as practiced by professional data scientists.

Broadly, data science problems fall into two areas:

1) Voluminous, diffuse, diverse, usually cheap, data with a focus on finding needles in haystacks. Raw predictive power largely determines model success. This situation is the classic Big Data data science problem and is tightly associated with the realm of artificial intelligence. The term Big Data sometimes creates confusion among the uninitiated – I’ve seen the occasional business manager assume that large data sets refer to a file that’s just a bit too large to manage with Excel. In reality, true Big Data comprises of data sets that cannot fit into memory on a single machine or be processed by a single processor. Most applications of artificial intelligence require truly huge amounts of training data along with a host of specialized techniques to process it. Examples include finding specific objects within a large collection of videos, voice recognition and translation, handwriting and facial recognition, and automatic photo tagging.

2) Small, dense, formatted, usually expensive data with a focus on revealing exploitable relationships for human decision making. Interpretability plays a large role in determining model success. Unlike the Big Data problems, relevant data almost always fit into memory on a single machine amenable to computation with a limited number of processors. These moderate-sized problems fit within the world of operations research and theoretical models of the phenomenon provide important guides. Examples include modelling queues, inventories, optimal stopping, and trade-offs between exploration and exploitation. A contextual understanding of the data and the question is paramount.

Government data science problems are almost always of the second type, or can be transformed into the second type with a bit of effort. Our data is operational in nature, expensive, dense, small (under 30 GB), rectangular, approximately well-formatted (untidy with some errors, but not overly messy) with a host of privacy and sometimes security concerns. Government decision makers seek interpretable relationships. The real world is more complicated than any mathematical model, hence the need for a decision maker in the first place. The decision maker’s experience is an essential part the process. As Andrew Ng points out in the Harvard Business Review, What Artificial Intelligence Can and Can’t Do Right Now,

“If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future.”

Government decision making usually does not conform to that problem type. Data science in government is really operations research by another name.

Often analysts confuse the two types of data science problems. Too often we have seen examples of the inappropriate use of black-box software. Feeding a few gigabytes of SQL rectangular data into a black-box neural net software package for making predictions in a decision making context is almost certainly misplaced effort. Is the black-box approach stronger? As Yoda told Luke, “No, no, no. Quicker, easier, more seductive.” There is no substitute for thinking about the mathematical structure of the problem and finding the right contextual question to ask of the data.

To give a more concrete example, in the past I was deeply involved with a queueing problem that the government faced. Predicting wait times, queue lengths, and arrivals, is not a black-box-plug-and-play problem. To help government decision makers better allocate scarce resources, we used queueing theory along with modern statistical inference methods. We noticed that servers across our queue came from a heterogeneous population of experience and skill, nested within teams. We estimated production using hierarchical models and Markov Chain Monte Carlo which we used to infer some aspects of our queueing models. We were not thinking about driving data into black-boxes, we were more concerned with the world of random walks, renewal theory, and continuous time Markov chains. Our modelling efforts engendered management discussions that focus on trade-offs between a reduction in server time variance, increasing average service speed, and adding to queue capacity; all of which play a role in determining the long term average queue length and all of which have their own on-the-ground operational quirks and costs. Data science, as we practice it in the civil service, moves management discussions to a higher level so that the decision maker’s unique experience and insight becomes crucial to the final decision. Raw predictive power is usually not the goal – an understanding of how to make optimal trade-offs with complex decisions is.

Data science in government is about improving decisions through a better understanding of the world. That’s mostly the application of operations research and that is how our group applies computation and mathematics to government problems.