I am NOT the decider: the limits to science in public policy and decision making

In the last decade, Western politicians and government officials have made evidence-based decision making a key plank in their platforms and operations. From climate change to Covid-19, governments around the world are increasingly leaning on scientists and other experts to help form policy. I welcome scientific input; without it we are blind. At the same time I fear that we sometimes expect too much from science. Science cannot answer moral questions, and it cannot determine our values.

Parliament: Our collective decision making home.

Today, within some circles of our chattering classes it’s in vogue to complain that our democracies are too slow, too ineffectual, and too unresponsive; that somehow how, an administrative state run by experts and only lightly guided by politicians will offer superior results. But for all its shortcomings and imperfections in process, accountability from the election booth provides the best mechanism to ensure that our collective decision making lines up with our collective values. We invest the power of decision making in our elected officials for a reason – we demand that our leaders take responsibility and we then we make them accountable.

Science can never replace public decision making. How many of our civil liberties should we suspend to fight Covid-19? How much global warming is worth extra economic growth? How much poverty should we tolerate in our country? These are not scientific questions, they all require a value judgment and there is no ultimate right answer. In an increasingly technical and scientific age, we need our democracy more than ever. Scientists, economists, and other professional experts are not elected and are not accountable to the public like an elected official. The real decision involves many competing issues on which scientists and other experts are just as dumb as the next guy. There is no “science machine” that can spit out the right course of action for our elected officials to take. The real strength of science is not certitude but doubt. With my data science team, I stress our role in government decision making with our team motto:

We draw conclusions from data, not recommendations.

By focusing on conclusions that the data can support, we help decision makers understand the likely consequences of alternative courses of action. We emphasize that for all its sophistication and mathematics, our input is a simplification of reality but with enough fidelity that we can help ring-fence the decision. We are under no illusion how difficult the real problem is, and we never put the decision maker to an ultimatum with a recommendation. We are not elected.

In digesting expert advice, I think Lord Salisbury’s insights from 1877 still apply:

No lesson seems to be so deeply inculcated by the experience of life as that you never should trust experts. If you believe the doctors, nothing is wholesome: if you believe the theologians, nothing is innocent: if you believe the soldiers, nothing is safe. They all require to have their strong wine diluted by a very large admixture of insipid common sense.

The biggest wealth transfer in history – from our children to us

As the Western world grapples with Covid-19 by trying to find the right balance between limiting human contacts while keeping our economies open to at least some degree, we are embarking on perhaps the biggest wealth transfer in human history. We are in the process of transferring a very large portion of the future consumption of our children to the present in the form of increased safety. Between creditor bailouts and new spending, it will be our children who will have to pay the bill in the form of higher taxes.

Someone has to pay!

In a usual situation we use debt to finance an asset that will generate an expected return. For example, a business like a restaurant might borrow to finance renovations or start-up costs and debt is paid back through business profits. Occasionally the restaurateur will fail and the loan might not get paid back in full, but that is why business loans don’t offer riskless interest rates. The higher interest rate is compensation for the possibility of failure. Government deficits operate in a similar fashion. The increased government debt is supposed to generate societal returns while recognizing the debt must be paid back through taxation. As Ricardian equivalence points out, there is no free lunch – society internalizes the government’s budget constraint. First order, the level of people’s consumption decisions do not depend on how government finances its spending, just on the spending itself. With increasing public debts, people anticipate the higher future taxes and change their consumption accordingly.

In the current situation debts public and private are not increasing asset performance, they are just keeping the lights on. There is no extra business profits or extra economic growth that we can expect from all this new debt to pay back the burden. This situation is the definition of a financial hole. Someone will have to cover that hole and that someone is our children.

The total cost of Covid-19 mitigation is not just the current direct costs, but also the lost future economic growth as our children pay taxes to cover the hole instead of using their wealth to make investments and generate innovation. And these costs are really beginning to pile up. I wonder what the total cost per life-year saved will turn out to be, because in the end, that is what our children are buying with all the debt we are creating. How much future consumption of our children and our children’s children is worth for the extra safety, almost exclusively for senior citizens, today? I don’t know, but I do know that our children and our children’s children don’t get a say.

Honestly, I find this all a little strange. We are waiting for a vaccine, but society went about its business long before Salk, and long before antibiotics. We built railways across the country and skyscrapers in our cities under what today would be considered prohibitively dangerous working conditions. You and I continue to benefit from that inheritance, but what will we bequeath to our children? Life was more hazardous in the past. I’m not suggesting that we return to 19th or early 20th century standards, but Covid-19 has made life only a little bit more dangerous again. Instead of living with and accepting some extra degree of danger, as previous generations did, apparently we are willing to risk destroying the opportunities of the generations coming up so that we can keep our safety as absolutely as high as possible. That trade-off is not a public health issue, it’s a moral one.

It’s a good thing that our ancestors didn’t shy away from risk; after we are done with Covid-19, maybe our children won’t either.

No better than a Fermi estimate?

Enrico Fermi, the great Italian-American physicist who contributed immensely to our understanding of nuclear processes and particle physics, was known for saying that any good physicist who knows anything about the scale of a problem should be able to estimate any result to within an half order of magnitude or better without doing a calculation. You only need to solve difficult equations when you want to do better than a factor of 2 or 3.

Enrico Fermi: How many piano tuners live in Chicago?

When I taught at Carleton University, I used to teach my students how to make Fermi estimates. I would ask them to estimate (without using Google!) the number of police officers in Ottawa, the number of marriages that took place in Ontario last summer, or the number of people who die in Canada every day. Fermi estimation isn’t magical, it’s just focused numeracy.

There is an article in the CBC this morning What national COVID-19 modelling can tell us — and what it can’t. Unfortunately, the author misses an opportunity to critically question the purpose of modelling and forecasting. The article contains a sub-title: “Uncertainty not a reason for doubt” (Really?!). On the numerical side, the article tells us that forecasts for Alberta predict between 400 and 3,100 Covid-19 deaths by the end of the summer, and that Quebec could see between 1,200 and 9,000 deaths by the end of April. Beyond the silliness of reporting two significant figures with such uncertainty, if that’s what the models are telling us, they don’t offer much because they are no better than a Fermi estimate. You can get these results by counting on your fingers, just like Enrico Fermi.

People want answers, I understand that. People don’t like not knowing things especially when they are frightened. But “models” that offer forecasts that are no better than Fermi estimates aren’t really models. There’s no need to solve differential equations when your model uncertainty exceeds the simple Fermi estimate. That doesn’t mean we shouldn’t work hard at building models, but it means that the Covid-19 prediction models need far better calibration from real world data before they can be useful in helping us understand the reality of future Covid-19 fatalities.

I will leave you with a wonderful story, told at the Federal Open Market Committee (a meeting at the Federal Reserve Bank) in September 2005 which highlights the absurdity that can result from forecasting behind a veil of ignorance:

During World War II, [Nobel laureate, Ken] Arrow was assigned to a team of statisticians to produce long-range weather forecasts. After a time, Arrow and his team determined that their forecasts were not much better than pulling predictions out of a hat. They wrote their superiors, asking to be relieved of the duty. They received the following reply, and I quote, “The Commanding General is well aware that the forecasts are no good. However, he needs them for planning purposes.”

Choosing Charybdis

The West needs immediate plans to restart their economies in the most virus safe way possible. If we don’t begin restarting our economies soon, the West will have chosen Charybdis over Scylla. It’s no longer hypothetical. In the United States alone, the response is costing nearly $2 trillion per month. To put that in perspective, the annual output of the entire US economy before the Covid-19 pandemic was $22 trillion. The economic contraction that we already face rivals the largest year-over-year falls in production during the Great Depression. We risk not having an economy left to restart. The next phase could be a sovereign debt run across the globe – the bond markets are already beginning to signal trouble.

South Koreans are winning the war against Covid-19 by testing as many people as possible, isolating the the infected – including asymptomatic carriers – and employing aggressive triage policies. Let’s learn from each other, and slowly and safely reopen our economies while employing best social distancing practices. If the world can’t get back to some kind of a functioning economy soon, the law of unintended consequences may come into sharp focus. And what can emerge from those unintended consequences truly frightens me. In the 20th century, the most tyrannical ideologies grew out of instability and hardship. No society is immune to those forces.

Covid-19: Between Scylla and Charybdis, only difficult choices. (Alessandro Allori)

We are in completely uncharted territory. Never in history have we tried to shutdown our economies for an indefinite period of time. There is no experience to guide us here; no one knows what awaits us beneath the whirlpool. In addition to our quarantine efforts, we also need to seriously start thinking about the statistical value of life-years-remaining as the beginning of some kind of cost-benefit analysis.

People are comparing our current situation to WWII. I think that comparison is apt, but in a way that most people don’t intend.

In 1939 (1941 for our American cousins) we went to war against the Axis powers to protect our way of life, our prosperity, and to build a world in which liberty could grow. If we had let Germany succeed in Europe by surrendering at Dunkirk, we would have survived and with few Allied causalities. There would be no Allied military cemeteries in Normandy today or elsewhere in France and Europe. British civilians would have been spared the Blitz. But we would have inherited a world with little opportunity, little prosperity, and a hopeless future for our children. Instead, Canada sacrificed the lives of 42,000 young men – all in the prime of life – with another 55,000 wounded. Our young country of a 11 million people put 10% of its citizens directly in harm’s way so that you and I could enjoy a world full of potential, growth, freedom, and peace. The Allies together lost millions. We marched straight forward with resolve and determination and we refused to be swallowed. In the coming weeks, even while employing our best containment efforts, the West may be once again put in the most awful of positions: We may need to ask the literal sons and daughters of the generation that ensured our freedom 80 years ago for a similar sacrifice, this time by accepting only a slightly higher level of risk, and stand in harm’s way to protect us from what lies beneath.

In 1939, we chose Scylla and we won.

Covid-19: Between Scylla and Charybdis. A word of caution from Professor John Ioannidis

Like Odysseus, the Western world finds itself caught between Scylla and Charybdis. We have embarked on a policy path to combat the Covid-19 pandemic that has no precedent in our collective history. The eurozone is looking at a 24% economic contraction in the second quarter on an annualized basis. With numbers that large, I can’t help but think that all kinds of geopolitical risks lurk around the corner. (In the lead up to WWI, nearly all intellectuals and leaders of the European powers believed any conflict would last a mere matter of weeks and at most only a few months. They badly miscalculated.)

Odysseus facing the choice between Scylla and Charybdis, Henry Fuseli.

In Italy, limited capacity is forcing physicians and medical staff into difficult moral choices. We may reach another moral choice in the very near future – placing a hard upper bound on the “value of a statistical life”, corrected for remaining years of life expectancy. How much are we willing to throttle our economy to save some lives with policies that will eventually cost other lives down the road? There are no easy answers here, only trade-offs.

But before we can do any trade-off analysis, we need good data. John Ioannidis, professor of Medicine, of Health Research and Policy and of Biomedical Data Science, at Stanford University School of Medicine, and a professor of Statistics at Stanford University School of Humanities and Sciences, has a new article in STAT: A fiasco in the making? As the coronavirus pandemic takes hold, we are making decisions without reliable data. Professor Ioannidis is an expert in statistics, data science, and meta-analysis (combining data and results from multiple studies on the same research question). He is also the author of the celebrated paper “Why Most Published Research Findings Are False” in PLOS Medicine. In A fiasco in the making?, professor Ioannidis asks,

“Draconian countermeasures have been adopted in many countries. If the pandemic dissipates — either on its own or because of these measures — short-term extreme social distancing and lockdowns may be bearable. How long, though, should measures like these be continued if the pandemic churns across the globe unabated? How can policymakers tell if they are doing more good than harm?”

He also points out that we truly don’t understand the current infection level,

“…we lack reliable evidence on how many people have been infected with SARS-CoV-2 (Covid-19) or who continue to become infected. Better information is needed to guide decisions and actions of monumental significance and to monitor their impact…The data collected so far on how many people are infected and how the epidemic is evolving are utterly unreliable. Given the limited testing to date, some deaths and probably the vast majority of infections due to SARS-CoV-2 are being missed. We don’t know if we are failing to capture infections by a factor of three or 300…The most valuable piece of information for answering those questions would be to know the current prevalence of the infection in a random sample of a population and to repeat this exercise at regular time intervals to estimate the incidence of new infections. Sadly, that’s information we don’t have.”

In the article, he details the analysis of the natural experiment offered by the quarantined passengers on the Diamond Princess cruise ship and what it could mean for bounding the case fatality ratio of SARS-CoV-2. He ends the article on a cautionary note about the importance of weighing consequences against expected results:

“…with lockdowns of months, if not years, life largely stops, short-term and long-term consequences are entirely unknown, and billions, not just millions, of lives may be eventually at stake. If we decide to jump off the cliff, we need some data to inform us about the rationale of such an action and the chances of landing somewhere safe.”

I encourage you to read professor Ioannidis’ article. We are stuck between Scylla and Charybdis, but we can make better decisions with better data. Our choices over the next couple of weeks may incalculably change the course of human history forever.

UPDATE March 20, 2020

A commenter, Gittins Index (thanks!) has found a freely accessible copy of W. Kip Viscusi’s classic paper on the value of statistical life: “The Value of Risks to Life and Health”, Journal of Economic Literature Vol. XXXI (December 1993), pp. 1912-1946 .

Covid-19: Fighting a fire with water or gasoline? Whispers from the 1930s

I’ve been reflecting a bit on the global Covid-19 situation for the last couple of weeks, and I fear government failures around the world. The world governments’ reaction to the novel Coronavirus risks pushing our economies into a deep global recession. There is often an enormous cost to “an abundance of caution”. Are the risks worth the trade-offs?

Covid-19: Ground zero of a global recession?

The recent statement last week by the World Health Organization, claiming that 3.4% of those who caught Covid-19 died, is in all likelihood a gross upward bias of the true mortality rate. In South Korea, a country that has been hit particularly hard by the infection, the authorities there have administered more than 1,100 tests per million citizens. Analysis of their data suggests a mortality rate of 0.6%. As a point of comparison, the seasonal flu has a mortality rate of about 0.1%. High mortality from early estimates of Covid-19 seem to result from extreme truncation – a statistical problem that is not easy to solve. People who present themselves at medical facilities tend to be the worst affected making observation of those individuals trivial, while those who have mild symptoms are never heard from. Covid-19 is probably more dangerous than the flu for the elderly and those with pre-existing conditions which is almost certainly the main driver of the higher mortality rate relative to the seasonal flu. Italy’s numbers seem to be an outlier, but it’s unclear exactly what testing strategy they are using. At any rate, what worries me is not Covid-19 but the seemingly chaotic, on-the-fly world government responses that threaten to turn a bad but manageable problem into a global catastrophe. We have a precedent for such government policy failures in the past: The Great Depression.

In the late 1920s, in an attempt to limit speculation in securities markets, the Federal Reserve increased interest rates. This policy had the effect of slowing economic activity to the point that by August of 1929 the US economy fell into recession. Through gold standard channel mechanisms the Federal Reserve’s policy induced recessions in countries around the world. In October the stock market crashed. By itself, even these poor policy choices should not have caused a depression, but the Federal Reserve compounded its mistakes by adopting a policy of monetary contraction. By 1933 the stock of money fell by over a third. Since people wished to hold more money than the Federal Reserve supplied, people hoarded money and consumed less, choking the economy. Prices fell. Unemployment soared. The Federal Reserve, based on erroneous policy and on further misdiagnoses of the economic situation, turned a garden variety but larger than average recession into a global catastrophe. The former chairman of the Federal Reserve Ben Bernanke, and an expert on the Great Depression, says:

Let me end my talk by abusing slightly my status as an official representative of the Federal Reserve. I would like to say to Milton [Friedman] and Anna [Schwartz]: Regarding the Great Depression, you’re right. We did it. We’re very sorry. But thanks to you, we won’t do it again.

Unintentionally, the Federal Reserve’s poor decision making created a global disaster. This is the face of government failure. Poor polices can lead to terrible consequences that last decades and scar an entire generation.

When an entire economy largely shuts down by government fiat for a few weeks or a month, it is not as simple as reopening for business and making back the losses when the crisis passes. During the shutdown, long term contracts still need to get paid, employees still need to get paid, business loans still need to get repaid, taxes are still owed, etc. When everything restarts, businesses are in a hole so it’s not back to business as usual. Some businesses will fail; they will never catch up. Some people will accordingly lose their jobs. Production and supply chains will need to adjust; an economic contraction becomes likely. Quickly shutting down an economy is a bit like quickly shutting down a nuclear reactor: you must be really careful or you risk a meltdown. With Covid-19 policies, governments around the world are risking the economic equivalent. The stock market is rationally trying to price the probability of a policy-induced catastrophe, hence the incredible declines and massive volatility.

Every day 150,000 people die on this planet. How much do we expect that number to change as the result of Covid-19? Are polices that risk a serious global recession or worse worth it? Maybe. But that’s a serious set of consequences to consider. Maybe we will get lucky and we will largely escape unscathed and it will all pass soon. Or maybe not. Yet a comparison to the Federal Reserve’s policy actions in the late 1920s and early 1930s generates an unsettling feeling of deja vu: Made-on-the-fly world government responses, rooted in an “abundance of caution” with more than a touch of panic, is putting the world economy on the cusp of a global catastrophe.

The depth of a serious government failure is beyond measure. It’s not climate change that’s the biggest threat to humanity; it’s unforeseen events coupled with risky policy responses, like the situation we currently find ourselves in, that should really worry us. Real problems come out of nowhere, just like Covid-19, not stuff that might happen over a hundred years with plenty of time to adapt. Let’s all demand careful policy responses and weigh the risks and consequences appropriately. Otherwise, we just might find out how true the aphorism is:

History might not repeat itself, but it does rhyme.

UPDATE – March 13, 2020

Policy choices have trade-offs. When policy is slapped together in a panic more often than not the hastily constructed policy produces little value in solving the problem but creates enormous secondary problems that eclipse the original problem’s severity. We need to be careful. It doesn’t mean we ignore the problem, over course saving lives matter and we should all do our part to help. But swinging post-to-post with large policy shifts that appear faster than the 24 hour news cycle, as we have seen in some countries, is a very risky policy response. We don’t want to do more harm than good. Fortunately, it appears that governments around the world are beginning to have more coordinated conversations.

More than anything, I think this experience points to the need for serious government and economic pandemic plans for the future. It’s a bit ironic that policy has started to demand stress testing the financial system for slow moving climate change effects, but no one seemed to include pandemics. How many financial stress tests evaluated the impact of what’s happening right now? This event is a wake up call for leadership around the world to be more creative in thinking about what tail events really look like. Having better plans and better in-the-can policy will protect more lives while preserving our economic prosperity.

Finally, a serious global recession or worse is not something to take lightly. Few of us have experienced a serious economic contraction. If the global economy backslides to a significant extent, the opportunities we have to lift billions of people out of poverty gets pushed into the future. That costs lives too. Economic growth is a powerful poverty crushing machine. Zoonotic viruses like Covid-19 almost always result from people living in close proximity to livestock, a condition usually linked to poverty. In a world as affluent as Canada, the chance of outbreaks like the one we are witnessing drop dramatically. I hope that the entire world will one day enjoy Canada’s level of prosperity.

A paper to read by Gelman and Shalizi: Philosophy and the practice of Bayesian statistics

The great 20th century physicist Richard Feynman supposedly quipped “Philosophy of science is about as useful to scientists as ornithology is to birds.” As always, Feynman has a point, but in the fields of statistics, machine learning, and data science, understanding at least some of the philosophy behind techniques can prevent an awful lot of silliness and generate better results.

Feynman: You philosophers!

In their paper, Philosophy and the practice of Bayesian statistics, (British Journal of Mathematical and Statistical Psychology 2013, 66, 8-38) Andrew Gelman and Cosma Shalizi offer a thoughtful piece on what is really going on – or what really should be going on – in Bayesian inference. This paper is a short, highly interesting read, and I strongly suggest that all data scientists in the federal government put it on their reading lists.

For the uninitiated, statistical inference falls into two broad schools. The first, often called “classical statistics”, follows Neyman-Pearson hypothesis tests, Neyman’s confidence intervals, and Fisher’s p-values. Statistical inference rests on maximizing the likelihood function, leading to parameter estimates with standard errors. This school of statistics is usually the first one people encounter in introductory courses. The second school – Bayesian statistical inference – starts with a prior distribution over the parameter space and uses data to transform the prior into a posterior distribution. The philosophies behind each school are often said to be deductive in the classical case, and inductive in the Bayesian one. The classical school follows a method that leads to rejection or falsification of a hypothesis while the Bayesian school follows an inductive “learning” procedure with beliefs that rise and fall with posterior probabilities. Basically, if it’s not in the posterior, the Bayesian says it’s irrelevant. The Bayesian philosophy has always made me feel a bit uncomfortable. Bayesian methods are not the issue, I use them all the time, it’s the interpretation of pure inductive learning that has always bothered me. To me, I’ve felt that in the end the the prior-to-posterior procedure is actually a form of deductive reasoning but with regularization over the model space.

Gelman and Shalizi go right to the heart of this issue claiming that “this received view [pure inductive learning] of Bayesian inference is wrong.” In particular, the authors address the question: What if the “true” model does not belong to any prior or collection of priors, which is always the case in the social sciences? In operations research and anything connected to the social sciences, all models are false; we always start with an approximation that we ultimately know is wrong, but useful. Gelman and Shalizi provide a wonderful discussion about what happens with Bayesian inference in which the “true” model does not form part of the prior, a situation they label as the “Bayesian principal-agent problem”.

In the end, Gelman and Shalizi emphasize the need for model testing and checking, through new data or simulations. They demand that practical statisticians interrogate their models, pushing them to the breaking point and discovering what ingredients can make the models stronger. We need to carefully examine how typical or extreme our data are relative to what our models predict. The authors highlight the need for graphical and visual checks in comparisons of the data to simulations. This model checking step applies equally to Bayesian model building and thus in that sense both schools of statistics are hypothetico-deductive in their reasoning. In fact, the real power behind Bayesian inference lies in its deductive ability over lots of inferences. The authors essentially advocate the model building approach of George Box and hold to a largely Popperian philosophy.

Finally, Gelman and Shalizi caution us that viewing Bayesian statistics as subjective inductive inference can lead us to complacency in picking and averaging over models rather than trying to break our models and push them to the limit.

While Feynman might have disparaged the philosopher, he was a bit of a philosopher himself from time to time. In an address to the Caltech YMCA Lunch Forum on May 2, 1956, he said:

That is, if we investigate further, we find that the statements of science are not of what is true and what is not true, but statements of what is known to different degrees of certainty: “It is very much more likely that so and so is true than that it is not true;” or “such and such is almost certain but there is still a little bit of doubt;” or – at the other extreme – “well, we really don’t know.” Every one of the concepts of science is on a scale graduated somewhere between, but at neither end of, absolute falsity or absolute truth.

It is necessary, I believe, to accept this idea, not only for science, but also for other things; it is of great value to acknowledge ignorance. It is a fact that when we make decisions in our life we don’t necessarily know that we are making them correctly; we only think that we are doing the best we can – and that is what we should do.

I think Feynman would have been very much in favour of Gelman’s and Shalizi’s approach – how else can we learn from our mistakes?

What does it take to be a successful data scientist in government?

Oh, no…yet another blog post on what it takes to be successful (fill in the blank). What a way to start 2020!

But, last month I was conducting job interviews and at the end of one interview the candidate asked me this very question. So, I thought I would share my answer.

There is endless hype about data science, especially in government circles: AI/deep learning will solve everything – including climate change – chatbots are the future of “human” service interaction, etc. Yes, all of these methods are useful and have their place, but when you ask the enthusiastic official jumping up and down about AI exactly which problem he hopes to solve and how he thinks AI or deep learning applies, you get a muddleheaded response. Most of the problems people have in mind don’t require any of these techniques. Unfortunately, in the rise toward the peak of inflated expectations, people often promote “solutions” in search of problems instead of the other way around.

My zeroth rule to becoming a successful data scientist: Avoid the hype and instead concentrate on building your craft. Read. Code. Calculate.

Data science applied in government requires three pillars of expertise: mathematics and statistics, hard coding skills, and a thorough contextual understanding of operations.

Mathematics and Statistics

With the explosion of data science there are, to be frank, a lot of counterfeits. Expertise is not something that can be built in a day, in a couple months, or through a few of online courses – it takes years of dedication and hard work. Data science in government and most business operations is not about loading data into black boxes and checking summary statistics. To make better decisions, we almost always seek a causal understanding of the world, generating the ability to answer counterfactual questions while providing a basis for interpreting new observations. Causal constructions require careful mathematical modelling. In the end, the data scientist attached to operations presents decision makers with the likely consequences of alternatives courses of action. By quantitatively weighing trade-offs, the data scientist helps the decision maker use his or her expertise in non-quantitative reasoning to reach the best possible decision.

Turning the quantitative part of the decision problem into mathematics requires the data scientist to be an applied mathematician. This requirement goes well beyond the usual undergraduate exposure to linear algebra and calculus. Mathematical maturity, the ability to recognize the nature of the mathematical or statistical inference problem at hand and develop models, is essential to the successful application of data science in business. Think the “physics” structure, not black boxes.


I get silly questions all the time about computer languages. Is Python better than R? Should I use Julia? Telling a computer what to do in a particular language, while technical, should not be the focus of your concerns. Learn how to write quality intelligible code; after that, which language you use is moot. Use the tool appropriate for the job (R, Python, SQL, Bash, whatever.). In our team, we have coop students who have never seen R before and by the end of their work term they are R mini-gods, building and maintaining custom R packages, Shiny websites, and R Markdown documents all within our Git repos.

Whatever data science language you choose as your primary tool, focus on building coding skills and your craft. Data cleaning and tidying is a large part of data science so at least become proficient with split/apply/combine coding structures. Communication is key, not only for clients but also for fellow data scientists. Learn how to build targeted and clean data visualizations in your final products and in your diagnostics. Think functional structure and communication, not obsessing over computer languages.

Operational context

Understanding the business that generates your datasets is paramount. All datasets have their own quirks. Those quirks tell you something about the history of how the data were collected, revealing not only messages about the data generating process itself, but also about the personalities, the biases, and working relationships of people within the business. Learning about the people part of the data will help you untangle messiness, but more importantly, it will help you identify the key individuals who know all the special intricacies of the operations. A couple of coffee conversations with the right people can immensely strengthen the final product while shortening production time-lines.

From a statistical point of view, you need to understand the context of the data – which conclusions the data can support, which ones it can’t, and which new data, if made available, would offer the best improvements to future analysis. This issue ties us back to mathematics and statistics since in the end we desire a deeper causal and counterfactual understanding of operations. Predictions are rarely enough. Think data structure and history, not raw input for algorithms.

Machine learning in finance – technical analysis for the 21th century?

I love mathematical finance and financial economics. The relationships between physics and decision sciences are deep. I especially enjoy those moments while reading a paper when I see ideas merging with other mathematical disciplines. In fact, I will be giving a talk at the Physics Department at Carleton University in Ottawa next month on data science as applied in the federal government. In one theme, I will explore the links between decision making and the Feynman-Kac Lemma – a real options approach to irreversible investment.

I recently came across a blog post which extolls the virtues of machine learning as applied to stock picking. Here, I am pessimistic of long term prospects.

So what’s going on? Back in the 1980s, time series and regression software – not to mention spreadsheets – started springing up all over the place. It suddenly became easy to create candlestick charts, calculate moving averages of convergence/divergence, and locate exotic “patterns”. And while there are funds and people who swear by technical analysis to this day, on the whole it doesn’t offer superior performance. There is no “theory” of asset pricing tied to technical analysis – it’s purely observational.

In asset allocation problems, the question comes down to a theory of asset pricing. It’s an observational fact that some types of assets have a higher expected return relative to government bonds over the long run. For example, the total US stock market enjoys about a 9% per annum expected return over US Treasuries. Some classes of stocks enjoy higher returns than others, too.

Fundamental analysis investors, including value investors, have a theory: they attribute the higher return to business opportunities, superior management, and risk. They also claim that if you’re careful, you can spot useful information before anyone else can, and, that when that information is used with theory, you can enjoy superior performance. The literature is less than sanguine on whether fundamental analysis provides any help. On the whole, most people and funds that employ it underperform the market by at least the fees they charge.

On the other hand, financial economists tell us that fundamental analysis investors are correct up to a point – business opportunities, risk, and management matter in asset valuation – but because the environment is so competitive, it’s very difficult to use that information to spot undervalued cash flows in public markets. In other words, it’s extraordinarily hard to beat a broadly diversified portfolio over the long term.

(The essential idea is that price, p(t), is related to an asset’s payoff, x(t), through a discount rate, m(t), namely: p(t) = E[m(t)x(t)]. In a simple riskless case, m(t) =1/R, where R is 1 + the interest rate (e.g., 1.05), but in general m(t) is a random variable. The decomposition of the m(t) and its theoretical construction is a fascinating topic. See John Cochrane’s Asset Pricing for a thorough treatment.)

So where does that leave machine learning? First, some arithmetic: the average actively managed dollar gets the index. That is, on average, for every actively managed dollar that outperforms, it comes at the expense of an actively managed dollar that underperforms. It’s an incontrovertible fact: active management is zero-sum relative to the index. So, if machine learning leads to sustained outperformance, gains must come from other styles of active management, and, it must also mean that the other managers don’t learn. We should expect that if some style of active management offers any consistent advantage (corrected for risk), that advantage will disappear as it gets exploited (if it existed at all). People adapt; styles change. There are lots of smart people on Wall Street. In the end, the game is really about identifying exotic beta – those sources of non-diversifiable risk which have very strange payoff structures and thus require extra compensation.

Machine learning on its own doesn’t offer a theory – the 207,684th regression coefficient in a CNN doesn’t have a meaning. The methods simply try to “learn” from the data. In that sense, applied to the stock market, machine learning seems much like technical analysis of the 1980s – patterns will be found even when there are no patterns to find. Whatever its merits, to be useful in finance, machine learning needs to connect back to some theory of asset pricing, helping to answer the question of why some classes of assets enjoy higher return than others. (New ways of finding exotic beta? Could be!) Financial machine learning is not equal to machine learning algorithms plus financial data – we need a theory.

In some circumstances theory doesn’t matter at all when it comes to making predictions. I don’t need a “theory” of cat videos to make use of machine learning for finding cats on YouTube. But, when the situation is a repeated game with intelligent players who learn from each other and who are constantly immersed in a super competitive highly remunerative environment, if you don’t have a theory of the game, it usually doesn’t end well.

Climate change: Evidence based decision making with economics

Climate change is in the news every day now. The CBC has a new series on climate change, and news sources from around the world constantly remind us about climate change issues. As we might expect, the political rhetoric has become intense.

In my previous blog post, I showed how using even relatively crude statistical models of local daily mean temperature changes can easily extract a warming signal. But to make progress, we must understand that climate change has two parts, both of which require separate but related scientific reasoning:

1) What is the level of climate change and how are humans contributing to the problem?

2) Given the scientific evidence for climate change and human contributions, what is the best course of action that humans should take?

The answer to these two questions get muddled in the news and in political discussions. The first question has answers rooted in atmospheric science, but the second question belongs to the realm of economics. Given all the problems that humanity faces, from malaria infections to poor air quality to habitat destruction, climate change is just one among many issues that competes for scarce resources. The second question is much harder to answer and I won’t offer an opinion. Instead, I would like to leave you with a question that might help center the conversation about policy and how we should act. I leave it for you to research and decide.

If humanity did nothing about climate change and the upper end of the climate warming forecasts resulted, 6 degrees Celsius by year 2100, how much smaller would the global economy be in 2100 relative to a world with no climate change at all? In other words, how does climate change affect the graph below going forward?

A cause for celebration: World GDP growth since 1960.