The Unreasonable Confusion Hiding under Data

tl;dr data is the new oil, right? College classes and books from machine learning to marketing tell us that everything can be learnt from it.

My personal experience drinking at the fountain of miracles was a tad different, I often end up with more questions than answers when I look at charts; for data often contains variations that are meaningless without context.

We often laugh at ourselves asserting that everyone has become an epidemiologist or a doctor. Yet, there is some truth in that. In terms of public awareness provided by main stream media, COVID has indeed helped a great deal. We were talking in terms that sounded like “112 deaths per 100 000 persons over 7 days” or “1.3 reproduction factor”. In other words, we were presented figures that were taking geographical differences, seasonality, noise and context into account, and were relative to a population or to a previous measure to capture exponential evolution and enable meaningful comparisons across time and locations.

It is a wonderful thing compared to, for example, how we discuss social issues. “In X, Y% people live with less than Z€ per day”, considering that nobody ever mentions what can be bought in X with Z€, or what was Y two years ago.

I will stick to this shared experience as a reference example, but I have very similar observations about most business metrics I see.

Know your question

First, let’s observe that we always have an implicit question as soon as we look into any kind of data. It might be as simple as “is the situation getting any better?”.

All else being equal

Economists and other scientists often introduce their assumptions
saying “all else being equal”. Usually assuming they changed only a single causal variable and measured a target variable, here the causal variable could be a policy and the assumption: “does such a policy reduces the reproduction rate?”.

Real life is slow, COVID has an incubation time. So the answer is not instant, sometimes a policy’s implementation also takes time. Are the effect we are observing linked to the lever we just pushed?

Real life is messy, across countries and people, the willingness to respect a law or properly wear mask may switch from a standard to an unreasonable expectation. Will what we learn have universal value?

Real life is complex, often key decisions are working in combination, and it can be hard to isolate a single one. Here, the idea of having a single causal variable is just wrong. Can we really say that we have isolated a single effect?

In other words, things are rarely constant.

Crisis are times when good and informed decisions are most required. Unfortunately, they are often on the worst end of the information quality spectrum.

Figures only look neutral

COVID was a worldwide topic that saw a wide range of responses over the globe but also a wide range of measurement diversity:

Tests availability
Price of said tests
Presence of testing campaigns
Rules of association of a death with the disease

It means that suddenly one might not be aware that they are comparing apples with oranges and unless they aim for a smoothie it is not going to end well.

Trade off is key

We want to act toward the best solution. The best solution is often whatever leads to a desired long term situation, yet we may need to act and react quickly in a short term environment. In this search, one of the possible trade off is to have a proxy, a metric that is correlated but not exactly our objective.

For example, the key idea could be to reduce the number of COVID deaths, yet it is easier to assess the current situation through the number of hospitalization, which itself can be anticipated through the number of cases.

To make things harder, the validity of proxies also changes overtime. The arrival of vaccines was supposed to shut down the disease completely, then to stop its propagation, it ended up reducing severity. And suddenly the number of cases and the amount of deaths in a few weeks were not that correlated anymore, or not with the same factor.

If you deal with lagging data, I will give you a pro-tip here: try to maintain a timed log of the key actions that are supposed to impact your metrics. I typically used a dedicated Google Calendar for that. It makes live much sweeter when you are wondering what could explain this sudden change of your organic traffic three months ago.

Volumes

Business cases often implies issues that are less pressing in COVID discussions: volumes. There are simply not big enough numbers to reliably assess something. “Our conversion rate changed from 3, to 4%”, great if you had 10 000 customers, if you had 30, it probably does not mean anything.

Unfortunately most business people are not aware about what statistical tests are.

Conclusion

I will not blabber that data has no added value, but numbers often look honest and straightforward. They simply are not.

I have written about interpretation issues, that is excluding psychological biases, conscious or not. The analyst often “wishes” to read a specific result, or at least anything significant.

Using data to guide decision asks much more than looking at a single chart. This statement holds true even if you are tracking the right things. Yes, data needs to be cleaned, metrics need to be well thought but not only. Context also needs to be considered, few are the databases that contain data that can be considered homogeneous over time or segments.

If we get back to the “data is the new oil” metaphor, oil requires a lot of refining before being useful, some oil is so hard to extract that it actually makes little economic sense to dig it.