Big data and the traps of spurious correlations

In Spain, it is estimated that around 68% of industrial companies are considered "digital novices" or "digital followers", indicating that they have not yet fully adopted digitalization in their businesses and need to do so to improve their competitiveness. When analyzing large data sets, we warn against finding spurious correlations, where variables may appear related without having real meaning or where a third variable could be influencing them. Therefore, it is essential to interpret the data with caution, remembering that correlation does not imply causation, and to be aware of how graphs and visualizations are constructed to avoid erroneous conclusions.

The speed of the environment can lead us to fall into statistical traps when interpreting data. Faced with an enormous volume of information, an attentive, informed and unbiased look is required that allows for a more efficient reading.

The estimates they point out that the 68% of Spanish industrial companies are “digital novices” or “digital followers”. This means that some 130,000 companies – most with less than 10 employees – still have to fully immerse themselves in the digitalization of their businesses and improve their competitiveness.

In a scenario where millions and millions of data are generated, whoever can better interpret that information has a clear competitive advantage, given that they will be in better conditions to make business decisions. If Big Data is the door to achieving better results, we must also focus on how to manage it. 

A very important point that I want to refer to in this sense has to do with the danger of discovering infinite correlations in those huge data sets. Why is there a potential risk? The alarm goes off because given a large amount of data, it is possible to find variables that correlate even when they should not. In very large databases, arbitrary correlations always appear, not necessarily due to the nature of the data, but merely due to its volume.

First let's go to the definitions: when we talk about correlation we mean that two things vary together. However, While correlation implies association, it does not necessarily imply causation. This is because two variables can be related, but one cannot cause the other. Conversely, causality implies association, but not necessarily correlation.

So the calls “spurious correlations” relate two variables that at a mathematical level may make sense, but that objectively (if we look at it from a broader or specific contextual view) do not make any sense or a third variable that is not being considered may be influencing. In the analysis.

The examples to understand these errors in the interpretation of data are known and already famous, from “storks bring babies” (based on a curious phenomenon in northern Europe, in medieval times, where couples got married on the solstice of summer and the storks returned from their migration from Africa the following spring, exactly nine months later) until finding a relationship between the monthly number of drownings on a beach with the amount of ice cream sold in the same period. Are ice cream the cause of more drownings? No, but people tend to eat more ice cream on hot days, when they are also more likely to go swimming, meaning temperature is a third variable involved in the overall consideration. 

It is important to note that pThere may be a causal relationship between two variables, but the correlation does not indicate the direction of causality.. A clear example of this is illustrated with examples such as the link between active lifestyles as a “guarantee” of better cognitive functioning in older people. There is evidence that the causal direction is the opposite: higher cognitive functioning may result in a more active lifestyle.

All of these caveats in terms of how to interpret the data, They become even more evident when we talk about their visualization and the need to understand how the axes and questions that guide the reading of the graphs and the visualizations themselves are constructed. The objective is not fall into erroneous correlations and also be able to adopt this habit when applying data science strategies in our organizations.

It is very useful in this sense place by Tyler Vigen, Harvard criminology student. Virgin created a program whose algorithms They detect correlations between groups of random data that range from the funniest to the most ridiculous: finds relationships between US R&D spending and the number of suicides by hanging, strangulation or suffocation over a decade; between per capita cheese consumption and the number of people who died entangled in the sheets of their beds or the link between the average age of Miss America with deaths from the use of vaporizers, or hot objects. 

Let's imagine frequently falling into this type of deception when analyzing business data: What would be the results? It sounds absurd if we take it to an extreme like the one proposed by Virgin, but this extreme exercise shows us that it is much more common than we think to stumble upon these traps that statistics and – much more Big Data – can set for us.

The speed of context can lead us to seek increasingly automatic results, but on the contrary, what we need is to pause and interpret the data more carefully from a macro and informed view.

Julio Cesar Blanco – July 27, 2022

Be part of the Cloud world

Subscribe to our periodic summary of Technology News.

en_US