Big Data and the Pitfalls of Spurious Correlations

In Spain, it is estimated that around 68% of industrial companies are considered "digital novices" or "digital followers," indicating that they have not yet fully adopted digitalization in their businesses and need to do so to improve their competitiveness. When analyzing large datasets, there is a warning about the danger of finding spurious correlations, where variables may appear to be related without any real sense or where a third variable might be influencing the results. Therefore, it is crucial to interpret data with caution, remembering that correlation does not imply causation, and to be aware of how graphs and visualizations are constructed to avoid incorrect conclusions.

The speed of the environment can lead us to fall into statistical traps when interpreting data. Given the enormous volume of information, a careful, informed, and unbiased perspective is required for more efficient reading.

Estimates indicate that 68% of Spanish industrial companies are “digital novices” or “digital followers.” This means that about 130,000 companies—most with fewer than 10 employees—still need to fully engage in digitalizing their businesses and improving their competitiveness.

In a scenario where millions and millions of data points are generated, those who can better interpret this information have a clear competitive advantage, as they will be in a better position to make business decisions. If Big Data is the key to achieving better results, focus must also be placed on how to manage it.

An important point to address in this regard is the danger of discovering countless correlations within these vast datasets. Why is there a potential risk? The alarm is raised because, given a large amount of data, it is possible to find variables that correlate even when they shouldn’t. In very large databases, arbitrary correlations always appear, not necessarily due to the nature of the data, but merely because of its volume.

First, let’s define the terms: when we talk about correlation, we refer to the fact that two things vary together. However, while correlation implies association, it does not necessarily imply causation. This is because two variables may be related, but one does not cause the other. Conversely, causation implies association but does not necessarily imply correlation.

Thus, so-called “spurious correlations” relate two variables that, at a mathematical level, may make sense but objectively (if viewed from a broader or more specific context) make no sense or might be influenced by a third variable not considered in the analysis.

Examples of these data interpretation errors are well-known and even famous, from “storks bring babies” (based on a curious phenomenon in Northern Europe during the medieval period, where couples married at the summer solstice and storks returned from their migration from Africa in the following spring, exactly nine months later) to finding a relationship between the monthly number of drownings at a beach and the amount of ice cream sold in the same period. Are ice creams the cause of more drownings? No, but people tend to eat more ice cream on hot days, when it is also more likely that they go swimming, meaning temperature is a third variable involved in the overall consideration.

It is important to note that there may be a causal relationship between two variables, but correlation does not indicate the direction of causality. A clear example of this is illustrated by the link between active lifestyles as a “guarantee” of better cognitive functioning in older adults. Evidence suggests that the causal direction is the opposite: higher cognitive functioning may result in a more active lifestyle.

All these warnings about how to interpret data become even more apparent when discussing its visualization and the need to understand how axes and the questions guiding the reading of graphs and visualizations are constructed. The goal is to avoid falling into erroneous correlations and to adopt this habit when applying data science strategies in our organizations.

A very useful resource in this regard is Tyler Vigen’s site, a Harvard criminology student. Vigen created a program whose algorithms detect correlations between random data groups ranging from the amusing to the ridiculous: finding relationships between U.S. R&D spending and the number of suicides by hanging, strangulation, or asphyxiation over a decade; between per capita cheese consumption and the number of people who died entangled in bed sheets or the link between the average age of Miss America and deaths from the use of vaporizers or hot objects.

Imagine frequently falling into such deceptions when analyzing business data: What would the results be? It sounds absurd if taken to the extreme as Vigen suggests, but this extreme exercise shows us that stumbling upon these traps is much more common than we think, especially with statistics and—much more so with Big Data.

The speed of context can lead us to seek increasingly automatic results, but on the contrary, we need to pause and interpret data more thoroughly from a macro and informed perspective.

By Julio Cesar Blanco – July 27, 2022

Be part of the Cloud world

Subscribe to our periodic Technology News digest.