The problem with Exploratory Data Analysis
The surprising way philosophy of science can help you become a better data scientist.
More often than not, data scientists start a project by being told by business stakeholders to “look at the data”. And by that, they often mean “let’s skip the part where I need to figure out what to do with the data and get directly to the part where you provide me with actionable business insights”. Because that’s what data science, machine learning, and all these shiny buzzwords are about, right?
Even to most data scientists, “looking at the data” sounds like a sensible first step, and they shall henceforth engage in a so-called Exploratory Data Analysis (EDA). Yet, I argue that EDA is often the first step toward the failure of the entire project. Am I being hyperbolic? Please allow me to articulate why this approach, when misunderstood, can lead to catastrophe.
Exploratory Data Analysis is fundamentally flawed in fulfilling the expectations of stakeholders, that is, providing actionable insights.
John Tukey was a strong promoter, if not the inventor of EDA, which he saw as an encouragement to explore the data and formulate hypotheses that could lead to new data collection and experiments. If you use EDA for this purpose, great! But that’s usually not the case, and we will come back to this.
One must remember that Tukey lived in a world where computational resources were extremely limited and he advocated for quite basic (by today’s standards) methods (e.g., the five-number summary). Like many other things (social media and the smartphone come to mind), a perfectly sensible idea turned into a disaster when others took it to an extreme.
Proper scientific inquiry never starts with data: it ends with data. Data must have the last word, but certainly not the first; instead, models and theories must drive the process. The order matters.
The multiple comparisons problem
But, really, what are the issues at play here? First, one must be on the lookout for the so-called multiple comparisons problem (MCP). Whenever you perform more than one statistical inference or test more than one hypothesis on a dataset, the probability of making erroneous discoveries (aka the family-wise error rate) increases. There are controlling procedures to account for such effects, but they reduce statistical power and require as much if not more care in their execution. Many data scientists simply ignore this problem, let alone try to do something about it.
The consequence of the MCP is that one cannot just « search for patterns » in a large dataset. Chances are, you will find some that don’t exist! For instance, DNA microarrays have enabled the study of millions of genetic markers, e.g. for genetic association studies. However, such searches with no prior belief of an effect previously led to replication failures in the literature [1].
The consequence of the multiple comparisons problem is that one cannot just « search for patterns » in a large dataset. Chances are, you will find some that don’t exist!
But there is more to this issue than meets the junior data scientist’s eye. The fundamental flaw of EDA is that it seldom fulfills the (often untold) expectation of stakeholders: they don’t pay top-of-the-range salaries for nice-sounding stories about their business. They need actionable insights, which they can use to drive decision-making. EDA is fundamentally flawed in fulfilling this requirement. To see why we must turn to a discipline rarely discussed in data science bootcamps: philosophy of science.
Make inductivism great again
EDA and all kinds of “pattern mining” strategies are modern revampings of the old (and still commonplace) idea of inductivism, which postulates that scientific theories (which, to recall, embody scientific knowledge) can be derived and established from facts. On the surface, this may sound like a perfectly good claim, and many scientists still think this is what their job is about.
Karl Popper, for one, did not agree! Arguably the most influential philosopher of science of the past century, he realized the deep flaws of inductivism, which he rejected in favor of empirical falsification, a much more potent and reliable method of scientific inquiry.
Popper realized that reality and facts can be deceptive, and no amount of correlation or experimental agreement can ever prove a theory. However, he argued, a single counterexample can be decisive. One should therefore seek to design experiments whose outcomes could be used to falsify a theory.
The fundamental flaw of inductivism and, by extension, EDA is that these methods lead to insights that agree by construction with facts and data. They never ask the question: do my data contradict my insights? In other words, EDA and inductivism remain in the realm of observational cognition, in the language of Judea Pearl [2]. To make matter worse, they provide a false sense of certainty reminiscent of the dynamics of cognitive bubbles in social networks: in absence of contradictory evidence, incorrect conclusions become “truths”, and facts remain uncovered.
The fundamental flaw of inductivism and, by extension, EDA is that these methods lead to insights that agree by construction with facts and data. They never ask the question: do my data contradict my insights?
In contrast, empirical falsification relies on a pre-existing model of the world (not an ML model, mind you). This model has no claim to correctness at first, but it encapsulates whatever one thinks knowing about the system under study. Then, based on this model, one makes predictions about how the system shall behave under various circumstances. Such predictions must be falsifiable, that is, one must be able to perform an experiment whose outcome may contradict the predictions of the model.
In data science terms, one must first seek to build up an understanding of the system that generated the data we are asked to analyze. This step never happens in a vacuum and invariably requires the help of subject matter experts. Then, one may formulate one or more testable hypotheses. Only then do data come into play, often by performing hypothesis testing. Again, order matters!
Better names for the kind of hypotheses data scientists should come up with are predictions or counterfactuals, that is, statements of the form “If I do X, then I will obtain Y” or “Had I done X, I would have obtained Y”. For instance, a good prediction could be “If I increase subscription fees by $5, the churn rate will increase by 10%”. Incidentally, this matches closely the expectations that business stakeholders have from data science, that is, an actionable insight.
The bad news is, EDA can not produce counterfactuals, but only observational insights, such as “when we increased our subscription fee by $5, the churn rate increased by 10%”. Notice the past tense: this statement is an observation, not a prediction. Does observing such a pattern in the data, even if statistical testing rules out correlation by chance, allow turning it into a prediction? Of course not! Maybe the churn rate wasn’t caused by the subscription fee change at all, but by an outage that occurred in the same week (i.e., confounding). Even if the 10% churn was indeed caused by this $5 increase, another increase in a short time span might lead to 25% churn this time. Absent a causal model of churn dynamics as a function of subscription fees, one can only guess.
You might object that it is impossible to answer these questions, at least not without performing further experiments, and you would be right. This is the crux of the matter, really: most data scientists are up against an insurmountable obstacle. Asked if they can derive actionable insights from a dataset, they should reply:
No, we cannot derive actionable insights from purely observational data. But we can confirm whether hypotheses about your business fit our data so far. Alternatively, we can suggest experiments whose outcomes may provide such insights.
Few stakeholders are willing to hear it. Again, top-of-the-range salaries and all that. “Forget Popper, multiple comparisons, and family-wise error rates, just give us something, quick.”
Did you learn something today? Did you spot a typo or a conceptual error? Positive or negative, leave me a comment below.
References
[1] Ioannidis, John, et al. "Replication validity of genetic association studies." Nature genetics 29.3 (2001): 306-309.
[2] Pearl, Judea. Causality. Cambridge university press, 2009.