Missing data – the statistician’s perspective

Tom Breur

26 April 2023

Missing data are an unavoidable fact of life – and an inconvenient one at that for data scientists. Sometimes data are “missing” because there is no value expected, these data elements are legitimately absent. A label like N/A may be applied. But the tricky ones are values that are missing when there should be something. The latter category will be the subject of this blog post.

Generally, we recognize three different kinds of patterns in missing data. One is called “missing completely at random” (MCAR) and this corresponds with the layperson’s notion of randomness – every data element has the exact same chance (hence: random) of being absent when an actual value was to be expected. The second pattern is called “missing at random” (MAR) and somewhat counter intuitively in this case the pattern of missingness is not completely random, but instead precisely and strictly defined in statistical terms (that I will clarify shortly). The third pattern is “not missing at random” (NMAR) and here as the name implies, there is a systematic non-random pattern to the missing values.

Another way to think of these patterns in missing data is as a mechanism that gives rise to the distributions in the data we observe, as well as the pattern in missingness. Confusingly, the statistical community refers to this as the “data model”, a term that database managers use as well, albeit with a very different meaning. Much of my career as a data scientist –perforce a multi-disciplinary profession– has been traversing domains of expertise and decoding all these synonyms and homonyms. It never ceases to amaze me how little standardization there is in terminology. And when the same term (homonym) gets used differently across groups of professionals it gets really confusing!

The strongest assumption to make is that the data is MCAR (“missing completely at random”). In this case, the odds for a value Y to be “missing” is neither dependent on the value of Y itself, nor should “missing” be associated with any other variable in the dataset. As I mentioned in the opening, this pattern of missingness corresponds with laypeople’s’ notion of “truly” random. One way to test for this MCAR assumption is by recoding Y into a Boolean with 0 for “missing” and 1 otherwise, and then regressing all other variables in the dataset variables on Y. If any of the coefficients are significant, then the pattern in Y is not MCAR. Unfortunately, you cannot test if missing is dependent on the value of Y itself (i.e. “income” is missing more often for the higher values) because that would require knowledge of the missing data elements. Under this (arguably rare) scenario, there is nothing systematic about what makes certain data elements missing, and others not.

MAR (“missing at random”) is a weaker assumption, in that now “missing” values in Y may be associated with predictor variables X1 through Xn, but not depend on the value of Y itself. For example: if you asked about “annual income”, and people with higher income are more likely not to answer that question. Then “missing” is obviously correlated with the value of income: the odds that a record is missing is higher for affluent respondents. That would be a violation of MAR. Strictly speaking, the MAR assumption is actually relaxed somewhat further in that the pattern of “missing” in Y may not depend on Y after controlling for X1 through Xn. Although this assumption is (much) more relaxed, unfortunately, there is no way for testing it in the data since that would require knowledge of the missing data points in Y. In our example we don’t know what income our most affluent respondents had because they declined to answer.

NMAR (“Not Missing At Random”) is the weakest assumption, which occurs when the assumptions for MAR are violated. In real-world datasets, claiming MAR is often unreasonable (obviously incorrect) and then standard imputation mechanisms are problematic because they lead to biased and inaccurate estimates.

This third pattern is sometimes referred to as “non-ignorable” because now the relation between other variables of interest and the column containing missing data needs to be modeled explicitly. An example of this could be when longitudinal data are collected in clinical trial. You will have an array of measurements over time, where sometimes a patient “isn’t available” for participation (data cannot be collected) because they got particularly sick as result of side effect from their treatment. Thankfully, after they recover and continue participating, their time series extends, but the “missing” data point was at a time they were not well. In other words, when their reading would have been atypical (worse). One would need to develop a statistical model to account for such latent variables. This is considered in contrast with MAR and MCAR where the pattern in missingness is called “ignorable” because it does not require explicit modeling of this function.

It is often desirable to “fix” records with missing values rather than delete these data. There are two main reasons for this. Typically, we want to retain records (rather than delete them because they have missing values) to preserve valuable information contained in incomplete records. Principally, this serves to increase statistical power to the extent possible. Another reason to pursue imputation is to avoid bias associated with listwise deletion when the data are not MCAR. Whenever these patterns are non-random, the practice of wholesale deleting records with missing data elements biases the data set. By and large, efforts to impute are to both increase statistical power as well as prevent biasing your dataset by selectively deleting records. Contemporary innovations in software, and advances in methods (largely Bayesian methods), provide data scientists with a range of tools to perform imputation. All statistical packages provide such functions nowadays, so there is very little excuse to ignore these innovations. If nothing else, and no matter how bad the time pressure, an assay of association between missing data patterns with other data present in your dataset gives some diagnostics to justify a highly pragmatic approach. For those interested, I have written a considerably longer (4356 words) whitepaper titled “The Many Faces of Missing Data” that explores some finer points, and clarifies the requirements and assumptions of various imputation mechanisms.

Leave a comment