The curse of the denominator – why the average doesn’t always work so well…

Tom Breur

7 July 2023     

Descriptive statistics is about collecting, organizing, summarizing, and presenting data (Bluman, 2018). I am constantly surprised how few people take an interest in learning about the pros and cons of various measures of central tendency (e.g.: Mean, Mode, Median, Midrange), and what mathematical operations they allow – or not! I’m amused, although underwhelmed, to see people habitually perform all kinds of computations on median values, even though none of those computational procedures are mathematically valid. Another one of those transgressions is calculating the average of averages, another invalid mathematical operation. How many people are aware, I honestly wonder. Do they even care?

Because we can (almost) never collect all of the data, inferences are made based on samples – presumably unbiased ones. Practically speaking, in the real world it is near impossible to conduct any field sampling while avoiding bias altogether. Needless to say, It’s risky to neglect the possible impact this bias might have on decision making. In this post I want to draw attention to situations where sources of such bias might go unnoticed. Not about sampling bias –that people can relate to intuitively– but instead bias as a result of misusing proportions, in particular when misrepresentation is the result of obtuse shifts in the denominator. This is most likely to occur when the measure in central tendency is imperfectly understood, and therefore misused.

In Frequentist statistics, we always have to make assumptions about the underlying distributions. Although egregious violations of those assumptions might get flagged, I find that by and large the majority of those assumptions are unknown, too. There are many heuristics one can apply to get a quick sanity check if violations of core assumptions might be problematic. I make an effort to keep those handy in my proverbial back pocket, so that I can always assess when taking a closer look might be warranted.

In this post, I highlight a phenomenon that I am going to coin “the curse of the denominator.” I can hardly imagine being the first to use that term, but my Google search didn’t render any “noteworthy” citations, so who knows? What I mean by “the curse of the denominator” is misinterpretation in proportions when you make a comparisons across two ratios with distinct (different) denominators, and where that shift in denominator is non-obvious. Let me try and illustrate with an example.

An example

I am drawing from an example that was published elsewhere (https://www.opentable.com/state-of-industry), and discussed by some authorities on statistics (Prof. Carl Bergstrom and Nate Silver, among others). It has everything to do with how to represent numbers truthfully, how to avoid falling victim to “the curse of the denominator.” The example pertains to  restaurant occupancy rates, that is used to illustrate a trend. One that all of us can relate to in the “post-COVID” world, namely the return to a new normal. Personally I’m not sure about “new” nor about “normal”, but that’s beside the point for this example. When you compare two (or more) proportions, occupancy rate before and after the pandemic, it appears that mentally we tend to average those averages – a well known invalid mathematical operation. The example pertains to restaurant occupancy rates: before and after the COVID pandemic.

Most people know OpenTable as booking platform for restaurant tables. Obviously, restaurant reservations hit a big slump during COVID, and the quoted report suggests that in most geographies occupancy rates (!) are almost back to what they were before the pandemic. In an earlier version of this report (May 2022), they use (mostly) the following table to illustrate their conclusion that occupancy rates were almost back to pre-COVID rates, at least in some geographies – notably the United States (bottom row of this table).

The problem with this conclusion, however, is that the comparison is between rates. And when you compare percentages, people intuitively compare (“average”) those numbers. What that comparison obfuscates is that the denominator can change (due to restaurants going out of business during COVID). When lower reservation numbers are divided by a lower number of restaurants (the denominator in that occupancy rate) the occupancy rate may not seem (much) lower, when the natural frequency of reservations has dropped considerably! In all fairness, they do make the data available for review, so the astute data scientist can go in and double check some of these findings. Of course only those (like me) who find a conclusion like this odd, because I myself got the impression that people are dining out less often! Needless to say, I am delighted for the people in the hospitality industry that had been suffering the last few years.

Note how common it is to compare percentages, and how often the denominator may be subject to slight, barely noticeable shifts. Statisticians refer to that denominator as the base population, the universe you are generalizing to. When that population drifts (has a different composition), you can hardly generalize to it – that becomes a proverbial apples to oranges comparison. This effect is compounded by the fact that the denominator is typically the larger number, and therefore changes in the denominator don’t weigh as heavily on a proportion as the nominator does. However, at the same time the focus of attention is usually on the nominator, so changes in proportion are “mentally” more intuitively attributed to changes in the nominator – not the denominator! Bottom line is that you need to be mindful of shifts in either, one of the reasons behind Simpson’s Paradox as well, that I wrote about in an earlier blog.

Conclusion

Much of this comes down, in my opinion, to a wholesale neglect of the importance of data literacy. We live in the era of “Big Data”, but our school curricula have not caught up with that reality, yet. Neither has journalism, for the most part. We are all taught to be very careful and diligent with our choices of wording. The same care and accuracy doesn’t seem to have set in when it comes to framing numerical questions, and how to present evidence derived from data. The data never lies, but statistician make great liars!

In business, if we are going to be serious about data driven decision making, we’d better catch up on data literacy, first. Foundational concepts like measurement levels of data, and which mathematical operations are valid (one can not average Median values, nor proportions) are essential data hygiene in this regard. All too often I cringe when I hear news reports misrepresent data, although I mostly don’t (even) suspect them of deliberately trying to mislead. This problem needs to be solved at the root, beginning at school curricula. Journalism needs to take good (better…) care to avoid many of these traps, as the author John A Paulos of the classic “Innumeracy” (2001) already showed 10 years ago in “A Mathematician Reads the Newspaper” (2013). I am encouraged when I see new titles appear like “The Art of Statistical Thinking ” (2022) by Rutherford & Kim, or “The Digital Mindset” (2022) by Leonardi & Neeley. I hope some of those go on to become a best seller like “How to Lie With Statistics” (1993) by Darrell Huff. Huff was a best seller author, with a dubious reputation of getting hired by the tobacco industry to fend for their cause. It took decades before statisticians would acknowledge the causal link between smoking and cancer, and I would argue that innumeracy played a significant role in that (no pun intended!). It would be nice if a new generation of data savvy thought leaders would grow their repute on a more noble cause…

Leave a comment