Pathological Probability Distributions

Bayesianism vs Frequentism and Pathological Probability Distributions

I’ve been thinking recently about bayesian vs frequentist perspectives on statistics. These two perspectives differ in how they approach thinking about probabilistic problems. The frequentist perspective can roughly be characterized by arguments considering the limits of random processes. For example, when flipping a fair coin, the probability that the coin lands heads-up can be said to be 50%, because if the coin was flipped infinitely many times the ratio of the number of heads to the total number of flips would approach 1/2. Hence, from the frequentist perspective assigning a probability to an empirical observation is an idealization. There is no way to actually observe coin flipping infinitely many times, but it’s often the case that in the real world we deal with large enough sample sizes or datasets that the limiting quantities are close enough to be of use.

In contrast, bayesian reasoning considers probability not in terms of the limit of an infinite experiment, but instead as a degree of certainty about an outcome. For the same coin example, a bayesian might say that the coin has a 50% chance of landing heads up, because – until we observe it – we are only half certain of its state. As can be expected, both types of reasoning reach the same conclusion. However, specific forms of reasoning can be more useful for certain problems.

One example of such a problem is considering previous data when evaluating an experiment. Such prior information is called – appropriately – a ‘prior’. In the bayesian framework, it’s clear what such information represents. It’s just the prior certainty about a given state – when that information is combined with a current measurement the result (the ‘posterior’) gives the final certainty on the quantity. A similar conclusion can be reached with frequentist reasoning, but is much less intuitive.

Another failure of frequentism is pathological heavy-tailed probability distributions like the Cauchy distribution. From the frequentist perspective, these are relative distributions of the values a given random variable may have – whereas from the bayesian perspective these distributions encode the certainty we have in a given unknown variable. Two of the most fundamental parameters for a given distribution are the mean and variance. The mean encodes a rough average of what the ‘typical’ draw from the distribution will look like, whereas the variance encodes the ‘spread’ in the draws about that mean value. Let’s start by conducting a frequentist experiment to measure empirically, the mean and standard deviation of a normal distribution

Consider for example draws from a normal distribution with a mean of zero and variance of one. For a given number of draws, we can plot the inferred variation and mean as a function of the number of draws. The running mean is shown on the left, the variance in the center, and the distribution on the right. The red lines are the true values. We see as we draw more and more numbers, our calculations converge quickly to the predicted values.

Normal Distribution

We can repeat the exact same experiment with a different distribution, in this case, a uniform distribution (between 0 and 1). We can see the distribution (the plot on the right) looks totally different from the first example, but our trick still works. we just take a bunch of examples and compute a running mean and variance them and they’ll approach our true mean and variance!

Unfortunately, there’s more to the story. There are some types of randomness that don’t behave this way. The Cauchy distribution is an example of one of these ‘pathological’ examples. To see why let’s repeat our same experiment with this final one.

The Cauchy distribution is the traditional example of a probability distribution that doesn’t follow these rules.

Here the distribution looks reasonable (it looks like a wider version of the normal distribution) but as we take more and more draws neither the mean nor the variance converge! Unlike in the previous examples, and instead, our running totals keep jumping around without settling down. Intuitively, the Cauchy distribution accomplishes this by having wide tails so that as you take more and more examples the ‘outliers’ outnumber the ‘typical’ values and so no reasonable mean or variance can be calculated.

The Cauchy distribution is not the only distribution where asking ‘what is the average?’ doesn’t make sense, In fact, there are infinitely many different distributions that can’t be described this way. While one does not necessarily need to describe probability distributions in terms of their mean and variance, it is troubling that the frequentist perspective fails so spectacularly for such a simple example!

Of course, after critiquing frequentism so much, I’m inclined to provide at least one example in which frequentism is more useful than bayesian reasoning. Frequentist arguments tend to excel when you have complicated systems, that can also be easily simulated, a good example of such a system is the coin counting project I embarked on a few years ago. In such a system one can easily compute error bars and other useful metrics of extremely complicated systems using nothing but simulation – simulating the system repeatedly. If computing each individual draw is computationally cheap, this is an excellent way of characterizing a system with no need for bayesian mathematics – much of any mathematics. In this way, frequentist augments can be extremely fast and convenient to implement computationally.

Github Link to Code: https://github.com/r-zachary-murray/archive/tree/master/Pathological_Distributions