I’m feeling bad about having used $p$-values in my paper about Bell inequalities, despite considering them bullshit. The reason we decided to do it is because we wanted the paper to be useful for experimentalists, and *all* of them use $p$-values instead of Bayes factors1. We are already asking them to abandon their beloved Bell inequalities and use nonlocal games instead, and that is the main point of the paper. We thought that also asking them to give up on $p$-values would be too much cognitive dissonance, and distract from the main issue. Good reasons, no?

Still, still. The result is that there’s yet another paper perpetuating the use of $p$-values, and moreover my colleagues might see it and think that *I* defend them! The horror! My reputation would never recover. To set things right, I’m writing this post here to explain why $p$-values are nonsense, and restate our results in terms of Bayes factors.

First of all, what is a $p$-value? It is the probability of observing the data you got, or something more extreme, under the assumption that the null hypothesis is true2. If this probability is less than some arbitrary threshold, often set at $0.05$ or $10^{-5}$, the null hypothesis is said to be rejected.

The very definition already raises several red flags. Why are we considering the probability of “something more extreme”, instead of just the probability of the data we actually got? Also, what counts as “more extreme”? It doesn’t only sound ambiguous, it is ambiguous. And what about this arbitrary threshold? How should we set it, and why should we care about this number? More worrisome, how can we possibly regard the null hypothesis in isolation? In practice we always have another hypothesis in mind, but come on. What if we do reject the null hypothesis, but the alternative hypothesis attributes only slightly higher probability to the observed outcome? Obviously the experiment was just inconclusive, but apparently not if you take $p$-values seriously.

Perhaps these are just counterintuitive properties of a definition that has a solid basis in probability theory, and we need to just grow up and accept that life is not as simple as we would want. Well, as it turns out, there’s absolutely no basis for using $p$-values. Pearson just pulled the definition out of his ass, for use in his $\chi^2$-test. People then ran with it and started applying it everywhere.

In stark contrast, a Bayes factor is defined simply as the ratio of the probabilities of observing the data you got, given the null or the alternative hypotheses:

\[ K := \frac{p(D|H_0)}{p(D|H_1)}.\] It suffers none of the problems that plague the definition of $p$-value: We only consider the actually observed data, no “more extreme” events are relevant. It explicitly depends on both hypotheses, there’s no attempt to reject an hypothesis in isolation. There’s no arbitrary threshold to set, it is just a number with a clear operational interpretation.

More importantly, it doesn’t come out of anybody’s ass, but from Bayes’ theorem: it is the data we need to calculate the posterior probability $p(H_0|D)$. In the case of two hypotheses, we have that

\[ \frac{p(H_0|D)}{p(H_1|D)} = \frac{p(D|H_0)p(H_0)}{p(D|H_1)p(H_1)} = K\frac{p(H_0)}{p(H_1)}.\] How do our results look when written in terms of the Bayes factor then?

Before we see that, we have to undo another bad decision we did in our paper. Our goal was to minimize the expected $p$-value from a Bell test. But we don’t observe an expected $p$-value, we observe a $p$-value. And the $p$-value we expect to observe is the $p$-value of the expected number of victories $n\omega_q$. This is given by a hideous expression we have no hope of optimizing directly, but luckily there’s a simple and power upper bound, the Chernoff bound3:

\[p_n \le \exp(-n D(\omega_q||\omega_\ell)),\] where $D(\omega_q||\omega_\ell)$ is the relative entropy. And the Bayes factor of the expected number of victories is

\[K_n = \exp(-n D(\omega_q||\omega_\ell)),\] exactly the same expression. Neat no? We don’t even need to fight about $p$-values and Bayes factors, we just maximize the relative entropy. The difference is that for the $p$-value this is just an upper bound, whereas for the Bayes factor it is the whole story.

In the paper we maximized the gap of the nonlocal game $\chi := \omega_q-\omega_\ell$, because the expected $p$-value was too difficult to minimize directly, and because a large gap a sufficient condition for having a small expected $p$-value. Now the Bayes factor is a simple expression that we could minimize directly, but at least our work was not in vain: a large gap is also a sufficient condition for a small Bayes factor, so it is still a good idea to maximize it. To see that, we only need to note that

\[D(\omega_q||\omega_\ell) \ge -\chi\log(1-\chi).\] Interestingly, having a large *ratio* $\omega_q/\omega_\ell$ is not sufficient for a small Bayes factor. The ratio is of widespread interest in the literature because a large ratio implies a large resistance to noise. Intuitively, this should be useless if $\omega_q$ is too small, as the favourable outcome would happen too seldom in an experiment. Indeed, in the Khot-Vishnoi game we have $\omega_q \ge 1/\log^2(d)$ and $\omega_\ell \le e^2/d$, so the ratio gets arbitrarily large with $d$, but the relative entropy and thus the Bayes factor gets arbitrarily close to 04. It’s a near miss, though, and the intuition is wrong. If we had instead $\omega_q \ge 1/\sqrt{\log(d)}$ and the same $\omega_\ell$, the relative entropy would actually diverge with increasing $d$, even though the gap gets arbitrarily small.

It’s easy to show that

\[ D(\omega_q||\omega_\ell) \le \log\left( \frac{\omega_q}{\omega_\ell} \right), \] so having a large ratio is a necessary condition for a small Bayes factor. We have then a nice duality between the ratio and the gap: the gap is sufficient, but not necessary, and the ratio is necessary, but not sufficient.

I don’t think that beauty implies true; otherwise neutrinos would be massless. But it’s a joy when you have an independent argument for something to be true, and it turns out to be quite beautiful as well.