I’m feeling bad about having used $p$-values in my paper about Bell inequalities, despite considering them bullshit. The reason we decided to do it is because we wanted the paper to be useful for experimentalists, and *all* of them use $p$-values instead of Bayes factors1. We are already asking them to abandon their beloved Bell inequalities and use nonlocal games instead, and that is the main point of the paper. We thought that also asking them to give up on $p$-values would be too much cognitive dissonance, and distract from the main issue. Good reasons, no?

Still, still. The result is that there’s yet another paper perpetuating the use of $p$-values, and moreover my colleagues might see it and think that *I* defend them! The horror! My reputation would never recover. To set things right, I’m writing this post here to explain why $p$-values are nonsense, and restate our results in terms of Bayes factors.

First of all, what is a $p$-value? It is the probability of observing the data you got, or something more extreme, under the assumption that the null hypothesis is true2. If this probability is less than some arbitrary threshold, often set at $0.05$ or $10^{-5}$, the null hypothesis is said to be rejected.

The very definition already raises several red flags. Why are we considering the probability of “something more extreme”, instead of just the probability of the data we actually got? Also, what counts as “more extreme”? It doesn’t only sound ambiguous, it is ambiguous. And what about this arbitrary threshold? How should we set it, and why should we care about this number? More worrisome, how can we possibly regard the null hypothesis in isolation? In practice we always have another hypothesis in mind, but come on. What if we do reject the null hypothesis, but the alternative hypothesis attributes only slightly higher probability to the observed outcome? Obviously the experiment was just inconclusive, but apparently not if you take $p$-values seriously.

Perhaps these are just counterintuitive properties of a definition that has a solid basis in probability theory, and we need to just grow up and accept that life is not as simple as we would want. Well, as it turns out, there’s absolutely no basis for using $p$-values. Pearson just pulled the definition out of his ass, for use in his $\chi^2$-test. People then ran with it and started applying it everywhere.

In stark contrast, a Bayes factor is defined simply as the ratio of the probabilities of observing the data you got, given the or the alternativenull hypotheses:

\[ K := \frac{p(D|H_1)}{p(D|H_0)}.\] It suffers none of the problems that plague the definition of $p$-value: We only consider the actually observed data, no “more extreme” events are relevant. It explicitly depends on both hypotheses, there’s no attempt to reject an hypothesis in isolation. There’s no arbitrary threshold to set, it is just a number with a clear operational interpretation.

More importantly, it doesn’t come out of anybody’s ass, but from Bayes’ theorem: it is the data we need to calculate the posterior probability $p(H_0|D)$. In the case of two hypotheses, we have that

\[ p(H_0|D) = \frac{p(D|H_0)p(H_0)}{p(D|H_0)p(H_0) + p(D|H_1)p(H10)} = \frac1{1+K\frac{p(H_1)}{p(H_0)}}.\] How do our results look when written in terms of the Bayes factor then? Our goal was to minimize the expected $p$-value from a Bell test. The expression for the expected $p$-value is a hideous expression that we have no hope of optimizing directly, so with a lot of work we showed that the expected $p$-value after $n$ rounds $\langle p_n \rangle$ is upperbounded by

\[ \langle p_n \rangle \le (1-\chi^2)^n, \] where $\chi := \omega_q-\omega_\ell$ is the gap of the nonlocal game, and $\omega_q$ and $\omega_\ell$ are the Tsirelson and local bounds. We can easily maximize the gap, and a large gap is a sufficient condition for having a small $\langle p_n \rangle$, so that’s we did. We suspect that a large gap is also a necessary condition, and that the lower bound $\frac12(1-\chi)^n \le \langle p_n \rangle$ holds, but we couldn’t prove it.

With the Bayes factor we have pure bliss. The expression for the expected Bayes factor after $n$ rounds $\langle K_n \rangle$ is simple, no bound is needed:

\[ \langle K_n \rangle = \left( 1 + \frac{\chi^2}{\omega_\ell(1-\omega_\ell)} \right)^n. \] Maybe we can even optimize that directly. In any case, a large gap is also a sufficient condition for a large $\langle K_n \rangle$, as we have the bound

\[ \frac1{(1-\chi)^n} \le \langle K_n \rangle, \] valid for $\chi \ge 1/2$3 In contrast to the $p$-value case, it’s obvious from the expression that a large gap is *not* necessary for a large $\langle K_n \rangle$. We even have an explicit example of a nonlocal game with arbitrarily small gap and arbitrarily large $\langle K_n \rangle$, the Khot-Vishnoi game. Ironically enough, we use this same example in the paper to argue for the opposite conclusion, as it has $\langle p_n \rangle$ close to 1.

The Khot-Vishnoi game is of widespread interest in the literature because its ratio $\omega_q/\omega_\ell$ is arbitrarily large, which implies a large resistance to noise. We wanted to argue that this is useless if $\omega_q$ is too small, as the favourable outcome would happen too seldom in a finite experiment. As pointed out above, this is not true for this game, but the reasoning is still sound: if we have a nonlocal game with $\omega_q = 1/d^{3/4}$ and $\omega_\ell = 1/d$, the ratio is $d^{1/4}$, which does get arbitrarily large with $d$, but $\langle K_n \rangle$ goes roughly like $1+n/\sqrt{d}$ for large $d$, so the large ratio doesn’t help at all.

It’s easy to show that

\[ \langle K_n \rangle \le \left( \frac{\omega_q}{\omega_\ell} \right)^n, \] so having a large ratio is a necessary condition for a large $\langle K_n \rangle$. We have then a nice duality between the ratio and the gap: the gap is sufficient, but not necessary, and the ratio is necessary, but not sufficient.

I don’t think that beauty implies true; otherwise neutrinos would be massless. But it’s a joy when you have an independent argument for something to be true, and it turns out to be quite beautiful as well.