It happens all the time. You make an experiment on nonlocality or steering, and you want to test whether the data you collected is compatible with hidden variables. You plug them into the computer and the answer is no, they are not. You examine them a bit more closely, and you see that they are also incompatible with quantum mechanics, because they are signalling. After a bit of cold sweating, you realize that they are very close to non-signalling, all the trouble happened because the computer needs them to be exactly non-signalling. You then relax, project them onto the non-signalling subspace, and call it a day.
Never do this. Experimental data is sacred. You can’t arbitrarily chop it off to fit your Procrustean bed.
First of all, remember that even if your probabilities are strictly non-signalling, the probability of obtaining relative frequencies that respect the no-signalling equations exactly is effectively zero. There’s nothing wrong with “signalling” frequencies. On the contrary, if some experimentalist reported relative frequencies that were exactly non-signalling I’d be very suspicious. What you should get in a real experiment are frequencies that are very close to non-signalling, but not exactly1.
“That doesn’t help me”, you reply. “I can accept signalling frequencies all day long, but the computer still needs them to be non-signalling in order to test hidden variable models.”
Sure, but what the computer needs are non-signalling probabilities, that you should infer from the signalling frequencies.
“Exactly, and to infer non-signalling probabilities I just project the frequencies onto the non-signalling subspace.”
No! Inferring probabilities from frequencies is the oldest problem in statistics. People have studied this problem to death, and came up with several respectable methods. There’s no point in reinventing the wheel. And if you do insist in reinventing the wheel, you’d better be damn sure that it’s round.
To make it clear that this projection technique is a square wheel, I’ll examine in detail a toy version of the problem of getting non-signalling probabilities. The simplest case of the real problem involves getting from a 12-dimensional space of frequencies to a 8-dimensional non-signalling subspace, which is too much to do by hand for even the most dedicated PhD students2. Instead I’ll go for the minimal scenario, a 2-dimenionsal space of frequencies that goes down to a 1-dimensional subspace.
Consider then an experiment with 3 possible outcomes, 0,1, and 2, where our analogue of the no-signalling assumption is that $p_1 = 2p_0$. The possible relative frequencies we can observe are in triangle bounded by $p_0 \ge 0$, $p_1 \ge 0$, and $p_0 + p_1 \le 1$. The possible probabilities are just the line $p_1 = 2p_0$ inside this triangle. Again, if we generate data according to these probabilities they will almost surely not fall in the $p_1 = 2p_0$ line. Let’s say we observed $n_0$ outcomes 0, $n_1$ outcomes 1, and $n_2$ outcomes 2. What is the probability $p_0$ we should infer from this data?
Let’s start with the projection technique. Compute the relative frequencies $f_0 = n_0/n$ and $f_1 = n_1/n$, and project the point $(f_0,f_1)$ onto the line $p_1 = 2p_0$. Which projection, though? There are infinitely many. The most natural one is an orthogonal projection, but that already weirds me out. Why on Earth are we talking about angles between probability distributions? They are vectors of real numbers, sure, we can compute angles, but we shouldn’t expect them to mean anything. Doing it anyway, we get that
\[ p_0 = \frac15(f_0 + 2f_1)\quad\text{and}\quad p_1 = \frac25(f_0 + 2f_1),\]which do not respect positivity: if $f_0=0$ and $f_1=1$ we have that $p_0+p_1 = 6/5$, which implies that $p_2 = -1/5$.3 What now? Arbitrarily make the probabilities positive? Invent some other method, such as minimizing the distance from the point $(f_0,f_1)$ to the line $p_1 = 2p_0$? Which distance then? Euclidean? Total variation? No, it’s time to admit that it was a bad idea to start with and open a statistics textbook.
You’ll find there a very popular method, maximum likelihood. We write the likelihood function
\[L(p_0) = p_0^{n_0} (2 p_0)^{n_1} (1-3p_0)^{n_2},\]which is just the probability of the data given the parameter $p_0$, and maximize it, finding
\[p_0 = \frac13(f_0 + f_1)\quad\text{and}\quad p_1 = \frac23(f_0+f_1).\]Now maximum likelihood is probably the shittiest statistical method one can used, but at least the answer makes sense. The resulting probabilities are normalized, and they mean something: they are those which assigned the highest probability to the observed data. My point is that even the worst statistical method is better than arbitrarily chopping off your data. Moreover, it’s very easy to do, so there’s no excuse.
If you want to do things properly, though, you have to do Bayesian inference. You have to multiply the likelihood function by the prior, normalize that, and compute the expected $p_0$ from the posterior in order to obtain a point estimate. It’s a bit more work, but in this case is still easy, and for a flat prior it gives
\[p_0 = \frac13\frac{n_0 + n_1+1}{n+1}\quad\text{and}\quad p_1 = \frac23\frac{n_0 + n_1+1}{n+1}.\]Besides getting a more sensible answer and the ability to change the prior, the key advantage of Bayesian inference is that it gives you the whole posterior distribution. It naturally provides you a confidence region around your estimate, the beloved error bars any experimental paper must include. It’s harder to do, sure, but none of you got into physics because it was easy, right?