# Understanding Bell’s theorem part 2: the nonlocal version

Continuing the series on Bell’s theorem, I will now write about its most popular version, the one that people have in mind when they talk about quantum nonlocality: the version that Bell proved in his 1975 paper The theory of local beables.

But first things first: why do we even need another version of the theorem? Is there anything wrong with the simple version? Well, a problem that Bohmians have with it is that its conclusion is heavily slanted against their theory: quantum mechanics clearly respects no conspiracy and no action at a distance, but clearly does not respect determinism, so the most natural interpretation of the theorem is that trying to make quantum mechanics deterministic is a bad idea. The price you have to pay is having action at a distance in your theory, as Bohmian mechanics has. Because of this the Bohmians prefer to talk about another version of the theorem, that lends some support to the idea that the world is in some sense nonlocal.

There is also legitimate criticism to be made against the simple version of Bell’s theorem: namely that the assumption of determinism is too strong. This is easy to see, as we can cook up indeterministic correlations that are even weaker than the deterministic ones: if Alice and Bob play the CHSH game randomly they achieve $p_\text{succ} = 1/2$, well below the bound of $3/4$. This implies that just giving up on determinism does not allow you to violate Bell inequalities. You need to lose something more precious than that. What exactly?

The first attempt to answer this question was made by Clauser and Horne in 1974. Their proof goes like this: from no conspiracy, the probabilities decompose as
$p(ab|xy) = \sum_\lambda p(\lambda)p(ab|xy\lambda)$
Then, they introduce their new assumption

• Factorisability:   $p(ab|xy\lambda) = p(a|x\lambda)p(b|y\lambda)$.

which makes the probabilities reduce to
$p(ab|xy) = \sum_\lambda p(\lambda)p(a|x\lambda)p(b|y\lambda)$
Noting that for any coefficients $M^{ab}_{xy}$ the Bell expression
$p_\text{succ} = \sum_{abxy} \sum_\lambda M^{ab}_{xy} p(\lambda)p(a|x\lambda)p(b|y\lambda)$
is upperbounded by deterministic probability distributions $p(a|x\lambda)$ and $p(b|y\lambda)$, the rest of the proof of the simple version of Bell’s theorem applies, and we’re done.

So they can prove Bell’s theorem only from the assumptions of no conspiracy and factorisability, without assuming determinism. The problem is how to motivate factorisability. It is not a simple and intuitive condition like determinism or no action a distance, that my mum understands, but some weird technical stuff. Why would she care about probabilities factorising?

The justification that Clauser and Horne give is just that factorisability

…is a natural expression of a field-theoretical point of view, which in turn is an extrapolation from the common-sense view that there is no action at a distance.

What are they talking about? Certainly not about quantum fields, which do not factorise. Maybe about classical fields? But only those without correlations, because otherwise they don’t factorise either! Or are they thinking about deterministic fields? But then there would be no improvement with respect to the simple version of the theorem! And anyway why do they claim that it is an extrapolation of no action at a distance? They don’t have a derivation to be able to claim such a thing! It is hard for me to understand how anyone could have taken this assumption seriously. If I were allowed to just take some arbitrary technical condition as an assumption I could prove anything I wanted.

Luckily this unsatisfactory situation only lasted one year, as in 1975 Bell managed to find a proper motivation for factorisability, deriving it from his notion of local causality. Informally, it says that causes are close to their effects (my mum is fine with that). A bit more formally, it says that probabilities of events in a spacetime region $A$ depend only on stuff in its past light cone $\Lambda$, and not on stuff in a space-like separated region $B$ (my mum is not so fine with that). So we have

• Local causality:   $p(A|\Lambda,B) = p(A|\Lambda)$.

How do we derive factorisability from that? Start by applying Bayes’ rule
$p(ab|xy\lambda) = p(a|bxy\lambda)p(b|xy\lambda)$
and consider Alice’s probability $p(a|bxy\lambda)$: obtaining an outcome $a$ certainly counts as an event in $A$, and Alice’s setting $x$ and the physical state $\lambda$ certainly count as stuff in $\Lambda$. On the other hand, $b$ and $y$ are clearly stuff in $B$. So we have
$p(a|bxy\lambda) = p(a|x\lambda)$
Doing the analogous reasoning for Bob’s probability $p(b|xy\lambda)$ (and swapping $A$ with $B$ in the definition of local causality) we have
$p(b|xy\lambda) = p(b|y\lambda)$
and substituting this back we get
$p(ab|xy\lambda) = p(a|x\lambda)p(b|y\lambda)$
which is just factorisability.

So there we have it, a perfectly fine derivation of Bell’s theorem, using only two simple and well-motivated assumptions: no conspiracy and local causality. There is no need for the technical assumption of factorisability. Because of this it annoys me to no end when people implicitly conflate factorisability and local causality, or even explicitly state that they are equivalent.

Is there any other way of motivating factorisability, or are we stuck with local causality? A popular way to do it nowadays is through Reichenbach’s principle, which states that if two events A and B are correlated, then either A influences B, B influences A, or there is a common cause C such that
$p(AB|C) = p(A|C)p(B|C)$
It is easy to see that this directly implies factorisability for the Bell scenario.

It is often said that Reichenbach’s principle embodies the idea that correlations cry out for explanations. This is bollocks. It demands the explanation to have a very specific form, namely the factorised one. Why? Why doesn’t an entangled state, for example, count as a valid explanation? If you ask an experimentalist that just did a Bell test, I don’t think she (more precisely Marissa Giustina) will tell you that the correlations came out of nowhere. I bet she will tell you that the correlations are there because she spent years in a cold, damp, dusty basement without phone reception working on the source and the detectors to produce them. Furthermore, the idea that “if the probabilities factorise, you have found the explanation for the correlation” does not actually work.

I think the correct way to deal with Bell correlations is not to throw your hands in the air and claim that they cannot be explained, but to develop a quantum Reichenbach principle to tell which correlations have a quantum explanation and which not. This is currently a hot research topic.

But leaving those grandiose claims aside, is there a good motivation for Reichenbach’s principle? I don’t think so. Reichenbach himself motivated his principle from considerations about entropy and the arrow of time, which simply do not apply to a simple quantum state of two qubits. There may be another motivation other than his original one, but I don’t know of any.

To conclude, as far as I know local causality is really the only way to motivate factorisability. If you don’t like the simple version of Bell’s theorem, you are pretty much stuck with the nonlocal version. But does it also have its problems? Well, the sociological one is its name, which leads to the undying idea in the popular culture that quantum mechanics allows for faster than light signalling or even travelling. But the real one is that it doesn’t allow you to do quantum key distribution based on Bell’s theorem (note that the usual quantum key distribution is based on quantum mechanics itself, and only uses Bell’s theorem as a source of inspiration).

If you use the simple version of Bell’s theorem and believe in no action at a distance, a violation of a Bell inequality implies not only that your outcomes are correlated with Bob’s, but also that they are in principle unpredictable, so you managed to share a secret key with him, which you can use for example for a one-time pad (which raises the question of why don’t Bohmians march in the street against funding for research in QKD). But if you use the nonlocal version of Bell’s theorem and violate a Bell inequality, you only find out that your outcomes are not locally causal – they can still be deterministic and nonlocal.[1]

Update: Rewrote the paragraph about QKD.

# Understanding Bell’s theorem part 1: the simple version

To continue with the series of “public service” posts, I will write the presentation of Bell’s theorem that I would like to have read when I was learning it. My reaction at the time was, I believe, similar to most students’: what the fuck am I reading? And my attempts to search the literature to understand what was going on only made my bewilderement worse, as the papers disagree about what are the assumptions in Bell’s theorem, what are the names of the assumptions, what is the conclusion we should take from Bell’s theorem, and even what Bell’s theorem even is! Given this widespread confusion, it is no wonder that so many crackpots obsess about it!

This is the first of a series of three posts about several versions of Bell’s theorem. I’m starting with what I believe is by consensus the simplest version: the one proved by Clauser, Horne, Shimony, and Holt in 1969, based on Bell’s original version from 1964.

The theorem is about explaining the statistics observed by two experimenters, Alice and Bob, that are making measurements on some physical system in a space-like separated way. The details of their experiment are not important for the theorem (of course, they are important for actually doing the experiment). What is important is that each experimenter has two possible settings, named 0 and 1, and for each setting the measurement has two possible outcomes, again named 0 and 1.

Of course it is not actually possible to have only two settings in a real experiment: usually the measurement depends on a continuous parameter, like the angle with which you set a wave plate, or the phase of the laser with which you hit an ion, and you are only able to set this continuous parameter with finite precision. But this is not a problem, as we only need to define in advance that “this angle corresponds to setting 0” and “this angle corresponds to setting 1”. If the angles are not a good approximation to the ideal settings you are just going to get bad statistics.

Analogously, it is also not actually possible to have only two outcomes for each measurement, most commonly because you lost a photon and no detector clicked, but also because you can have multiple detections, or you might be doing a measurement on a continuous variable, like position. Again, the important thing is that you define in advance which outcomes correspond to the 0 outcome, and which outcomes correspond to the 1 outcome. Indeed, this is exactly what was done in the recent loophole-free Bell tests: they defined the no-detection outcome to correspond to the outcome 1.

Having their settings and outcomes defined like this, our experimenters measure some conditional probabilities $p(ab|xy)$, where $a,b$ are Alice and Bob’s outcomes, and $x,y$ are their settings. Now they want to explain these correlations. How did they come about? Well, they obtained them by measuring some physical system $\lambda$ (that can be a quantum state, or something more exotic like a Bohmian corpuscle) that they did not have complete control over, so it is reasonable to write the probabilities as arising from an averaging over different values of $\lambda$. So they decompose the probabilities as
$p(ab|xy) = \sum_\lambda p(\lambda|xy)p(ab|xy\lambda)$
Note that this is not an assumption, just a mathematical identity. If you are an experimental superhero and can really make your source emit the same quantum state in every single round of the experiment you just get a trivial decomposition with a single $\lambda$ (incidentally, by Caratheodory’s theorem one needs only 13 different $\lambda$s to write this decomposition, so the use of integrals over $\lambda$ in some proofs of Bell’s theorem is rather overkill).

The first assumption that we use in the proof is that the physical system $\lambda$ is not correlated with the settings $x$ and $y$, that is $p(\lambda|xy) = p(\lambda)$. I think this assumption is necessary to even do science, because if it were not possible to probe a physical system independently of its state, we couldn’t hope to be able to learn what its actual state is. It would be like trying to find a correlation between smoking and cancer when your sample of patients is chosen by a tobacco company. This assumption is variously called “freedom of choice”, “no superdeterminism”, or “no conspiracy”. I think “freedom of choice” is a really bad name, as in actual experiments nobody chooses the settings: instead they are determined by a quantum random number generator or by the bit string of “Doctor Who”. As for “no superdeterminism”, I think the name is rather confusing, as the assumption has nothing to do with determinism — it is possible to respect it in a deterministic theory, and it is possible to violate it in a indeterministic theory. Instead I’ll go with “no conspiracy”:

• No conspiracy:   $p(\lambda|xy) = p(\lambda)$.

With this assumption the decomposition of the probabilities simplifies to
$p(ab|xy) = \sum_\lambda p(\lambda)p(ab|xy\lambda)$

The second assumption that we’ll use is that the outcomes $a$ and $b$ are deterministic functions of the settings $x$ and $y$ and the physical system $\lambda$. This assumption is motivated by the age-old idea that the indeterminism we see in quantum mechanics is only a result of our ignorance about the physical system we are measuring, and that as soon as we have a complete specification of it — given by $\lambda$ — the probabilities would disappear from consideration and a deterministic theory would be recovered. This assumption is often called “realism”. I find this name incredibly stupid. Are the authors that use them really saying that they cannot conceive of an objective reality that is not deterministic? And that such a complex concept such as realism reduces to merely determinism? And furthermore they are blissfully ignoring the existece of collapse models, which are realistic but fundamentally indeterministic. As far as I know the name realism was coined by Bernard d’Espagnat in a Scientific American article from 1979, and since them it caught on. Maybe people liked it because Einstein, Podolsky and Rosen defended that a deterministic quantity is for sure real (but they did not claim that indeterministic quantities are not real), I don’t know. But I refuse to use it, I’ll go with the very straightforward and neutral name “determinism”.

• Determinism:   $p(ab|xy\lambda) \in \{0,1\}$.

An immediate consequence of this assumption is that $p(ab|xy\lambda) = p(a|xy\lambda)p(b|xy\lambda)$ and therefore that the decomposition of $p(ab|xy)$ becomes
$p(ab|xy) = \sum_\lambda p(\lambda)p(a|xy\lambda)p(b|xy\lambda)$

The last assumption we’ll need is that the probabilities that Alice sees do not depend on which setting Bob used for his measurement, i.e., that $p(a|xy\lambda) = p(a|x\lambda)$. The motivation for it is that since the measurements are made in a space-like separated way, a signal would have to travel from Bob’s lab to Alice’s faster than light in order to influence her result. Relativity does not like it, but does not outright forbid it either, if you are ok with having a preferred reference frame (I’m not). Even before the discovery of relativity Newton already found such action at a distance rather distasteful:

It is inconceivable that inanimate Matter should, without the Mediation of something else, which is not material, operate upon, and affect other matter without mutual Contact… That Gravity should be innate, inherent and essential to Matter, so that one body may act upon another at a distance thro’ a Vacuum, without the Mediation of any thing else, by and through which their Action and Force may be conveyed from one to another, is to me so great an Absurdity that I believe no Man who has in philosophical Matters a competent Faculty of thinking can ever fall into it.

Without using such eloquence, my own worry is that giving up on this would put into question how can we ever isolate a system in order to do measurements on it whose result does not depend on the state of the rest of universe.

This assumption was called in the literature “locality”, “no signalling”, and “no action at a distance”. My only beef with “locality” is that this word is overused, so nobody really knows what it means; “no signalling”, on the other hand is just bad, as the best example we have of a theory that violates this assumption — Bohmian mechanics — does not actually let us signal with it. I’ll go again for the more neutral word and stick with “no action at a distance”.

• No action at a distance:   $p(a|xy\lambda) = p(a|x\lambda)$ and $p(b|xy\lambda) = p(b|y\lambda)$.

With this assumption we have the final decomposition of the conditional probabilities as
$p(ab|xy) = \sum_\lambda p(\lambda)p(a|x\lambda)p(b|y\lambda)$
This is what we need to prove a Bell inequality. Consider the sum of probabilities
\begin{multline*}
p_\text{succ} = \frac14\Big(p(00|00) + p(11|00) + p(00|01) + p(11|01) \\ p(00|10) + p(11|10) + p(01|11) + p(10|11)\Big)
\end{multline*}
This can be interpreted as the probability of success in a game where Alice and Bob receive inputs $x$ and $y$ from a referee, and must return equal outputs if the inputs are 00, 01, or 10, and must return different outputs if the inputs are 11.

We want to prove an upper bound to $p_\text{succ}$ from the decomposition of the conditional probabilities derived above. First we rewrite it as
$p_\text{succ} = \sum_{abxy} M^{ab}_{xy} p(ab|xy) = \sum_{abxy} \sum_\lambda M^{ab}_{xy} p(\lambda)p(a|x\lambda)p(b|y\lambda)$
where $M^{ab}_{xy} = \frac14\delta_{a\oplus b,xy}$ are the coefficients defined by the above sum of probabilities. Note now that
$p_\text{succ} \le \max_\lambda \sum_{abxy} M^{ab}_{xy} p(a|x\lambda)p(b|y\lambda)$
as the convex combination over $\lambda$ can only reduce the value of $p_\text{succ}$. And since the functions $p(a|x\lambda)$ and $p(b|y\lambda)$ are assumed to be deterministic, there can only be a finite number of them (in fact 4 different functions for Alice and 4 for Bob), so we can do the maximization over $\lambda$ simply by trying all 16 possibilities. Doing that, we see that
$p_\text{succ} \le \frac34$
for theories that obey no conspiracy, determinism, and no action at a distance. This is the famous CHSH inequality.

On the other hand, according to quantum mechanics it is possible to obtain
$p_\text{succ} = \frac{2 + \sqrt2}{4}$
and a violation of the bound $3/4$ was observed experimentally, so at least one of the three assumptions behind the theorem must be false. Which one?