> When I(short proof)=I(long proof), per-token average surprisal must be lower for the long proof than for the short proof. But since surprisal for a single token is simply -log P, that would mean that, on average, the shorter proof is made out of less probable tokens.
This assertion is intuitive, but it isn't true. Per-token entropy of the long proof can be larger if the long proof is not minimal.
For example, consider the "proof" of "list the natural numbers from 1 to 3, newline-delimited." The 'short proof' is:
"1\n2\n3\n" (Newlines escaped because of HN formatting)
Now, consider the alternative instruction to give a "long proof", "list the natural numbers from 1 to 3, newline-delimited using # for comments. Think carefully, and be verbose." Trying this just now with Gemini 2.5-pro (Google AI Studio) gives me:
"# This is the first and smallest natural number in the requested sequence.\n
1
# Following the first, this is the second natural number, representing the concept of a pair.\n
2
# This is the third natural number, concluding the specified range from 1 to 3.\n
3"
We don't have access to Gemini's per-token logits, but repeating the prompt gives different comments so we can conclude that there is 'information' in the irrelevant commentary.
The author's point regains its truth, however, if we consider the space of all possible long proofs. The trivial 'long' proof has higher perplexity than the short proof, but that's because there are so many more possible long proofs of approximately equal value. The shortest possible proof is a sharp minimum, but and longer proofs are shallower and 'easier'.
The author also misses a trick with:
> Prompted with “Respond only with code: How do you increment i by 1 in Python?”, I compared the two valid outputs: i += 1 has a perplexity of approximately 38.68, while i = i + 1 has a perplexity of approximately 10.88.
… in that they ignore the equally-valid 'i = 1 + i'.