The paper uses a number representation that is designed to make attention easy to learn: each digit is a separate token and the least significant digit is put first, so that the first digit of the output is simply the sum of the first digits of the inputs and the second digit is the sum of the second digits plus an optional carry from the first digits and so on.
If the numbers are represented with the most significant digit first as usual, you need a bunch of intermediate steps before outputting even the first digit just to determine whether it is affected by a carry or not.
The paper looks at multiplication of numbers represented with the least significant digit first as a toy task requiring several additions as intermediate steps to study why a model large enough to perform those additions in principle fails to learn to do so in practice.
They compare with a model that is first trained to produce the intermediate additions explicitly (as a "chain of thought" with a specific format) and then has this CoT progressively shortened during training until there's nothing left of it. But that second model successfully multiplies.
The difference appears to be that the presence of the intermediate results induces a better number representation in latent space, whereas the model without CoT gets stuck in a less efficient local minimum.
So the answer to the question "Why can't transformers learn multiplication?" is that the training process is insufficient for the model to discover the best intermediate steps on its own.
You could do a similar experiment where the CoT involves first taking the logarithm, adding, and then exponentiating to get the final result, but I think logarithms are probably another computation that's too difficult to learn without additional hints for intermediate steps.
> but I think logarithms are probably another computation that's too difficult to learn without additional hints for intermediate steps.
I suppose you're probably right, but LLMs probably have a lot of log tables in their training data so I'm not so sure.
The paper is about the ability of transformers to learn a task based on training data for that task only, not about LLMs pretrained on much of the internet. And training on log tables doesn't necessarily allow the model to always output the correct logarithm, just as training on multiplication tables doesn't necessarily confer the ability to multiply.