They exist, but they're not that common (give or take the "expected" numerical deviations based on the order of summation and whatnot, which can both be nontrivial and propagate error further).
Something I recommend doing, the best time being the start of the project and the second best time being now, is adding numerical gradient checking tests to all operations. You will make mistakes in your kernels from time to time, and it's valuable to know at a glance where those mistakes are.
Mind you, it's possible to write both the forward pass and the backward pass in a way that's wrong but compatible. An additional layer of checks I like to add is a dead-simple implementation of all algorithms -- no vectorization, no fancy blocking or re-orderings, nothing. Compare results to the simple implementation.
It sounds like a lot of work, but writing an optimized kernel is much slower than the numerical gradient checking and the simple kernel, and given how in numerical code it's basically impossible to identify the source of a bug without doing the equivalent of all of those checks, it only takes one bug in the whole project for the effort to pay off.
Thanks a lot for the pointers, I think I've done a similar approach to what you suggest, lots of tiny (relative) tests for each step in the process, and doing sort of sanity checking between the naive stuff I first wrote which works and which does inference correctly, and the new kernel which is a lot more performant, but currently incorrect and produces incoherent outputs.
I'll try to replace bits by simplified versions though, probably could help at least getting closer to knowing where the issue is.
Anyone have more debugging tips I'd greatly appreciate it! Nothing is too small or "obvious", as I'm about to lose my mind more or less.
You may be running into jensen (huang)’s inequality,
E(loss).cuda() <= E(loss.cuda())
Would make sense I suppose if I was using two different GPUs for the same thing and get two different outcomes. But instead I have two implementations (one naive, one tensor cores) running on the same GPU, but getting different outcomes, where they should be the same.
But then this joke might be flying above my head as well.
Tensor cores use lower precision, so small numerical differences should be expected.
Consumer-visible hardware bugs are extremely uncommon nowadays. There's approximately 10x as many people working in design verification as actual hardware design.
I say "consumer-visible" because the bugs still exist and people who can catch them early get promoted quickly and paid a lot. It's very exciting work if you can get it, since you really have to understand the full GPU to break it.
Good luck!!
How big is the numerical difference? If it's small it might be within the precision of the operation itself.
Magnitudes away (maybe "small numerical difference" was an understatement), my current hypothesis is that I'm doing scaling wrong somewhere, but I can't help but sometimes slide into the "maybe there is something deeper wrong" territory in the evening after another day...