david-gpu
5 months ago
I began writing GPU drivers in 2006 and had to support some legacy fixed-function chips from the late 90s at one point.
I think you and other commenters pretty much summarized what it was like. Documentation was often poor, so you would sometimes have to reach out to the folks who had actually written the hardware (or the simulator) for guidance.
That is the easy part of writing a driver, even today. Just follow the specification. The code in a GPU driver is relatively simple, and doesn't vary that much from one generation to the next. In the 90s some features didn't have hardware support, so the driver would do a bunch of math in the CPU instead, which was slow.
I'm contrast, the fun part are the times when the hardware deviates from he specification, or where the specification left things out and different people filled in the blanks with their own ideas. This is less common nowadays, as the design process has become more refined.
But yeah, debugging hardware bugs essentially boils down to:
(1) writing the simplest test that triggers the unexpected behavior that you had observed in a more complex application, then
(2) providing traces of it to the folks who wrote that part of the hardware or simulator,
(3) wait a few days for them to painstakingly figure out what is going wrong, clock by clock, and
(4) implement the workaround that they suggest, often something like "when X condition happens on chips {1.23, 1.24 and 1.25}, then program Y register to Z value, or insert a command to wait for module to complete before sending new commands".
It was more tedious than anything. Coming up with the simplest way to trigger the behavior could take weeks.
Well, that's what it was like to write user mode drivers. The kernel side was rather different and I wasn't directly exposed to it. Kernel drivers are conceptually simpler and significantly smaller in terms of lines of code, but much harder to debug.