This is one of the most unfortunate cases of UB in C. It's necessary because C largely doesn't forbid pointer aliasing (accessing an object using multiple, distinct pointers). But the compiler needs to have some idea of when pointers might alias; otherwise, it's forced to generate extremely inefficient code that redundantly loads pointers over and over again just in case the value changed unexpectedly.
Within the body of a single function, the compiler can see where addresses came from and figure out what pointers may or may not alias. But when pointers are passed in as function parameters, the compiler has no idea. Thus, the C standard allows compilers to assume that pointers to different types never refer to the same object; an assumption that's true for almost all code anyone would want to write. But this adds surprising undefined behavior to some things you might try, like casting between layout-compatible structs, or the fast inverse square root trick from Quake. (Casting a pointer to a different type can still be tricky to get right even if it wasn't UB though. Alignment requirements are another footgun: for example, it's not legal to cast a uint8_t pointer to a uint16_t pointer unless you're sure its address is even.)
The blessed-by-the-standard way to do type punning is to either use a union or a memcpy, not a pointer cast.
For comparison, Rust prohibits mutable pointer aliasing for safe reference types, so the compiler knows when references may or may not alias. (Raw pointers are assumed to always be aliasable unless the compiler can prove that they're unique, e.g. by being derived from a safe reference). This leads to more efficient codegen in many cases, but it also means that type punning through pointer casting is fully legal in unsafe code (provided validity and alignment requirements are met), since the compiler does not need the type information to figure out aliasing.
> But when pointers are passed in as function parameters, the compiler has no idea.
"restrict" was added to give the compiler an idea.
As far as standard ISO C goes, nothing. That's just how UB can work in theory, though in practice compiler (ab)use of UB is somewhat more indirect (optimizing under the assumption that UB is not present in the program, rather than observing UB and making decisions based on that).
Beyond that, I believe Linux and other kernels technically use a slight variant of C by taking advantage of compiler extensions/flags to better fit their use cases. For example, Linux compiles (compiled?) with -fno-strict-aliasing and -fwrapv and uses a GCC extensions that allows type punning via unions [0], so that they can compile what the standard calls "incorrect" C code without worry.
I'm not sure whether recent versions of C have changed their stance on this particular UB, but since it'll probably be a while until they're adopted (if ever) kernels will be making do with their workarounds for a while longer.
[0]: https://lkml.org/lkml/2018/6/5/769
Type punning through unions is not undefined behavior in C, only in C++.
Yep, you're right. Think I got confused with something else :(
It's not UB if the "common initial sequence" exemption applies as it does here.
Nothing in the language, though the permissions mechanism of the system may disallow it. Unless the compiler is running with root permissions.
C is not a safe language. The more abrasive among fans of the language would say "skill issue" and "git gud" if you want to avoid footguns.
What kind of mechanism is there in place to prevent any compiler from just replacing your main() function with formatting /dev/sda?
That's nothing to do with the language, that's always on the compiler. Nothing stops the compiler from taking even correctly-specified code and doing whatever it wants.
That's a rathee obtuse response. You can typically operate under the assumption that the compiler works correctly. The question of the GP was - what prevents the compiler from (correctly) exploiting UB to change the program behaviour from the desired one? If the Linux devs want to rely on their compilers' output, they have to somehow be obeying the contract around UB.
I think the more reasonable assumption is that the practical needs of the biggest users will probably trump what any specification demands.
The thing stopping the compiler from doing dangerous behaviours in response to commonly abused UB is obviously that people wouldn't use the compiler if it did that. Just like how the thing stopping the compiler from doing dangerous behaviours in response to spec-legal code is that people wouldn't use it if it did that.
he was referring to the U in "UB".