Exploiting Undefined Behavior in C/C++ Programs: The Performance Impact [pdf]

87 pointsposted 4 days ago
by luu

70 Comments

pcwalton

19 hours ago

I notice that the paper doesn't claim to eliminate all reasoning about undefined behavior for optimizations. For example:

    int f() {
        int arr[3], i = 0;
        arr[3] = 5;
        return i;
    }
Optimizing this to "return 0" is relying on UB, because it's assuming that i wasn't laid out directly after arr in the stack frame. I believe this is what the paper calls "non-guardable UB".

I don't agree with the claim in the paper that their semantics offers a "flat memory model". A flat memory model would rule out the optimization above. Rather, the memory model still has the notion of object bounds; it's just simplified in some ways.

dooglius

18 hours ago

>it's assuming that i wasn't laid out directly after arr in the stack frame

The compiler isn't "assuming" that so much as choosing not to put i in the stack frame at all. And I don't think it's correct to view the lack of a register spill as an "optimization" per se. It does remain true that code writing past the end of an array will be UB in typical scenarios (i.e. when not using asan/valgrind).

(Now, if the compiler also removed the store, that could legitimately be called an optimization based on assuming no-UB)

pcwalton

17 hours ago

"Exploiting undefined behavior" occurs when a simple semantics (however one defines "simple") results in behavior A, but the compiler chooses behavior B instead based on the actual, more complex, language semantics. The code snippet in question passes that test. If I flip the declaration order of i and arr, then I get this [1] at -O0 (the "simple" semantics):

        push    rbp
        mov     rbp, rsp
        mov     dword ptr [rbp - 4], 0
        mov     dword ptr [rbp - 4], 5
        mov     eax, dword ptr [rbp - 4]
        pop     rbp
        ret
Which indeed returns 5. But at -O2 clang optimizes it to this:

        xor     eax, eax
        ret
Which returns 0. So the simple semantics produces one result, and the complex semantics produces another. Hence, it's exploiting undefined behavior.

[1]: https://godbolt.org/z/df4dhzT5a

dooglius

16 hours ago

Maybe this is just arguing semantics, but I don't agree with the definition you've given, and I don't think that your definition is what TFA means. "Exploiting undefined behavior" I think refers narrowly to the implementation assuming that undefined behavior does not occur, and acting on that assumption. Undefined behavior naturally resulting in unpredictable behavior is not exploitation in the same sense. For example,

  printf("A");
  bool x;
  if ( x ) {printf("B");} else {printf("C");}
  printf("\n");
If at -O0 "AB" is printed and at -O2 "AC" is printed (due to the vagaries of whatever was left on the stack), then that would meet your definition, but I would not regard that as "exploiting undefined behavior", merely as the manifestation of the inherent unpredictability of UB. If the compiler didn't print anything (i.e. the whole block was removed due to UB detection) then that _would_ be a case of exploiting undefined behavior.

pcwalton

16 hours ago

That example is an instance of unspecified vs. undefined behavior, but the correctness of the pointer provenance-based optimization example I gave doesn't depend on whether writing to an out-of-bounds pointer is unspecified or undefined.

artikae

16 hours ago

What about this: https://godbolt.org/z/xP9xG3Ee3

Here the compiler "register allocates" i for some reads but not for others.

i gets stack allocated, but some uses of it act as though they were register allocated.

dooglius

14 hours ago

I'm not sure quite what you're asking for exactly, given the link is for clang trunk and doesn't have the modifications discussed in TFA, and I don't dispute that clang does UB-based reasoning at -O3. But, I will argue that the assembly shown can be accomplished without resorting to what I call "reasoning about UB", and within a flat memory model, supporting the claim that these sacrifices are often not necessary. I'm going to draw a distinction between stack memory being "private" in the sense that only the compiler is allowed to alter it, and "public" where the address can be written to by something else and the compiler needs to handle that. Local variables at first are tracked privately. After the address of a variable is taken with &x, or at the point in time when an array variable is indexed, the associated memory is public. Conceptually, the use of private memory can be indirect; the compiler could encode/decode a stack variable x as (x XOR 0xDEADBEEF) on the stack and it would be fine (but the compiler does the simple thing in practice, naturally). Note that this notion of "private"/"public" stack memory is a property of addresses, not the provenance of the accessing pointers, and so is fully compatible with a flat memory model. The compiler's ability to use private memory isn't a case of "reasoning around UB" in a meaningful sense -- otherwise you could just as well argue that returning from a function call is "reasoning about UB", because the the return address can't be clobbered.

In your provided snippet, the correctness argument for the assembly in a non-UB-reasoning universe goes like this: at first, i is stored privately on the stack with value zero, and so as an optimization we can assume that value is still zero without rereading. Only later, when &i is taken, is that memory made public and the compiler has to worry about something altering it. In actual execution, the problem is that the write function alters compiler-private memory (and note again, that being private is a property of the underlying address, not the fact that it's accessed via an out-of-bounds array indexing), and this is UB and so the program breaks. But, the compiler didn't need to make _assumptions_ around UB.

jcranmer

17 hours ago

I've only briefly skimmed the paper, but on that glance, it looks like what they did was (effectively) drop all the attributes in LLVM that can indicate UB, e.g., the inbounds flags on getelementptr instructions, or the nsw flags on arithmetic operations.

As you note, it doesn't remove the more core UB behaviors in LLVM, in particular LLVM's reliance on pointer provenance.

nikic

15 hours ago

They do a bit more than that. One of the options (-disable-object-based-analysis under AA2) disables the assumption that distinct identified objects do not alias, which is disabling pointer provenance in at least one key place.

So I think this option very roughly approximates a kind of "no provenance, but with address non-determinism" model, which still permits optimizations like SROA on non-escaping objects.

jcranmer

15 hours ago

That's what I get for relying on a skimming of the paper.

Also, hi, didn't know you commented on this site.

throwawayqqq11

18 hours ago

Sorry, i dont get why the memory layout should have any effect, when its clear in the AST that i=0 should be returned.

maartenscholl

18 hours ago

I think in the example the parent gave `arr[3]` is past the end of the 3 element array, where `i` might reside, potentially changing its value.

bregma

18 hours ago

It's clear in the AST that there is undefined behaviour and it is malformed code. It is not valid C code, so what the compiler chooses to do with it is not defined by the language.

pcwalton

18 hours ago

Note that if you change the code to this you have the same issue:

    int g(int n) {
        int arr[3], i = 0;
        arr[n] = 5;
        return i;
    }
Without "exploiting UB" it's incorrect to optimize this to "return 0", because of the possibility that i was allocated right after arr and n == 3.

screcth

17 hours ago

If we consider writing out of bounds to be legal, we make it impossible to reason about the behavior of programs.

pcwalton

17 hours ago

Hence why I (and many other compiler developers) are inherently skeptical whenever anyone says "just stop exploiting undefined behavior".

tialaramex

13 hours ago

I'm not a compiler developer but I'm at least as skeptical as you because there is no sign that the "just stop exploiting UB" people actually want any specific semantics, IMO they want Do What I Mean, which isn't a realizable language feature.

If you could somehow "stop exploiting UB" they'd just be angry either that you're still exploiting an actual language requirement they don't like and so have decided ought to be excluded or that you followed the rules too literally and obviously the thing they meant ought to happen even though that's not what they actually wrote. It's lose-lose for compiler vendors.

gavinhoward

12 hours ago

I am one of the "stop exploiting UB" camp. [1]

I agree that some of us are unreasonable, but I do recognize that DWIM is not feasible.

I just want compilers to treat UB the same as unspecified behavior, which cannot be assumed away.

[1]: https://gavinhoward.com/2023/08/the-scourge-of-00ub/

muldvarp

6 hours ago

> I just want compilers to treat UB the same as unspecified behavior, which cannot be assumed away.

Unspecified behavior is defined as the "use of an unspecified value, or other behavior where this International Standard provides two or more possibilities and imposes no further requirements on which is chosen in any instance".

Which (two or more) possibilities should the standard provide for out-of-bounds writes? Note that "do what the hardware does" wouldn't be a good specification because it would either (a) disable all optimizations or (b) be indistinguishable from undefined behavior.

tialaramex

11 hours ago

You mention that "Note that those surprised programmers are actually Rust compiler authors" but I can't figure out which of the many links is to some "surprised programmers" who are actually rustc authors, and so I don't even know if you're right.

Rust's safe subset doesn't have any UB, but the unsafe Rust can of course cause UB very easily, because the rules in Rust are extremely strict and only the safe Rust gets to have the compiler ensure it doesn't break the rules. So it seems weird for people who work on the compiler guts to be "surprised".

pcwalton

8 hours ago

I'm a Rust compiler author, and I'm fully in favor of "UB exploitation". In fact, LLVM should be doing more of it. LLVM shouldn't be holding back optimizations in memory-safe languages for edge cases that don't really matter in practice.

gavinhoward

7 hours ago

muldvarp

6 hours ago

I don't see any surprised compiler authors in that thread. The reporter immediately suggests the correct underlying reason for the bug and another compiler author even says that they wondered how long it would take for someone to notice this.

Even if you read any surprise into their messages they wouldn't be surprised that C does something completely unreasonable, they would be surprised that LLVM does something unreasonable (by default).

Dwedit

10 hours ago

There is also a completely different scenario where out-of-bounds writes aren't undefined behavior anymore. And that's when you've manually defined the arrays in an assembly source file, and exported their symbols. In that situation, you know what's before the array or after the array, so doing pointer math into an adjacent area has a well known effect.

im3w1l

17 hours ago

My thinking disagrees with yours, but I can't say I have fully made up my mind yet. To me a compiler that is not "exploiting UB" has several valid choices about how to compile this. i may be stored in memory, in a register or may be deleted entirely as it's known at compile time that it will have value 0. The store to arr[n] may go through or it may be deleted as it's a dead store.

You may say I'm "exploiting ub" when making these deductions but I would disagree. My argument is not of the form "x is ub so I can do whatever I want".

To elaborate on the example, if arr was volatile then I would expect the write to always go through. And if i was volatile then I would expect i to always be read. However it's still not guaranteed that i is stored immediately after arr, as the compiler has some discretion about where to place variables afaik. But if i is indeed put immediately after, then the function should indeed return 5 for n=3. For n>3 it should either return 0 (if writing to unused stack memory), page fault (for small n outside of the valid stack space), or stomp on random memory (for unlucky and large n). For negative n, many bad things are likely to happen.

Edit: I think I mixed up which way the stack grows but yeah.

dgrunwald

16 hours ago

> But if i is indeed put immediately after, then the function should indeed return 5 for n=3.

That's not how compilers work. The optimization changing `return i;` into `return 0;` happens long before the compiler determines the stack layout.

In this case, because `return i;` was the only use of `i`, the optimization allows deleting the variable `i` altogether, so it doesn't end up anywhere on the stack. This creates a situation where the optimization only looks valid in the simple "flat memory model" because it was performed; if the variable `i` hadn't been optimized out, it would have been placed directly after `arr` (at least in this case: https://godbolt.org/z/df4dhzT5a), so the optimization would have been invalid.

There's no infrastructure in any compiler that I know of that would track "an optimization assumed arr[3] does not alias i, so a later stage must take care not to place i at that specific point on the stack". Indeed, if array index was a runtime value, the compiler would be prevented from ever spilling to the stack any variable that was involved in any optimizations.

So I think your general idea "the allowable behaviors of an out-of-bounds write is specified by the possible actual behaviors in a simple flat memory model for various different stack layouts" could work as a mathematical model as an alternative to UB-based specifications, but it would end up not being workable for actual optimizing compiler implementations -- unless the compiler could guarantee that a variable can always stay in a register and will never be spilled (how would the compiler do that for functions calls?), it'd have to essentially treat all variables as potentially-modified by basically any store-via-pointer, which would essentially disable all optimizations.

nikic

15 hours ago

One peculiar thing about the benchmark results is that disabling individual UB seems to fairly consistently reduce performance without LTO, but improve it with LTO. I could see how the UB may be less useful with LTO, but it's not obvious to me why reducing UB would actually help LTO. As far as I can tell, the paper does not attempt to explain this effect.

Another interesting thing is that there is clearly synergy between different UB. For the LTO results, disabling each individual UB seems to be either neutral or an improvement, but if you disable all of them at once, then you get a significant regression.

mwkaufma

19 hours ago

Reading e.g. the 13% perf regression in simdjson from disabling UB:

  A simpler alternative is to compile the program with LTO. We confirmed that LLVM’s inter-procedural analyses can propagate both alignment and dereferenceability information for this function, which allows the LTO build to recover the performance loss.
"can" is doing a lot of heavy-lifting here. Guaranteeing expected optimizations "will" be applied are hard-enough, without leaving it entirely to an easily-derailed indirect side-effect.

UebVar

19 hours ago

This is "can" has exactly the same meaning as in "UB can make your programms faster". You could replace it with "it does, at least with clang". LTO is, in this regard, the same as UB, and unlike guaranteed optimizations, such as the single member optimization, or the empty base optimization.

mwkaufma

18 hours ago

Concretely, here, the UB-exploitation in question in this case is assuming that the "this" pointer in C++ is aligned and non-null, meaning it's a pervasive annotation throughout C++ codebases, not an edge-case.

Relying on LTO to "discover" this annotation through interprocedural analysis -- based on my experience of looking at LTO in practice -- will not be as comprehensive, and even when it works it accomplishes its task in an achingly-slow and expensive way.

This is a real devil-is-in-the-details case.

quotemstr

17 hours ago

I love when papers disagree with their own abstracts.

gitroom

19 hours ago

perfect, this is right up my alley - honestly i keep wondering if teams avoid optimizations like lto just because build pain sucks or if theres some deeper trust issues around letting the toolchain be clever. you think peopled deal with slow builds if it bought way more speed for the final product?

jonstewart

21 hours ago

> The results show that, in the cases we evaluated, the performance gains from exploiting UB are minimal. Furthermore, in the cases where performance regresses, it can often be recovered by either small to moderate changes to the compiler or by using link-time optimizations.

_THANK YOU._

Rusky

19 hours ago

It's worth noting (and the paper does go into this) that this is limited to a very specific subset of UB, which they call "guardable."

They are not removing UB around things like out-of-bounds or use-after-free, which would likely be more expensive.

jonstewart

15 hours ago

I don’t understand the down votes. Conducting empirical research on the performance impact of undefined behavior is fantastically needed, as the C++ committee’s obsession with undefined behavior strictness (in contrast with longstanding semantics, e.g., uninitialized memory accesses being just fine) has been justified largely by how they enable optimizing compilers. This research shows that many types of UB have a negligible impact on performance.

saagarjha

14 hours ago

You're getting downvoted because you're looking for a particular result ("UB optimizations don't help performance") rather than actually evaluating the quality of this analysis (which doesn't really support what you want anyway).

atq2119

14 hours ago

Possibly somebody downvoted because "thank you" in all caps is not a substantial contribution to discussion. It feels like the kind of low effort stuff you'd see on reddit.

Also, commenting on downvotes is generally frowned upon.

ryao

21 hours ago

> by using link-time optimizations

These are almost never used by software.

mgaunard

20 hours ago

Only places where I've seen LTO not be used are places with bad and unreliable build systems that systematically introduce undefined behaviour by violating the ODR.

tialaramex

20 hours ago

Violating ODR doesn't introduce UB it's IFNDR, Ill-formed No Diagnostic Required which is much worse in principle and in such cases probably also in practice.

UB is a runtime phenemenon, it happens, or it doesn't, and we may be able to ensure the case where it happens doesn't occur with ordinary human controls.

But IFNDR is a property of the compiled program, if you have IFNDR (by some estimates that's most C++ programs) your program has no defined behaviour and never did, so there is no possible countermeasure, too bad game over.

jeffbee

20 hours ago

The only organization I've worked in that had comprehensive LTO for C++ code was Google. I've worked at other orgs even with 1000s of engineers where LTO, PGO, BOLT, and other things you might consider standard techniques were considered voodoo and too much trouble to bother with, despite the obvious efficiency improvements being left on the table.

com2kid

20 hours ago

I helped with pgo work at Microsoft over 15 years ago, back when it was a Microsoft Research project.

The issue with early pgo implementations was getting a really good profile, as you had to have automation capable of fully exercising code paths that you knew would be hot in actual usage, and you needed good instrumentation to know what code paths those were!

The same problem exists now days, but programs are instrumented to hell and back to collect usage data.

jeffbee

20 hours ago

I am willing to assume that organizations dedicated to shipping software to customers like Microsoft or Autodesk or somebody like that are almost certainly all in on optimization techniques. The organizations where I worked are ones that are operating first party or third party software in the cloud where they're responsible for building their own artifacts.

astrange

15 hours ago

PGO is pretty difficult. In my experience compilers don't seem to know the difference between "this thing never runs" and "we don't have any information about if this thing runs". Similarly it might be useful to know "is this branch predictable" more than just "what % is it taken".

CPUs are so dynamic anyway that there often isn't a way to pass down the information you'd get from the profile. eg I don't think Intel actually recommends any way of hinting branch directions.

jeffbee

15 hours ago

It's implied by the target offset. Taken branches jump backwards, unlikely branches jump forward.

atq2119

14 hours ago

Not generally, no. This is true for some chips, especially (very) old or simple cores, but it's not something to lean on for modern high end cores.

jeffbee

13 hours ago

Generally yes. This is not for "simple" cores this is the state-of-the-art static branch prediction algorithm as described by Intel in their optimization manual.

"Branches that do not have a history in the BTB ... are predicted using a static prediction algorithm: Predict forward conditional branches to be NOT taken. Predict backward conditional branches to be taken."

It then goes on to recommend exactly what every optimizing compiler and post-link optimizers like BOLT do:

"Arrange code to be consistent with the static branch prediction algorithm: make the fall-through code following a conditional branch be the likely target for a branch with a forward target, and make the fall-through code following a conditional branch be the unlikely target for a branch with a backward target."

This is why a reduction in taken forward branches is one of the key statistics that BOLT reports.

saagarjha

14 hours ago

Surely you are not putting code behind an if/else

UncleMeat

19 hours ago

Google doesn't have full-lto either, since binaries are way too big. Thin-lto is vastly less powerful.

mgaunard

18 hours ago

In practice the default ABI on linux x86-64 is still limiting you to binaries that are 4G or thereabout.

Not exactly a problem for LTO since any reasonable build machine will have 128GB of ram.

pcwalton

18 hours ago

Yeah, I would have liked to see the paper specify whether the LTO they tried is fat LTO or ThinLTO.

jeffbee

19 hours ago

"Vastly" eh? I seem to recall that LLVM ThinLTO has slight regressions compared to GCC LTO on specCPU but on Google's own applications the superior whole-program devirtualization offered only with ThinLTO is a net win.

UncleMeat

17 hours ago

I'll adjust my phrasing.

As a user, building with thin-lto vs full-lto generally produces pretty similar performance in no small part because a huge amount of effort has gone into making the summaries as effective as possible for key performance needs.

As a compiler developer, especially when developing static analysis warnings rather than optimization passes, the number of cases where I've run into "this would be viable if we had full-lto" has been pretty high.

loeg

20 hours ago

Facebook uses LTO/PGO for C++ pretty broadly.

jeffbee

19 hours ago

Yeah they just never hired me. They also invented BOLT.

I think there is a valley in terms of organization size where you have tons of engineers but not enough to accomplish peak optimization of C++ projects. These are the orgs that are spending millions to operate, for example, the VERY not-optimized packages of postgresql from Ubuntu, in AWS.

spookie

11 hours ago

Well, Ubuntu isn't really a good project to look up upon :)

Hell, their latest upgrade broke one of their flavours. Not to mention how fragile their installer is.

jandrewrogers

20 hours ago

LTO is heavily used in my experience. If it breaks something that is indicative of other issues that need to be addressed.

yxhuvud

20 hours ago

Main issue isn't that it break stuff but that it tend to be pretty slow to compile with it.

jorvi

19 hours ago

.. that's why you compile without LTO during development and do a final 'compile with LTO > profile > fix / optimize > compile with LTO' pass.

Compilation happens once and then runs on hundreds of thousands up to billions of devices. Respect your users.

astrange

15 hours ago

This assumes that LTO is strictly better than no-LTO, ie only gets faster, has the same optimization hotspots, and doesn't break anything.

I would recommend only doing things that fit within the 'build > text > fix' loop.

Sharlin

19 hours ago

Which doesn't matter at all in a release build. And in a dev build it's rarely necessary.

pcwalton

18 hours ago

At FAANG scale the cost is prohibitive. Hence the investment in ThinLTO.

KerrAvon

18 hours ago

At FAANG scale, you absolutely want to have a pass before deployment that does this or you're leaving money on the table.

scott_s

11 hours ago

It's not as obvious a win as you may think. Keep in mind that for every binary that gets deployed and executed, it will be compiled many more times before and after for testing. For some binaries, this number could easily reach the hundreds of thousands of times. Why? In a monorepo, a lot of changes come in every day, and testing those changes involves traversing a reachability graph of potentially affected code and running their tests.

steveklabnik

19 hours ago

It's on by default for Rust release builds, so at least the codepaths in LLVM for it are well-exercised.

vlovich123

17 hours ago

I don't think that's right unless the docs are stale:

    [profile.release]
    lto = false
https://doc.rust-lang.org/cargo/reference/profiles.html#rele...

steveklabnik

17 hours ago

So the thing is that false means thinlto is used depending on other settings, see https://doc.rust-lang.org/cargo/reference/profiles.html#lto

> false: Performs “thin local LTO” which performs “thin” LTO on the local crate only across its codegen units.

I think this is kind of confusing but whatever. I should have been more clear.

LegionMammal978

12 hours ago

There is no cross-crate LTO with 'lto = false', but there is cross-crate thin LTO with 'lto = "thin"'. The codepaths might still be getting hit, but individual CGUs within a crate are generally invisible to the user, which can create the impression that LTO doesn't occur. (That is, if you operate under the mental model of the crate being the basic compilation unit, then 'lto = false' means you'll never see LTO.)

alpaca128

18 hours ago

That must have been changed sometime in the last year then. When I enable LTO for one of my projects on a Rust compiler from 2024 the compilation time more than doubles.

steveklabnik

17 hours ago

I should have been more clear: thin LTO is, not full “fat” LTO, for exactly that reason.