AVX Bitwise ternary logic instruction busted

259 pointsposted 17 hours ago
by msephton

57 Comments

red_admiral

28 minutes ago

Is this similar to the Windows (since at least 3.1 I think?) BitBlt function, that takes an `op` parameter to decide how to combine the source, destination and mask?

I remember there are names for some of the codes like BLACKNESS for producing black whatever the inputs are, COPY (or something like that) to just copy the source to the destination etc. I always thought BLACKNESS and WHITENESS had a kind of poetic ring to them.

As far as I know, I think this is from Petzold, it's implemented in software but the opcode is actually converted to custom assembly inside the function when you call it, a rare example of self-modifying code in the Windows operating system.

mmozeiko

14 hours ago

There is a simple way to get that immediate from expression you want to calculate. For example, if you want to calculate following expression:

    (NOT A) OR ((NOT B) XOR (C AND A))
then you simply write

    ~_MM_TERNLOG_A | (~_MM_TERNLOG_B ^ (_MM_TERNLOG_C & _MM_TERNLOG_A))
Literally the expression you want to calculate. It evaluates to immediate from _MM_TERNLOG_A/B/C constants defined in intrinsic headers, at least for gcc & clang:

    typedef enum {
      _MM_TERNLOG_A = 0xF0,
      _MM_TERNLOG_B = 0xCC,
      _MM_TERNLOG_C = 0xAA
    } _MM_TERNLOG_ENUM;
For MSVC you define them yourself.

o11c

11 hours ago

To take the magic away, write it in binary:

  A = 0b11110000
  B = 0b11001100
  C = 0b10101010

meow_catrix

6 hours ago

The Amiga manual suggests normalizing to a conjunctive normal form.

bcoates

5 hours ago

The constant-math method above doesn't require that and works on denormalized expressions, too.

That said, the trick for turning four or more argument bitwise operations into a series of vpternlogd operations has yet to be posted

Sniffnoy

16 hours ago

Oh, I thought the title was saying that the instruction doesn't work properly! (The article actually just explains how it works.)

SomeHacker44

16 hours ago

Agreed on initial interpretation. Terrible title!

foota

15 hours ago

Amusingly, I had a third interpretation, which is "busted" as being too strong. I realized that when the author started talking about the Amiga though that's probably not what they meant (as busted is a fairly modern gaming term, I'd be surprised to see someone as old as to be familiar with Amiga to use it. Sorry to anyone that feels personally attacked by this description :P)

thfuran

14 hours ago

Is broken not cool enough anymore?

aspenmayer

12 hours ago

I’ve heard broken and cracked used in this way, usually in a game-mechanics balance context implying something is overpowered, but busted in that same context has a negative connotation to me for some reason, signifying something being underpowered or bugged, if that makes sense. I can’t even tell if this is ironic or unironic at this point.

Lerc

16 hours ago

My teenage self did not write "CRAP!" on that page of the hardware manual, but I stared at it for so long trying to figure it out.

In the end I did what pretty much everyone else did, Found the BLTCON0 for Bobs and straight copies and then pretended I newer saw the thing.

I did however get an A+ in computational logic at university years later, so maybe some of the trauma turned out to be beneficial.

vardump

3 hours ago

Yup. Learned so much from that page in Amiga Hardware Reference Manual!

cubefox

14 hours ago

About the title: "Ternary logic" usually means "logic with three truth values". But this piece covers a compiler instruction which handles all binary logic gates with three inputs.

dzaima

14 hours ago

The x86 instruction is named 'ternlog', and intrinsic - 'ternarylogic' though; while perhaps unfortunate, the title is appropriate. (and even then 'bitwise' already sort of takes place of what 'ternary'-as-three-valued would, and 'ternary' is also very often three-input, so much so that 'a ? b : c' is often called the ternary operator (and in fact ternlog can simulate this ternary operation; and in fact the article is even about exactly that))

cubefox

12 hours ago

Yeah, though the article describes 0xE2, which is 'b ? a : c'. 'a ? b : c' would be 0xCA.

orlp

3 hours ago

That's not what ternary refers to here.

In C, '+' is a binary operator because it accepts two inputs. '?:' is a ternary operator because it accepts three inputs. It is usually referred to as the ternary operator because it is unique in C, but there's nothing fundamental about that.

vpternlogd implements all bitwise ternary operators - those operators that accept three inputs.

eapriv

7 hours ago

It’s “ternary (logic instruction)”, not “(ternary logic) instruction”.

someguydave

12 hours ago

Agree, I was also confused on this point. I guess the name “evaluate a three term binary expression” is less snappy though.

cubefox

13 minutes ago

I'm also not sure what was "busted" here.

abecedarius

9 hours ago

Re the choice of function "E2" for the example in the docs: it's sort of the most basic, canonical boolean function on 3 inputs, named mux: A if B else C. It's universal -- you don't need to be an Amiga fan to pick it (though for all I know they might've been).

kens

16 hours ago

I'll point out that this is the same way that FPGAs implement arbitrary logic functions, as lookup tables (LUTs).

dreamcompiler

2 hours ago

All logic can be implemented with memory, and all memory can be implemented with logic (in some sort of feedback configuration).

Neither is typically done in practice except for specialized purposes like FPGAs and the instruction described in this article. High-speed registers and static RAM are sometimes built with logic but it's more common to build them directly with transistors than with gates.

okanat

11 hours ago

Basically CPUs, GPUs and FPGAs all converge to the crab equivalent of computation. They all expose the same capability with different areas of optimization.

kevin_thibedeau

16 hours ago

Most but not all. Actel/Microsemi use a small tree of muxes and gates.

pjmlp

2 hours ago

> The Amiga blitter user manual didn’t help much either. The “Amiga Hardware Reference Manual” from 1989 tried to explain minterm calculation using confusing symbols, which frustrated many young demo makers at the time.

That is super normal logical calculus that any worthwhile CS degree teaches about.

Granted, probably not what a teenager without access to a BBS, or Aminet, would be able to figure out.

pwrrr

2 hours ago

Holy cow. I remember reading that page in the Amiga reference manual, thinking it was utter crap and made up my own way of calculating the value (which worked, lol).

unwind

4 hours ago

As someone who fits the description rather too well (although neither my teenage or current self would ever use a marker in the Hardware Reference, omg) this was really nice and satisfying to read.

In a weird sense it kind of helped me feel that, yes, I would probably understand stuff better if I tried re-learning the Amiga hardware today and also like I got a bit of it for free already! Is there such a thing as being protected from a nerd snipe? "This article was my nerd trench" ... or something. Thanks! :)

worstspotgain

5 hours ago

It's nice that they're finally starting to "compress" the instruction space.

To take a related concept further, it would be nice if there were totally unportable, chip-superspecific ways of feeding uops directly, particularly with raw access to the unrenamed register file.

Say you have an inner loop, and a chip is popular. Let your compiler take a swing at it. If it's way faster than the ISA translation, add a special case to the fat binary for a single function.

Alas, it will probably never happen due to security, integrity, and testing costs.

larodi

5 hours ago

Is it only security which stands in the way of compilers devising their own instructions. Certain architectures potentially would allow this, and most of the chips, even MCUs as ESP32 (which run RISC-V-like LX7 cores), have microcode. It is apparent what the gains can be, what is the typical show stopper - security or proprietary microcode lang?

worstspotgain

4 hours ago

The uops aren't smart or terribly proprietary IIRC. Aside from security/integrity/testing, the documentation would have to be automated, and releasing it early might tip off competitors to some stuff. I don't think you'd even need to roll your own ISA, you'd just issue uops as a native ultra-low-level ISA.

One issue (without further architectural help) is that you'd save and restore every register you touch, which is fine for an inner loop that's consequential enough. Another obstacle is any hardware that's shared among cores, such as memory coherency.

In general, moving super-ad-hoc memory management into the compiler would suck for the compiler writers. Maybe some of your compiler can be a LLM?

If performance is key, I would totally deploy this if I have faith in my tests - and maybe a way for the user to turn it off just in case.

leogao

14 hours ago

Nvidia SASS has a similar instruction too (LOP3.LUT)

Findecanor

15 hours ago

I didn't have the official Amiga hardware manual, but instead the book "Mapping the Amiga". It said the same thing in a slight more verbose way. I don't remember which minterms I used back then but I think I managed to work things out from this book to do shadebobs, bobs, XOR 3D line drawing and other things.

The page in Mapping the Amiga: https://archive.org/details/1993-thomson-randy-rhett-anderso...

ChuckMcM

11 hours ago

This is an instruction I would like to implement in RISC-V if it isn't already, (which yeah, I know, isn't very RISC like)

   movei (%r1),(%r2),(%r3),value
Move the contents of memory pointed to by r1, to the contents of memory pointed to by r2, applying the boolean operator <value>, with the memory pointed to by r3. Then increment all three registers by 4 to point to the next word. There was something similar to this in the Intel 82786 graphics chip which had a sort of minimal cpu part that could run simple "programs".

And yeah, I really enjoyed the blitter on the Amiga. It was a really cool bit of hardware.

magicalhippo

8 hours ago

What's the value (no pun intended) of such an instruction if "value" is an immediate and not register-based?

If it's an immediate then the compiler (human or machine) knows what the operation will be, so could just write that instead?

ChuckMcM

6 hours ago

The value is that a two instruction pipeline can do a blit at cache speed.

magicalhippo

5 hours ago

I totally get why you'd want mem-to-mem instructions, it just seemed more interesting in a blitting context (and others) to have the operation not be static but rather register-based. But perhaps in practice it matters not, been a long time since I wrote blitting code.

RISC-V already has what you want in terms of R-type instructions[1], ie "dest = src1 op src2" where "op" is effectively an immediate value, but being based on a load-store architecture it's only register-to-register of course.

Though I suppose ISA-wise there's nothing in the way of making an extension that introduces "M-type instructions" which act like the R-type instructions but are mem-to-mem instead. How much that messes with everything else I have no idea.

edit: ah, forgot about you wanting it to behave like movsb. Still, something that could be handled by the extension instruction.

[1]: https://github.com/riscv/riscv-isa-manual/releases/download/... (section 2.4.2)

Someone

5 hours ago

But the “two instruction” part isn’t important there on most risc-v hardware, is it? That “memory to memory” instruction still routes the data through the CPU, so what does replacing a load instruction and a store instruction by a more complex but probably/likely slightly shorter load-store instruction bring if your CPU is much faster than your memory?

stevefan1999

5 hours ago

If you want to calculate the minterms why don't you just get a K-Map?

dreamcompiler

2 hours ago

You don't even need to do that. The 1s and 0s of the 8-bit word are the 1s and 0s of a Karnaugh map.

notfed

14 hours ago

Couldn't every Boolean operation be "busted" as a lookup table?

ekimekim

13 hours ago

Yes! But your lookup table will need 2^N bits for a function with N inputs. In this way you can easily enumerate all possible functions from N bits to 1 bit.

As a fun exercise, you can do this for all 2-bit -> 1-bit functions. There's only 16 of them, and most of them have very well known names like "and" (LUT 1000) or "xor" (LUT 0110). Some of them don't depend on some of the inputs (eg. LUT 1100 / 1010 which is "return A" and "return B" respectively) or even any of them (eg. LUT 0000 which always returns 0).

somat

10 hours ago

The epiphany for me was that any boolean logic can be replaced by memory.

https://www.youtube.com/watch?v=BA12Z7gQ4P0 (ben eater)

I mean it is obvious in retrospect, sort of along the lines of memoizing a function. but it was mind blowing when I first saw that.

Not that at the hardware level memory is actually any simpler than whatever boolean logic it is pretending to be, but it feels simpler and is easily available in large quantities.

ggerules

8 hours ago

It looks like someone paid attention in their undergraduate Discrete Math class.

londons_explore

14 hours ago

Do compilers actually output this instruction?

So many super-clever instructions are next to impossible for compilers to automatically use.

dzaima

8 hours ago

This instruction at least is very trivial to emit - any sequence of bitwise logic ops over three inputs compiles to one ternlog always. This isn't at all a super-clever instruction, it's more a super-boring one. If anything, it simplifies codegen compared to having to emit a varying amount of instructions for three-operand logic, having to find the best one.

486sx33

16 hours ago

it’s fundamentally just a lookup table

red_admiral

24 minutes ago

Any pure function is just a glorified lookup table.

(Yes, I know, finite input domain etc.)

hvenev

12 hours ago

> an obscure instruction

Come on, vpternlog* is not obscure. It subsumes _all_ bitwise instructions, even loading the constant (-1) into a register.

LegionMammal978

11 hours ago

Not if, like most people, you still aren't using a CPU with AVX-512 support. And I don't recall ever seeing it in compiler output in any case. It's not like boolean operations on three variables occur very frequently in most programs, especially (EDIT: this last part apparently isn't the case) not operations that can't be decomposed into a pair of two-variable operations with no worse performance.

dzaima

10 hours ago

As far as everything on uops.info goes, ternlog has the same throughput and latency as the two-operand logic instructions everywhere (with the mild exception of Zen 4 where it goes from 0.50 to 0.56 cycles/instr; which also shows as having 2-cycle latency to one operand but I think that might be measurement error), so it's always bad to decompose ternlog into two two-operand logic ops.

transfire

14 hours ago

Great little article! Thank you.