Action was the best 8-bit programming language

91 pointsposted 5 days ago
by rbanffy

68 Comments

SeenNotHeard

5 days ago

One limitation not mentioned is that Action! didn't support recursion. This had to do with how local variables were stored.

Whether it was the best language for 8-bit programming, it certainly was a great fit for the 6502, as the language targeted the peculiarities of that chip. Accessing hardware-specific features of the 8-bit Atari's was a snap, which was necessary in order to do anything more interesting than sieves or print loops.

Action! probably could've been ported to the Apple line, but 8-bits were winding down by the time it was released. Porting to 16-bit machines like the IBM PC or Mac (or even the Atari ST) would have been a tougher sell, since Pascal and C were better established by that point, and worked well on those machines.

Two bad things about Action!: Charging a license fee to distribute the runtime, and that dumb bang in the name.

eschneider

4 days ago

The lack of recursion was not a practical limitation on the Atari. Base systems were 16K and you really didn't have space for recursion. And the license fee for the runtime didn't feel that painful compared to the other options.

NetMageSCW

4 days ago

I think a bigger problem for recursion support is the 6502 being limited to a 256 byte stack by its 8-bit stack pointer.

keyle

4 days ago

Wasn't recursion a problem for early C and Pascal in those days anyway? They didn't have tail call optimisations.

As in after X recursion, you'd get in trouble as the memory kept allocating new stack frames for every recurse...

vincent-manis

4 days ago

No, recursion still works well without TCO (though as a Schemer, I love TCO). I was programming in BCPL in the early 1970s, and it handled recursive procedures with aplomb. The big revolution was realizing that, if you don't allow access to automatic variables declared in outer scopes, you could store all the variables in the stack frame, and access them with a small offset from the stack or frame pointer. That made automatic variables just about as fast as static ones (which, on System/360, had to be accessed via a base register), with small overheads at call and return sites.

Again on System/360, I benchmarked BCPL procedure call costs against subroutine call costs in Fortran G (the non-optimizing compiler). BCPL was about 3 times faster.

That said, as soon as you added multi-tasking (what we'd now call threads), it all went to hell. It's not an accident that one IBM PL/I manual of the 1960s said “Do not use procedures, they are expensive.”

As mentioned by others, it was the tiny stack in the 6502 that killed this approach. I appreciate all those who pine for the 6502, but it made implementing modern (even for the 1970s) languages almost impossible.

pasc1878

4 days ago

C and Pascal still don't have TCO.

But you still need some recursion and C and Pascal can do that.

keyle

4 days ago

When would you absolutely need recursion that couldn't be fulfilled with a while loop?

unnouinceput

4 days ago

Never, but I can counter with "when do you absolutely need switch that can't be fulfilled with if-else"?

It's a nice to have that makes your life easier

gavinray

4 days ago

Many algorithms are more simply expressed as recursive functions than stack-based iterators or "while" loops.

AnimalMuppet

2 days ago

Adaptive quadrature is an algorithm for numerical integration. You have a function of one variable, and two endpoints of a range, and an error limit. You want to return the value of the integral of that function over that range, a value that is no further from the correct value than the error limit.

What you do is, you do a three-point approximation and a five-point approximation. The difference between the two gives you a fairly good estimate of the error. If the difference is too high, you cut the region in half, and recursively call the same function on each half.

That calling twice is what makes it hard for a while loop. I mean, yes, you could do it with a work queue of intervals or something, but it would be much less straightforward than a recursive call.

kragen

a day ago

Hey, this is awesome! I'd never heard of adaptive quadrature before. Thanks for the tip! Here's what I hacked together from your description and looking up the Newton–Cotes formulas to get a reasonable order of convergence:

    local function q(f, start, fstart, mid, fmid, stop, fstop, ε)
       -- Scaled approximation of integral on 3 points using Simpson’s
       -- rule.
       -- <https://en.wikipedia.org/wiki/Newton%E2%80%93Cotes_formulas#Closed_Newton%E2%80%93Cotes_formulas>
       local Q3 = (1/6)*(fstart + 4*fmid + fstop)

       -- Scaled approximation of integral on 5 points.
       local mid1, mid2 = (1/2)*(start + mid), (1/2)*(mid + stop)
       local fmid1, fmid2 = f(mid1), f(mid2)
       -- Boole’s rule, giving exact results for up to cubic polynomials
       -- and an O(h⁷) error.  This means that every additional level of
       -- recursion, adding 1 bit of precision to the independent
       -- variable, gives us 7 more bits of precision for smooth
       -- functions!  So we need like 256 evaluations to get results to
       -- 53-bit machine precision?
       local Q5 = (7/90)*(fstart+fstop) + (32/90)*(fmid1+fmid2) + (12/90)*fmid

       --print(mid, Q3, Q5)

       -- Approximate the error by comparing the two.
       if (stop - start)*math.abs(Q5 - Q3) <= ε then return (stop - start) * Q5 end

       -- Recursively subdivide the interval.
       return
          q(f, start, fstart, mid1, fmid1, mid, fmid, ε) +
          q(f, mid, fmid, mid2, fmid2, stop, fstop, ε)
    end

    function adaquad(f, start, stop, ε)
       local mid = (1/2)*(start + stop)
       -- In some quick testing, ε/2 seems to work okay.
       return q(f, start, f(start), mid, f(mid), stop, f(stop), (1/2)*ε)
    end

    return adaquad
http://canonical.org/~kragen/sw/dev3/adaquad.lua

It's kind of insane that 16 lines of code can do this!

stevefan1999

4 days ago

You still don't have TCO anyway unless you use [[must_tail]] (or the upcoming become keyword in Rust)

wduquette

5 days ago

The OP says that 8-bit CPUs couldn't handle Pascal well, and that Action! (release in 1983) was the first IDE for 8-bit machines.

But Apple Pascal was released for the Apple II in 1979. Based on UCSD Pascal, the Apple Pascal system was basically an OS that simply was an IDE; and it worked perfectly well on 8-bit hardware. I had quite a lot of fun with it back in the day.

ajross

5 days ago

Apple Pascal was a UCSD Pascal descendant, which means it was a P-Code interpreter.

The article is broadly correct. The 6502 had what amounts to a mixed-performance address space. All indirect addressing had to be done via pairs of registers in the zero page at addresses 0-255. Essentially all the "pointers" in your application wanted naturally to live as one of these 128 "pointer registers". But that's not the way natural code generation wants to work, where the pointers get stored in the data memory along with everything else.

So compiled languages need to have some kind of trampoline for every pointer to copy it into the memory where it needed to live, which was a really tall order for the optimizers of 1983.

Or they could just cheat and compile to a virtualized instruction set and feed that to an interpreter. Apple chose this, twice: once with Woz's sweet16 in which Integer BASIC was written, and again with the port of the P-Code interpreter for Pascal.

sema4hacker

5 days ago

Around 1979 or 1980 I was working for an 8080-based CRT terminal manufacturer and ported UCSD Pascal to our 8080 system, which worked flawlessly. I don't remember the details, but I believe all I had to do was implement a few BIOS-style routines. I got hung up for a few days because I had inited the heap pointer to a byte boundary instead of a word boundary, but after that everything booted and ran as advertised.

kjs3

5 days ago

The OP says that 8-bit CPUs couldn't handle Pascal well

The 6502 might not have been able to handle Pascal well, but Borland Turbo Pascal for CP/M (z80, 8080, etc) worked very, very well. It was also released in 1983 or so, but dunno whether it or Factor was 'first'.

guenthert

4 days ago

Yes, Turbo Pascal was awesome and quite possibly the best programming environment for the Z80. The register-starved 6502 however relies on zero-page addressing for efficient code, that's difficult to exploit in compilers for high level languages (and hence why you don't see that in general purpose CPUs anymore). One way around that is to compile to a byte code which is then interpreted (e.g. BCPL, UCSD Pascal).

wduquette

2 days ago

And in fact I eventually went from Apple Pascal to Turbo Pascal on a Kaypro 4 running CP/M-80. It was like coming home.

cmrdporcupine

5 days ago

If I'm not mistaken Apple Pascal ran a virtual machine which executed "p-Code" and the compiler emitted that.

Because, yeah, the 6502 is a difficult target for high level languages.

forinti

5 days ago

I learnt Pascal on a Beeb. It had a compiler, an editor, and a runtime in two 16KB ROMS.

eschneider

4 days ago

I remember Kyan Pascal on the Atari. It worked, but I'm not sure it worked _well_. Better than BASIC, but still quite a bit slower than ASM. Action! was much nicer if you were looking to deliver software on the Atari.

brucehoult

5 days ago

As others have pointed out, Apple Pascal was a bytecode interpreter, not compiled to native 6502 code.

Even worse, the bytecode was standard UCSD Pascal, designed for mainframes, and not really how you'd design bytecode if you knew you were going to implement it on a 6502.

We can actually compile Pascal / C style languages to the 6502 pretty well these days. The way to do it is treat it the same as a RISC machine with a lot of registers (let's say 32 2-byte pairs in zero page, leaving 75% of ZP for globals and a handful of important runtime functions that benefit from self-modifying code). Split the 32 pseudo-registers up into A, S, T registers the same as you do for any RISC (or x86_64 for that matter).

It's problem to handle recursion. Just the same as any function call, put anything you still want after the function call into S registers, and in the called function if it needs to use S registers then it needs to save them on a stack (NOT the 256 byte hardware stack) first and restore them after. For more than one or two saved registers it's better to use a runtime function for this -- 12 cycles overhead, but lots of bytes of code saved. RISC-V's -msave-restore provides a good model for this.

You can make a nice set of 2-address arithmetic routines between 16 or 32 bit Zero-Page registers by using the X and Y registers to pass which locations to work on.

    add16:
        clc
        lda 0,x
        adc 0,y
        sta 0,x
        lda 1,x
        adc 1,y
        sta 1,x
        ret
    
        ...
        ldx #dst
        ldy #src
        jsr add16
This reduces the call site from 13 bytes to 7 bytes (and you can often reuse an X or Y), while increasing the execution time for an inline add from 20 cycles to 42 (1 cycle extra per instruction for the indexed addressing = 6, 4 cycles for loading X and Y, 12 cycles for the jsr/ret)

For 32 bit arithmetic the time overhead is much reduced. For floating point it would be negligable!

There is no convenient way to make 3-address routines, but `mov16` is only 5 instructions and 28 cycles. Or just 4 instructions (8 bytes) and 12 cycles if inlined, which might be a better tradeoff i.e.

        lda src1
        sta dst
        lda src1+1
        sta dst+1
        ldx #dst
        ldx #src2
        jsr add16
vs

        ldx #dst
        ldy #src1
        jsr mov16
        ldy #src2
        jsr add16

Stack-based 6502 code is both bigger and slower than (pseudo) register-based code -- a zero-argument `jsr add16` itself is smaller, but it's significantly slower, and loading and storing values between stack and somewhere else will be much more code and much slower.

Stack-based bytecode has a whole lot more overhead again to fetch the next instruction and dispatch to the correct code for it. Register-based bytecode would be a much better idea, as it uses several times fewer bytecode instructions, and is more compact to boot (see Dalvik vs JVM ... sad that webasm didn't pay attention).

kragen

2 days ago

Typically I've found that stack-based bytecode uses about twice as many instructions as register-based bytecode but significantly less space. (I'm sorry that's kind of handwavy; I'd be delighted to see a more rigorous study of the question.) For Wasm I don't think the number of instructions was a concern, since that only affects the time to "load" it by compiling to native code, but size seems to have been a primary concern.

I agree about the general approach to taking advantage of the zero page, but I don't have the 6502 competency to comment on your proposed instruction sequences.

brucehoult

2 days ago

> stack-based bytecode uses about twice as many instructions as register-based bytecode but significantly less space

That's not my experience, and I don't see how it can be true, even theoretically, for typical programs.

I don't at all mind more instructions, if that works out smaller or at least not bigger. E.g. RISC-V using `c.slli; c.srli` or `c.slli; c.srai` (all 2-byte opcodes) to extract unsigned or signed bitfields instead of having a 4-byte "extract field" instruction.

It is true that a stack-based ISA can have a size advantage in evaluating very complex expressions, such as those typical in scientific / engineering computing, when you get a lot of arithmetic operators in a row with no arguments needed.

But in general purpose programs the vast majority of lines of C code have only very simple expressions. `x=x+y`. `x=x+10`. `if x<y`.

Register ISAs provide short addressing for the most frequently-used 8 or 16 or 32 variables in a function (the registers). Experience shows 8 isn't really enough, 32 is usually a little more than you need -- I think some study showed 24 was enough but you don't get a code size advantage from that, so ...

Stack based ISAs do similarly. JVM can load or store any of four local variables within a 1-byte instruction. For more than than you can refer to 256 variables with a 2-byte instruction. Four really isn't enough, but 256 is a waste of code size. WASM doesn't have any option to include a variable number in the main opcode byte -- `local.get 1` is two bytes.

A very typical statement in programs is `x += y`.

JVM 4 bytes:

    iload 0  // 1A
    iload 1  // 1B
    iadd     // 60
    istore 0 // 3B
If x and y aren't in the first 4 locals then it's 7 bytes

WASM 7 bytes:

    (local.get 0) // 20 00
    (local.get 1) // 20 01
    (i32.add)     // 6A
    (local.set 0) // 21 00
ARMv7 / ARMv6-M / ARMv4T 2 bytes:

    add r0, r0, r1 // 08 44
RISC-V with C extension 2 bytes:

    add a0, a0, a1 // 2E 95
The Arm example works with 16 local variables (only the most common operations MOV, ADD, and CMP can use all 16 registers, other operations can only use 8), the RISC-V example works with all 32 registers (again only C.MV, C.ADD, C.ADDI, C.SLLI work with all 32 registers, other 2-byte opcodes use only 8).

8086, M68k, PDP-11, Super-H, MSP430 can all also do this very common operation with a 2-byte instruction, with anything from 8 to 16 local variables.

Exactly the same analysis applies to `x += 10`.

Most programs consist almost entirely of statements like these, not things like `det = a(ei − fh) − b(di − fg) + c(dh − e*g)`

There are two main problems with stack code:

- you need four opcodes: `load`, `load`, `add`, `store`. Each opcode needs at minimum 3 or 4 bits, so no matter what you do you're going to have 12-16 bits to specify four actions. In contrast, the register machine can just use a single `add` opcode, again using 3 or 4 bits.

- all the common stack ISAs use a minimum of 8 bits for an instruction (even the extremely common `iadd` or `i32.add` and use a multiple of 8 bits for every instruction. Four instructions is going to be minimum 4 bytes.

- the register machine needs only one 3 or 4 bit opcode for `add` AND can pack the other operands arbitrarily into a 16-bit instruction, using whatever field size and position makes sense.

- the register machine needs to mention the dst/rs1 variable only once.

Stack machine code could be improved in size a little if you bit-packed the instructions, but you'll still have the problems of needing more opcode fields and needing to repeat variable names.

Perhaps ironically, some accumulator machines do better. For example the much-hated PIC microcontroller actually does this well:

    MOVF  0x21, W    ; Load y (from address 0x21) into W register
    ADDWF 0x20, F    ; Add W (containing y) to x (at address 0x20), store result in x
Instructions such as ADDWF have a bit to specify whether to put the result of the add into W (the accumulator) or the source register.

On the smaller chips each of those instructions is 12 bits (with 32 registers, many with special purposes), though those are very limited and most applications use mid-range chips with 14 bit instructions (allowing 128 directly accessible registers).

So this is 24 or 28 bits of code, still not as compact as the 16 bits for that extensive list of register machines: RISC-V, Arm Thumb, 8086, M68k, PDP-11, Super-H, MSP430.

kragen

2 days ago

Thank you! This is indeed very thought-provoking!

It certainly does depend on which stack machine. I'm surprised that Wasm doesn't have a one-byte encoding for (local.get 0) or (local.set 0)! As you point out, even the JVM has one. My survey of bytecode instruction sets in https://dercuano.github.io/notes/tiny-interpreters-for-micro... found such an 8-bit or 5-bit instruction in nearly every example except CPython, with a 3-to-5-bit operand field in the local.get and local.set instruction byte, but of course Wasm didn't exist in 02007. The JVM only offering 2 bits is atypically stingy.

RVC and Thumb2 are pretty extreme cases of optimized register-machine encoding, and probably not something you'd want to decode in software on a 6502. The comparable stack-machine code would be something like Chuck Moore's x18/F18A with its four instructions per 18-bit word or James Bowman's J1A.

It also depends, as you say, on the code. Most code is somewhere in between the extremes of x += y and your example of

    det = a*(e*i − f*h) − b*(d*i − f*g) + c*(d*h − e*g)
with statements like these:

  dstRect.bottom := Max(dstRect.bottom,dstRect.top + txMinHeight);
  thePort^.grafProcs := Nil;   { restore to normal }
  if (n != 0 && --n != 0) ... 
  target_command = entry_i + CAM_CONTENT_COUNT * j;
  tfd = open(ptr, O_RDWR|O_TRUNC|O_EXCL|O_CREAT, 0600);
  for (size_t i = 1; i < n; i++) ...
  if (event == EV_enter_state) ... else if (event == EV_tick) ... else if (event == EV_reenter_state) ...
Those are all taken from my examples in https://dernocua.github.io/notes/c-stack-bytecode.html. There are a lot of subroutine calls, some reuse of the same value in multiple computations, many more variable reads than writes, and an occasional infix expression with more than one operator. These are all factors that favor stack instruction sets more than x += y does.

Subroutine calls in particular are typically a single bytecode once all the operands are on the stack, referring to a subroutine name or address stashed elsewhere with a short literal index, which similarly is usually allocated a 3–5-bit field in the opcode byte. Very popular subroutines like car or #at:put: can get their own bytecode.

There's a bit of a difference between the 8 registers in the PDP-11 or the 8086 and the 3-bit local variable field in an Emacs Lisp bytecode byte. The stack pointer soaks up one of those slots; on the PDP, the program counter soaks up another. (On ARM the link register sometimes is a third. Thumb and RVC do better here by excluding these registers from their 3-bit namespaces.) Analogously, one of the 8 local-variable indices in Elisp is "see next byte". Also, though, stack machines don't need to spend local variable indices on temporaries, and they often have instance-variable-access opcodes as well. So in effect register machines have less local variables than it at first appears.

Incidentally, as that note mentions, Aztec C for the 6502 did support a bytecode interpreter to more than halve code size:

> As an alternative, the pseudo-code C compiler, CCI, produces machine language for a theoretical machine with 8, 16 and 32 bit capabilities. This machine language is interpreted by an assembly language program that is about 3000 bytes in size.

> The effects of using CCI are twofold. First, since one instruction can manipulate a 16 or 32 bit quantity, the size of the compiled program is generally more than fifty percent smaller than the same program compiled with C65 [the Aztec C native code compiler for the 6502]. However, interpreting the pseudo-code incurs an overhead which causes the execution speed to be anywhere from five to twenty times slower.

But I don't know whether it was a stack bytecode, an accumulator bytecode, a register bytecode, or something weirder.

My experience programming in FORTH is that often I can profitably keep about one variable on the operand stack, so accessing that variable takes no instructions and no bytes. In theory this is something that a compiler could do, but the only case I know of is when you store to a variable in Smalltalk inside a larger expression. Similarly it's common in FORTH to spend no instructions on accessing arguments or returning a value; you just don't drop it before returning.

brucehoult

a day ago

With respect to your tiny interpreters page, I'd just like to point out that Thumb code for recursive fib() is 26 bytes, very close to Squeak (this is out of GCC with -Os, I haven't checked if I can do better manually).

    00010448 <fib>:
       10448:       b570            push    {r4, r5, r6, lr}
       1044a:       0004            movs    r4, r0
       1044c:       2500            movs    r5, #0
       1044e:       2c01            cmp     r4, #1
       10450:       dd05            ble.n   1045e <fib+0x16>
       10452:       1e60            subs    r0, r4, #1
       10454:       f7ff fff8       bl      10448 <fib>
       10458:       3c02            subs    r4, #2
       1045a:       182d            adds    r5, r5, r0
       1045c:       e7f7            b.n     1044e <fib+0x6>
       1045e:       1960            adds    r0, r4, r5
       10460:       bd70            pop     {r4, r5, r6, pc}
RISC-V is 30 bytes using newish instructions found on e.g. RP2350

    00010186 <fib>:
       10186:       b872                    cm.push {ra,s0-s2},-16
       10188:       842a                    mv      s0,a0
       1018a:       4481                    li      s1,0
       1018c:       4905                    li      s2,1
       1018e:       00895863                bge     s2,s0,1019e <fib+0x18>
       10192:       fff40513                addi    a0,s0,-1
       10196:       3fc5                    jal     10186 <fib>
       10198:       1479                    addi    s0,s0,-2
       1019a:       94aa                    add     s1,s1,a0
       1019c:       bfcd                    j       1018e <fib+0x8>
       1019e:       00940533                add     a0,s0,s1
       101a2:       be72                    cm.popret       {ra,s0-s2},16
The instructions are identical, the difference comes from RISC-V not being able to encode 3-address adds in 2-byte instructions while Arm can.

Falling back to original 2015 RV32IC (no newfangled extensions) adds 2 bytes to 32:

    0001018a <fib>:
       1018a:       050002ef                jal     t0,101da <__riscv_save_0>
       1018e:       842a                    mv      s0,a0
       10190:       4481                    li      s1,0
       10192:       4905                    li      s2,1
       10194:       00895863                bge     s2,s0,101a4 <fib+0x1a>
       10198:       fff40513                addi    a0,s0,-1
       1019c:       37fd                    jal     1018a <fib>
       1019e:       1479                    addi    s0,s0,-2
       101a0:       94aa                    add     s1,s1,a0
       101a2:       bfcd                    j       10194 <fib+0xa>
       101a4:       00940533                add     a0,s0,s1
       101a8:       a899                    j       101fe <__riscv_restore_0>
MSP430 comes to 38 bytes, primarily because all literals and offsets are a full 16 bits.

    00000000 <fib>:
       0:   1a 15           pushm   #2,     r10     ;16-bit words
       2:   0a 4c           mov     r12,    r10     ;
       4:   49 43           clr.b   r9              ;
    
    00000006 <.L3>:
       6:   5c 43           mov.b   #1,     r12     ;r3 As==01
       8:   0c 9a           cmp     r10,    r12     ;
       a:   00 34           jge     $+2             ;abs 0xc
       c:   0c 4a           mov     r10,    r12     ;
       e:   3c 53           add     #-1,    r12     ;r3 As==11
      10:   b0 12 00 00     call    #0              ;
      14:   3a 50 fe ff     add     #-2,    r10     ;#0xfffe
      18:   09 5c           add     r12,    r9      ;
      1a:   30 40 00 00     br      #0x0000         ;
    
    0000001e <.L2>:
      1e:   0c 4a           mov     r10,    r12     ;
      20:   0c 59           add     r9,     r12     ;
      22:   19 17           popm    #2,     r10     ;16-bit words
      24:   30 41           ret                     
But as mentioned in another message it's a very simply-encoded ISA which would be easy to write an interpreter for e.g.

    0a 4c           mov     r12,    r10
    
    0: register addressing mode, Word size
    a: dst = r10
    4: mov
    c: src = r12

kragen

a day ago

Both the RISC-V and Thumb code here are no longer recursive fib; GCC has quite correctly transformed one of the recursions into a normal nonrecursive loop by exploiting the commutative and associative properties of integer addition. I infer that perhaps the same thing may be true of the MSP430 code from the presence of only a single call instruction, but I don't see the loop; perhaps the br to 0 will be relinked to go to .L3 before execution?

I haven't seen MSP430 code before, due to my somewhat embarrassing and unfair distaste for TI as a company, and I agree that it is a very nice encoding indeed!

If you insist that the code contain at least two recursive calls, two subtractions, an addition, and a conditional for the base case, even Thumb2 and RVC no longer look so competitive with stack machines. If not, there is no particular reason to leave one of the recursions in; you may as well optimize that out too, thus reducing an exponential-time algorithm to a linear-time one (or linear to logarithmic, if the parameter is the number of bits of input.) This also makes it smaller, something like (untested):

      mov r1, #1
      mov r2, #1
  1:  subs r0, #1
      itt lo
      movlo r0, r1
      bxlo lr
      mov r3, r2
      adds r2, r1
      mov r1, r3
      b 1b
I think that's maybe 22 bytes?

But I don't think that tells us much about the relative densities of stack code, register code, accumulator code, and belt code, because it's a very different mix of operations.

brucehoult

a day ago

> Both the RISC-V and Thumb code here are no longer recursive fib

If the programmer wrote a doubly-recursive fib() and the compiler transformed one to tail-recursion, then that's a doubly-recursive fib() as far as I'm concerned. If you don't want that to happen then use a benchmark that isn't so easily optimised, such as maybe Ackermann.

Note that this optimisation doesn't change the calculation performed or the execution time from being exponential, it merely saves some stack manipulation.

Having access to good compilers is part of the attraction of using a real ISA instead of a made-up one.

Changing the algorithm to an iterative linear-time form one is not an optimisation I'd expect a compiler to do.

kragen

a day ago

I agree, but what we're trying to look at here is how space-efficient the ISA is at encoding computational operations. If one of the example programs encodes a totally different set of computational operations, it stops being an informative comparison.

But, you may reasonably object, this is handwaving! If a subroutine call is a "computational operation", why isn't duplicating a stack item or storing a variable? The stack code is cheating by not storing all its temporaries into variables! And I think the answer is that there's a bit of a slippery slope there, where more extensive program transformations give us less interesting answers. I would argue that transformations like not allocating registers to temporaries (or, say, subroutine arguments) are more broadly applicable than the transformation you're exhibiting, where the compiler has changed a non-tail call to dumbfib, followed by an addition, into an addition followed by a tail-recursive call implemented as a jump. That depends on being able to reassociate the entire chain of additions through the entire call chain, which is something I wouldn't have expected a compiler to be able to do.

I agree that it's more interesting to see benchmarks that aren't so easily gamed by optimizers, which is why in https://dernocua.github.io/notes/c-stack-bytecode.html I selected a bunch of small C and Pascal subroutines quasi-randomly and hand-compiled them into different candidate stack bytecode representations. I'm not sure that Ackermann or Tak would be as interesting.

brucehoult

a day ago

I have my own toy benchmark too :-)

https://hoult.org/primes.txt

kragen

a day ago

That seems like a significantly better microbenchmark than dumbfib!

brucehoult

13 hours ago

The main weakness is -- opposite of dumbfib -- it has no function calls at all!

But I find it correlates well with other CPU core & L1 cache/SRAM benchmarks such as Dhrystone and Coremark. The Embench guys added a cut down version -- smaller arrays, shorter execution time [1] to their benchmark suite a couple of year back.

[1] baseline ~4s on typical microcontrollers ... I'd started at ~5s on Ivy Bridge, 10-30s on the Arm boards I had back then

kragen

9 hours ago

It does have loops, integer multiplication, and array indexing (mostly reading), and all of those are pretty important.

Subroutine calls are particularly difficult to benchmark because they are so critical to real-world performance and so easy for compilers to remove from a benchmark if they can guess that a particular callsite will be hot. (LuaJIT takes this to the extreme.)

You'd think non-tail recursion would stymie this, but I've seen GCC unroll a recursive loop several layers deep and constant-fold away several layers over the base case. Impressive, useful, and completely invalidated my dumb benchmark.

I would guess that most programs have a subroutine call on the order of once every 100 instructions, spending something like 15% of their time on subroutine call and return. This is a very difficult number to quantify because of the pervasive indirect costs of subroutine calls: less efficient register allocation, forgone opportunities for constant folding and specialization, objects stored in memory rather than in registers, allocation on the heap rather than a stack, etc.

kragen

a day ago

That's 24 bytes because gas is using Thumb2 mov.w instructions to initialize r1 and r2. It took me a long time to figure out that the problem was mov vs. movs; this is the correct 20-byte version:

            .cpu cortex-m4
            .thumb
            .syntax unified
            .thumb_func
            .globl fib
    fib:    movs r1, #1
            movs r2, #1
        1:  subs r0, #1
            itt lo
            movlo r0, r1
            bxlo lr
            mov r3, r2
            adds r2, r1
            mov r1, r3
            b 1b
I swear, I keep making the same n00b mistakes, year after year!

It might be possible to get it down to 18 bytes, but I don't see how.

brucehoult

2 days ago

> My survey of bytecode instruction sets in https://dercuano.github.io/notes/tiny-interpreters-for-micro...

Oh, cool, I'll take a look. Having started with Apple ][ in 1980 I'm already well aware of SWEET16 and admire both it's effectiveness, integration with native code, and small size of the interpreter.

> it's common in FORTH to spend no instructions on accessing arguments or returning a value

Yes, good point. Stack machines might not be good for code inside a function, but they often are good on function call/return, if you can arrange things to not need extra stack shuffling. Again, this matches very well to evaluating complex arithmetic expressions where some of the operations are function calls not inline code.

RISC-V has recent extensions -- so far the RP2350 is the most common chip that implements them -- that optimise function prologue and epilog and marshalling function arguments to S registers and S registers to function argument registers for a call. See the Zcmp extension.

https://docs.openhwgroup.org/projects/cva6-user-manual/01_cv...

Also, Zcmt provides optimised 2-byte jumps and calls to up to 256 well-known arbitrary entry points: 32 jumps and 224 calls.

> RVC and Thumb2 are pretty extreme cases of optimized register-machine encoding, and probably not something you'd want to decode in software on a 6502.

I wan't thinking of 6502. I don't think either JVM or WASM runs there :-) If you're JITing then decoding complexity really doesn't matter much, providing it doesn't increase the VMs code size too much.

I've a couple of times worked on some notes and the beginnings of an interpreter for MSP430 on 6502. I think it can be relatively small and relatively fast. There are very few instructions, and the encoding is very regular. And there is a good optimising GCC for it.

If you're not familiar with it, think "PDP-11 extended to 16 registers and 4 bit opcode (plus byte/word) instead of 3 by trimming src addressing modes to 4 (Rs, @Rs, @Rs+, 0xNNNN(Rs) plus immediate and absolute (ok PC-rel) by punning on PC) and trimming dst addressing modes to just 2 (Rd, 0xNNNN(Rd) plus the punning on PC).

Decoding is fairly 6502-friendly, with the src and dst registers in the low 4 bits of each byte of the instruction, the opcode in the high 4 bits of one byte, and the two addressing modes and B/W flag in the high bits of the other byte.

If you keep the hi and lo bytes of each register in separate arrays in Zero Page then you don't even need to shift the register numbers -- just mask them off and shove them into X & Y registers. Similarly, opcode can just be masked and then shove it into a location that is the bottom half of a modified JMP or JSR, or a JMP (zp), giving 16 bytes of code to get started. Or shift it once if 8 bytes is enough. Similarly, the two addressing modes and B/W flag don't need to be parsed with shifting and masking (at least if you care about speed more than interpreter size), you can just mask that byte and also use a jump table to "decode" it

--

Also, there is an intermediate stage between full interpretation and full JITing. I don't know if you've looked at the "Spike" RISC-V reference emulator. It doesn't normally decode instructions, it hashes them to look up a pre-decoded form with src and dst registers and any immediate/offset decoded into a struct, along with a pointer to the code to interpret the instruction. The hash table size is 8000 and that gets a good hit rate on most code. Of course if there is a hash table miss then the instruction is actually decoded the hard way and inserted into the hash table.

This wouldn't be a good fit for an 8 bit micro (unless you knocked the hash table size down a lot) but would be fine with even 1 MB of RAM.

--

>> The effects of using CCI are twofold. First, since one instruction can manipulate a 16 or 32 bit quantity, the size of the compiled program is generally more than fifty percent smaller than the same program compiled with C65 [the Aztec C native code compiler for the 6502]. However, interpreting the pseudo-code incurs an overhead which causes the execution speed to be anywhere from five to twenty times slower.

I'll just point out again that the native code technique I showed in my original message up-thread cuts a 16 bit register-to-register add down from 13 bytes of code to 7 bytes (or fewer for sequential operations on the same register) at a cost of increasing execution time from 20 cycles to 42 i.e. not a 5x-20x slowdown but only a 2.1x slowdown.

For a 32 bit add it reduces the code size from 25 bytes to 7, and increases the execution time from 38 cycles to 66, a 1.7x slowdown.

Well, I don't know how "native" the native code compilation for C65 is. I'm assuming all operations are inline for speed.

kragen

a day ago

I'm interested to hear what you used SWEET16 for.

cm.push and cm.popret seem like very appealing instructions from this point of view, yes! They look more expressive than ldm/stm. I imagine they would hurt interrupt latency? How do they affect page fault handling in practice?

If you are going to add non-RISC features to your ISA to ensmallen prologues and epilogues, even at the expense of interrupt latency, another possibility is to add semantics to the call and return instructions. Perhaps you could describe the required activation record structure in a header word before the subroutine entry point that need not follow the usual ISA encoding at all, and if the return instruction can reliably find the same header word (on the stack, say, or in a callee-saved register like lr) there is no need for it to contain a redundant copy of the same information.

The MSP430 sounds very appealing in some ways, and Mecrisp-Stellaris is a usable self-licking development environment for it. Too bad it's 16-bit. How small is the smallest FPGA implementation? OpenMSP430 says it's 1650 6-LUTs.

I had no idea about the Spike hash table approach. It's a totally new approach to emulation to me. It makes a lot of sense especially in the RISC-V context where EEs threw the instruction word layout in a blender to reduce their propagation delays—a decision I endorse despite my initial horror.

brucehoult

a day ago

> I'm interested to hear what you used SWEET16 for.

Just generic coding that needed 16 bit calculations but wasn't speed-critical. Pointer-chasing code is particularly tiresome on 6502.

> cm.push and cm.popret seem like very appealing instructions from this point of view, yes! They look more expressive than ldm/stm. I imagine they would hurt interrupt latency?

They are designed to be easily interruptible / restartable with a variety of possible implementations (non-interruptible, complete restart, continue where you left off). The tradeoff is up to the designer of the individual CPU. A basic idea is to hold them in the decoder and emit normal instructions to the execution pipeline, while simplifying the original instruction towards a base case. A correct (interrupt-safe) sequence of normal instructions to generate is given in the specification.

Also, that code didn't show other instructions to copy `a0` and `a1` to two arbitrary `sN` registers, or to copy two arbitrary `sN` registers to `a0` and `a1`. This helps a lot both implementing and calling functions with two or more arguments, if you're not using arithmetic to marshal the arguments anyway.

> How do they affect page fault handling in practice?

It it unlikely that anyone will ever make a machine with both Zcmp (and Zcmt) and page faults. Architects of high performance CPUs would have a fit, and besides which they use the same opcode space as RVC instructions for floating point load/store, making them incompatible with standard Linux software.

> Perhaps you could describe the required activation record structure in a header word before the subroutine entry point that need not follow the usual ISA encoding at all, and if the return instruction can reliably find the same header word (on the stack, say, or in a callee-saved register like lr) there is no need for it to contain a redundant copy of the same information.

VAX, begone!

As Patterson showed, JSR subroutines outperform CALLS/G subroutines.

kragen

a day ago

I see, thanks!

Yes, VAX CALLS/CALLG is pretty much what I'm talking about, but with arguments in registers.

I think the performance landscape has changed somewhat since Patterson; if your instruction decoder is shattering each instruction into a dependency graph of separately executable micro-ops anyway to support OoO, you don't have to slow down your whole pipeline to support elaborate instructions like cm.push or CALLG. As I see it, though, a couple of things haven't changed:

- Having arguments in registers is faster than having arguments in memory.

- Page fault handling is unfriendly to architectural instructions having multiple effects, because the page fault can't be handled at the micro-op level; it has to be handled at the architectural level. But maybe this is a smaller problem with OoO and especially speculative execution, I don't know.

If you can safely get by with a complete restart of the instruction, though, the page fault problem goes away.

I think that solves the problems Patterson identified?

cmrdporcupine

5 days ago

"I found it endearing that to end an IF block you used FI (IF spelled backwards) and to end a DO block you used OD. That is some interesting symmetry although I’m not really sure it helps readability."

This comes straight from Algol if I'm not mistaken. It seems weird to us now (tho bourne shell / bash kinda has this in spots) but it was in the air in the 60s/70s.

When I've looked at it in the past I definitely got the sense that Action was very much inspired by Algol-68, but with some accomodations for the niche of 6502.

6502 is a terrible target for C (and even Pascal) compilation, I have often wondered if it made sense for someone to try and revive Action for the 21st century as a general purpose 6502 high level PL.

rootbear

5 days ago

That was from Algol 68. Algol 60 used BEGIN/END blocks when the body of a do loop (or a then or else block, etc.) had more than one statement. Bash was influenced by Algol 68.

hmmokidk

5 days ago

I like it because I definitely spend brain cycles trying to figure out what is being closed!

It’s like HTML hah.

rtpg

5 days ago

I have this theory that Go tickles people because like Basic or something like Action it has all of these sort of abstraction ceilings that lead to "straight down the middle" procedural code.

Definitely leads to a feeling of velocity. I don't like the language that much but I do get the fun from that feeling!

jdndnfbdn

4 days ago

I personally do not get the feeling of velocity since I need to struggle with language and tooling to get the most basic things done

However I love the ease of cross compiling and distributing cross compiled binaries that much that Go is still the best tool for some jobs

jhallenworld

4 days ago

I never had Action!, but I did try Deep Blue C for the Atari 800. One issue was lack of brackets in the character set.. I think it used $( and $) instead.

https://www.atariarchives.org/APX/showdocs.php?cat=20166

There were some other weird 8-bit programming languages: PLM and MPL. These are PL/I clones for 8080 and 6800. I used MPL long ago and wrote an article about it:

https://github.com/jhallen/exorsim/tree/master/mpl

PLM was much more popular and better:

https://en.wikipedia.org/wiki/PL/M

smackeyacky

4 days ago

You could argue that the “best” programming environment available for DOS machines was dBase. dBase III in particular. For storing rows of data and building text interfaces it was very impressive. Not for games, but for information systems that ran a lot of small business back then.

NetMageSCW

4 days ago

When I started college my computer was an Atari 400 I had purchased with money from my after-school job programming commercial software. I had replaced the membrane keyboard with a third party “real” keyboard and had the Action! cartridge. I used the editor primarily to write papers for class printed on my dot-matrix electric arc printer (it produced the dots by scorching the paper with a tiny carbon electrode thus needing no ink, though the print was brown and lower contrast). Sadly I sold the 400 to someone I knew for their child and loaned the Action! cartridge to a co-worker who moved across the country and we lost touch. I’ve been searching eBay for years for a copy of the manual as well - at one time I had an ambition to create a version for iPad or Windows.

brianpaul

3 days ago

I loved programming in Action! The editor was great and both compilation and runtime were really fast. I used it for several years from high school into college until I got an Amiga. I wrote a paint program, 3D modeler and 3D renderer with Action! No floating point. Fixed-point math with lookup tables for sin/cos/etc.

jasperry

5 days ago

Another 8-bit "better than BASIC" language was COMAL. Similar to the language in the article, it also had structured programming constructs, and the C64 version had built-in turtle graphics, sprite, and sound commands. I remember picking a version up at a mall kiosk that sold PD disks and it expanded my horizons!

jjwiseman

4 days ago

Same here. I knew BASIC very well and had done some 6502 assembly, but the COMAL disk I picked up somewhere for the C64 was my first exposure to structured programming, and it definitely expanded my young mind.

pikeangler

4 days ago

There was the fantastic COMAL 80 cardtrige for the Commodore 64 with high level commands for sound and graphics, and a full screen editor. It bank-switched so you could use the full 64K RAM on the C64.

charcircuit

5 days ago

This article doesn't really prove why it's the best. I feel like if it's the best it would have been ported to more systems.

wk_end

5 days ago

I think it hinges on:

  The Action! language may not have been as advanced as C or Pascal, but because it was designed with the 6502 CPU in mind, compiling the language was astonishingly fast.

  The original Atari Pascal system from APX needed multiple disk drives and could take several minutes to compile a small program. The only C package available in 1983 (Deep Blue C) was at least as limited as Action!, but also not an integrated package and compiled slowly. Draper Pascal only compiled to pseudo-code.

  Action! compiled your program to machine code in memory and in seconds. Typing C (to compile) and then R (to run) was hardly slower than just typing RUN in BASIC.
So less about the language itself (unless it had some particular properties that facilitated compiling it quickly) and more about the tooling.

buescher

5 days ago

Micro-SPL was cut down from the HP SPL systems programming language to run on the Xerox Alto, in microcode, and was ported pretty directly to the 6502 as Action! The Action! folks were pretty coy about this back in the eighties.

kragen

a day ago

https://en.wikipedia.org/wiki/Systems_Programming_Language?

  BEGIN
     INTEGER A:=0, B, C:=1;
     PROCEDURE N(X,Y,Z);
         INTEGER X,Y,Z;
         X:=X*(Y+Z);
     FOR B:=1 UNTIL 20 DO
         N(A,B,C);
  END.
The "Action!" code samples don't look like a dialect of this, but of a different branch of the Algol tree:

  PROC hello()
  ; This is a comment.
    DO
      PrintE("Goto 10")
    OD
  RETURN

  PROC timestwo()
    CARD i=[0],j
    DO ;start of DO - OD loop
      i==+1     ;add 1 to
      j=i*2     ;set 'j' equal to i*2
      PrintC(i)                 ;**** See the following
      Print(" times 2 equals ") ;PROGRAMMING NOTE for
      PrintCE(j)                ;an explanation
    OD ;end of DO - OD loop
  RETURN

  PROC hithere()
    BYTE ctr       ;counter used in FOR loop
    FOR ctr=1 TO S ;this FOR loop has no 'STEP', so
      DO           ;an increment of 1 is assumed.
      PrintE("Hi there")
      OD
  RETURN
The divergences I see that gave me this idea:

1. SPL used PROCEDURE to define procedures, and "Action!" uses PROC.

2. SPL used semicolon to separate statements, and "Action!" evidently used newlines or some kind of Lua-like rule, instead using semicolon for comments.

3. SPL used := for assignment statements (and FOR loops), and "Action!" used =.

4. In SPL, as in Algol 60 or Pascal, the body of a FOR loop was evidently a statement, using DO as a keyword to separate the loop setup from the body, so if you wanted a multi-statement loop you would presumably use a BEGIN END block, while "Action!" used Algol-68-style DO-OD (and IF-THEN-ELSEIF-ELSE-FI).

5. SPL used the UNTIL keyword to introduce upward-counting loop limits, while "Action!" used TO.

6. "Action!" seems to have used =[0] to initialize a variable to zero, while SPL used =0; in "Action!", =0 would have put the variable at address 0. Most versions of Algol, annoyingly including Pascal, had no way to initialize variables at their declaration. Presumably, if "Action!" were derived from SPL, it would have used the SPL syntax to initialize variables, and some other syntax to specify their location, like =@0 or =^0 or =[0] or something.

7. Also, not seen here, "Action!" evidently spelled INTEGER as INT.

There are a number of other differences on view here, such as the use of BYTE and CARD (unsigned) variables, the absence of the program-terminating period character, the presence of an augmented assignment operator ==+, and having the top-level syntactic construct be a procedure rather than a block, but those seem to me like the kinds of things you might change because you were building an IDE or because you were running on a tiny microcomputer. By contrast, these purely syntactic choices seem to me to be freely chosen on the basis of aesthetics or tradition, with almost no technical advantages or disadvantages accruing to any possible choice.

However, the Micro-SPL manual is at http://www.bitsavers.org/pdf/xerox/alto/micro-spl/Micro-SPL_.... On points of difference 1, 2, 4, and 7, it is similar to "Action!" rather than SPL, though it does follow SPL on point 3. As for points 5 and 6, Micro-SPL didn't have FOR loops or variable initialization, explaining the syntactic divergence. Also the Micro-SPL manual is by Henry Baker and "Action!"'s author Clinton Parker, who apparently confirmed the identity on a podcast in 02015: https://ataripodcast.libsyn.com/antic-interview-111-clinton-...

So I guess you're right that "Action!" is indeed Micro-SPL in a trenchcoat. Sure fooled me!

As for compilation speed, I don't think there's anything particularly special about "Action!"'s syntax or semantics that would make it especially fast to compile. Not using a textual preprocessor like C, maybe. And it's probably an advantage that most statements are introduced by a keyword so that simple recursive-descent parsing (with a separate tokenizer) is all you need, but that's more about the ease of writing the compiler than about its performance.

This was the same time Baker was working on COMFY-65, a more interesting language that got less adoption because it was not released for decades.

buescher

2 hours ago

The Action! source is out there and it’s very easy to read. I forget if it mentions Micro-SPL. I’ve long thought it would be an interesting language for embedded: maybe in some alternate timeline something like Action! was viable for 90s-00s microcontrollers that were poor targets for C.

kragen

2 hours ago

That's a good point. PICs are still poor targets for C actually. sdcc has a bunch of flags to only pay the cost of re-entrancy for the functions where you really need it, but "the usual arithmetic conversions" can be a serious performance problem on 8-bitters too.

charcircuit

5 days ago

I missed the "was" in the title. In the last 40 years a lot has progressed, so I don't think Action would be the best anymore.

jnaina

4 days ago

Blast from the past. Still have my Action cartridge.

veltas

5 days ago

Any relation to ActionScript?

hn_acc1

5 days ago

ActionScript came ~10-15 years later. I would be very surprised if there was any relation.

cosmotic

5 days ago

ActionScript is based on ECMA/JavaScript

fourthark

5 days ago

Originally it was based on HyperTalk, then it switched to ECMA later on.

jjtheblunt

5 days ago

not historically, but merged back into implementing ECMA/Javascript later. it predates javascript by years in earlier revisions.

veltas

4 days ago

That's why I asked, fishing for surprising but interesting information.