jsheard
4 days ago
I'm curious to see what the performance impact of using 64bit memory ends up being. WASM runtimes currently use a clever trick on 64bit host platforms where they allocate a solid 4GB block of virtual memory up front, and then rely on the MMU to catch out-of-bounds accesses by faulting on unmapped pages, so bounds checking is effectively free. That won't work with 64bit WASM though so runtimes will have to do old fashioned explicit bounds checking on every single heap access.
pstoll
4 days ago
I also wonder what the perf overhead will be for programs that only need i32. I didn’t dig deep enough on the implementation, but these type of variadic runtime options often cause perf regressions because of all the “ifs” / lookups of type configs/parameters, etc. I just imagine some inner loops where mem.get adds a few instructions from the “if u32/64/“.
Unless it ends up being seamless / no cost via loading different u32/u64 implementations.
I mostly agree with the old c++ mantra of - no feature should have a runtime cost unless used.
jsheard
4 days ago
I would hope it's set up such that the runtime always knows whether a memory access is 32bit or 64bit at JIT-time so it can generate unconditional native code. Runtimes that use an interpreter instead of a JIT might be hurt by having to branch every time, but interpreters being slow is already kind of a given.
cornstalks
4 days ago
Yes, the bitcode specifies whether a memory operation/address uses 32-bit or 64-bit addressing, so a runtime/JIT can determine this ahead of time during the static analysis phase.
wild_pointer
4 days ago
Can't the same trick be used for slightly more bits? 64-bit uses 48 bits nowadays AFAIK, which is 256 TB. It's usually 50/50 for user/kernel mode. If, say, WASM takes half of the user mode address space, it's still 46 bits which is 64 TB (ought to be enough for anybody?). Or maybe I'm way off, I don't really know the specifics of the trick you're referring to.
Findecanor
4 days ago
The trick takes advantage of 32-bit registers automatically being zero-extended to 64. It actually uses 8GB of allocated address space because WASM's address mode takes both a "pointer" and a 32-bit offset, making the effective index 33 bits. On x86-64, there are address modes that can add those and the base address in a single instruction.
When that trick can't be used, I think the most efficient method would be to clamp the top of the address so that the max would land on a single guard page. On x86-64 and ARM that would be done with a `cmp`and a conditional move. RISC-V (RVA22 profile and up) has a `max` instruction. That would be typically one additional cycle.
The new proposal is for using a 64-bit pointer and a 64-bit offset, which would create a 65-bit effective index. So neither method above could be used. I think each address calculation would first have to add and check for overflow, then do the bounds-check, and then add the base address to the linear memory.
Joker_vD
3 days ago
> RISC-V (RVA22 profile and up) has a `max` instruction.
You know, it's kind of insane that things like this are nigh impossible to learn from the RISC-V official site. Googling on the site itself doesn't yield anything; you have to go to the "Technical > Specifications" page which has PDFs on the basic ISA from 2019, but not on the newer frozen/ratified extensions, and a link to the "RISC-V Technical Specifications" page on their Attlassian subdomain, where you can find a link to a Google doc "RISC-V profiles" and there you will learn that, indeed Zbb extension is mandatory for RVA22U64 profile (and that, apparently, there is no RVA22U32 profile).
And of course, you have to know that Zbb extension is the one that has max/min/minu/maxu instructions in it; grepping for "min" or "max" won't find you anything. Because why bother listing all the mnemonics of the mandatorily supported function in a document about the sets of instructions with mandatory support? If someone's reading a RISC-V document, they obviously have already read all the other RISC-V docs that predate it, that's the target audience, after all.
camel-cdr
3 days ago
There is ongoing work in addressing this.
For one, all of the specs are getting merged into a unified isa-manual, you can get the latest version from github: https://github.com/riscv/riscv-isa-manual/releases
The other project is the riscv-unified-db: https://github.com/riscv-software-src/riscv-unified-db The goal is that this database will hold all specification information in a machine readable format.
That allows you to e.g. generate a manual for RVA22 including the ISA specifications: https://riscv-software-src.github.io/riscv-unified-db/pdfs/R...
Or a website were you can search for instructions, and get all of the relevant information about it: https://riscv-software-src.github.io/riscv-unified-db/exampl...
The above two links are still WIP, so there are still a lot of things missing.
BeeOnRope
3 days ago
> When that trick can't be used, I think the most efficient method would be to clamp the top of the address so that the max would land on a single guard page.
If you are already doing a cmp + cmov, wouldn't you be better off just doing a cmp + jmp (to OOB handler)? The cmp + jmp can fuse, so it's probably strictly better in an execution cost sense, plus it doesn't add to the critical data-dependent chain of the load address, which would otherwise add a couple of cycles to the address data chain.
Of course, it does require you have these landing pads for the jmp.
lossolo
3 days ago
CMP + JMP have a branch, which can be mispredicted and then you may pay 5-10x more than for cmp + cmov.
BeeOnRope
3 days ago
They won't be mispredicted nor take predictor resources since the default prediction is "not taken" and these branches are never taken (except perhaps once immediately before an OOB crash if that occurs). So they are free in that sense.
Retr0id
3 days ago
A saturating add instruction could help do the same trick without checking for overflows first, although they seem fairly uncommon outside of SIMD instruction sets (aarch64 has UQADD for example)
user
3 days ago
jsheard
4 days ago
The problem is you need the virtual memory allocation to span every possible address the untrusted WASM code might try to access, so that any possible OOB operation lands in an unmapped page and triggers a fault. That's only feasible if the WASM address space is a lot smaller than the host address space.
I suppose there might be a compromise where you cap a 64bit WASM instance to (for example) an effective 35bit address space, allocate 32GB of virtual memory and then generate code which masks off the top bits of pointers so OOB operations safely wrap around without having to branch, but I'm not sure if the spec allows that. IIRC illegal operations are required to throw an exception.