For a modern Forth on a modern platform, not so much because there’s been so much work done on optimizing C compilers. You can get pretty close, though.
On some of the older platforms, certain implementations were very low level. There are Forth implementations for the 6800, 6809, 6502, 8086 (CP/M, DOS, and embedded) where all the core words are precompiled and all expansions to the library get iteratively replaced with their definitions until they’re also native code. There are probably a few for the 8080 and Z80 too.
Absolutely not everything needs to be as fast as C or hand-tuned assembly (which these days is also sometimes not as fast as C that’s been through an optimizer). The ratio of the difference between C and some other solution can have wildly different bounds, though, which is my main point.
There are a lot of languages that get acceptably close to C, but as you get into more demanding tasks that list gets shorter. Fortran, Pike, C++, Rust, OCaml, Ada, and a few others are in that list for a lot more scenarios than CPython or Ruby. Perl has a big startup time, but for long-running tasks is acceptably close on its opcode VM. Many Forths would be too, and many Lisps. Both of those languages have native compilers here and there, though, that get you even closer.
What I mean by a penalty of threaded code isn't related to whether words are implemented in native code or not. For example:
: square dup * ;
is going to generate a square word that does 2 calls, regardless of whether "dup" and "*" are native words or not.
The equivalent in C:
int square(int x) { return x*x; }
will generate code that contains no call, even if your C compiler is not a very optimized one.
With STC, it becomes possible for an elaborate Forth to inline "dup" and "*", but STC is less popular on the 8-bit architectures you mentioned because it's much less compact.
It's in that context that I mention that threaded code entails a speed tax. It's those 2 calls.
Of course, in your Forth system, you could rewrite "square" in native code to get rid of the penalty, but then it's not threaded code anymore, it's native code.
Oh, yeah. The call overhead specifically isn’t all that onerous though is it? For your example you’re also talking about making a memory copy, and unless you have hardware multiply you’re doing looping addition.
Most Forths I’ve dealt with also offer inline assembly as part of a word definition, so I suppose you could do it that way if really desired. I can see what you mean though about the penalty being completely acceptable, because it shouldn’t be super large.