Rust clean-slate POSIX CLI utilities 0.2.1 release: Awk, M4, ftw and more

80 pointsposted 10 months ago
by jgarzik

20 Comments

nialv7

10 months ago

Should've called it Oreutils.

ramon156

10 months ago

Or like uutils, ooreutils

teo_zero

10 months ago

I like standards and abhor bloat, but I must admit there are GNU extensions that are so useful and well known that it would be difficult to do without. Probably this happens when POSIX specs are too strict or feature-poor to be of use even for medium-complexity tasks.

One example is "make": I'm afraid that a POSIX-only implementation wouldn't run most Makefiles out there!

marcus0x62

10 months ago

They acknowledge this. From the project's README.md:

> Popular GNU options will be supported by virtue of the "don't break scripts" rule. Unpopular options will not be implemented, to prevent bloat.

rybosome

10 months ago

I’d love to see what performance benchmarks look like. The old ones were highly optimized, but perhaps for different challenges than today’s architectures present.

simonask

10 months ago

Would definitely be interesting, but from a cursory look at the repository, it doesn't look like squeezing the last percentage points of performance has been a priority yet.

Things that stand out:

- The `awk` implementation uses the Pest parser generator (https://pest.rs/), which is known to not generate the fastest possible parsers, but is great for getting up and running.

- They are using the `clap` crate for argument parsing, which is also known to not be the fastest, but again is very user friendly (for example, it does Unicode linebreaks in the output of `--help`). It's marginal, but for a tiny utility being invoked many times from a shell script, this can add up.

It's very probably "fast enough", and it makes sense to prioritize like this at this point, but people shouldn't use this expecting a performance improvement right now.

jgarzik

10 months ago

I wouldn't assume. The awk implementation likely stands up well for a 1-billion-row challenge, with its thoughtful bytecode-based design.

Redditors ran some quick performance tests on parsing, also: https://www.reddit.com/r/rust/comments/1fd7qgl/comment/lmelo...

dundarious

10 months ago

Parsing the awk input? While not an irrelevant concern, that is obviously on the low priority end when considering awk performance in general. My most often used awk program is just '{print $1}', but I use it on enormous files. The performance when operating on the enormous file is the concern wrt performance, not the initial parse of '{print $1}' or of command line arguments.

I know you're just directly responding to the concerns of a parent comment though.

jgarzik

10 months ago

It is hoped that posixutil's awk's bytecode-based modern design should keep performance high, theoretically higher than ancient C-based awks.

Inspired by Ray Gardner's "wak" awk implementation https://github.com/raygard/wak

A volunteer benchmarking our awk on a 1-billion line text file would be welcome.

dundarious

10 months ago

Yes, that's the most important part to focus on. I have nothing to remark about it, I haven't seen/gathered any numbers on it.

littlestymaar

10 months ago

Yeah, and I don't think performance matters that much for these utilities (and AFAIK many of the original haven't been particularly optimized for performance anyway).

dundarious

10 months ago

Check the list, performance matters a great deal for many of these utilities, and the GNU project version is often pretty well optimized (often best in class of POSIX compliant impls that ship with an OS/distribution).

I do not want a "performance will be looked at later" version of m4, awk, grep, cp, find, diff, sort, uniq, etc., right now in my personal or dev environments. I can understand using a not-yet-optimized memory safe language version to compete with OpenBSD though, but not for me right now.

littlestymaar

10 months ago

You are already using “performance will be looked at later” versions of grep and find as they are much slower than alternative implementations (ripgrep and fd are way faster) and these are the one for which perf is the most sensitive in your list…

The fact that performance don't really matter for these tools in the general case is the reason while most people still stick with those slow POSIX-compliant tools when much faster alternatives exist. (POSIX-compliance only matters when using it in existing scripts, but not at all when using it in the command line or when writing new scripts)

wmf

10 months ago

I didn't realize The Open Group still exists and is updating POSIX.

jgarzik

10 months ago

Updated in 2024, no less! (2024 version still has UUCP though, heh)

7bit

10 months ago

I always wondered, is there an actual POSIX documentation or spec out there? The last time I researched it, it seemed that POSIX is a number of documents behind a massive paywall. Yet, so many people seem to know what is and isn't POSIX compliant, that it just seems unlikely that POSIX is locked behind a paywall.

user

10 months ago

[deleted]