nialv7
10 months ago
Should've called it Oreutils.
10 months ago
Should've called it Oreutils.
10 months ago
I like standards and abhor bloat, but I must admit there are GNU extensions that are so useful and well known that it would be difficult to do without. Probably this happens when POSIX specs are too strict or feature-poor to be of use even for medium-complexity tasks.
One example is "make": I'm afraid that a POSIX-only implementation wouldn't run most Makefiles out there!
10 months ago
They acknowledge this. From the project's README.md:
> Popular GNU options will be supported by virtue of the "don't break scripts" rule. Unpopular options will not be implemented, to prevent bloat.
10 months ago
I’d love to see what performance benchmarks look like. The old ones were highly optimized, but perhaps for different challenges than today’s architectures present.
10 months ago
Would definitely be interesting, but from a cursory look at the repository, it doesn't look like squeezing the last percentage points of performance has been a priority yet.
Things that stand out:
- The `awk` implementation uses the Pest parser generator (https://pest.rs/), which is known to not generate the fastest possible parsers, but is great for getting up and running.
- They are using the `clap` crate for argument parsing, which is also known to not be the fastest, but again is very user friendly (for example, it does Unicode linebreaks in the output of `--help`). It's marginal, but for a tiny utility being invoked many times from a shell script, this can add up.
It's very probably "fast enough", and it makes sense to prioritize like this at this point, but people shouldn't use this expecting a performance improvement right now.
10 months ago
I wouldn't assume. The awk implementation likely stands up well for a 1-billion-row challenge, with its thoughtful bytecode-based design.
Redditors ran some quick performance tests on parsing, also: https://www.reddit.com/r/rust/comments/1fd7qgl/comment/lmelo...
10 months ago
Parsing the awk input? While not an irrelevant concern, that is obviously on the low priority end when considering awk performance in general. My most often used awk program is just '{print $1}', but I use it on enormous files. The performance when operating on the enormous file is the concern wrt performance, not the initial parse of '{print $1}' or of command line arguments.
I know you're just directly responding to the concerns of a parent comment though.
10 months ago
It is hoped that posixutil's awk's bytecode-based modern design should keep performance high, theoretically higher than ancient C-based awks.
Inspired by Ray Gardner's "wak" awk implementation https://github.com/raygard/wak
A volunteer benchmarking our awk on a 1-billion line text file would be welcome.
10 months ago
Yes, that's the most important part to focus on. I have nothing to remark about it, I haven't seen/gathered any numbers on it.
10 months ago
Yeah, and I don't think performance matters that much for these utilities (and AFAIK many of the original haven't been particularly optimized for performance anyway).
10 months ago
Check the list, performance matters a great deal for many of these utilities, and the GNU project version is often pretty well optimized (often best in class of POSIX compliant impls that ship with an OS/distribution).
I do not want a "performance will be looked at later" version of m4, awk, grep, cp, find, diff, sort, uniq, etc., right now in my personal or dev environments. I can understand using a not-yet-optimized memory safe language version to compete with OpenBSD though, but not for me right now.
10 months ago
You are already using “performance will be looked at later” versions of grep and find as they are much slower than alternative implementations (ripgrep and fd are way faster) and these are the one for which perf is the most sensitive in your list…
The fact that performance don't really matter for these tools in the general case is the reason while most people still stick with those slow POSIX-compliant tools when much faster alternatives exist. (POSIX-compliance only matters when using it in existing scripts, but not at all when using it in the command line or when writing new scripts)
10 months ago
I didn't realize The Open Group still exists and is updating POSIX.
10 months ago
Updated in 2024, no less! (2024 version still has UUCP though, heh)
10 months ago
I always wondered, is there an actual POSIX documentation or spec out there? The last time I researched it, it seemed that POSIX is a number of documents behind a massive paywall. Yet, so many people seem to know what is and isn't POSIX compliant, that it just seems unlikely that POSIX is locked behind a paywall.
10 months ago