QRS: Epsilon Wrangling

16 pointsposted 2 days ago
by zdw

4 Comments

o11c

2 days ago

I also have been playing with regexes recently. One observation that I've made: if you're willing to setting for something slightly weaker than regexes, you can make your code trivial to understand (and go straight to a DFA instead of going through NFA). AFAICT the only "hard" case (which I'm erroring out on) involves things like /(AB)+|(AC)+/ (where A, B, and C are arbitrarily complex patterns), since everything else can be easily simplified. And at least for the contexts I care about, that kind of regex is exceptionally rare.

... I probably should actually read the papers on how to do it properly though. Last time I tried, I got stuck in a tangle of "why does C make efficient hashmaps so hard!" - this time, I'm avoiding C until I have a fully-tested prototype (current status: 1.0KLoC logic, 0.5KLoC comments, 4.4KLoC test suite, 40% tests failing after a recent refactor [edit: I forgot Python enums don't compare equal to integers], 100% of the time being annoyed at how stupidly slow Python is if you use obscure programming features like "loops", "strings", "branching", or "functions").

kragen

2 days ago

Plausibly this approach is trivial to understand and implements full regexes, but it is slower than the NFA or DFA approach: http://canonical.org/~kragen/sw/dev3/redl.py

PEGs are in some ways more powerful than regexes, and in other ways less powerful, but usually the former matter more. This is not trivial to understand but I think it's not that hard either; it's a page of code: https://github.com/kragen/peg-bootstrap/blob/master/peg.md

That version doesn't memoize and so doesn't enjoy Packrat parsing's linear-time guarantee, but you can easily modify it to do so.

Another subset of regexes that's easy to understand is this single-page pattern matcher by Rob Pike from TPOP: https://www.cs.princeton.edu/courses/archive/spr09/cos333/be... which is enormously less code than my single-page PEG parser generator above. But then, it doesn't have to compile itself.

o11c

2 days ago

Unfortunately, neither "waste time" nor "waste space" are approaches worth pursuing.

We already have too many programs being written that are too simple and thus slow and/or wrong. We need to write code that is as simple as possible, but no simpler.

kragen

2 days ago

In practice you can always make a program take less space if it can use more time, or take less time if it can use more space; the guiltless perfection you seek does not exist.

I feel like PEG parsing can be fast and space-efficient, but I haven't seen an existence proof yet.