danielparks
13 days ago
Cool, I’m interested to see where you go with this.
I found the operator precedence unnatural, and it looks like a lot of other folks in this thread did too. I would naturally assume `cat:dog` would be equivalent to `(cat):(dog)` rather than `ca(t:d)og`.
bane
13 days ago
This is a very interesting idea for a few reasons.
> I would naturally assume `cat:dog` would be equivalent to `(cat):(dog)` rather than `ca(t:d)og`
It was confusing to me too until I remembered that we all kind of use regexes sort of wrong. They're "really" supposed to be considered as generators and not matchers. So IIR cat|dog as a "regular expression" (not a regex) is supposed to formaly expand to
{catog,cadog}
For matching, this set of strings can then be substring matched against some larger text.
The problem is that almost no regex matching engine actually does this, and so now they'll do all kinds of strange things either to meet our expectations, or for efficiency or something.
If you go and try a bunch of different regex tools you'll get variations that either service (cat)|(dog) or (cat)|(dog)|(ca[td]og) or something else.
So from a more formal conceptualization I think cat:dog should produce ca(t:d)og not (cat):(dog). But our experience with "regex" tools has subverted that formalization and now everybody just puts parens around expressions they want to alternate.
My real minor issue with this proposal, as interesting and well thought out as it is, is that it feels like it's just trying to get back at regular expressions as generators, which they actually are and it's coming from a place on the other side of a few decades of how we've been abusing them as regexes for user expectations. In other words, the problem is the tooling, not the syntax.
source: I've worked adjacent to this space in the past and if you've never thought of regexes as string set generators you can toy with the idea here
https://onlinestringtools.com/generate-string-from-regex
but again, how these generator tools works is also very specific. The ones I used to work with had a variety of ways to specify constraints on closures and such to restrict the generators.
Timwi
12 days ago
There’s no reason to say that “ca(t:d)og” is a “more correct” parsing than “(cat):(dog)”. You did hit the nail on the head insofar as that you realized that we as programmers have built strong habits and make assumptions on the basis of those habits. But you didn’t take it to the logical conclusion and didn’t realize that having a text-based syntax to represent regexes is also such a habit/assumption.
In pure theoretical computer science, regular expressions exist as an abstract concept independent from syntax or parsers. They are an “algebra”, which means they are composed of elements connected with operators, but they are not inherently tied to a syntax. In the most fundamental formulation of regular expressions (the one in the Chomsky hierarchy), the only operators are alteration (which modern syntaxes express as “|”), the Kleene star (“*”) and — notably — concatenation, which modern syntaxes simply omit, in a way comparable to how modern mathematics notation omits the multiplication operator when you write “2x”.
In the same way that maths needs rules to define whether “2x²” means “(2x)²” or “2(x²)”, regex syntax needs such rules too. This is called operator precedence. I’m sure you’ve heard that before, but you just might not have realized that the regular expression “ab” has an operator in it because it is typically not written.
Now I’m not going to argue that the operator precedence in maths notation is haphazard or without reason — but it is arbirary. It was arbitrarily chosen to be the most useful to mathematicians using the notation. And it turns out that giving exponentiation higher precedence than (invisible) multiplication (meaning: “2x²” means “2(x²)” rather than “(2x)²”) is more useful.
So coming back to the original example, whether “cat:dog” means “ca(t:d)og” or “(cat):(dog)” is simply a matter of defining the precedence of the “:” operator relative to the concatenation operator. You can argue (and I would agree with you) that one is more useful than the other, and therefore preferable (in the same way that “(cat)|(dog)” is more useful than “ca(t|d)og”), but neither of them is more fundamentally correct or primal or, as you put it, “supposed to formally expand to”.
c0nstantine
12 days ago
I agree with the point that precedence is arbitrary. The current version looks like this:
1 Escaped characters
2 []
3 ()
4 * + ? {m,n}
5 :
6 . (implicit concatenation)
7 |
I have some reasons to put it that way. I want : to be somewhat 'atomic'. If you think about '*' or '+' they can be lower in the table as well. Anyway, I will try to put : lower in the next version and see how it goes.
user
13 days ago
c0nstantine
13 days ago
Thank you for the feedback. Yes, the precedence is a question for me. Maybe I will change this.
If I shift it behind concatenation there could be another problem. E.g. with non-associative : should be illegal. And I am not sure how to treat this:
cat:dog:mouse
In the current version I inject the epsilon (empty string). It looks natural E.g. to remove every second letter I could run '..:' which is technically '.(.:eps)':
echo 'abcde' | ./trre '..:'
result: 'ace'
actually ':' association could have a meaning as a composition of regular relations; but I found it too complicated for now.
kccqzy
13 days ago
I do not understand the rules by which you inject the epsilon and I think this is a source of confusion for many people. I had thought that an epsilon could be injected anywhere REGEX can appear (effectively allowing epsilon as a REGEX) but of course that just leads to infinite number of parses. Manually injecting epsilon is a highly hacky thing to do; better to consider that when you design the grammar.
I would not worry about "cat:dog:mouse" because intuitively it is clearly correct and it means replacing cat with mouse. With parentheses it could be written as "((cat:dog):mouse)".
c0nstantine
11 days ago
Epsilon injection appears whenever right or left side of ':' has no operand. E.g.
(:a)
(a:)
a:|b
a|b:
etc
I will try to change the precedence and see how it works. Btw what do you think about explicit operators '>' '<' where '<' works as usual regex matcher, and '>' as a generator. For example to change 'cat' to 'dog' there could be something like '<cat>dog' where '<cat' part is a parser and '>dog' is a generator. Thanks.
kccqzy
11 days ago
I think your epsilon injection rule is trying to achieve this kind of production:
TRRE <- TRRE ':' REGEX | ':' TRRE | TRRE ':' | REGEX | ...
I think this would work better, but ':a:' is still ambiguous: it has two parse trees.twiss
13 days ago
Yeah. Similarly, for the range transformations, instead of `[a:A-z:Z]`, I would suggest `[a-z:A-Z]`; and instead of `[a:b-y:zz:a]`, something like `[a-y:b-z;z:a]`, perhaps.
kazinator
13 days ago
I would suggest simply [a-z]:[A-Z], inspired by tr.
Then there is no syntactic special case. This is just EXPR:EXPR; the special case is that both EXPR are character class syntax, and so the tr-like range mapping applies.
c0nstantine
12 days ago
[a-z] is equivalent to 'a|b|...|z' in the normal regex language.
So if we do [a-z]:[A-Z] it should be expanded to:
(a|b|...|z):(A|B|...|Z)
which is pretty legal in trre but has different meaning of mapping any a-z to ALL the A-Z (generating A-Z on each occurrence of lowercase letter).
kazinator
12 days ago
[a-z] is a semantically equivalent regex to a|b|..|z, but the two are not equivalent syntactic forms.
Distinct syntactic forms can be given distinct semantics, as long as there is rhyme and reason.
Moreover, the right side of the colon is not the normal regex language, it only borrows its syntax. So there, we may be justified in denying that the usual equivalence holds between character class syntax and a disjunction of the symbols denoted by the class.
c0nstantine
12 days ago
The right side is a normal regex language syntactically. Semantically it is a generator instead of a parser (consumer).
But I got your point. Maybe there could be some ways to do it in consistent way. Just straight tr-like syntax won't work, e.g I really want it something like this to be valid:
[a-b]:(x|y) (pairs a:x, b:x, a:y, b:y)
and I prefer not handle these in some ad-hoc way.
kazinator
12 days ago
I also go your point. The right side is a regular expression because it denotes a regular set.