i_have_to_speak
2 months ago
This really has nothing to do with CTEs.
The SQL standard does not specify how many times a subquery must or can be run. In this case, Postgres is free to run the uncorrelated subquery many times. Unfortunately, the subquery result is not idempotent.
The real solution is not to rely on the optimizer for the correctness of your query. The query should be logically correct, irrespective of the query plan, right? The correct solution would be to ensure that the subquery runs only once, using a materialized cte or temp table. Do not base your "solution" on the discovery that for this particular version of Postgres, the planner runs this type of subquery once for the '=' operator and multiple times for 'IN'.
Diggsey
2 months ago
Based on my understanding, your comment is the most accurate in this thread.
However, I think it's telling that almost every single comment here has a different understanding of how PostgreSQL works, and nobody can link to any documentation that conclusively addresses this case. I think that indicates that this is a failing if not of implementation, then of documentation.
For example: another comment says that `=` forces PostgreSQL to evaluate the subquery exactly once. I couldn't find any justification for that statement. Is it guaranteed? I don't know.
Another comment says "if only the subquery was ordered then the result would be consistent", but AFAIK, this is not true in the presence of "SKIP LOCKED".
I think the problem stems from the fact that the SQL standard is written from an academic point of view: the ordering of execution of subqueries is not defined because it is irrelevant within the relational algebra, so optimizations which change the execution order are just that: optimizations, with no observable side effects.
Real database systems do not actually have the nice algebraic properties of the relational algebra... The same subquery may give different results over multiple executions, etc. Given this, it's important that the database system give sufficient and sufficiently clear guarantees such that they can still be used correctly, and so that the optimizations implemented by the system can be justified.
swid
2 months ago
The comment about = is mine and I thought about rewriting that part of the comment after the fact. It is soooo hard to explain these things without writing a book.
The reason = mean it runs once is because the outer query will only run once, and in this case that query, when using =, can only delete based on a single id. But if that outer query was subquery in a context where it could be run more than once, you are back to where you started. Hence me saying their fix was sort of incidental.
Diggsey
2 months ago
I think your logic is flawed (even if PostgreSQL may behave this way in practice).
First, let's make sure we're talking about the same two examples.
A:
DELETE FROM task_queue
WHERE id = (
SELECT id FROM task_queue
WHERE queue_group_id = 15
LIMIT 1
FOR UPDATE SKIP LOCKED
)
B: DELETE FROM task_queue
WHERE id IN (
SELECT id FROM task_queue
WHERE queue_group_id = 15
LIMIT 1
FOR UPDATE SKIP LOCKED
)
You seem to be saying that B may exhibit the problem whilst A does not. (ie. there is a different between using `=` vs `IN`). I would like to see some documentation justifying that.Here's my logic:
- In both cases, the outer query is run once.
- In both cases the outer WHERE clause is evaluated for each row in `task_queue`.
- In both cases, it is up to the optimizer whether the result of the subquery is materialized or not.
- In both cases, if the subquery is not materialized, multiple rows in the outer query may match the condition.
In practice, it may be that the optimizer always materializes uncorrelated subqueries on the RHS of an `=` expression. My contention is whether that is a formal guarantee.
swid
2 months ago
Most of what you say I agree with. But if this outer query is run only once with version A there is a caveat with where it says “where id = X”.
This cannot match more than one X at a time. So that forces the inner query to be run once, as we can only have one id, and running it twice may produce 2.
I am not sure though to be honest.
…
We are too deep for me to reply now, but to your next comment I didn't mean only one row, but only one id. It is easy for a small difference in word choice to get things wrong.
I think if it did return two rows; ie limit 2- the query with = will fail. Hmmm maybe that will happen even with limit 1 under certain plans. I wouldn’t trust it.
Diggsey
2 months ago
Why would it only be run once? The WHERE condition of the outer query is run multiple times: once for each row, so of course it can return TRUE multiple times.
For example:
DELETE FROM example WHERE id = (SELECT RANDOM(0, 100) FROM other_table LIMIT 1)
This could delete multiple rows in principle, since there may be multiple rows where the `=` expression is true.