hackernews client

SQL, Homomorphisms and Constraint Satisfaction Problems

153 pointsposted 8 months ago

19 Comments

mbid

7 months ago

The post mentions the idea that querying a database D can be understood algebraically as enumerating all morphisms Q -> D, where Q is the "classifying" database of the query, i.e. a minimal database instance that admits a single "generic" result of the query. You can use this to give a neat formulation of Datalog evaluation. A Datalog rule then corresponds a morphism P -> H, where P is the classifying database instance of the rule body and H is the classifying database instance for matches of both body and head. For example, for the the transitivity rule

  edge(x, z) :- edge(x, y), edge(y, z).

you'd take for P the database instance containing two rows (a_1, a_2) and (a_2, a_3), and the database instance H contains additionally (a_1, a_3). Now saying that a Database D satisfies this rule means that every morphism P -> D (i.e., every match of the premise of the rule) can be completed to a commuting diagram

  P --> D
  |    ^
  |   /
  ⌄  /
  Q

where the additional map is the arrow Q -> D, which corresponds to a match of both body and head.

This kind of phenomenon is known in category theory as a "lifting property", and there's rich theory around it. For example, you can show in great generality that there's always a "free" way to add data to a database D so that it satisfies the lifting property (the orthogonal reflection construction/the small object argument). Those are the theoretical underpinnings of the Datalog engine I'm sometimes working on [1], and there they allow you to prove that Datalog evaluation is also well-defined if you allow adjoining new elements during evaluation in a controlled way. I believe the author of this post is involved in the egglog project [2], which might have similar features as well.

[1] https://github.com/eqlog/eqlog [2] https://github.com/egraphs-good/egglog

snthpy

7 months ago

Thank you @xlinux and @mbid. Very interesting and not something I knew much about before.

I had a look at eglog and egglog and if I'm understanding things correctly then one possible use case is type inference and optimization. I'm particular I looked at the example in [1].

I'm thinking that this could be useful in the PRQL [2] compiler, in particular for: a) inference of type restrictions on input relations and resultant output relation types, b) optimization of resultant SQL queries.

Would you be able to comment on whether that's correct?

Any links to related examples, papers, or work would be appreciated. Thanks!

1: https://egglog-python.readthedocs.io/latest/tutorials/sklear...

2: https://www.prql-lang.org/

mbid

7 months ago

I actually started working on Eqlog because I wanted to use it to implement a type checker. You might want to skim the posts in my series on implementing a Hindley-Milner type system using Eqlog, starting here [1]. The meat is in posts 3 - 5.

The type checker of Eqlog is mostly implement in Eqlog itself [2]. The general idea is that your parser populates a Database with syntax nodes, which are represented as `...Node` types in the Eqlog program at [2], and then you propagate type information with Datalog/Eqlog evaluation. Afterwards, you check whether the Database contains certain patterns that you want to rule out, e.g. a variable that doesn't have a type [3].

There are still some unsolved problems if you're interested in writing the whole type checker in Datalog. For example, variable lookup requires quadratic memory when implemented in Datalog. I mention this and a possible solution at [4]. However, Datalog as is can probably still be useful for some subtasks during type checking. For example, the Rust compiler uses Datalog in some parts of the type checker I believe. Reach out via e.g. github to mbid@ if you'd like to discuss in more detail.

Regarding optimization you probably want to talk with somebody working with egglog, they have a dedicated Zulip [5]. I'd imagine that for prql you want to encode the algebraic rules of pipeline transformations, e.g. associativity of filter over append. Given the query AST, eqlog or egglog would give you all equivalent ways to write the query according to your rules. You'd then select the representation you estimate to be the most performant based on a score you assign to (sub)expression.

[1] https://www.mbid.me/posts/type-checking-with-eqlog-parsing/

[2] https://github.com/eqlog/eqlog/blob/9efb4d3cb3d9b024d681401b...

[3] https://github.com/eqlog/eqlog/blob/9efb4d3cb3d9b024d681401b...

[4] https://www.mbid.me/posts/dependent-types-for-datalog/#morph...

[5] https://egraphs.zulipchat.com

snthpy

7 months ago

Thank you. Will try to get into this on the weekend. I'll reach out once I can ask a more informed question.

bubblyworld

7 months ago

Very interesting perspective I hadn't heard before on datalog, thanks. How far does it go - can you interpret extensions of datalog (say negation or constrained existentials) in a nice categorical way, for instance? I've given this very little thought but I imagine you'd have issues with uniqueness of these "minimal" database instances, and I'm not sure what that means for these lifting properties.

(if my question even makes sense, pardon the ignorance)

mbid

7 months ago

If you're interested in the details, you might want to have a look at papers [1] or [2].

You can add existentials in this framework, which basically means that the lifting problems mentioned above don't need to have unique solutions. But as you say, then the "minimal" databases aren't determined uniquely up to isomorphism anymore. So the result of Datalog evaluation now depends on the order in which you apply rules.

If I recall correctly, then [3] discusses a logic corresponding to accessible categories (Datalog + equality corresponds to locally presentable categories) which includes the the theory of fields. The theory of fields involves the negation 0 != 1, so perhaps that might give you a nicer way to incorporate negations without stratification.

[1] https://www.mbid.me/eqlog-semantics/

[2] https://arxiv.org/abs/2205.02425

[3] Locally presentable and accessible categories, https://www.cambridge.org/core/books/locally-presentable-and...

bubblyworld

7 months ago

Thanks for the references, those papers looks great! Will dig into them this evening =)

babel_

7 months ago

For anyone curious: the performance difference between Clang and GCC on the example C solution for verbal arithmetic comes down to Clang's auto-vectorisation (deducing SIMD) whilst GCC here sticks with scalar, which is why the counter brings Clang closer in line to GCC (https://godbolt.org/z/xfdxGvMYP), and it's actually a pretty nice example of auto-vectorisation (and its limitations) in action, which is a fun tangent from this article (given its relevance to high-performance SMT/SAT solving for CSP)

pcblues

7 months ago

When SQL can't internally optimise a query into a more efficient constraint problem, unrolling joins is the key. This once MSSQL hacker got to the point of optimising queries with large amounts of joins or CTEs to just populating a single table's columns with one query per one to a few columns at a time (two minute locking queries down to about two seconds.) After that, I started using SQL to generate SQL and run that for really curly requirements. That gives you the ability to write queries that can search for a particular value in any column in any table, or find changes in the past 5 minutes in any column in any table within a fairly quick timeframe. And that's great for debugging applications that interface with the database or identifying rogue table changes. Without needing a transaction log. Programmer's paradise :)

lovasoa

7 months ago

The topic of huge queries on tiny databases makes me think of this recent discussion on the SQLite forum: https://sqlite.org/forum/forumpost/0d18320369

Someone had an issue because SQLite failed to optimize the following query

    select * from t where x = 'x' or '' = 'x'

Someone said that SQLite could not optimize out the "or '' = 'x'" because it would be too expensive to compute. Which is obviously true only for huge queries on tiny datasets.

jiggawatts

7 months ago

> SQLite

Well... there's your problem. SQLite is not a general-purpose RDBMS, it is marketed as a replacement for "fopen()", a purpose for which it excels.

A similar product is the Microsoft Jet database engine, used in products such as Microsoft Exchange and Active Directory. Queries have to be more-or-less manually optimised by the developer, but they run faster and more consistently than they would with a general-purpose query engine designed for ad-hoc queries.

cerved

7 months ago

I hate Jet with a vengeance

recursive

7 months ago

It's not obviously true at all. Optimizing out `'' = 'x'` can be done for a fixed cost regardless of record count.

lovasoa

7 months ago

Optimizing out static expressions can be done in linear time at best. So if the number of clauses in WHERE is huge and the size of the underlying table is tiny (such as in the examples shown in the article we are commenting on), it will be better not to run the optimization.

But of course, in normal life, outside of the world of people having fun with Homomorphisms, queries are much smaller than databases.

recursive

7 months ago

Parsing the expression in the first place is already linear time.

thaumasiotes

7 months ago

True, but that doesn't mean doing additional work during the parse is free. Optimizing out static expressions will take additional time, and in general that additional time will be linear in the query size.

recursive

7 months ago

My argument is that, on average, it will more than pay for itself.

The only losing case, if there are any measurable ones, is where you have long queries and short data. I'd call that a case of "doing it wrong". Wrong tool for the job.

hinkley

7 months ago

Why would it be too expensive to optimize out static subexpressions?

jjice

7 months ago

My guess is that the expense can be tricky to calculate since the additional optimization prior to executing the query may take longer than if the query was just able to run (depending on the dataset, of course). I wonder if it's too expensive to calculate a heuristic as well, so it just allows it to execute.

Just a guess.