hackernews client

SQL needed structure

146 pointsposted 6 days ago

152 Comments

bccdee

5 days ago

My favourite way of doing this is

  with movieIds as (select id from Movie where title = $1),
       actorIds as (select Actor.id from Actor join ActorMovie on [...]
                    where ActorMovie.Movie in movieId),
       alsoActedIn as (select id from ActorMovie where actor in actorId),
       movieResults as (select * from Movie where id in movieIds),
       actorResults as (select * from Actor where id in actorIds),
       alsoActedInResults as (select * from Movie join ActorMovie on [...]
                              where ActorMovie.id in alsoActedIn)
  select * from movieResults
      full outer join actorResults on false
      full outer join alsoActedInResults on false;

Not every database supports "full outer join on false," and sometimes you have to add "as not materialized" to the "with" subqueries in order for it to be performant, but it works in Postgres, and you end up with a results table that looks something like this:

  MovieId,MovieTitle,ActorId,ActorName,MovieId,MovieTitle,ActorMovie.actor
  0,"Indiana Jones",null,null,null,null,null
  null,null,0,"Harrison Ford",null,null,null
  null,null,null,null,0,"Indiana Jones",0
  null,null,null,null,1,"Star Wars",0
  null,null,null,null,2,"The Fugitive",0

With a bit of client-side logic, it's easy to transform that into this:

  [{
    id: 0,
    title: "Indiana Jones",
    actors: [{
      id: 0,
      name: "Harrison Ford",
      alsoActedIn: [
        {id: 1, title: "Star Wars"},
        {id: 2, title: "The Fugitive"},
      ]
    }]
  }]

without the need to use json as a middleman.

This pattern has saved me from some truly awful query logic.

timeinput

5 days ago

I've been trying that, and I just keep running into miserable query performance.

What order of magnitude do you use this on and find it acceptable? 1MB, 1GB, 1TB, 1PB? At 1GB it seems okay. At 10GB things aren't great, at 100 GB things are pretty grim, and at 1TB I have to denormalize the database or it's all broken (practically).

I'm not a database expert, but I don't feel like I'm asking hard questions, but I'm running into trouble with something I thought was 'easy', and matches what you're describing.

bccdee

5 days ago

Try using the `explain analyze` statement. My guess is, either you need to add indexes, or you're not using the indexes you do have because some of your subqueries are being materialized when they shouldn't be. I can't be sure of the specific problem, but you should be able to fix it with some run-of-the-mill query optimization.

mycall

4 days ago

Chained CTEs that reference previous expressions can cause the query plan to explode sometimes. My only work around for that situation is to switch to using temp tables with indexes, replacing some (or all) of the table expressions until things are fast again.

bccdee

4 days ago

Yeah—lasy time I had that issue, I found postgres was materializing some of the subqueries and then naively scanning those materialized queries when the original tables had indexes. I fixed it by adding "not materialized" and hitting the original tables directly, but if you want to materialize a subquery AND have it be indexed, a temp table is the only way I know to make that happen.

kaelwd

3 days ago

My favourite way of doing this is with edgeql

    select Movie {
      title,
      actors: {
        name,
        alsoActedIn := (
          .movies { title }
          filter .id != Movie.id
        ),
      },
    } filter .title = <str>$title;

gerad

5 days ago

couldn't you do a union all instead of outer join on false?

bccdee

5 days ago

Nope, because unions merge columns. `select * from Actor union select * from Movies` gives you [(0, "Harrison Ford"), (0, "Indiana Jones")], which is a problem because we can't tell which rows are actors and which rows are movies. What we need is [(0, "Harrison Ford", null, null), (null, null, 0, "Indiana Jones")].

lixtra

5 days ago

Yes, you could use union. But then you have to pad the columns of the other tables with NULLs to arrive at the same output and carefully count. And we all hate counting.

taffer

5 days ago

> With a bit of client-side logic, it's easy to transform that into this [example here] without the need to use json as a middleman.

The result in your example looks exactly like JSON. Am I missing something?

Also, what is the point of avoiding JSON? Your client has to unmarshal the result either way.

bccdee

5 days ago

The json in my example would be generated in client code from the rows returned by the query, which I represented as CSV. The client's representation could just as easily be objects or protobufs or whatever, but I figured json would be a convenient way to portray nested data.

It's worth avoiding json on the wire if possible, because the extra layer of encoding will complicate things. Any good postgres integration will know how to safely deserialize and discriminate between (e.g.) DateTime and DateTimeWithTimeZone. But with json, everything's stringified.

greggyb

6 days ago

Coming from a data engineering and business analytics background, the terms "structured" and "unstructured" seem to be swapped in meaning here.

I'd expect to talk about anything in a DB as "structured data". Flexible serialization formats such as JSON or XML are "semi-structured". And something from e.g., an office document is "unstructured".

This is not a complaint or criticism. I understand the article just fine. It's just interesting how different perspectives can affect what words mean (:

sinfulprogeny

5 days ago

Kleppman[1] calls it schema-on-read (json, xml) and schema-on-write (typed columns in an RDB). I like it over structured/unstructured, it's a bit more specific.

[1] https://martin.kleppmann.com/2017/03/27/designing-data-inten...

greggyb

5 days ago

Yes, I agree with that preference. I don't love the verbiage of "structured" / "unstructured" in either usage (the article's or that which I shared).

thaumasiotes

5 days ago

> Coming from a data engineering and business analytics background, the terms "structured" and "unstructured" seem to be swapped in meaning here.

Mark Rosewater likes to write about his personal concept of "linear" Magic: the Gathering decks, which are decks in which the cards tend to pick up synergy bonuses from each other, so that having two of them together is considerably more powerful than you'd expect from the strength of those two cards individually.

This always bothers me because it is the opposite of the more normal use of "linear" relationships, in which everything contributes exactly as much to an aggregate as it's worth individually.

stevage

5 days ago

I think the SQL sense is more that "structured" means "it contains internal structure" (like a tree), whereas a table without JSON is free from additional structure apart from its own table structure.

stared

5 days ago

I used to work bit with MongoDB (over 10 years ago) and it did wonders in making me fall back into love with relational databases. I finally saw the light of Codd!

SQL, as a language is clunky, true. It can be patched here are there, either by PipeSQL or by various ORMs. I agree, that it would be wonderful to have standardized tooling for generating JSON like in the post.

Yet, with relational databased you can separate concerns of: what is your data and what you want to display. If you use JSON-like way to store data, it can do the job until you want to change data or queries.

andyferris

5 days ago

100%.

The shame, to me, is that SQL is _unnecessarily_ clunky with producing query results with nested/heirarchical data. The relational model allows for any given value (field or cell value) to be itself a relation, but SQL doesn't make it easy to express such a query or return such a value from the database (as Jamie says - often the API server has to "perform the join again" to put it in nested form, due to this limitation).

bazoom42

5 days ago

> The relational model allows for any given value (field or cell value) to be itself a relation

First normal form explicitly forbids nested relations though. Relational algebra does not support nested relations for this reason.

But perhaps nesting relations might make sense as the final step, just like sorting, which is not supported by the pure relational model either.

marcosdumay

5 days ago

If your relational algebra comes with normative opinions about data normalization, there's something really strange with the entire way you think about mathematics.

Usually, relational algebra doesn't have many restrictions about the type of the atomic values it deals with, what makes sequences and relations perfectly valid candidates.

But yeah, there are many reasons for you to normalize your data at rest.

taffer

5 days ago

AFAIK relational algebra and relational calculus are based on FOL (first-order logic) which predicates atomic elements. For nested elements you would need HOL (higher-order logic) which is mathematically hard to optimize.

As FOL is as powerful as HOL but easier to optimise, it has become the basis for relational databases. The trade-off is that you have to normalise your data so that you can efficiently query and apply constraints to it.

But yes, just because your data at rest should be flat relations, that doesn't mean that the results of your queries need to be flat. I think querying a well-normalised database and returning nested JSON makes a lot of sense.

bazoom42

5 days ago

Sure, treating nested relations as “atoms” is supported naturally, but presumably you would want to query the content of these nested relations, which is where it gets complicated.

The straighforward translation of a hierarchical database into relational form is to represent nested structures as nested relation. But if you cant query these nested relations the database is useless. So you either need to normalize to 1NF or extend the query language to query nested relations.

marcosdumay

2 days ago

You could reasonably want to evaluate data in those relations, to compose data to use at the top level. You could also reasonably run queries within the internal relations.

You could probably be able to promote those relations up on the hierarchy by using them in joins. But I don't think it's reasonable to mix queries of several levels. (I can't even imagine how the syntax for that would look like.)

strbean

5 days ago

SQL is unnecessarily clunky in just about every respect. It's mind boggling that the syntax of a language designed in the 1970s has so many ardent defenders.

taffer

5 days ago

There are many people defending relational databases, and rightly so, but I have never heard anyone defend the syntax of SQL. It's just that everyone standardised on it as the language of relational databases in the 1980s. Its inertia is hard to beat.

user

4 days ago

[deleted]

bazoom42

5 days ago

I’m not sure how many defend SQL syntax. It is a clunky syntax over a great concept, the relational model. But SQL often gets conflated with relational, hence the term NOSQL for non-relational databases.

dotancohen

5 days ago

Probably because us old geezers are still using CLI interfaces on our servers and, often, on our desktops. We still use VIM. And we still listen to Led Zeppelin.

Your new fad NoSQL may or may not be relevant tomorrow. SQL has inertia. I have no problem finding experienced SQL developers. Do there even exist experienced developers for technologies that have been invented yesterday? Will you be able to find somebody to maintain it tomorrow?

strbean

a day ago

> We still use VIM.

VIM has autocomplete nowadays old man! ;)

No excuse to use a language that has the syntax of "get (things...) from (thing)" and make it impossible contextually to know what the possible (things...) are until later.

> Your new fad NoSQL may or may not be relevant tomorrow.

NoSQL is a misnomer, this is exactly why the whole conversation I'd like to have is nearly impossible. SQL and relational aren't synonyms. The guys who co-opted the name NoSQL to mean "non-relational DBs" really screwed us.

It's like someone created a movement called "No-C++" and defined it to mean "we are only going to use programming languages that run in an interpreter, no bare metal!" Then someone comes along suggesting the ideas behind Rust etc. and everyone argues against it by saying "No-C++ is dumb, sometimes you need to run on the bare metal!"

That's before you even encounter the people who firmly claim that SQL and the relational model are one and the same, and there is no conceivable other way to represent it. Which I regularly do encounter here on HN.

setr

4 days ago

One of the things I find most annoying about DBAs is they don’t understand just how utterly awful their tooling is. The error messages alone should be enough to throw a hissy fit, but as a group they’ve never dreamed of a life better than the misery of ERROR ON LINE 1: <entire query>

The aggressively common pipe-delimited sproc output (the standard hack to get around commas being common in strings, instead of finding a legitimate data format) is another clear example of the brain damage the poor DBA incurs through constant investment in “modern” relational databases.

TristanBall

4 days ago

Larping as a dba now, and trust me many of us do know, and its a source of ongoing suffering, but we're often not developers and don't necessarily have the choice of tooling.

Right now the DB I'm paid to babysit provides cli tools that flip between tabular and k/v list output without making that configurable, or a tabular form that doesn't include headers, or a 3rd party tool that will, but has the spectacularly annoying "error on line 1 : multi-line query" issue. Or things that are slow or only talk via generic protocols like odbc etc etc etc

I think you're being a little unfair about the pipe delimited thing though, it's a least worst compromise based on who we're providing the data too - non technical business people, the vast majority of whom use excel or similar tools and couldn't even tell you what "data format" means, let alone configure their systems to parse something else.

Personally I had a bit of an epiphany around the ascii delimiters (us/fs/rs/gs) which work extremely well when the data is ascii/utf-8, and make data interchange between shell cli tools very easy. But they've also invisible and little business software supports them in a friendly way. Telling someone in accounts or market to "use octal 034" helps no one.

And I've resigned myself to using multiple tools, dev with tool 1 with decent error messages, tool 2 for production use because it can actually produce sane output formats.

What I don't have a choice on is which db we use, and it's not modern or cool and honest most things don't even have drivers for it

lucketone

4 days ago

I ran away from comfy DB-centered job, just because of the quality of the tooling.

Technically it is good enough to perform the function, but at the same time it’s just soul-crushing how much quality of life is being wasted.

zozbot234

5 days ago

> SQL is _unnecessarily_ clunky with producing query results with nested/heirarchical data.

Property Graph Query (PGQ) is now a part of SQL and is expected to help with expressing these complex queries.

twoodfin

5 days ago

Has anyone implemented it yet?

mycall

4 days ago

There are lots of ways in SQL to provide nested/heirarchical data, each with their own strengths and weaknesses.

* Adjacency List Model

* Path Enumeration Model, also known as the Materialized Path

* Closure Table Model (or Bridge Table)

* Nested Set Model

* Recursive CTEs

arnsholt

5 days ago

My stint with MongoDB was brief, but I too came away with a deeper appreciation of SQL bases. I feel it's a bit like a good type system: yes, there absolutely is an upfront cost, but over time it really does save you from stupid runtime problems. It's especially important when the bugs are in the data storage layer, because if a bug causes bad data to be written to your database, not only do you need to find and fix the bug, you also have to figure out how to deal with (possibly lots of) bad data.

Also, modern SQL is an incredibly powerful language. Good SQL can save you from lots of dumb data munging code if you know how to wield your database properly. Especially for analytical queries, but also more typical application code IMO.

anonymars

5 days ago

I have spent many years beating the drum that a few minutes spent on constraints today save many, many, many hours of painful data cleanup later

And in fact, good constraints also can improve query performance. If the optimizer knows this column is unique, or that column is guaranteed to have a corresponding value in the joined table, it can do all sorts of tricks...

da_chicken

5 days ago

Yes, nothing will make you appreciate the relational model more than using a database where the designer broke first normal form. Or where the developer thought the data in the database was theirs and it was only ever going to be used by just their application.

Something else I find confusing is that every developer seems to want every database query for a single object to return a single row with a thousand columns, and anything multi row or with multiple results is way too complicated to handle. This goes double for report writing software.

I really wonder what we're doing with database providers (drivers) that makes people want to develop ORMs when ORMs consistently feel like 10 ton gorillas. If the database world is so disparaging of the "row by excruciating row" processing, why do the drivers exclusively present data that way and no other?

hdjrudni

5 days ago

I can't figure out why when I do `t1 left join t2` and it's not 1:1 or 1:0, I get duplicate t1 rows. Why can't either the driver or SQL find a way to represent a single t1 with multiple t2 (this is of course trivial to represent in JSON or any programming language with arrays and dicts).

I don't mind that the tables in relational databases are essentially 2d matrices. I can do the hierarchy with the joins. I just need a way to get the data out in a sane format.

Last time I brought this up ~15 years ago everyone hated on me. But everyone knows this is ultimately the data we want to get out of the system.

da_chicken

3 days ago

I mean, I understand why it does that currently. It's because query results are always relations (tables), and relations are defined as an unordered set of records. By definition you can't use row order meaningfully in a relation, and by extension row relationships. So it duplicates the t1 rows in order to show which records from t1 correlate to the records in t2.

However, the issue is that relations are great inside the RDBMS, but they are not necessarily the best presentation of data outside the RDBMS. MS Access, as maligned as it is, has been able to display subsheets in a collapsible hierarchy [0] for decades at this point.

Really, there should be a way to tell the RDBMS to output data as hierarchies based on the join without having to jump through the, quite frankly, completely ridiculous subquery syntax bullshit that JSON and XML formatting currently requires. Even if it only works in 95 out of 100 cases; even if you have to manually tell the query somehow what the key fields are to agglomerate based on. It should be way easier than having to stitch strings together like you're still using printf() for everything at all times.

[0]: https://www.fmsinc.com/free/NewTips/Access/SubdatasheetName....

mrlongroots

5 days ago

Agree: flat/relational structure is highly efficient to store, query planning/optimization/scaling become a lot easier, the language is clunky but relational algebra is beautiful, but...

I think pushing multiple joins to what is still a relational database and getting a complex output isn't the worst idea in the world, as a higher-level layer on top of a regular database.

On the other hand, "it needs four queries/RTTs" is not the worst thing in the world. It needn't be the goal of a system to achieve theoretical minimum performance for everything.

Let those who truly have the problem in prod push the first patch.

andyferris

5 days ago

Yes - I have to agree.

Codd was right in that if you want transactional semantics that are both quick and flexible, you'll need to _store_ your data in normalized relations. The system of record is unwieldly otherwise.

The article is right that this idea was taken too far - queries do not need to be restricted to flat relations. In fact the application, for any given view, loves heirarchical orginization. It's my opinion that application views have more in common with analytics (OLAP) except perhaps latency requirements - they need internally consistent snapshots (and ideally the corresponding trx id) but it's the "command" in CQRS that demands the normalized OLTP database (and so long as the view can pass along the trx id as a kind of "lease" version for any causally connected user command, as in git push --force-with-lease, the two together work quite well).

This issue is of course that SQL eshews hierarchical data even in ephemeral queries. It's really unfortuante that we generate jsonb aggregates to do this instead of first-class nested relations a la Dee [1] / "third manifesto" [2]. Jamie Brandon has clearly been thinking about this a long time and I generally find myself nodding along with the conclusions, but IMO the issue is that SQL poorly expresses nested relations and this has been the root cause of object-relation impedence since (AFAICT) before either of us were born.

[1] https://github.com/ggaughan/dee [2] https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf

9rx

5 days ago

> The article is right that this idea was taken too far

The biggest mistake was thinking we could simply slap a network on top of SQL and call it a day. SQL was originally intended to run locally. You don't need fancy structures so much when the engine is beside you, where latency is low, as you can fire off hundreds of queries without thinking about it, which is how SQL was intended to be used. It is not like, when executed on the same machine, the database engine is going to be able to turn the 'flat' data into 'rich' structures any faster than you can, so there is no real benefit to it being a part of SQL itself.

But do that same thing over the network and you're quickly in for a world of hurt.

Mikhail_Edoshin

5 days ago

But the very title of Codd's paper mentions shared data banks, doesn't it? Concurrent access and thus networking was there from the beginning. One of reasons SQL is the way it is (declarative) is because it shields the user from the underlying concurrency: there is no notion of it at all, to the user things look as if he was the only client of the database, while in reality he's just one of many.

9rx

5 days ago

> But the very title of Codd's paper mentions shared data banks, doesn't it?

Shared as in on a single mainframe, yes. Remember, Codd was a mainframe guy — even helping to design one before starting work on the relational model. It wasn't until Oracle came along did anyone really think it would be a good idea to slap networking directly on top of SQL.

> Concurrent access and thus networking was there from the beginning.

Again, the trouble is the high latency that networks introduce. That wasn't there from the beginning. Mainframes are designed for low-latency. That is a relatively new constraint that Codd didn't need to think about...

...But we in the internet age do. The high latency means that, in many cases, we can't realistically use SQL as it was intended. Which is why we have all ended up building bespoke DMBSes that speak things like JSON instead of SQL. Granted, there have been some efforts to bring richer data structures to SQL, including JSON support, but they're all pretty hacky, frankly. More ideal would have been to design a better language for the client/server database model from the start, but we are no doubt in too deep now. Worse is better applies.

reaanb2

5 days ago

In my view, the O/R impedance mismatch derives from a number of shortcomings. Many developers view entities as containers of their attributes and involved only in binary relationships, rather than the subjects of n-ary facts. They map directly from a conceptual model to a physical model, bypassing logical modeling. They view OOP as a data modeling system, and reinvent network data model databases and navigational code on top of SQL.

mcphage

5 days ago

In that case, the mismatch is between "What developers need" and "What SQL provides".

kragen

5 days ago

I agree about first-class nested relations, but I don't agree about transactions.

Codd was writing 10 years before the idea of transactional semantics was formulated, and transactions are in fact to a great extent a real alternative to normalization. Codd was working to make inconsistent states unrepresentable in the database, but transactions make it a viable alternative to merely avoid committing inconsistent states. And I'm not sure what you mean by "quick", but anything you could do 35 years ago in 10 milliseconds is something you can do today in 100 microseconds.

andyferris

5 days ago

It's not about just _transactions_. What you wrote is 100% correct.

It's specifically about _fast_ transactions in the OLTP context. When talking about the 1970s (not 1990s) and tape drives, rewriting a whole nested dataset to apply what we'd call a "small patch" nowadays wasn't a 10 millisecond job - it could feasibly take 10s of seconds or minutes or hours. That a small patch to the dataset can happen almost instantly - propagated to it's containing relation, and a handful of subordinate index relations - was the real advance in OLTP DBs. (Of course this never has and never will help with "large patches" where the dataset is mostly rewritten, and this logic doesn't apply to the field of analytics).

Perhaps Codd "lucked out" here or perhaps he didn't have the modern words to describe his goal, but nonetheless I think this is why we still use flat relations as our systems of record. Analytical/OLAP systems do vary a lot more!

kragen

5 days ago

Hmm, but I think people doing OLTP in the 01970s were largely using things like IMS, which used ISAM, on disk, to be able to do small updates to large nested datasets very quickly? And for 20+ years one of the major criticisms of relational databases was that they were too slow? And that even today the remaining bastions of IMS cite performance as their main reason for not switching to RDBMSes?

I think that if you're processing your transactions on tape drives, your TP isn't OL; it's offline transaction processing.

I think Codd's major goal was decoupling program structure from on-disk database structure, not improving performance. There's a lot of the history I don't know, though.

mrkeen

5 days ago

The author didn't see fit to mention that you just flip the arrows.

Classes/Structs know about their children.

Relations know about their parents. If you want crazier m*n relationships you use an association table.

Did the author just not know? Or he didn't see it worthy of dismissal?

> Doing this transformation by hand is tedious and error-prone. We call this tedium "the object-relational mismatch" but it isn't really about objects or relations.

It really is.

> The fundamental problem is that fitting complex relationships to human vision usually requires constructing some visual hierarchy, but different tasks require different hierarchies.

Yes. Different tasks require different hierarchies. One particular way of doing things should not baked into your 1970s relational model.

tucnak

6 days ago

> All that's left to do now is... the same joins, but inside the backend web server. Because we have to re-assemble these flat outputs into the structure of the page.

This is NOT the case with modern SQL, as it supports JSON. The typical Postgres use-case would be to produce denormalised views for the specific relationships and/or aggregates in JSON form. LATERAL, the correlated sub-query, is really convenient in avoiding double-counting. The rest of the article deals with object-mapping, which is really a poor man's VIEW. Time and time again, software engineers reinvent the data-access wheel out of ignorance to database capabilities.

taffer

6 days ago

This is also the conclusion of the article. The author then shows an example of how to get hierarchical json out of your relational database.

andyferris

5 days ago

JSON is very helpful here, and it _happens_ [1] that our HTTP APIs tend to respond with JSON, but I feel SQL's data model is much better suited. The difference is that JSON is hierarchical but SQL relations are flat. However if we let a relation be contained as a value (in a field/cell) then we should be able to produce our hierarchical query result in a natural way (and possibly map that through a `toJSON` method on the API server, if the client wants application/json... otherwise the server might want to do grpc or whatever and getting JSON from the DB is a roundabout pain in the bum).

[1] Actually causality is backwards here - postgres supports JSON only because the Restful API servers that frequently interact with it need to provide JSON to _their_ clients... and so it was a highly demanded feature.

MaxMonteil

6 days ago

This sounds very interesting, I've been using SQL more lately and am constantly impressed by what it can just do.

My thinking is everything I could get done by the DB avoids heavier and maybe slower application code.

Do you have some resources or material I could check to learn more?

taffer

6 days ago

Putting business logic in the database: https://sive.rs/pg

Example code: https://github.com/sivers/store

This talk is more Oracle specific but with some minor changes you can apply the same ideas to postgres: https://www.youtube.com/watch?v=GZUgey3hwyI

Personally I am using an architecture similar to https://sive.rs/pg in my personal projects and I am very happy with it. It is important that you put data and procedures/views in different schemas so that you can use migrations for your data but automatically delete and recreate the procedures/views during deployment. Also use pgtap or something similar for testing.

reaanb2

5 days ago

That's a cool approach that could work well if you don't need realtime data validation such as in a UI. I would love to find a solution that allows the same validation rules to be used in the DBMS as well as in the backend and frontend code.

MaxMonteil

5 days ago

Oh nice! I had read this Sivers post a long time ago, time for a refresher.

Appreciate the info, thanks!

bbkane

5 days ago

I like this point of view but putting logic in the database also has downsides - the tooling in particular is bad. No debugger, hard to write tests, can't canary changes, bad or non-existent autocompletion, lack of linters, etc.

MaxMonteil

5 days ago

The tooling keeps getting better so I'm bullish on that side.

For canaries there's growing support for branching the database which can help.

But in the end, this like all things requires balance. Putting everything in there can lead to pain down the road, for example I wouldn't follow the http part.

My use case is with app's local sqlite and I have a lot of code transforming the returned rows into JSON. It works but feels slower and to divorced from the data.

fredguth

6 days ago

Today I Learned the power of LATERALs... thanks. :-)

bob1029

5 days ago

> Time and time again, software engineers reinvent the data-access wheel out of ignorance to database capabilities.

Much of this discourse around SQL "not having structure" seems to be about arrogance rather than ignorance. It would take 10 seconds with ChatGPT to resolve this deficit, but for some reason we insist on writing entire blogs and burning hundreds of aggregate hours commenting about a make-believe world wherein views and CTEs don't exist.

kragen

5 days ago

An interesting thing about "A Relational Model of Data for Large Shared Data Banks" (the stone tablets handed down from Codd https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf) is that it starts out by considering N-ary relations whose values may themselves be relations, and only then switches to considering "normalized" relations whose values are not relations:

> A relation [table] whose domains [column types] are all simple [not relations] can be represented in storage by a two-dimensional column-homogeneous array of the kind discussed above. Some more complicated data structure is necessary for a relation with one or more nonsimple domains. For this reason (and others to be cited below) the possibility of eliminating nonsimple domains appears worth investigating.⁴ There is, in fact, a very simple elimination procedure, which we shall call normalization.

But non-normalized relations support the kind of nested structure the eminent Dr. Brandon wants, without resorting to JSON or abandoning the strong uniform typing we have with SQL. Darwen & Date's The Third Manifesto https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf includes group and ungroup operations which translate back and forth between normalized and non-normalized relations.

I've been playing around with some of these ideas using some rather horrific atrocities perpetuated against Lua operator overloading in http://canonical.org/~kragen/sw/dev3/cripes.lua. I haven't added joins or ungrouping to it yet, but I think it might be a prototype of something promising for this kind of use case.

Mikhail_Edoshin

6 days ago

There was a discussion about a relevant paper here at HN:

"What if an SQL statement returned a database?"

https://news.ycombinator.com/item?id=38606146

cryptonector

5 days ago

^F case-insensitive search for:

  - cte
  - recursive
  - with (in SQL code, not prose)
  - connect (as in Oracle's CONNECT BY)

nothing. Nothing!

SQL deals with hierarchical data just fine. You just have to learn about how (recursive CTEs).

bwestergard

2 days ago

I've recently been learning about Datalog, which is excellent for all the queries that you'd want to do with recursive CTEs, but has the added benefit of making it much easier to factor one's code into discrete units (relations) for reusability.

tedggh

4 days ago

In this particular case I much rather use an ORM with caching of view models. I understand the article and the problem, but it doesn’t seem to be a good case for the proposed solution here. This data doesn’t change often and it’s not critical for a user to get the latest data when immediately available. So why you want to write complex, hard to test and debug SQL code when you can have that logic neatly handled in the backend? It is far easier to just build the page structure once in your backend code, cache it, and serve it from memory. The row-explosion problem disappears because you’re not re-running the join logic constantly. The “too many queries” problem disappears because the expensive work was already done when you filled the cache. And the ORM inconsistency worry doesn’t apply because you build the page in a single pass and keep that snapshot until the cache expires.

gethly

5 days ago

I am using event sourcing with cqrs and that means i also utilise projections, which the db itself can replicate less efficiently as a "view" or virtual table.

So one can technically create a projection/view that is tailor-made for a query that needs to display some data. Of course it is no often possible to retrieve all the data with a single select command.

So joins and multiple queries are simply inherent to complexity of data we store nowadays.

Anyway, years ago, i have moved to a db model where every entity is stored as a blob in a dedicated column and every additional column, beside id, is indexed. So there is no wasted space and a ton of columns that only hold data but are not used for filtering. I can run data-efficient queries that yield a list of ids of blobs to load or the blobs themselves and then i extract any necessary data out of those blobs(entities) on the application level. So the database us purely a blob store + few fast indices.

hdjrudni

5 days ago

I'm sure it does work well if the database is perfectly tailored to your app. I'm skeptical that it can handle every new use case that might come up -- i.e. if you need to query for something hidden in a blob. Sounds precisely like NoSQL. It works until it doesn't.

gethly

5 days ago

If new index is needed, you can just add it and rebuild the projection, it's no problem.

kragen

5 days ago

What's your application?

gethly

5 days ago

I use it on gethly.com and previously i used the same approach on an internal application for domain registrar that handles the EPP. So far no issues. Also using those blobs allows me to easily add encryption layer on application level.

kragen

5 days ago

I see: "Gethly is a paywalled hosting and sales platform for digital content creators. Sell online courses, license downloadable content, receive donations or build communities accessible via paid memberships." Are all the online courses and downloadable content and analytics and payments in the CQRS system, or are you using it for only some subset?

gethly

5 days ago

The entire system is event-sourced. I wrote few articles about the technical aspect of the platform on the blog https://gethly.com/blog

At this time, there are 12 services that make up the entire application. Event-sourcing is what allows infinite scaling and CQRS.

ES is the pinnacle of technology in the web sector, but it comes with a lot of overhead and time to market is significantly slowed down. So it is not something every project can or should consider implementing.

kragen

5 days ago

Thank you very much!

cousin_it

5 days ago

Nobody mentioned nested data in BigQuery? It's a superset of SQL where fields can be arrays of structs (so basically nested tables) and there's some extra syntax to nest/unnest things, so you can often get everything you wanted in one result set.

mccanne

4 days ago

For sure... the use of JSON here seems orthogonal to Jamie's point of constructing complex nested values in SQL with a single backend query. Modern SQLs support all this as native types (presuming the result can fit in a homogeneous relational type, which it can in this case), e.g., Jamie's query can be written with a record expression (duckdb dialect):

  with theTitle as (
    from title.parquet
    where tconst = 'tt3890160'
  ),
  principals as (
    select array_agg({id:principal.nconst,name:primaryName,category:category})
    from principal.parquet, person.parquet
    where principal.tconst = (from theTitle select tconst)
    and person.nconst = principal.nconst
  ),
  characters as (
    select array_agg(c.character) as characters, p.u.name
    from principal_character.parquet c
    join (select unnest((from principals)) as u) p
      on c.character is not null and u.id=c.nconst and c.tconst=(select tconst from theTitle)
    group by p.u
  )
  select {
    title: (select primaryTitle from theTitle),
    director: list_transform(
                list_filter((from principals), lambda elem: elem.category='director'),
                lambda elem: elem.name),
    writer: list_transform(
              list_filter((from principals), lambda elem: elem.category='writer'),
                lambda elem: elem.name),
    genres: (select genres from theTitle),
    characters: (select array_agg({name:name,characters:characters}) from characters),
  } as result

And if you query typeof on the result, you'll get:

  STRUCT(
    title VARCHAR,
    director VARCHAR[],
    writer VARCHAR[],
    genres VARCHAR,
    characters STRUCT(
      "name" VARCHAR,
      characters VARCHAR[]
    )[]
  )

deepsun

5 days ago

PostgreSQL (what article author uses) supports nested data (see Composite Types). You an even define a type yourself for type safety.

foota

5 days ago

Yes, see https://cloud.google.com/bigquery/docs/reference/standard-sq... and https://cloud.google.com/bigquery/docs/reference/standard-sq.... Very pleasant compared to normal SQL.

Table-valued functions (I think these are not exclusive to GoogleSQL?) are also very nice. Every time I see scripts with a long block of with statements part of me dies inside.

mg

6 days ago

I am becoming more and more comfortable with storing everything in JSON.

With its JSON arrow operators and indexed expressions, SQLite makes a really nice JSON document store.

Assume you have a "cars" table with nothing but a "cars" column that is a JSON blob. Then selecting the models of all cars is just:

    SELECT cars->>'model' FROM cars

I wish MariaDB would follow suit and also support arrow operators. You can do it in MariaDB, but it becomes more cumbersome:

    SELECT JSON_UNQUOTE(JSON_EXTRACT(cars, '$.model')) FROM cars

tgv

6 days ago

If that's all, I still recommend classical columns. Now if you have array values and/or values with varying structures which you don't need to query (frequently), JSON makes sense.

mr_toad

5 days ago

> SELECT cars->>'model' FROM cars

Can ‘model’ be a variable, or does it have to be a constant literal?

CGamesPlay

5 days ago

It’s any expression, so a parameter or shudder column would work. https://sqlite.org/json1.html#jptr

ivanb

5 days ago

One limitation of JSON is its limited set of types. For example, for decimal numbers one has to resort to stringly typing representation because DB connection libraries assume that JSON numbers are floating point. Note that JSON numbers are just sequences of digits, nothing more. There is no attached precision semantic.

Another example is UUIDs. Instead of transferring 16 bytes, the libraries deal with wasteful string representation. I'm sure you can bring another examples.

Nonetheless, for majority of data JSON as DB output format is alright.

jamii

5 days ago

Yeah, json is annoying because of the limited types, but postgres arrays/rows are annoying because of the lack of sum/union types (if your UI has a heterogenous list of elements).

The OLAP world has much nicer type systems eg https://duckdb.org/docs/stable/sql/data_types/union.html.

deepsun

5 days ago

> If there was no director in the database then this query would return 0 rows.

Well, LEFT JOIN is the basic SQL knowledge, surprising the author doesn't know that.

> because sql wasn't designed to produce hierarchical data

Well, SQL was designed _because_ of the shortcomings of hierarchical DBMSes.

Also, author uses PostgreSQL, so could easily use ARRAY_AGG(), and Composite Types to get the exact answer they need in one query, no need for hacks. They can even make their own custom type for type safety, if needed.

umanwizard

5 days ago

I promise Jamie Brandon knows about left joins.

mcdonje

5 days ago

So then it's a bad faith example.

umanwizard

4 days ago

I'm not Jamie but I think the point is just that this is an awkward way to represent one-to-many relationships.

Even with left joins things are even clunkier because you no longer have the clunky but straightforward n * m formula for number of rows, you have something even more confusing: n * max(1, m)

anewhnaccount2

5 days ago

I think EdgeDB/gel maybe addresses this problem. It has nice syntax for returning hierarchical resultsets from a relational database.

procaryote

5 days ago

Queries like these

    select primaryTitle, person.primaryName 
    from title, principal, person
    where title.tconst = 'tt3890160' 
    and title.tconst = principal.tconst 
    and principal.nconst = person.nconst 
    and principal.category = 'director';

are much nicer to write as

    select primaryTitle, person.primaryName 
    from title
      join principal on principal.tconst = title.tconst
      join person on person.nconst = principal.nconst
    where title.tconst = 'tt3890160' 
      and principal.category = 'director';

Because you separate the filter from the joining, and you're much less likely to forget a join condition

If ambiguity of column names are low enough you could also use `join person using(nconst)`

phiresky

5 days ago

I think relational databases are great, but this is my biggest problem with SQL, even before the ridiculous syntax.

A query like

    select users.*, orders.* from users left join orders on orders.user_id = users.id

Should always have returned a structure like:

    type SingleReturnRow = { users: {id: ..., }[], orders: {user_id: ..., id: ..., }[]}
    type Return = SingleReturnRow[]

Mangling the columns together and _removing_ groupings that naturally appear is just so unnecessary.

I don't think a larger change in the query language would even be needed.

Even better of course would be a return value like

    type SingleReturnRow = { user: User, orders: Order[] }

But I see how that would require a fundamental change in the language.

Of course in PG now you can use

    select users.*, json_agg(orders.*) as orders from users left join orders on orders.user_id = users.id group by users.id

but using JSON as intermediate steps just feels unnatural.

johannes1234321

5 days ago

It follows the relational algebra model. Relations (aka Tables) go in, relations (aka Tables) come out. This makes some things really nice.

However I proposed a hierarchical result for such cases a long time to our database, but couldn't convince enough people. json_agg came later at there all the machinery is there, it would "just" require exposing this to the protocol and adapting all clients to understand that data format ...

refset

5 days ago

The world deserves something better than json_agg. For XTDB we coined NEST_ONE and NEST_MANY: https://xtdb.com/blog/the-missing-sql-subqueries

The output format is either raw Arrow DenseUnions (e.g. via FlightSQL) or Transit via a pgwire protocol extension type.

andyferris

5 days ago

There's nothing in the relational model that suggests a field (cell) can't take the value of a relation. Only SQL makes that difficult.

johannes1234321

5 days ago

There is nothing forbidding it, but then you can't process it further with the same algebra, that value then is a single opaque value. (Which for many uses is fine as that should be one of the final steps of processing)

reaanb2

5 days ago

See https://arxiv.org/abs/2312.00638 for one proposal to address this

procaryote

5 days ago

Having built a few high load things with underlying SQL, I've found that in practice, big joins don't scale well, and caching becomes pretty important.

It often becomes more interesting to find what bits are reused in several places and can get broken out into a useful cache (like person id → person name, thumbnail url, person url, …) than making a custom query for each view

That completely changes how you do queries as you can cut a bunch of joins. Instead you get a partial data set from the query that you then throw on a function or service that populates all the referenced ids from a cache, and backfills (and caches) any missing ones (preferably with a single query for a list of people in the example).

hbrundage

5 days ago

I agree and would take it one step further — the structure of the joins is something that should most often come from the schema, not the query. In the same way that the attributes of an entity should be modelled out ahead of time, the relationships between the entities should be as well, and that yields more productive querying and better performance.

We built a GraphQL / SQL hybrid expression language that does just this here: https://docs.gadget.dev/guides/data-access/gelly#gelly , curious what yall think.

reaanb2

5 days ago

I disagree. From a fact-oriented modeling perspective, relationships among entities are already well-modelled and represented in tables. There's a reason the relational model calls tables relations - they relate things. The mistake here is viewing tables/rows as representing entities, FK constraints as representing relationships, and viewing entities as containers of attributes rather than the subjects of facts.

Joining tables is composing complex facts from simple ones, and is the opposite of normalization which is decomposing complex facts into simpler ones. The ability to join tables on arbitrary conditions is fundamental to the ability to ask a DBMS complex questions and have it respond with all the facts that match that question.

mcdonje

5 days ago

This was enlightening. I've often wondered why ORMs exist, and it turns out a bunch of devs don't understand tabular data. I was going to say they don't understand SQL, but apparently it goes deeper than that.

Like, left joins have been around since SQL-92, so the query not returning a result when there is no director is a skill issue.

Nowadays, you can run a JSON or XML query and have the DB return your results in a hierarchical format.

kragen

5 days ago

The author of this blog post writes in https://www.scattered-thoughts.net/writing/reflections-on-a-...:

> - In a small team (averaging ~3 people, I think?) in a little under a year, we built a SQL query planner that passed the >6 million tests in the sqlite logic test suite.

> - I figured out how to decorrelate arbitrary SQL subqueries (I wasn't the first person to figure this out, but I got there independently).

> - I demonstrated a mode of consistency failure in various popular streaming systems that I believe was not widely understood at the time.

I think it is likely that the reason you disagree with him is not that he "don't understand tabular data" and has "a skill issue". Unless you're secretly Michael Stonebraker?

You write:

> Nowadays, you can run a JSON or XML query and have the DB return your results in a hierarchical format.

You may not have noticed this, but the post you are commenting on explains how to run a JSON query and have the DB return your results in a hierarchical format.

mcdonje

5 days ago

My take was charitable because if he knows better, then he's making a bad faith argument, and I didn't want to accuse him of that.

I didn't read beyond the article. I only responded to what was in the post.

He didn't use capabilities SQL has had for decades, made a bad query, and used that to build his argument about how bad SQL is.

The ORM tables could've been written in one query that returns a tabular format of hierarchical data, which the front end could use.

What does that look like? It looks like the kind of underlying tabular format JSON & XML queries require. He didn't do that with his non-JSON examples.

Yes, the framework would need to turn it into an object. Yes, there is data duplication in some fields. While not ideal, it's still better than the ORM example because you're not splitting it up into multiple queries.

His JSON example undercuts his main argument. The DB is fully capable of returning the representation he wants. It's also more complex than necessary with all the JSONB_AGGs in the subqueries. Then he says you can see the duct tape.

So yeah, it reads like a front end person who never really learned modern SQL. Or maybe it's someone who knows better who is making a bad faith argument. Either way, it's just a bad take.

kragen

5 days ago

I don't think his final query is more complex than necessary? Maybe there's a better approach that I'm not seeing.

mcdonje

5 days ago

Well for one thing, he's using a bunch of correlated subqueries, which is a bad idea. He even links to an article about why it's a bad idea and then does it anyway. Generally easily avoidable.

For another, postgres has more JSON functions than just jsonb_agg.

https://www.postgresql.org/docs/current/functions-json.html

I don't have time to write something out right now, but there are definitely options.

kragen

5 days ago

I'm not sure there's a better option in Postgres, and I'm not convinced you understand his main argument. How would you restate it?

mcdonje

5 days ago

You're not sure there's a better option than JSONifying a bunch of correlated subqueries individually?

kragen

5 days ago

That's right: I'm not sure there's a better option in Postgres than JSONifying a bunch of correlated subqueries individually. How would you do it?

It's kind of a red flag, though, that you didn't answer my question about how you would restate his main argument. It suggests to me that you're just here to flame instead of to have a productive discussion.

mcdonje

5 days ago

I already said I don't have time to rewrite the query right now, but I'd definitely get rid of the correlated subqueries, which should be obvious from my comments. I may or may not use an intermediate representation, like a CTE, and I'd probably make use of JSONB_OBJECT_AGG.

I kind of thought it was a red flag that you were accusing me of not understanding the post, which is about how SQL is bad at hierarchical data, while defending the bad SQL he used to defend his argument. But I didn't make a big deal out of it.

strbean

5 days ago

I think that poster is asking you to answer this question:

> I'm not sure there's a better option in Postgres, and I'm not convinced you understand his main argument. How would you restate it?

Rather than asking you to demonstrate the better way of writing the query.

mcdonje

5 days ago

I know. I answered that in my second paragraph. They also asked how I'd approach writing the query after I already mentioned the issues and sent them the postgres doc page.

kragen

5 days ago

You would restate his main argument as "I kind of thought it was a red flag that you were accusing me of not understanding the post, which is about how SQL is bad at hierarchical data, while defending the bad SQL he used to defend his argument. But I didn't make a big deal out of it."? That doesn't make any sense.

I think you don't understand what the post is about, you don't understand what the query is trying to achieve, you don't know a better way to do it, and you're just engaging in ego defense instead of contributing anything.

mcdonje

5 days ago

No, that's not how I would restate it. I said I addressed it in there, not that the entire paragraph addresses it. Reading comprehension.

kragen

3 days ago

You didn't answer my question in there, no.

user

5 days ago

[deleted]

user

5 days ago

[deleted]

Spixel_

5 days ago

I don't get the transaction bit. At least with postgres, a transaction doesn't guarantee that all statements in it see the data at the same point in time (actually, it's not even guaranteed for subqueries).

Also, often, the transactional database servers is more difficult to scale than application servers so from a technical standpoint, it makes sense to do this glue work in app code.

andyferris

5 days ago

> a transaction doesn't guarantee that all statements in it see the data at the same point in time

This depends on the transaction isolation level. If you use snapshot or serializable this should be the case (but you may have aborted transactions due to optimistic concurrency).

Spixel_

5 days ago

You are right, but note that the default isolation level is "Read committed" in postgres.

TheTaytay

5 days ago

Doesn’t it guarantee consistency from the time the transaction started (assuming read committed isolation)? It guarantees you won’t see something “later” than when your transaction began.

https://www.postgresql.org/docs/current/transaction-iso.html

I’m likely misunderstanding what you mean by time.

Spixel_

5 days ago

Read committed (which is the default), doesn't guarantee that. See "Nonrepeatable Read" and "Phantom Read" which are both possible in your documentation page.

Jweb_Guru

5 days ago

SSI scales fine for most workloads as long as you correctly mark your read-only transactions.

ThinkBeat

5 days ago

Would this not become a bit simpler if you populate the sql schema like this?

                            +-------------------+                                                
                            |                   |                                                
                            |                   |

+------------------+ | Genere | | | | | | | +-------------------+ | Movies | | | | | +-------------------+ | | | | | | | Actors | +------------------+ | | | | +-------------------+

redwood

5 days ago

This is the core reason why MongoDB is popular. And while the HN crowd fell into a trough of dissolutionment with it 10 years ago since it wasn't ready then, a bunch of folks found success with it more recently. At my company we had legacy implementations of it and modern ones and the juxtaposition was significant.

smohare

5 days ago

[dead]

zigzag312

5 days ago

I think data definition schema needs to have better type system which not only defines database schema, but contains all necessary data and allows annotations to be able to generate quality language specific code from it. Single source of truth for data structure, which is used both by database and code.

kevindamm

5 days ago

I see some value in that, but in my experience it's actually the queries that inform much of what the table shema and indexes should look like. Especially when joins and/or ordering are involved, it's the details about what the application wants to extract from the DB that are important, and I'd rather have those details at the code's call site not at the schema definition, because if/when they change or are removed it will be clearer from the context.

zigzag312

5 days ago

What I'm proposing doesn't change that. It's just to information necessary to keep strongly typed data in sync between the database and dtos. To help with serialization and mapping. Compile time type safety.

When querying db for schema you don't get enough information to be able to generate very good code in all cases.

Language-first way (define schema in language and generate database schema/migrations) ties you to just one language. And usually these tools don't support using all database features.

sakesun

6 days ago

When something in SQL becomes cumbersome or difficult to work with, people often tend to create entirely new solutions rather than improving the existing ones. Few developers seem interested in refining SQL itself.

bmn__

5 days ago

The hurdle to entry is too damn high. If you want to have your day ruined, let an ISO WG participant explain to you the social and monetary capital needed just to be able to make the most timid of a suggestion for refining the language.

silon42

5 days ago

I've seen 'decorellation' mentioned... 95% of the time I am fighting the query optimizer is to get the correlation and a loop over few items working instead of a table scan.

sgarland

5 days ago

The only reason this post has any truth is that the IMDB dataset is horrendously unnormalized. It also points out at the end that you can in fact use a combination of aggregations to return the desired JSON object.

Even then, if you really wanted to, you can absolutely make a tree-like structure in SQL in a variety of ways: adjacency lists, nested sets, closure tables, etc.

mcphage

5 days ago

> you can absolutely make a tree-like structure in SQL in a variety of ways: adjacency lists, nested sets, closure tables, etc.

That's kinda the problem—rather there being 1 way to make a tree-like structure in SQL, that works correctly, there's a lot of ways to do it, and they all have different tradeoffs, and they're all a bit obnoxious.

deepsun

5 days ago

And the article says that yes, you need various different looks at the same data. I would say UX is a beast, once a UX person tells me what they need SQL query comes up easily. Figuring out what they need is typically the hard part.

mcphage

3 days ago

> once a UX person tells me what they need SQL query comes up easily. Figuring out what they need is typically the hard part.

If they needed only one thing, sure. However, often UX needs to look at the same data in very different ways, and an efficient representation for one need might not be at all efficient for the next (which may not even be known as a need for several years).

sgarland

3 days ago

It’s a lot easier if they tell you what they need when you’re designing the schemata. Unfortunately, IME this isn’t done, and so you wind up with tons of bolted-on attributes later, throwing any hope of normalization out the window.

sgarland

5 days ago

There are multiple ways to do just about anything, in any programming language.

mamcx

5 days ago

The things `SQL` fail to do that enable this sort of things are:

* Store relation-on-cell: All the trick of so called `nosql` is that they can do nested data. THAT IS ALL

* Then, `join` is already inline

* Then, instead of the very poorly named `GROUP BY` it could has REAL `GROUP BY`!

That is all is need at the core

ciconia

5 days ago

This is a problem that ActiveRecord-style ORM's have failed to solve. One possible solution is writing queries like in the article that transform normalized data from the DB into a JSON structure. Another possibility is to design an ORM that is actually capable of taking a set of relations and dynamically building queries with joins, then transforming the returned rows into tree like structures.

This is a very interesting area for exploring a new kind of ORM. I find the whole "CRUD monkey" shtick preached by DHH et al quite lacking for expressing the kind of hierarchical data structures developers need.

reaanb2

5 days ago

A better solution would be to move away from object-oriented domain modeling towards SQL-speaking entity-component systems.

dcreater

5 days ago

Terrible title. SQL is the very definition of structured data. I understand the author means SQL needed tiers/hierarchy/grouping but titling it as he has done just contributes to information pollution / future training data adulteration

maweki

5 days ago

Isn't the author just proposing GraphQL?

andyferris

5 days ago

Not at all.

GraphQL has the API servers (database clients) marshalling the data into heirarchies (nested data) and managing consistent results (e.g. streaming updates).

The article is arguing that the database should be more capable of understanding queries that request nested data (and separately Jamie has spoken in favour of databases that encourage end clients getting up-to-date/internally-consistent results).

My two cents is that it's a crutch having SQL rely on JSON instead of it's own first-class nested data (which would be any given value (field, aka cell) can itself be a relation - at least postgres has array values but they are challenging to index/use keys programmatically).

h4ch1

5 days ago

I mean for highly connected data shouldn't the solution be to ideally use a graph database from the get go instead of generating JSON aggregates in the database?

Don't get me wrong, JSONB is great but I see a lot of people simply slap on a JSON column to anything that requires more complexity without trying to first solve it through RDBMS relations.

zozbot234

5 days ago

PostgreSQL makes a perfectly fine graph database, at least for general workloads. You're not going to get vastly improved performance from a specialized solution unless you're doing very specific things like math-heavy network analysis that PostgreSQL is still a bad fit for.

nrjames

5 days ago

I loaded this same IMDB dataset into a Kuzu DB the other day, just for fun. It was an interesting project but I would agree that this dataset in particular lends itself to a graph database. Kuzu started to choke with > 10M nodes & 10M edges but I was doing a naive insert & probably could fit the entire dataset in without trouble using their direct CSV import.

zw17

5 days ago

Interesting points here. I’ve found the “graph DB vs. relational DB” discussion usually gets framed as an either/or, but there’s a middle ground.

A lot of teams already have their data sitting in Postgres, Mongo, or a lakehouse. Spinning up a separate graph database just for traversals often means duplicating data, building pipelines, and keeping two systems in sync. That’s fine if you need deep graph algorithms, but for many workloads it’s overkill.

What some folks are exploring now is running graph queries directly on top of their existing data, without having to ETL into a dedicated graph DB. You still get multi-hop traversal and knowledge graph use cases, but avoid the “yet another database” tax.

So yeah...graph databases are great, but they’re not the only way to model or query graphs anymore.

jbverschoor

5 days ago

Time to start taking that those CS classes before blogging

kjfarm

5 days ago

This response comes across as unnecessarily toxic and quip-y, which is a shame because there's real substance here. The author clearly understands database design, referencing concepts like normal forms and Boyce-Codd Normal Form (BCNF). Even if I don't agree with the conclusion, I have to respect anyone who takes the time to lay out their argument. Challenging ideas are what help our community grow.

kerbobotat

5 days ago

Time to start proofreading your comments before posting.

eatonphil

5 days ago

Guessing from the relevant masters thesis linked on his site and his time at materialize and tigerbeetle, Jamie's been working on databases for at least like 15 years?

At least personally when Jamie says something I listen. :)

roenxi

5 days ago

The real counterargument here is that the DB systems that stored nested data mostly lose in competition with relational systems over time as it turns out they don't satisfy the large number of people who use data for things other than UIs - they aren't suitable for long term storage and don't separate concerns in a way that allows generalists like database engineers to optimise how data gets too and from a storage location into memory.

And, in practice a lot of these frontenders end up storing a JSON blob with either no nesting or just one level of nesting that looks a lot like a database schema built by someone being stubborn about not normalising data. In some sense that is fine, databases seem to be learning to just treat simple JSON blobs as normalised data but it is hard to make that as efficient as a guaranteed schema. Plus larger companies often end up having to hire a dedicated team of engineers to properly normalise data so other people can use it.