hackernews client

SQL Anti-Patterns

296 pointsposted 4 months ago

(datamethods.substack.com)

221 Comments

EvanAnderson

4 months ago

> Overusing DISTINCT to “Fix” Duplicates

Any time I see DISTINCT in a query I immediately become suspicious that the query author has an incomplete understanding of the data model, a lack of comprehension of set theory, or more likely both.

sigwinch28

4 months ago

Or it’s simply an indicator of a schema that has not been excessively normalised (why create an addresses_cities table just to ensure no duplicate cities are ever written to the addresses table?)

echelon

4 months ago

DISTINCT, as well as the other aggregation functions, are fantastic for offline analytics queries. I find a lot of use for them in reporting, non-production code.

It depends when you see it, but I agree that DISTINCT shouldn't be used in production. If I'm writing a one off query and DISTINCT gets me over the finish line sparing me a few minutes then that's fine.

viraptor

4 months ago

Which categories did the user post in? Which projects did the user interact with in the last week? That's all normal DISTINCT usage.

ndsipa_pomu

4 months ago

There's nothing wrong with using DISTINCT correctly and it does belong in production. The author is complaining about developers that just put in DISTINCT as a matter of course rather than using it appropriately.

ndsipa_pomu

4 months ago

One reason to have excessively normalised tables would be to ensure consistency so that you don't have to worry about various records with "London", "LONDON", "lindon" etc.

sgarland

4 months ago

Because a city/region/state can be uniquely identified with a postal code (hell, in Ireland, the entire address is encapsulated in the postal code), but the reverse is not true.

At scale, repeated low-cardinality columns matter a great deal.

virissimo

4 months ago

There are ZIP codes that overlap a city and also an unincorporated area. Furthermore, there are zip codes that overlap different states. A data model that renders these unrepresentable may come back to bite you.

Breza

3 months ago

This assumption got me in trouble as a junior analyst years ago. I was asked to analyze our customer base and wrote something like the below. Management congratulated me on finding thousands more customers than we'd ever had before.

SELECT zipcode.rural_urban_code, COUNT(*) AS n_customer FROM customer INNER JOIN zipcode USING(zipcode) GROUP BY 1;

pbnjay

4 months ago

FYI this is not true in the US. Zip codes identify postal routes not locations

bdangubic

4 months ago

saying zipcodes uniquely identify city/state/region is like saying John uniquely identifies a human :)

sgarland

4 months ago

EDIT: TIL that there are cross-state ZIP codes.

lucyjojo

4 months ago

these kinds of things are almost never true in the real world.

bts89

4 months ago

That’s almost always my experience too.

Though fairly recently I learned that even with all the correct joins in place, sometimes adding a DISTINCT within a CTE can dramatically increase performance. I assume there’s some optimizations the query planner can make when it’s been guaranteed record uniqueness.

Breza

3 months ago

I agree with you. I also find that adding DISTINCT can sometimes make it easier for my colleagues to understand code, especially when I'm using multiple CTEs and it might be easy to miss a one-to-many join.

yxhuvud

4 months ago

I've seen similar effects when changing a bunch of left outer joins to lateral joins with a limit 1 tacked on. The limit do nothing to the end result, but speed up the query by a factor of 1000..

dotancohen

4 months ago

I've been told similar nasty things for adding LIMIT 1 to queries that I expect to return at most a single result, such as querying for an ID. But on large tables (at least in sqlite, mysql, and maybe postgress too) the database will continue to search the entire table after the given record was found.

Guillaume86

4 months ago

Only if your table is missing an unique index on that column, which it should have to enforce your assumption, so yeah LIMIT 1 is a code (or schema in the case) smell.

dotancohen

4 months ago

IDs are typically unique primary key. But in my experience, adding LIMIT 1 would on average halve the time taken to retrieve the record.

I'll test again, really the last time I tested that was two decades ago.

EvanAnderson

4 months ago

That seems like your RDBMS wasn't handling something right there or there wasn't a unique index on the column.

Do you recall what the database server was?

dotancohen

4 months ago

Yes, I was using Mysql exclusively at the time. I don't recall which version.

I also tested this once years later when doing a Python app with sqlite. Similar result, but admittedly that was not a very big table to begin with.

I am meticulous with my database schemas, and periodically review my indexes and covering indexes. I'm no DBA, but I believe that the database is the only real value a codebase has, other than maybe a novel method here and there. So I put care into designing it properly and testing my assumptions.

Guillaume86

4 months ago

You should use the DB explain or equivalent command to spit out the query plan, limit 1 shouldn't change anything in your case, if it's not the case you should file an issue, it's pretty much 101 of query optimization.

viraptor

4 months ago

That would be a reportable bug. Of a pretty high priority.

buckle8017

4 months ago

You are certainly doing something wrong if that's true.

I'm curious, can you demo this?

dotancohen

4 months ago

I'm curious as well to see if this still holds up. I'll try this week.

giovannibonetti

4 months ago

I've noticed that LIMIT 1 makes a huge difference when working with LATERAL JOINs in Postgres, even when the WHERE condition has a unique constraint.

sgarland

4 months ago

If you include an ORDER BY, the DB _may_ continue searching. MySQL (and, I assume, MS SQL Server, since it also can cluster the PK) can stop early in some circumstances.

But if you just have a LIMIT, then no - any RDBMS should stop as soon as it’s reached your requested limit.

dotancohen

4 months ago

Right, that's why I add it.

fipar

4 months ago

In mysql, the db will continue reading even if the limit condition has been met, and then anything beyond the limit will be discarded before returning the result.

dotancohen

4 months ago

Even without an ORDER BY clause?

fipar

4 months ago

Nope, that does work as expected, unless a filesort is required, good point.

mcv

4 months ago

It's the exact opposite in Cypher. I'm currently working with some complex data in neo4j, and wondered why my perfectly fine looking queries were so slow, until I remembered to use DISTINCT. It's very easy to get duplicate nodes in your results, especially when you use variable length relationships, and DISTINCT is the only fix I'm aware of that fixes that.

dleeftink

4 months ago

Yeah, similarly combining distinct with recursive CTE's in SQL can be the difference between a n×n blowout or a performant graph walk that only visits nodes once.

bandrami

4 months ago

IDK, "which ZIP codes do we have customers in?" seems like a reasonable thing to want to know

mbb70

4 months ago

The very next ask will be "order the zipcodes by number of customers" at which point you'll be back to aggregations, which is where you should have started

wvbdmp

4 months ago

Anti-Patterns You Should Avoid: overengineering for potential future requirements. Are there real-life cases where you should design with the future in mind? Yes. Are there real-life cases where DISTINCT is the best choice by whatever metric you prioritize at the time? Also yes.

RHSeeger

4 months ago

> Are there real-life cases where DISTINCT is the best choice by whatever metric you prioritize at the time

Indeed, along that line, I would say that DISTINCT can be used to convey intent... and doing that in code is important.

- I want to know the zipcodes we have customers in - DISTINCT

- I want to know how many customers we have in each zipcode - aggregates

Can you do the first with the second? Sure.. but the first makes it clear what your goal is.

dleeftink

4 months ago

Partly in jest, but maybe we need a NON-DISTINCT signaller to convey the inverse and return duplicate values only.

SOMEWHAT-DISTINCT with a fuzzy threshold would also be useful.

RHSeeger

4 months ago

I hear you. It's not all _that_ uncommon for me to query for "things with more than one instance". Although, to be fair, it's more common for me to that when grep/sort/uniqing logs on the command line.

majormajor

4 months ago

Here we start to get close to analytics sql vs application sql, and I think that's a whole separate beast itself with different patterns and anti-patterns.

bandrami

4 months ago

Ah, yeah, you beat me to it. I do reporting, not applications.

sql_nitpicker

4 months ago

distinct seems like an aggregation to me

user

4 months ago

[deleted]

kristjansson

4 months ago

Whole seconds will have been wasted!

bandrami

4 months ago

I do reporting, not application development. If somebody wants to know different information I'd write a different query.

edoceo

4 months ago

count(id) group by post_code order by 1

DavidWoof

4 months ago

In OP's defense, "becoming suspicious" doesn't mean it's always wrong. I would definitely suggest an explaining comment if someone is using DISTINCT in a multi-column query.

ryandv

4 months ago

Set theory...

There are self-identifying "senior software engineers" that cannot understand what even an XOR is, even after you draw out the entire truth table, all four rows.

BuyMyBitcoins

4 months ago

I am surprised at common it is for software engineers to not treat booleans properly. I can’t tell you how many times if seen ‘if(IsFoo(X) != false)’

It never used to bug me as a junior dev, but once a peer pointed this out it became impossible for me to ignore.

furyofantares

4 months ago

The most egregious one I saw, I was tracking down a bug and found code like this:

    bool x;

    ...

    if (x == true) {
        DoThing1();
    } else if (x == false) {
        DoThing2();
    }

And of course neither branch was hit, because this is C, and the uninitialized x was neither 0 nor 1, but some other random value.

Rexxar

4 months ago

Maybe it was initially supposed to be a sort of "3-value boolean" (true/false/undefined) and not a standard bool. You can (rarely) meet this pattern in c++ if you use boost::tribool or in c# if you have a nullable bool. There is probably similar thing in other languages.

furyofantares

4 months ago

It was definitely just bad code.

tomjakubowski

4 months ago

Sometimes this kind of thing happens after a few revisions of code, where in earlier versions the structure of the code made more sense: maybe several conditions which were tested and then, due to changing requirements, they coalesced into something which now reads as nonsense.

When making a code change which touches a lot of places, it's not always obvious to "zoom out" and read the surrounding context to see if the structure of the code can be updated. The developer may be chewing through a grep list of a few dozen locations that need to be changed.

1718627440

4 months ago

I think of comparisons as a type conversion to a boolean. You wouldn't convert a boolean, but I like it to convert other types like an integer, even when the language rules would already specify the same I'm writing.

munchlax

4 months ago

People do that? This hurts my brain. if(IsFoo(X)) is clear and readable.

catlifeonmars

4 months ago

Clearly the correct spelling is

`if(X&IsFooMask != 0)`

hyperman1

4 months ago

I've spent a lot of time not seeing how xor is just the 'not equals' operator for booleans.

layer8

4 months ago

Or, for a boolean type, that XOR is the same as the inequality operator.

avalys

4 months ago

Maybe it’s confusing because it’s misnamed?

ryandv

4 months ago

This is like saying the non-negative integers under addition, lists under append, and strings under concatenation are all just misnamings of the semigroup operator.

https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-...

layer8

4 months ago

Is it? Two things are equal exactly when they aren’t exclusive.

catlifeonmars

4 months ago

XOR is for key splitting.

ryandv

4 months ago

PostgreSQL's `DISTINCT ON` extension is useful for navigating bitemporal data in which I want, for example, the latest recorded version of an entry, for each day of the year.

There are few other legitimate use cases of the regular `DISTINCT` that I have seen, other than the typical one-off `SELECT DISTINCT(foo) FROM bar`.

dotancohen

4 months ago

Without DISTINCT ON (which I've never used) you can use a window function via the OVER clause with PARTITION BY. I'm pretty sure that's standard SQL.

ryandv

4 months ago

Yes, this is the implementation I have seen in other dialects.

jmull

4 months ago

I'd be wary of overgeneralizing on that. I guess it depends on whose queries you're usually reading.

RHSeeger

4 months ago

I think you're reading more into what was said than is really there

> I immediately become suspicious

All I read from that is, when DISTINCT is used, it's worth taking a look to make sure the person in question understands the data/query; and isn't just "fixing" a broken query with it. That doesn't mean it's wrong, but it's a "smell", a "flag" saying pay attention.

grumpylittleted

4 months ago

So how do you "know" when you can safely omit DISTINCT for your shiny new query SELECT x FROM t ?

Oh you looked the schema for t and it said x has a PRIMARY or UNIQUE constraint?

Ah well two minutes after you looked at the schema Tom removed the UNIQUE constraint. Now your scratching your head when you get duplicates.

Sql is a bag language not a set language. The contract with relation t is that if the runtime can find there rel t and attribute x it will return it. You may end up with rows or not, and you may end up with duplicates or not, and the type of x may change between subsequent execution.

So if you want a set you need to say so using DISTINCT. At runtime the query planner will check the schema and if the attribute is UNIQUE or PRIMARY it will not have to do a deduplication.

fipar

4 months ago

I'm not sure I understand the part about set theory. If anything, a valid use of DISTINCT is if you want the result to be (closer to) a set, as otherwise (to your point, depending on the data model) you may get a bag instead.

In fact, IIRC, using DISTINCT (usually bad for performance, btw) is an SQL advice by CJ Date in https://www.oreilly.com/library/view/sql-and-relational/9781...

dragonwriter

4 months ago

In my experience, its nearly as often a problem with the design of the database as the query author.

ch2026

4 months ago

Or maybe they’re on OLAP not OLTP.

9rx

4 months ago

Or believe more in Codd’s relational model than SQL’s tabulational model.

kpcyrd

4 months ago

SQL is somehow "ask two people, get three different opinions" for something as basic as:

"given a BTreeMap<String, Vec<String>>, how do I do .keys() and .len()".

qcnguy

4 months ago

SQL isn't very intuitive. Lots of people claim it is but then lots of people claim Haskell is, market outcomes suggest they are outliers.

The big justification for its design is to enable compiler optimizations (query planning) but compilers can optimize imperative code very well too, so I wonder if you could get the same benefits with a language that's less declarative.

leptons

4 months ago

And that's okay. Not every developer knows every single thing there is to know about every single tech. Sometimes you just need a solution, and someone with more specific knowledge can optimize later. How many non-database related mistakes would you make if you had to build every part of a system yourself?

pessimizer

4 months ago

But what if they don't know that they need your approval not to know things?

Sesse__

4 months ago

Or just doesn't know how to do semijoins in SQL, since they don't follow the same syntax as normal joins for whatever historical reason.

wvbdmp

4 months ago

Eh, sometimes you need a quick fix and it’s just extremely concise and readable. I’ll take an INNER JOIN over EXISTS (nice but insanely verbose) or CROSS APPLY (nice but slow) almost every time. Obviously you have to know what you’re dealing with, and I’m mostly talking about reporting, not perf critical application code.

Distinct is also easily explained to users, who are probably familiar with Excel’s “remove duplicate rows”.

It can also be great for exploring unfamiliar databases. I ask applicants to find stuff in a database they would never see by scrolling, and you’d be surprised how many don’t find it.

Sesse__

4 months ago

The less verbose way of doing semijoins is by an IN subquery.

wvbdmp

4 months ago

>subquery

>less verbose

Well…

In any case, it depends. OP nicely guarded himself by writing “overusing”, so at that point his pro-tip is just a tautology and we are in agreement: not every use of DISTINCT is an immediate smell.

Sesse__

4 months ago

What do you mean? Here are your real alternatives for doing a semijoin (assuming ANSI SQL, no vendor extensions):

  SELECT * FROM t1 WHERE EXISTS ( SELECT * FROM t2 WHERE t2.x = t1.x );
  SELECT * FROM t1 WHERE x IN ( SELECT x FROM t2 );
  SELECT * FROM t1 JOIN ( SELECT DISTINCT x FROM t2 ) s1 USING (x);

Now tell me which one of these is the less verbose semijoin?

You could argue that you could fake a semijoin using

  SELECT DISTINCT * FROM t1 JOIN t2 USING (x);

  SELECT * FROM t1 JOIN t2 USING (x) GROUP BY t1.*;

but it doesn't give the same result if t1 has duplicate rows, or if there is more than one t2 matching t1. (You can try to fudge it by replacing * with something else, in which case the problem just moves around, since “duplicate rows” will mean something else.)

wvbdmp

4 months ago

No, sorry, you’re certainly correct, I just meant that any subqueries are generally crazy verbose. And then you usually want additional Where clauses or even Joins in there, and it starts to stop looking like a Where clause, so I’m often happy when I can push that logic into From.

Sesse__

4 months ago

Yes, I would certainly prefer if you could write

SELECT * FROM t1 SEMIJOIN t2 USING (x);

although it creates some extra problems for the join optimizer.

Little_Kitty

4 months ago

It's great being able to use an any join (and the counterpart anti join) in Clickhouse to deal with these operations.

petalmind

4 months ago

> Overusing DISTINCT to “Fix” Duplicates

I wrote a small tutorial (~9000 words in two parts) on how to design complicated queries so that they don't need DISTINCT and are basically correct by construction.

https://kb.databasedesignbook.com/posts/systematic-design-of...

joshmn

4 months ago

Nice articles in there. Bookmarked.

Edit: it’s also actually a book!

aerzen

4 months ago

These "anti-patterns" are just workarounds for bad language design of SQL (or lack of design actually). I'm working on a language that can run on SQL databases, so I hope it will do better with every one of these points.

If anyone wants to check out a half-done lang with lacking documentation, I'd be happy to read your feedback: https://lutra-lang.org

gregjor

4 months ago

"SQL database" doesn't describe anything. Variations of SQL have implementations on relational and non-relational databases. SQL and relational often get used interchangeably but given your goal you might want to use the terms more precisely.

Experts including Codd recognized the problems with SQL since that language got traction. Some alternatives got proposed, perhaps most notably Tutorial D by Chris Date and Hugh Darwen. No SQL replacement goes anywhere because of the vast quantity of SQL code and supporting tools dating back decades. Chris Date wrote the textbook on databases, and at least one book going through the problems with SQL and various implementations of the relational model.

SQL perfectly illustrates what Strostrup meant by "There are only two kinds of languages: the ones people complain about and the ones nobody uses." In some sense I would welcome a better query language. On the other hand I attribute decades of job security and steady income to knowing SQL and dealing with its problems.

mkeedlinger

4 months ago

Hey, this looks really cool! Best wishes and I’ll try to watch out for when this is more ready

SoftTalker

4 months ago

A big one that isn't listed is looking for stuff that isn't there.

Using != or NOT IN (...) is almost always going to be inefficient (but can be OK if other predicates have narrowed down the result set already).

Also, understand how your DB handles nulls. Are nulls and empty strings the same? Does null == null? Not all databases do this the same way.

magicalhippo

4 months ago

> Also, understand how your DB handles nulls.

Also in regards to indexing. The DBs I've used have not indexed nulls, so a "WHERE col IS NULL" is inefficient even though "col" is indexed.

If that is the case and you really need it, have a computed column with a char(1) or bit indicating if "col" is NULL or not, and index that.

SoftTalker

4 months ago

NULL should generally never be used to "mean" anything.

If your business rules say that "not applicable" or "no entry" is a value, store a value that indicates that, don't use NULL.

crazygringo

4 months ago

Not sure what you mean.

If you have a table of customers and someone of them don't have addresses, it's standard to leave the address fields NULL. If some of them don't belong to a company, it's standard to leave the company_id field NULL.

This is literally what NULL is for. It's a special value precisely because missing data or a N/A field is so common.

If you're suggesting mandatory additional has_address and has_customer_id fields, I would disagree. You'd be reinventing a database tool that already exists precisely for that purpose.

MaxBarraclough

4 months ago

> This is literally what NULL is for. It's a special value precisely because missing data or a N/A field is so common.

Kinda. You need null for outer joins, but you could have a relational DBMS that prohibits nullable columns in tables. Christopher Date thought that in properly normalised designs, tables should never use nullable columns. Codd disagreed. [0]

> If you're suggesting mandatory additional has_address and has_customer_id fields, I would disagree. You'd be reinventing a database tool that already exists precisely for that purpose.

The way to do it without using a nullable column is to introduce another table for the 'optional' data, and use a left outer join.

[0] https://en.wikipedia.org/wiki/First_normal_form#Christopher_...

crazygringo

4 months ago

> The way to do it without using a nullable column

I mean, you could, but having separate tables for every optional field would be an organizational and usability nightmare. Queries would be longer and slower for no good reason. Not to mention a gigantic waste of space with all those repeated primary keys and their indexes.

And you could have databases that prohibited NULL values, but we mostly don't, because they're so useful.

MaxBarraclough

4 months ago

Right, Date's idea didn't catch on. I'm not aware of any industry-strength RDBMS prohibiting storing null.

Agreed that queries would tend be longer as you'd need joins, although views could help, especially for read operations.

Regarding storage-efficiency and query speed, agreed that it could well hurt both, but it's going to depend. If a column holds null in almost all rows, we would expect it to be more space-efficient to use a separate table and a left outer join. Query speed could also improve for queries that don't reference the nullable column, as the 'main' table would be smaller in storage. (I'm assuming a rowstore here.)

naasking

4 months ago

> but having separate tables for every optional field would be an organizational and usability nightmare

I think this indicates that declaring and managing state is too onerous in SQL.

crazygringo

4 months ago

Or maybe NULLs are actually a great solution here, and it's fine.

The idea that having a separate table for every optional field is too unworkable isn't an issue with SQL. It's a fundamentally overcomplicated idea. It's like a programming language where every variable that could be null had to be put in its own file. It would be terrible design.

naasking

4 months ago

> The idea that having a separate table for every optional field is too unworkable isn't an issue with SQL.

It sure is. Consider a database language that innately supported algebraic data types. Instead of:

    table Contact { int Id; Nullable<string> email; Nullable<string> phoneNo; }

you have:

    type PhoneOrEmail = Phone<string> | Email<string>
    table Contact { int Id; PhoneOrEmail info; }

This completely clarifies the relationships between the nullable columns (can both be null, or only one null?), and the database storage layer would manage how to actually store this. This is a direct consequence of SQL's semantics and how it implements the relational calculus.

MaxBarraclough

4 months ago

I'm reminded of table inheritance in Postgres, it's not quite the same thing but it's close.

https://www.postgresql.org/docs/current/tutorial-inheritance...

crazygringo

4 months ago

That seems like a logical nightmare to me. When thinking about which fields get returned, which fields do you index over, etc. It's a nice idea in theory, but would add an astronomical level of complexity in practice.

naasking

4 months ago

If the database supports algebraic data types, you wouldn't have to think about which fields get returned, it understand the needed semantics and handles that for you. Like I said, these are SQL limitations.

crazygringo

4 months ago

You'd need to handle it in your code. The last thing I want is a database that returns different columns per row.

No, these aren't SQL limitations, it's design that is super complex. Figuring out how to index over these multi-type fields isn't a SQL limitation, it's a hard engineering problem.

naasking

4 months ago

> The last thing I want is a database that returns different columns per row.

Actually that's exactly what you'd want, because it saves you from running two different queries in those cases with properly normalized disjoint data sets, and moves more of the domain's constraints into the database schema where it belongs.

crazygringo

4 months ago

Definitely not what I want. I'm perfectly happy getting both fields at the same time in a single query the way I do now.

> and moves more of the domain's constraints into the database schema where it belongs.

No, I prefer to keep my domain constraints at the application level where they're far more flexible.

The database is for storing information, not for validating my business logic.

I mean, I realize some people want to build some of that logic into the database, especially when many applications interact with it. But it's not a superior design pattern. If you have a single application, it's perfectly valid and desirable to put all business logic in the application, not the database.

solumunus

4 months ago

Over 90% of my optionals have their own tables and it’s the cleanest and most maintainable database I’ve ever worked with. I will always design databases this way going forward. That’s my experience.

I remember working on ERP systems with 40+ column tables, most of which were null. With no clear constraints on which options should or shouldn’t enable or make mandatory other options. This becomes incredibly obvious and natural when you group in additional tables.

My tables are incredibly concise and the cache loves this.

magicalhippo

4 months ago

> I remember working on ERP systems with 40+ column tables, most of which were null.

Those are rookie numbers. Add a zero to that number and we're talking.

And for us, a good portion of the data, a considerable fraction of those fields will have data, and which fields will vary between customers.

All except some key fields are NULL-able since the user can save and resume their work.

Just to display our main screen would require 100-150 joins using a separate table per optional.

I'm pretty sure the database would not love that.

naasking

4 months ago

400 columns all of which are nullable screams "dynamic field set" to me. Why have all of those as columns rather than something like:

    table Entity_Fields (ParentEntityId : int not null, FieldId : int not null, IntValue : int, TextValue : varchar(MAX), DateValue : datetime, ...)

Or per the OP's suggestion, a table per field type:

    table Entity_IntFields (ParentEntityId : int not null, FieldId : int not null, Value : int not null)
    table Entity_TextFields (ParentEntityId : int not null, FieldId : int not null, Value : varchar(MAX) not null)
    table Entity_DateFields (ParentEntityId : int not null, FieldId : int not null, Value : datetime not null)

magicalhippo

4 months ago

> 400 columns all of which are nullable screams "dynamic field set" to me

Could have been, but no. And doing it like you suggest would mean overview grids would have to do 50+ subqueries for each row, and loading a record would mean hundreds of queries.

And insertion performance would crater I assume, since the DB now needs to do hundreds of inserts per record rather than a single row. We do have some cases where we get 100k inserts per hour once a day or so. And this is just the main table, we have many child tables already, though they're not nearly as wide.

I think a more realistic split would have been to split the main table into maybe 10 tables or so at most. Still would result in a fair bit of subqueries for grids and such, but not that bad.

naasking

4 months ago

> And doing it like you suggest would mean overview grids would have to do 50+ subqueries for each row, and loading a record would mean hundreds of queries.

Only if you insist on loading the data into a flat record type with all of those fields. You have a dynamically evolving datatype, so your class should also be dynamic:

    public enum FieldId
    {
        Field1,
        Field2,
        //...
    }

    public class YourRecord
    {
        public Dictionary<FieldId, string> TextValues { get; set; }
        public Dictionary<FieldId, int> IntValues { get; set; }
        public Dictionary<FieldId, DateTime> DateValues { get; set; }
        // ...
    }

At most the number of subqueries increases by the number types your class needs. SQL has a fixed number of datatypes, and most apps use a small subset of these (dates, text, integral, floating point and decimal types, that's typically 5 at most). You could collapse these all into a single table with optional columns for each value too and then it's only one query.

> And insertion performance would crater I assume, since the DB now needs to do hundreds of inserts per record rather than a single row. We do have some cases where we get 100k inserts per hour once a day or so.

That's a fairly small amount of data, I don't really see an issue here.

Also, you're neglecting the fact that since the fields are independent, adding a field value doesn't involve locking and rewriting a large 400 column row to disk, but only appending a tiny 4 column row to a separate distinct table.

magicalhippo

4 months ago

> At most the number of subqueries increases by the number tables.

Fair enough. Application logic will want to access the data as a flat record, but that could be handled through getters. We have views which are also used by customers and our reporting tool, but the customers we're moving to API access and reporting could probably be done with something better if starting from scratch.

> That's a fairly small amount of data, I don't really see an issue here.

Well it turns 100k inserts into tens of millions. On tables where users need to work without slowdown while this is going on.

It probably works if you throw enough hardware at it, but currently we get by with quite modest DB hardware.

That said, how do you create compound indexes over these fields? Say you need to index one date and one varchar column? Such demands can arise down the line, often hard to predict up front.

naasking

4 months ago

> Well it turns 100k inserts into tens of millions.

I don't think "number of inserts" is the right metric because the total amount of bytes being written is almost the same, it's just written in different areas and still mostly contiguously. I think "number of distinct tables being written" is a better metric. Assuming all 400 columns become 400 records in the data type tables, say evenly divided among the 4 most common data types (int, decimal, text, date), that would be more like (4 or 5) x 100k = 400k-500k. I would still hesitate to naively compare it this way without a benchmark though, because with 4 or 5 tables being written there's also less contention than there is when everyone is writing to 1 table.

Regarding indexes, you can index the field tables but obviously this applies to the whole table. Without more understanding of your domain I can't really say if this breaks down. There's also the possibility that this one table is serving two competing goals, eg. perhaps the user should be able to add data incrementally (so this pattern applies), but then at some point when all of the data is filled it should be migrated to an actual table with all non-null columns that can be indexed and queried/processed as you usually do.

In any case, what I've sketched out isn't really new, it's been used at least since the 90s under "Dynamic Object Model" and "Entity-Attribute-Value" [1,2], so if they were using on hardware in the 90s I can't imagine the pattern would be unusable on modern hardware.

[1] https://www.cs.sjsu.edu/~pearce/oom/patterns/new/Riehle.pdf

[2] https://softwarepatternslexicon.com/patterns-sql/4/5/

crazygringo

4 months ago

At that point why not abandon databases entirely and use a key-value store?

The answer, obviously, is because traditional tables make lots of things really easy. SQL is designed for that use case, and it's performant.

You're not even talking about traditional relational databases anymore. You're trying to construct tables inside of tables, which means abandoning performance, indexes, etc.

naasking

4 months ago

> At that point why not abandon databases entirely and use a key-value store?

Because there is a schema, it's just a dynamically evolving one, and also, presumably the rest of your system depends on the relational DB, so why multiply your dependencies unnecessarily?

> You're not even talking about traditional relational databases anymore.

There's nothing non-relational about the schema I've outlined, it matches what the domain requires with less noise than 400 nullable columns.

Breza

3 months ago

This sounds like the kinds of healthcare claims data that I often analyze for work. Just in case you have a team of seven doctors working on you at the same time while you're in hospice, they keep a column for just about everything.

crazygringo

4 months ago

I don't know -- the 40+ columns feels simple and maintainable.

Having a bunch of grouped optional values in other tables adds a ton of maintenance and query complexity.

The "clear constraints" belong in business logic, whether in triggers at the database level or before queries get executed at the application level.

Now, splitting up tables can produce performance optimizations. I'm not saying to never do it. But it's a tradeoff that increases complexity.

SoftTalker

4 months ago

No null is fine if you don’t know or there’s literally no value. But don’t interpret a null phone number to mean the customer doesn’t have a phone number. You can’t infer anything from that, other than you don’t have it.

crazygringo

4 months ago

I'm not sure I agree.

If I have a column for the ID of the customer's current active subscription, and that column is NULL, it seems perfectly fine to interpret that the customer has no active subscription.

That is a valid inference. You don't need a separate has_active_subscription field.

On the other hand, your phone number example is just common sense. The database doesn't represent the external world. The database just knows the customer didn't provide a phone number.

Breza

3 months ago

I've found it's easier just to keep things NULL in many cases. Sure, if you have a student database with the column favorite_sport, maybe it makes sense to use the value "Not a sports fan." But think about an email list with the column unsubscribe_date. For active recipients, you'll want that field to be NULL.

rplnt

4 months ago

Interesting, I don't think I've seen that while NULLs are very common.

I guess you would handle it in the application and not in the query, right?

SoftTalker

4 months ago

I've seen it too, very often. But it's good if you can just keep NULL meaning NULL (i.e. "the absence of any value"), because otherwise you will eventually be surprised by behavior.

geysersam

4 months ago

> Using != or NOT IN (...) is almost always going to be inefficient.

Why do you say that?

My understanding is that as long as the RHS of NOT IN is constant (in the sense that it doesn't depend on the row) the condition is basically a hash table lookup, which is typically efficient if the lookup table is not massive.

What's the more efficient alternative?

Sesse__

4 months ago

I'm going to assume here that we're talking about a subquery here (SELECT * FROM t1 WHERE x NOT IN ( SELECT x FROM t2 )). If you're just talking about a static list, then the basic problem is the amount of data you get back. :-)

The biggest problem with NOT IN is that it has very surprising NULL behavior: Due to the way it's defined, if there is any NULL in the joined-on columns, then _all_ rows must pass. If the column is non-nullable, then sure, you can convert it into an antijoin and optimize it together with the rest of the join tree. If not, it usually ends up being something more complicated.

For this reason, NOT EXISTS should usually be preferred. The syntax sucks, but it's much easier to rewrite to antijoin.

dfe

4 months ago

It's not just the nullability behavior. My experience with several databases is that IN is always (or almost always) executing the subquery then using its results to match the outer predicate. But EXISTS can work the other direction, matching the predicates from the outer query then passing those keys into the exists, allowing use of an FK index on the inner query's table.

Sesse__

4 months ago

What databases are those? If you convert to a semijoin, both strategies should be doable.

Note that these caveats do _not_ apply to IN, only NOT IN.

geysersam

4 months ago

That's useful, thanks!

SoftTalker

4 months ago

Because they can't use indexes.

If I have a table of several million rows and I want to find rows "WHERE foo NOT IN ('A', 'B', 'C')" that's a full table scan, or possibly an index scan if foo is indexed, unless there are other conditions that narrow it down.

btilly

4 months ago

The biggest SQL antipattern is failing to recognize that SQL is actually a programming language.

Therefore you should create a consistent indentation style for SQL. See https://bentilly.blogspot.com/2011/02/sql-formatting-style.h... for mine. Second, you should try to group logical things together. This is why people should move subqueries into common table expressions. And finally, don't be afraid of commenting wisely.

xyzzy_plugh

4 months ago

Style opinions are borderline irrelevant without appropriate linters.

javcasas

4 months ago

Go and use Google BigQuery auto-formatter in a complex query with CASE and EXTRACT YEAR FROM date, and you will have a totally different opinion.

How that auto-formatter indents is borderly almost a hate crime. A thousand times better to indent manually.

OscarCunningham

4 months ago

I've even seen the BigQuery formatter change the behaviour of a query, by mixing a keyword from a comment into the real code.

FuriouslyAdrift

4 months ago

The single biggest thing that helped me speed up my queries and lower resource usage on the server was focusing on making my queries more sargable.

https://en.wikipedia.org/wiki/Sargable

https://www.brentozar.com/blitzcache/non-sargable-predicates...

jdnier

4 months ago

Looking up the etymology of "sargeable", I found this StackOverflow answer: https://dba.stackexchange.com/a/217983

And Google explains "The term 'sargable' is a portmanteau of "Search ARGument ABLE," formed by combining the words from a SQL database context."

MaxBarraclough

4 months ago

I'm sceptical of that article, it's making guesses about the limitations of SQL query optimisers.

Consider the Simple example it presents. The article is in effect implying that no query optimiser would be able to figure out the equivalence of the two predicates.

(Let's ignore that the two predicates aren't actually equivalent; the first version may raise an exception if myIntColumn is negative, depending on the DBMS.)

FuriouslyAdrift

4 months ago

Brent Ozar isn't guessing about MS SQL Server... he knows the eval engine inside and out.

MaxBarraclough

4 months ago

I should have been clearer, I meant the Wikipedia article.

Ozar's article is much better. It doesn't make sweeping assumptions about the limitations of all query optimisers, or basic oversights in contrasting supposedly equivalent predicates.

crazygringo

4 months ago

I'm really curious, what communities use that word?

I've been working with SQL for 20+ years, and have literally never come across that word a single time in any documentation, tutorial, Stack Overflow answer, or here on HN. Working in Postgres, MySQL and SQLite.

Is it used at some particular company, or open source community, or with a particular database, or something?

cowboylowrez

4 months ago

My guess is mssql as I've seen the term quite a bit with those guys.

FuriouslyAdrift

4 months ago

Oracle, DB2, SQl Server... I've seen in the more Enterprise targeted docs for Postgres.

spprashant

4 months ago

Not all of these are "anti-patterns", your query clause not matching your index is a problem of not understanding how indexes work.

Some of these have nothing to do with SQL the language itself, and more to do with database schema design. If you have to do a DISTINCT, it means your primary key design is likely not right. If you are layering too many views, something is broken in the base table design, requiring the creation of all these views.

A good database model goes a long way to avoiding all this.

kijin

4 months ago

Some of these things happen because people try to come up with a single clever query that does everything at once and returns a perfect spreadsheet.

Translating status codes into English or some other natural language? That's better done in the application, not the database. Maybe even leave it to the frontend if you have one. As a rule of thumb, any transformation that does not affect which rows are returned can be applied in another layer after those rows have been returned. Just because you know SQL doesn't mean you have to do everything in SQL.

Deeply nested subqueries? You might want to split that up into simpler queries. There's nothing shameful about throwing three stones to kill three birds, as long as you don't fall into the 1+N pattern. Whoever has to maintain your code will thank you for not trying to be too clever.

Also, a series of simple queries often run faster than a single large query, because there's a limit to how well the query planner can optimize an excessively complicated statement. With proper use of transactions, you shouldn't have to worry about the data changing under your feet as you make these queries.

thehours

4 months ago

> Mishandling Excessive Case When Statements

User Defined Functions (UDFs) are another option to consolidate the logic in one place.

> Using Functions on Indexed Columns

In other words, the query is not sargable [0]

> Overusing DISTINCT to “Fix” Duplicates

Orthogonal to author's point about dealing with fanout from joins, I'm a fan of using something like this for 'de-duping' records that aren't exact matches in order to conform the output to the table grain:

    ROW_NUMBER() OVER (PARTITION BY <grain> ORDER BY <deterministic sort>) = 1

Some database engines have QUALIFY [1], which lends itself to a fairly clean query.

[0] https://en.wikipedia.org/wiki/Sargable

[1] https://docs.aws.amazon.com/redshift/latest/dg/r_QUALIFY_cla...

andersmurphy

4 months ago

Non sargability easy to solve with expression indexes. At least in sqlite.

wmonk

4 months ago

The section of using functions on indexes could do with more explicit and deeper explanation. When you use the function on the index it becomes a full scan of the data instead as the query runner has to run the function on every row and column, effectively removing any benefit of the index.

Unfortunately I learned this the hard way!

LikesPwsh

4 months ago

Some well known docs on the topic- https://use-the-index-luke.com/sql/where-clause/obfuscation

tremon

4 months ago

The given solution (create an indexed UPPER(name) column) is not the best way to solve this, at least not on MS SQL Server. Not sure if this is equally supported in other databases, but the better solution is to create a case-insensitive computed column:

  ALTER TABLE example ADD name_ci AS name COLLATE SQL_Latin1_General_CI_AS;

(season to taste)

layer8

4 months ago

It depends on the database system, but for systems that support functional indexes, you can create an index using the same function expression that you use in the query, and the query optimizer will recognize that they match up and use the index.

For example, you define an index on UPPER(name_column), and in your query you can use WHERE UPPER(name_to_search_for) = UPPER(name_column), and it will use the index.

readthenotes1

4 months ago

"Unfortunately I learned this the hard way!" ... Seems to be the motto of SQL developers.

Otoh, it seems a fairly stable language (family of dialects?) so finding the pitfalls has long leverage

crazygringo

4 months ago

The blog has a typo. The first line needs to have the text in uppercase:

> query WHERE name = ‘ABC’

> create an indexed UPPER(name) column

The point is that the index itself is already on the data with the function applied. So it's not a full scan, the way the original query was.

Of course, in this particular example you just want to use a case-insensitive collation to begin with. But the general concept is valid.

user

4 months ago

[deleted]

anthonyIPH

4 months ago

"Instead you should:

query WHERE name = ‘abc’

create an indexed UPPER(name) column"

Should there be an "or" between these 2 points, or am I missing something? Why create an UPPER index column and not use it?

karmakaze

4 months ago

[and a third] OR use a case-insensitive collation for the name column.

MiscCompFacts

4 months ago

I think they reversed the 2 expressions. You should use “WHERE UPPER(name) = ‘ABC’” if you want to use the index.

dgb23

4 months ago

If „select *“ breaks your code, then there‘s something wrong with your code. I think Rich Hickey talked about this. Providing more than is needed should never be a breaking change.

Certain languages, formats and tools do this correctly by default. For the others you need a source of truth that you generate from.

sql_nitpicker

4 months ago

I don't see anything wrong with what the article is saying. If you have a view over a join of A and B, and the view uses "select *", then what is gonna happen when A adds a column with the same name as a column in B?

In sqlite, the view definition will be automatically expanded and one of the columns in the output will automatically be distinguished with an alias. Which column name changes is dependent on the order of tables in the join. This can absolutely break code.

In postgres, the view columns are qualified at definition time so nothing changes immediately. But when the view definition gets updated you will get a failure in the DDL.

In any system, a large column can be added to one of the constituent tables and cause a performance problem. The best advice is to avoid these problems and never use "select *" in production code.

naasking

4 months ago

Seems like a database failure if it can't notify you that introduced a breaking change. All of the schema information is available to the database after all, so it should be able to tell you about the duplicate column breaking that view.

rileymat2

4 months ago

The reasoning is in the article, and true.

> Schema evolution can break your view, which can have downstream effects

Select * is the problem itself in the face of schema evolution and things like name collision.

0xbadcafebee

4 months ago

`select *` is bad for many reasons, but the biggest is that the "contract" your code has with the remote data store isn't immutable. The database can change, for many different reasons, independent of your code. If you want to write reliable code, you need to make as few assumptions as possible. One of those assumptions is what the remote schema is.

hvb2

4 months ago

Sure but columns can change data types too which 'select column's doesn't protect you from either

layer8

4 months ago

A column changing its data type is generally considering a breaking change for the schema (for obvious reasons), while adding more columns isn’t. Backwards-compatible schema evolution isn’t practical without the latter — you’d have to add a new secondary table whenever you want to add more columns.

This mirrors how adding additional fields to an object type in a programming language usually isn’t considered a breaking change, but changing the type of an existing field is.

tremon

4 months ago

If you have select * in your code, there already is something wrong with your code, whether it breaks or not: the performance and possibly output of your code is now dependent on the table definition. I'm pretty sure Rich Hickey has also talked about the importance of avoiding non-local dependencies and effects in your code.

onli

4 months ago

The performance and partly the output of the code is always dependent on the table definition. * instead of column names just removes an output limiter, which can be useful or can be irrelevant, depending on the context.

Though sure, known to negatively affect performance, I think in some database systems more than in others?

jasonpbecker

4 months ago

We did the views on view thing once when triggers, at least how we implemented them failed. This became a huge regret that we lived with for years and not-so affectionately called "view mountain". We finally slayed viewed mountain over the last 2 years and it feels so good.

bambax

4 months ago

> Excessive View Layer Stacking

Guilty as charged. I love to do this. Materialized views aren't really possible on sqlite, and so I find stacking views on top of one another very readable and manageable. But it's true other people find it a little obscure and weird.

CuriousRose

4 months ago

I don’t fully agree with the nested view argument. In our context (POS software) we use them heavily to have a single source of truth for a clean transaction view, joining common tables like product, category, etc and then using that as the backbone for all user reporting that might be more/less complex. Not doing this means that we need to accommodate for each where clause in each table on each report. For example eliminating voided lines, voided transactions, returned transactions, etc. Not having this means that a single logic change would need to update 20+ views/stored procs so for our case I think its valid to nest.

skybrian

4 months ago

> three or four layers of subqueries, each one filtering or aggregating the results of the previous one, totaling over 5000 lines of code

In a better language, this would be a pipeline. Pipelines are conceptually simple but annoying to debug, compared to putting intermediate results in a variable or file. Are there any debuggers that let you look at intermediate results of pipelines without modifying the code?

tremon

4 months ago

This is not a pipeline in the control flow sense; the full query is compiled into a single processing statement, and the query compiler is free to remove and/or reorder any of the subqueries as it sees fit. The intermediate results during query execution (e.g. temp table spools) do not follow the structure of the original query, as CTEs and subqueries are not execution boundaries. It's more accurate to compare this to a C compiler that performs aggressive link-time optimization, including new rounds of copy elision, loop unrolling and dead code elimination.

If you want to build a pipeline and store each intermediate result, most tooling will make that easy for you. E.g. in dbt, just put each subquery in its separate file, and the processing engine will correctly schedule each subresult after the other. Just make sure you have enough storage available, it's not uncommon for intermediate results to be hundreds of times larger than the end result (e.g. when you perform a full table join in the first CTE, and do target filtering in another).

skybrian

4 months ago

Sure, a sufficiently smart compiler can do what it wants, but it's often conceptually a pipeline and could be implemented as one in debug mode, without having to rewrite the code. Not in production, though, since you don't want to store stuff in temporary files when you're not debugging them.

In some languages, a series of assignments and a large expression will often compile to the same thing, but if written as assignments, it will make it easier to set breakpoints.

physicles

4 months ago

I wrote some tooling to help debug sql queries with many CTEs. It parses the sql, finds all the CTEs, and prints the result of each CTE formatted as csv. If the .sql file changes on disk, it reruns the query and tells you which CTEs’ output changed. Saved me hours in debugging.

mattnewport

4 months ago

> Are there any debuggers that let you look at intermediate results of pipelines without modifying the code?

F# in the visual studio debugger does a pretty good job of this in recent versions.

chongli

4 months ago

When working with larger enterprise software, it is common to have large CASE WHEN statements translating application status codes into plain English. For example, status code 1 could mean the item is out of stock.

Why wouldn’t you store this information in a table and query it when you need it? What if you need to support other languages? With a table you can just add more columns for more languages!

megaman821

4 months ago

I usually use generated columns for this. It still uses CASE WHEN but it is obvious to all consumers of the table that it exists.

chuliomartinez

4 months ago

SQL makes it very hard to express real world requirements.

1. No easy way to limit child records count in joins - find all orders with orderproduct.amount is greater than X. Obviously this will genereate duplicates for orders that have more than one such orderproduct. So you slap a distinct on it… but what if you need an aggregation?

The possible fixes are highly non-trivial: subqueries, window functions, or vendor specific: outer apply.

2. Or queries, that is when you group where conditions with OR are very hard (impossible) to optimize.

Apart from the trivial case where the conditions are all on the same column, you are better of leaving the declarative world and imperatively tell sql to do a union.

I wrote a bit about it here: https://www.inuko.net/blog/platform_sql_or_conditions_on_joi...

procaryote

4 months ago

I don't really understand the problem in 1

in 2, looking at your article, from your first query it looks like person_relationship contains both (A,B) and (B,A) for all related people A and B; otherwise the left join won't work. If you also make people related to themselves and store (A,A) and (B,B) there your query becomes much simpler:

    SELECT other.id, other.name
    FROM person p
      JOIN person_relationship r ON r.from_person_id = p.id
      JOIN person other ON r.to_person_id = other.id
    WHERE p.family_id = @familyId;

chuliomartinez

4 months ago

Creative solution! Of course this creates more maintenance to ensure such records exist. I guess you kinda proved my point, you have to work around it.

dfe

4 months ago

Your own article points out that exists handles the first case. Exists is not actually implemented as a subquery, it is merely syntactically a subquery.

chuliomartinez

4 months ago

Sure exists make sense if you dont need columns from the child table. Exists is also far from basic sql.

ddxv

4 months ago

I've built myself a few problems that I haven't fixed yet:

Many materialized views that rely on materialized views. When one at the bottom, or a table, needs a changed all views need to be dropped and recreated.

Using a warm standby for production. I love having a read only production database, but since it's not the primary, it always feels like it's on the losing end of the system. Recently upgraded to Postgres 18 and forgot that means I need to rm rf the standby and pg_basebackup to rebuild... That wasn't fun.

physicles

4 months ago

Why do you need a warm standby for production? Do you need >= 3 nines?

Our staging environment has its own instance that is rebuilt from prod, with pii removed, every day outside working hours (this normally takes about 15 minutes). It’s fantastic for testing migrations, and is easy to support compared with a warm standby.

ddxv

4 months ago

"Do you need >= 3 nines?" No, it's a one person project. I'd be happy with a single 9 or even no 9s just a 99.0 haha

I switched to warm standby to reduce stress on the production db which was in the cloud. There is just a single production server and having it constantly run the heavy data processing MVs + handle queries was CPU intensive and slowed everything down. The CPU was costly.

To fix those issues, especially the CPU, I run the primary on a home server where it can crank the CPU as much as it wants running the data processing MVs and then sends the processed WALs to the warm standby that just handles the queries.

This has fixed those CPU and slow queries (when an MV is updating a table that is being constantly read). But introduced headaches anytime I update postgres.

My understanding is the 'fix' is to move data processing to another postgresql DB or flow? My biggest reason for not using another DB is I didn't like the idea of losing direct relations for keys.

Anyways, I appreciate the input, it's been a thorny issue I hit once or twice a year and am always unsure if what I'm doing is 'normal' or what I should do to fix it.

echelon

4 months ago

I'd like to call views, triggers, and integrity constraints antipatterns.

Your code should handle the data model and never allow bad states to enter the database.

There's too much performance loss and too many footguns from these "features".

Arch-TK

4 months ago

Forgot to add (all seen in production):

* Don't store UUIDs as strings.

* Don't use random UUID variants for your primary key (or don't use UUIDs for your primary key).

* Don't use a random column in your clustered index.

MichaelNolan

4 months ago

I guess things are DB dependent. Spanner for instance not only recommends using uuidv4 as a PK, it also stores it as string(36). Uuidv4 as a PK works fine on Postgres as well.

darkerside

4 months ago

Can you share any details? A teammate wants to change primary identifiers to a GUID, but I'm not sure it's a good idea.

Arch-TK

4 months ago

UUIDs take up 36 bytes as strings so store them natively as 16 bytes instead if you can.

This is still 2x the space of an auto increment number.

This is overhead for every table, every index, and every relationship.

That might be acceptable in your case though, the case where it became unacceptable in my experience was in a MSSQL Express context. But it was an idiotic decision to use MSSQL to begin with in that scenario.

Regarding random clustered indexes. Broadly speaking you want your clustered index to be made up of some incremental unique set of fields.

I mean, technically there is not a massive issue, but the largest tables in your database will be the non-indexes (indexes are just tables) and you want your big, mainly append only, tables to be nicely compact so a bunch of space isn't taken up by half full pages.

But again, I should honestly have clarified that the problem was mainly an MSSQL Express problem where databases are limited to 10GiB.

You might honestly be fine, but do look for documentation on your specific database.

BrenBarn

4 months ago

> SQL is one of those languages that looks simple on the surface but grows in complexity as teams and systems scale.

The funny thing is it's actually several of those languages. :-)

seanhunter

4 months ago

I can’t take any article like this seriously if it doesn’t lead with the #1 sql antipattern which kills performance all the time - doing things row-by-row instead of understanding that databases operate on relations, so you need to do operations over whole relations.

Very often I have seen this problem buried in code design and it always sucks. Sometimes an orm obscures this but the basic antipattern looks like

   Select some stuff
   For each row in stuff:
      … do some important things …
      Select a thing to do with this row
      … maybe do some other things …

Early on in my career an old-hand sql guru said to me “any time you are doing sql in a loop, you are probably doing it wrong”.

The non-sucky version of the code above is

   Select some stuff, joining on all the things you need for the rows because databases are great
   For each row in stuff:
      … do some important things …
      … maybe do some other things …

jacknews

4 months ago

"When handling large CASE WHEN statements, it is better to create a dimension table or view, ideally sourced from the landed table where the original status column is populated."

Is this code for 'use a lookup table' or am I falling behind on the terminology? The modern term should be 'sum table' or something similar surely.

LikesPwsh

4 months ago

"Dimension table" is the name for lookup tables in a star or snowflake schema.

jacknews

4 months ago

TIL, Thanks.

'Landed table'? Is that the 'fact table', the one that contains the codes that need to be looked-up?

tremon

4 months ago

I'm pretty sure the landed table refers to the local copy of the original source. In an ETL* pipeline, the place where source data is stored for further processing is usually called the landing zone. Fact and Dimension tables are outputs of the process, whereas the landing tables are the inputs.

* in whatever order they're used

parpfish

4 months ago

but sometimes large case statements cant be turned into a simple dimension table/lookup table because it's not a simple key-value transformation.

if your case statement is just a series of straighahead "WHEN x=this THEN that", you're very lucky.

the nasty case statements are the ones were the when expression sometimes uses different pieces of data and/or the ordering of the statements is important.

egeozcan

4 months ago

I don't know about anti patterns but what I like to do is putting 1=1 after each WHERE to align ANDs nicely and this is enough to create huge dramas in PR reviews.

DrewADesign

4 months ago

> what I like to do is putting 1=1 after each WHERE to align ANDs nicely

Frankly, that sounds like one of those things that totally makes sense in the author’s head, but inconsiderately creates terrible code ergonomics and needless cognitive load for anyone reading it. You know to just ignore those expressions when you’re reading it because you wrote it and know they have no effect, but to a busy code reviewer, it’s annoying functionless clutter making their job more annoying. “Wait, that should do nothing… but does it actually do something hackish and ‘clever’ that they didn’t comment? Let’s think about this for a minute.” Use an editor with proper formatting capability, and don’t use executable expressions for formatting in code that other people look at.

tombert

4 months ago

Using `WHERE 1=1` is such a common pattern that I seriously doubt it's realistically increasing "cognitive load".

I've seen it used in dozens of places, in particular places that programmatically generate the AND parts of queries. I wasn't really that confused the first time I saw it and I was never confused any time after that.

MobiusHorizons

4 months ago

I use `WHERE true` for this. Very little cognitive load parsing that. And it makes AND conditions more copy pastable. Effectively the trailing comma of SQL where clauses

DrewADesign

4 months ago

I absolutely cannot see how this would do what IDE formatting can’t, but admittedly the last time I wrote any significant amount of SQL directly was in a still-totally-relevant Perl 5 application. Could you give an example or link to a file in a public repository or whatever that would show this practice in context?

MobiusHorizons

4 months ago

These were all adhoc analytics queries, not the sort of thing that get checked in.

It means you can copy paste any where condition because they are all of the form

   AND <condition>

otherwise one condition is of the form

  WHERE <condition 1>

So adding it to a query that already has a where clause is a bit more awkward.

I use this technique for some analytics queries that all rely on the same base table. It’s not uncommon to start with copying an old query and just adding or removing conditions and grouping/aggregating until I get the right data. Using this format also makes commenting out any condition trivial.

  WHERE true
  --    AND some_column = “some value”
      AND event = “SOME_EVENT_TYPE”
      AND EXISTS(SELECT * FROM UNNEST(array_column) as v WHERE v = “some value”)

I don’t see how you could achieve this result with just IDE formatting.

ffsm8

4 months ago

It's always perfectly aligned for me, because enter prefixes 2 whitespace in my ide in SQL files, ending with

    where a=1
      And k=2
      And v=3

egeozcan

4 months ago

But the first condition looks special while it isn't and it sometimes leads to changes touching one too many lines.

exceldrawing

4 months ago

Oracle DATE field stores a time component. You have to be aware and adjust your queries to be specific.

ftchd

4 months ago

the points are fine and helpful, but they seem like a note from the author to themself rather than a cheatsheet that tries to be exhaustive.

was surprised to not see anything about dates/time.

jwsteigerwalt

4 months ago

That’s my rap sheet…

0xbadcafebee

4 months ago

At this point it's malpractice not to use AI to analyze your SQL statements and tables for optimizations

jpnc

4 months ago

Are we on bizarro HN?

No, you ask the DB to EXPLAIN itself to you.

Arch-TK

4 months ago

Next you'll be telling me that instead of asking AI to find my bug I should just use print statements or a debugger to observe the state of my program over time to find where it deviates from expectations and figure it out that way.

mjb

4 months ago

I agree. Modern code models tend to do a great job advising in SQL, especially if you include the table definition and EXPLAIN output in the context. Alternatively, I've found that an EXPLAIN MCP tool works well.

JohnHaugeland

4 months ago

these aren’t anti patterns. these are just things you shouldn’t do

hobs

4 months ago

https://pragprog.com/titles/bksqla/sql-antipatterns/ There's an actual book on them that had me nodding along the entire time.

evanelias

4 months ago

Agreed, it’s an excellent book by a great author. Bill is also quite prolific on Stack Overflow, and generally if you see an answer from him there, you can be confident it’s solid advice.

JohnHaugeland

4 months ago

that's a fantastic book; one of the best i've read, and i'm glad to see it get brought up

but also, the book anti-patterns is pretty clear here

em500

4 months ago

Still waiting for the definitive article on how using the term anti-pattern is an anti-pattern.

readthenotes1

4 months ago

If a pattern is a common problem (e.g., becoming accustomed to a spectacular view) and generally-useful solution to that problem (blocking the view so that effort is required to obtain it), then an anti-pattern is what?

I think most people think an anti-pattern is an aberration in the "solution" section that creates more problems.

So here, the anti-pattern is that people use a term so casually (e.g., DevOps) that no one knows what it's referring to anymore.

(The problem: need a way to refer to concept(s) in a pithy way. The solution: make up or reuse an existing word/phrase to incorporate the concept(s) by reference so that it can can, unambiguously, be used as a replacement for the longer description. )

JadeNB

4 months ago

> If a pattern is a common problem (e.g., becoming accustomed to a spectacular view) and generally-useful solution to that problem (blocking the view so that effort is required to obtain it), then an anti-pattern is what?

Strange choice of example! I'm not sure I agree that your example is a common problem, and I'm even less sure that the proposed solution to it is generally useful.

readthenotes1

4 months ago

It's name is Zen View, and is one of the memorable patterns from Alexander's catalog of design patterns

JohnHaugeland

4 months ago

> If a pattern is a common problem

it isn't, is the thing.

if you read the book design patterns, they spell out what a pattern is.

if you read the book anti-patterns, he spells out what an anti-pattern is.

people have gotten the wrong idea by learning the phrases from casual usage.

MaxBarraclough

4 months ago

Pointing to books isn't very helpful here. Please just state the definition you are advocating.

anotherevan

4 months ago

Well you do have to be careful, because if patterns and anti-patterns come into contact it could cause an explosive conflagration of regular expressions all over the place.

karmakaze

4 months ago

I'm waiting for the anti-patterns we shouldn't avoid.

KronisLV

4 months ago

No mentions of EAV/OTLT, I will use this opportunity to have a crashout about it and how in some companies/regions for whatever reason it's overused to the point where you'll see it in most projects and it's never nice to work with: https://softwareengineering.stackexchange.com/questions/9312...

If I have to work with one more "custom field" or "classifier" implementation, I am going to cry. Your business domain isn't too hard to model, if you need a 100 different "entities" as a part of it, then you should have at least 100 different tables, instead of putting everything into an ill fitting grab bag. Otherwise you can't figure out what is connected to what by just looking at the foreign keys pointing to and from a table, because those simply don't exist. Developers inevitably end up creating shitty polymorphic links with similarly inevitable data integrity issues and also end up coupling the schema to the back end, so you don't get like "table" and "table_id" but rather "section" and "entity_id" so you can't read the schema without reading the back end code either. Before you know it, you're not working with the business domain directly, but it's all "custom fields this" and "custom fields that" and people end up tacking on additional logic, like custom_field_uses, custom_field_use_ids, custom_field_periods, custom_field_sources and god knows what else. If I wanted to work with fields that much, I'd go and work on a farm. Oh, you're afraid of creating 100 tables? Use codegen, even your LLM of choice has no issues with that. Oh, you're too afraid that you're gonna need to do blanket changes across them and will forget something? Surely you're not above a basic ADR, literally putting a Markdown file in a folder in the repo. Oh, you're afraid that something will go wrong in those 100 migrations? How is that any different than you building literally most of your app around a small collection of tables and having fewer migrations that will affect pretty much everything? Don't even get me started on what it's like when the data integrity issues and refactoring gone bad starts. Worst of all, people love taking that pattern and putting it literally everywhere, feels like I'm taking crazy pills and nobody seems to have an issue what it's like when most of the logic in your app has something to do with CustomFieldService.

Fuck EAV/OTLT, thanks for coming to my rant. When it comes to bad patterns, it's very much up there, alongside using JSON in a relational database for the data that you can model and predict and put into regular columns, instead of just using JSON for highly dynamic data.

> Excessive View Layer Stacking

> In larger data environments, it’s easy to fall into the trap of layering views on top of views. At first, this seems modular and organized. But over time, as more teams build their own transformations on top of existing views, the dependency chain becomes unmanageable. Performance slows down because the database has to expand multiple layers of logic each time, and debugging turns into an archaeological dig through nested queries. The fix is to flatten transformations periodically and materialize heavy logic into clean, well-defined base views or tables.

I will say that this is nice to strive for, but at the same time, I much prefer having at least a lot of views instead of dynamically generated SQL by the application (see: myBatis XML mappers), because otherwise with complex logic it's impossible to predict exactly how your application will query the DB and you'll need to run the app locally with logging debug levels on so you see the actual SQL, but god forbid you have noisy DB querying or an N+1 problem somewhere, log spam for days, so unpleasant to work with. It's even more fun when people start nesting mappers and fucking around with aliases, just give me MongoDB at this point, it's web scale.