What I wish someone told me about Postgres

480 pointsposted 6 days ago
by todsacerdoti

189 Comments

munk-a

6 days ago

While postgres is indeed case sensitive usually writing queries with keywords in all caps is an effort to increase legibility for visual pattern matching. It absolutely isn't needed but if I'm debugging a query of yours I will send it through my prettifier so that I can breeze through your definitions without getting hung up on minor weird syntax things.

It's like prettification in any other language - visual structures that we can quickly recognize (like consistent indentation levels) make us waste less time on comprehension of the obvious so we can focus on what's important.

The only thing I really object to is "actuallyUsingCaseInIdentifiers" I never want to see columns that require double quotes for me to inspect on cli.

wiredfool

6 days ago

I find all caps identifiers wind up just looking like interchangeable blocks, where lowercase have word shapes. So all caps just slows down my reading.

WorldMaker

6 days ago

I feel similarly and I also have friends with Dyslexia with even stronger opinions on it. All caps in addition to being "shouting" to my ancient internet-using brain (and thus rude in most cases), creates big similar rectangular blocks as word shapes and is such a big speed bump to reading speed for everyone (whether or not they notice it). For some of my friends with Dyslexia that have a huge tough time with word shapes at the best of times, all caps can be a hard stop "cannot read" blocker for them. They say it is like trying to read a redacted document where someone just made rectangular black marker cross outs.

Personally, given SQL's intended similarity to English, I find that I like English "sentence case" for it, with the opening keyword starting with a capital letter and nearly every remaining letter lower case (except for Proper Nouns, the truly case-sensitive parts of SQL like table names). Sentence case has been helpful to me in the past in spotting things like missing semicolons in dialects like Postgres' that require them, and/or near keywords like `Merge` that require them or helping to visually make sure the `select` under a `Merge` is intended as a clause rather than starting a new "sentence".

akira2501

5 days ago

> I find that I like English "sentence case" for it,

I could go either way, but if you want to go back and modify a query, this makes it more difficult for me. I just use a block syntax for my queries:

    SELECT   *
    FROM     the_table
    WHERE    some_column = 12
    AND      other_column IS NOT NULL
    ORDER BY order_column;

WorldMaker

5 days ago

It's a bit of a "Why not both?" situation, I think? You can have blocks and sentence case:

    Select   *
    from     the_table
    where    some_column = 12
    and      other_column is not null
    order by order_column;
That seems so much more readable to me. As I said, that single capital `S` in the outermost "select" has come in surprisingly handy in my experience when scanning through a collection of statements or a transaction or a stored procedure or even just a statement with a bunch of nested sub-selects. It's an interesting advantage I find over "all lower case" or "all upper case" keywords.

4ggr0

5 days ago

your example is less readable for me. not by a lot, but still.

the other example has the commands, in caps, on the left and the values on the right, in lowercase. your example removes one of those aspects and makes everything lowercase. my brain can ignore all-caps stuff, as these are just commands and the things i actually care about mostly are the values.

but i mean, in the end, it's just preferences. if you write SQL-queries, #1 is that you understand them well :)

Ntrails

5 days ago

> your example is less readable for me. not by a lot, but still

Agree, but I wonder how much of that is just the lack of colouring. My brain is suuuuper hard wired to expect all the special keywords to be identified that way as well as by case.

Mostly I'm a caps guy because my ahk scripts expand text like "ssf","w" and "gb"* to be the way I learned to write them at first.

wruza

5 days ago

I use a similar structure but without column alignment.

  SELECT
    a,
    b
  FROM t1
  JOIN t2 ON …
  WHERE cond1
    AND cond2
  ORDER/GROUP/HAVING etc

cwbriscoe

5 days ago

   SELECT *
     FROM the_table
    WHERE some_column = 12
      AND other_column IS NOT NULL
 ORDER BY order_column;
I usually don't bother with ALL-CAP keywords.

conductr

5 days ago

I prefer lower case for my personal legibility reasons and it seems like a prettyfier should be able to adjust to that user’s preference. It’s not a team sport for me so I never had a conflict of styles other than converting public code samples to match my way.

yen223

5 days ago

I've always found it funny that SQL was designed the way it is to be as close to natural English as possible, but then they went ahead and made everything all-caps

paulryanrogers

5 days ago

Some old terminals didn't have lower case. Like 1960s era

andrei_says_

5 days ago

Also sql editors like datagrip color the sql syntax very well.

gwbas1c

6 days ago

It's really useful to know this when working with SQL interactively.

Specifically, if I'm banging out an ad-hoc query that no one will ever see, and I'm going to throw away, I don't worry about casing.

Otherwise, for me, all SQL that's checked in gets the commands in ALL CAPS.

archsurface

6 days ago

My understanding is that the caps were syntax highlighting on monochrome screens; no longer needed with colour. Can't provide a reference, it's an old memory.

_Wintermute

6 days ago

Most of the SQL I write is within a string of another programming language, so it's essentially monochrome unless there's some really fancy syntax highlighting going on.

jaredklewis

6 days ago

Aside, but jetbrains IDEs seem to have some way to detect embedded sql and highlight it. I don’t remember configuring anything to get this feature.

wmfiv

6 days ago

More than highlight they'll do schema validation against inline SQL strings also.

TRiG_Ireland

5 days ago

VS Code does (did?) detect embedded SQL in PHP and correctly colour it, but only if it's on a single line. Any linebreaks and the colour is turned off. Also, if you're using prepared statements and have an @label, and that label is at the end of the string (so immediately followed by a closing quote), the SQL colouring continues into the rest of the PHP beyond the string. So it's important that single-line SQL statements ending in a @label be edited into multi-line statements to turn off the broken SQL colouring. Odd.

munk-a

5 days ago

PHP strings tend to have better syntax highlighting with here/now docs (i.e. starting with `<<<TOKEN`). I've found SublimeText to have excellent SQL detection when using these tokens to delineate queries (and the syntax lends itself well to block strings anyways).

formerly_proven

6 days ago

If you're working with some established framework and project structure their IDEs pull that information out of that, otherwise you'll need to at least tell it the dialect, but if you e.g. configure the database as a data source in the IDE you'll get full schema xref.

tomjen3

6 days ago

Another aside: that is true for a huge range of programming languages as well as things like HTML. I believe it can automatically add \ to " in strings when those strings are marked to the IDE as HTML.

somat

5 days ago

Same, I am a bit conflicted, I enjoy the sql and don't really like the ORM's but I hate seeing the big blocks of SQL in my code.

So I wrote a thing that lets me use sql text as a function, it is several sorts of terrible, in that way that you should never write clever code. but I am not really a programmer, most of my code is for myself. I keep using it more and more. I dread the day Someone else needs to look at my code.

http://nl1.outband.net/extra/query.txt

WorldMaker

6 days ago

I find that in worst cases I can always copy and paste to a quick temporary buffer that is highlighted. I might be doing that naturally anyway if I'm trying to debug it, just to run it in a Data IDE of my choice, but sometimes even just using a scratch VS Code "Untitled" file can be useful (it's SQL auto-detect is usually good enough, but switching to SQL is easy enough if it doesn't auto-detect).

CoastalCoder

6 days ago

I think the "color is all we need" idea makes sense in proportion to how many of our tools actually support colorization.

E.g., the last time I used the psql program, I don't think it had colorization of the SQL, despite running in a color-capable terminal emulator.

It probably doesn't help that terminal colors are a bit of mess. E.g., piping colored output through 'less' can result in some messy control-character rendering rather than having the desired effect.

dllthomas

6 days ago

> piping colored output through 'less' can result in some messy control-character rendering rather than having the desired effect.

It can, but -G or -R can fix that.

vundercind

6 days ago

Like sigils in that regard. Perl-type sigils are extremely nice... if you're editing in Notepad or some ancient vi without syntax highlighting and the ability to ID references on request. Little point to them, if you've got more-capable tools.

o11c

6 days ago

Note that case handling is a place where postgres (which folds to lowercase) violates the standard (which folds to uppercase).

This is mostly irrelevant since you really shouldn't be mixing quoted with unquoted identifiers, and introspection largely isn't standardized.

yen223

5 days ago

Given that other mainstream RDBMSes lets you configure how case handling should happen, Postgres is arguably the closest to the standard.

Usual caveat of how nobody sticks to the ANSI standard anyway applies.

emmanuel_1234

6 days ago

Any recommendation for a prettifier / SQL linter?

gnulinux

6 days ago

I'm curious about this for DuckDB [1]. In the last couple months or so I've been using DuckDB as a one-step solution to all problems I solve. In fact my development environment rarely requires anything other than Python and DuckDB (and some Rust if native code is necessary). DuckDB is an insanely fast and featureful analytic db. It'd be nice to have a linter, formatter etc specifically for DuckDB.

There is sqlfluff etc but I'm curious what people use.

[1] DuckDB SQL dialect is very close to Postgres, it's compatible in many ways but has some extra QOL features related to analytics, and lacks a few features like `vacuum full`;

homebrewer

6 days ago

IDEA if you want to use it for other things (or any other JetBrains IDE). Nothing comes close feature-wise.

If you don't:

- https://www.depesz.com/2022/09/21/prettify-sql-queries-from-...

- https://gitlab.com/depesz/pg-sql-prettyprinter

Or https://paste.depesz.com for one-off use.

jillyboel

5 days ago

i think the idea sql prettifier is pretty silly sometimes. it really likes indenting stuff to make sure things are aligned, which often results in dozens of whitespaces

homebrewer

5 days ago

It makes it easy to distinguish null vs not null columns and other similar things, so I personally don't mind.

jillyboel

5 days ago

It's more about queries like (dummy example)

    RETURN CASE
               WHEN a = 1 THEN 1
               ELSE 2
        END;
where it insists on aligning WHEN past CASE. I think it would be perfectly reasonable to indent WHEN and ELSE 4 spaces less, for example. Similar things happen with nested conditions like (a or (b and c)) all getting pushed to the right

mannyv

5 days ago

The uppercase is usually there to show people what's SQL vs what's custom to your database. In books it's usually set in courier.

I thought he was talking about psql's case sensitivity with table names, which is incredibly aggravating.

avg_dev

6 days ago

i agree; but i use caps in my codebase, and lowercase when testing things out manually, just for ease of typing.

munk-a

6 days ago

Ditto - if I'm throwing out an inspection query just to get a sense of what kind of data is in a column I won't bother with proper readable syntax (i.e. a `select distinct status from widgets`). I only really care about code I'll need to reread.

MetaWhirledPeas

5 days ago

> writing queries with keywords in all caps is an effort to increase legibility for visual pattern matching

Considering most programming languages do just fine without ALL CAPS KEYWORDS I'd say it's a strange effort. I wish SQL didn't insist on being different this way.

I agree with you on prettification though. As long as the repository chooses a prettifier you can view it your way then commit it their way. So that's my advice: always demand prettification for pull requests.

sensanaty

5 days ago

I don't write or read SQL too often, but I prefer the ALL_CAPS because it usually lets me know if something is part of the SQL syntax itself or a function or whatever, or if it's referencing a table/column/etc.

Obviously not very foolproof, but automated linters/prettifiers like the one in DataGrip do a good job with this for every query I've ever thrown at it.

grahamplace

5 days ago

For checked-in SQL queries, we follow: https://www.sqlstyle.guide/

The combination of all caps keywords + following "the river" whitespace pattern dramatically improves readability in my opinion

zusammen

5 days ago

The all caps syntax also helps queries stand out as distinct from typical source languages. It is often helpful, since SQL tends to end up in all sorts of applications.

neves

5 days ago

You can convey more info with color. Any half decent editor can color your SQL.

All caps letters are more similar and harder to read.

marcosdumay

6 days ago

Well, as long as you aren't imposing the noisy syntax into everybody by pushing the case-change back into the code...

But getting some editor that highlights the SQL will completely solve your issue.

munk-a

6 days ago

I think fighting in PRs over syntax preferences is pretty useless so dev shops should generally have code style guidelines to help keep things consistent. In my company we use all caps casing since we have momentum in that direction but I think that decision can be reasonable in either direction as long as it's consistent - it's like tabs vs. spaces... I've worked in companies with both preferences, I just configure my editor to auto-pretty code coming out and auto-lint code going in and never worry about it.

troyvit

5 days ago

> no longer needed with colour.

I think the increased legibility for visual pattern matching also makes SQL easier to read for many of the 350 million color blind people in the world.

jillyboel

5 days ago

what is your prettifier of choice for postgres?

christophilus

6 days ago

I’d never stumbled across the “don’t do this” wiki entry[0] before. Very handy.

[0] https://wiki.postgresql.org/wiki/Don%27t_Do_This

gwbas1c

5 days ago

Why don't they deprecate some of these features? If they're such easy stumbling blocks, seems like it makes sense to disable things like table inheritance in new schemas, and require some kind of arcane setting to re-enable them.

Macha

5 days ago

e.g. the suggested replacement for timestamp is timestamptz, which has its own problems (notably, it eagerly converts to UTC, which means it cannot account for TZ rule changes between storing the date and reading it). If medium term scheduling across multiple countries is something that needs to work in your app, you're kind of stuck with a column with a timestamp and another column with a timezone.

Terr_

5 days ago

> stuck with a column with a timestamp and another column with a timezone.

I've been tinkering with a weird hack for this issue which might help or at least offer some inspiration.

It's similar to the foot-gun of "eagerly convert straight to UTC" except you can easily recalculate it later whenever you feel like it. Meanwhile, you get to keep the same performance benefits from all-UTC sorting and diffing.

The trick involves two additional columns along with time_zone and time_stamp, like:

    -- Column that exist as a kind of trigger/info
    time_recomputed_on TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT now(),

    -- You might be able to do this with special triggers too
    -- This coalesce() is a hack so that the above column changing causes re-generation
    estimated_utc TIMESTAMP GENERATED ALWAYS AS (COALESCE(timezone('UTC', timezone(time_zone, time_stamp)), time_recomputed_on)) STORED
PostgreSQL will recalculate estimated_utc whenever any of the other referenced columns change, including the useless dependency on time_recomputed_on. So you can force a recalc with:

    UPDATE table_name SET time_recomputed_on = now() WHERE time_recomputed_on < X;
Note that if you want time_recomputed_on to be more truthful, it should probably get updated if/when either time_zone or stamp are changed. Otherwise it might show the value as staler than it really is.

https://www.postgresql.org/docs/current/ddl-generated-column...

stickfigure

5 days ago

`timestamptz` doesn't convert to UTC, it has no timezone - or rather, it gets interpreted as having whatever TZ the session is set to, which could change anytime. Postgres stores the value as a 64 bit microseconds-since-epoch. `timestamp` is the same. It's sad that even the official PG docs get this wrong, and it causes problems downstream like the JDBC driver natively mapping it to OffsetDateTime instead of Instant.

But you're right that timestamptz on postgres is different from timestamptz on oracle, which _does_ store a timezone field.

Ndymium

5 days ago

Here's the PostgreSQL documentation about timestamptz:

> For timestamp with time zone, the internally stored value is always in UTC (Universal Coordinated Time, traditionally known as Greenwich Mean Time, GMT). An input value that has an explicit time zone specified is converted to UTC using the appropriate offset for that time zone. If no time zone is stated in the input string, then it is assumed to be in the time zone indicated by the system's TimeZone parameter, and is converted to UTC using the offset for the timezone zone.

> When a timestamp with time zone value is output, it is always converted from UTC to the current timezone zone, and displayed as local time in that zone. To see the time in another time zone, either change timezone or use the AT TIME ZONE construct (see Section 9.9.4).

To me it seems to state quite clearly that timestamptz is converted on write from the input offset to UTC and on read from UTC to whatever the connection timezone is. Can you elaborate on which part of this is wrong? Or maybe we're talking past each other?

stickfigure

4 days ago

That is, unfortunately, a lie. You can look at the postgres source, line 39:

https://doxygen.postgresql.org/datatype_2timestamp_8h_source...

Timestamp is a 64 bit microseconds since epoch. It's a zone-less instant. There's no "UTC" in the data stored. Times are not "converted to UTC" because instants don't have timezones; there's nothing to convert to.

I'm guessing the problem is that someone heard "the epoch is 12am Jan 1 1970 UTC" and thought "we're converting this to UTC". That is false. These are also the epoch:

* 11pm Dec 31 1969 GMT-1

* 1am Jan 1 1970 GMT+1

* 2am Jan 1 1970 GMT+2

You get the picture. There's nothing special about which frame of reference you use. These are all equally valid expressions of the same instant in time.

So somebody wrote "we're converting to UTC" in the postgres documentation. The folks writing the JDBC driver read that and now they think OffsetDateTime is a reasonable mapping and Instant is not. Even though the stored value is an instant. And the only reason all of this works is that everyone in the universe uses UTC as the default session timezone.

To make it extra confusing, Oracle (and possibly others) TIMEZONE WITH TIME ZONE actually stores a timezone. [1am Jan 1 1970 GMT+1] <> [2am Jan 1 197 GMT+2]. So OffsetDateTime makes sense there. And the generic JDBC documentation suggests that OffsetDateTime is the natural mapping for that type.

But Posgres TIMESTAMP WITH TIME ZONE is a totally different type from Oracle TIMESTAMP WITH TIME ZONE. In Postgres, [1am Jan 1 1970 GMT+1] == [2am Jan 1 197 GMT+2].

Macha

4 days ago

You are thinking of UTC offsets as zones here, which is wrong. Yes, you can interpret an offset from the epoch in any utc offset and that's just a constant formatting operation. But interpreting a zoned datetime as an offset against a point in UTC (or UTC+/-X) is not.

You do not confidently know how far away 2025-03-01T00:00:00 America/New_York is from 1970-01-01T00:00:00+0000 until after that time. Even if you decide you're interpreting 1970-01-01T00:00:00+0000 as 1969-12-31T19:00-0500. Postgres assumes that 2025-03-01T00:00:00 America/New_York is the same as 2025-03-01T00:00:00-0500 and calculates the offset to that, but that transformation depends on mutable external state (NY state laws) that could change before that time passes.

If you get news of that updated state before March, you now have no way of applying it, as you have thrown away the information of where that seconds since epoch value came from.

stickfigure

4 days ago

I'm not quite sure what your point is. Postgres doesn't store time zones. "The internally stored value is always in UTC" from the documentation is false. It's not stored in UTC or any other zone. "it is always converted from UTC to the current timezone zone" is also false. It is not stored in UTC.

Macha

4 days ago

This is pointless pedantry: Expressing it as the number of seconds since a point that is defined in UTC is a conversion to UTC by anyone else's logic (including, clearly, the documentation writers), even if there's not some bits in the database that say "this is UTC", even if that point can be expressed with various UTC offsets.

The internal representation is just a integer, we all agree on that, this is not some great revelation. The fact that the internal representation is just an integer and the business rules surrounding it say that integer is the time since 1970-01-01T00:00:00Z is in fact the cause of the problem we are discussing here. The internal implementation prevents it being used as a timestamp with time zone, which its name and the ability to accept IATA TZs at the query level both in datetime literals and features like AT TIME ZONE or connection timezones strongly imply that it should be able to do. It also means the type is flawed if used to store future times and expecting to get back what you stored. We complain about behaviours like MySQL's previous silent truncation all the time, documented as they may have been, so "read the source code and you'll see it's doing XYZ" is not relevant to a discussion on if the interface it provides is good or bad.

Nor is the link you shared the full story for the source code, as you'd need to look at the implementation for parsing of datetime literals, conversion to that integer value, the implementation of AT TIME ZONE, etc.

stickfigure

4 days ago

This is not pedantry. It has real-world consequences. Based on the misleading text in the documentation, the folks writing the Postgres JDBC driver made a decision to map TIMESTAMPTZ to OffsetDateTime instead of Instant. Which caused some annoying bugs in my 1.5M line codebase that processes financial transactions. Which is why I'm a bit pissed about all this.

If you walk into a javascript job interview and answer "UTC" to the question "In what timezone is Date.now()?", you would get laughed at. I don't understand why Postgres people get a pass.

If it's an instant, treat it as an instant. And it is an instant.

Terr_

5 days ago

Seeing this topic/documentation gives me a sense of deja vu: I think it's been frustrating and confusing a great many perfectly decent PostgreSQL-using developers for over 20 years now. :P

Ndymium

5 days ago

I agree, "timestamp with time zone" is a terribly misleading name and personally I don't use that type very much.

bvrmn

5 days ago

timestamptz is not more than glorified unix timestamp with nice formatting by default. And it's great! It provides easy zone conversions directly in SQL and allows to use your actual language "date with time zone" type.

Naming is highly misleading though.

mixmastamyk

5 days ago

Several of the broken are SQL standard.

FridgeSeal

5 days ago

At what point can we have an update to the standard that fixes a good number of these old hangups?

__loam

5 days ago

Changing defaults can screw over existing users.

SoftTalker

5 days ago

Presumably there are rare exceptions where you DO want to do the thing.

Parodper

5 days ago

The money one honestly sounds like a bug.

datadrivenangel

6 days ago

This reminds me of SQL Anti-patterns, which is a book that everyone who works with databases should read.

samarthr1

5 days ago

That was a fun read, thanks!

Made me reconsider a few habits I picked up from MySQL land

pavel_lishin

6 days ago

A lot of these aren't postgres-specific. (null weirdness, index column order, etc.)

For example, how nulls work - especially how interact with indexes and unique constraints - is also non-intuitive in mysql.

If you have a user table with a non-nullable email column and a nullable username column, and a uniqueness constraint on something like (email, username), you'll be able to insert multiple identical emails with a null username into that table - because a null isn't equivalent to another null.

anarazel

6 days ago

> If you have a user table with a non-nullable email column and a nullable username column, and a uniqueness constraint on something like (email, username), you'll be able to insert multiple identical emails with a null username into that table - because a null isn't equivalent to another null.

FWIW, since 15 postgres you can influence that behaviour with NULLS [NOT] DISTINCT for constraints and unique indexes.

https://www.postgresql.org/docs/devel/sql-createtable.html#S...

EDIT: Added link

bpicolo

6 days ago

I think this is a good pragmatic default. The use case for the alternative is much more rare.

pavel_lishin

5 days ago

I totally agree - but it's not an intuitive default.

jedberg

5 days ago

My number one tip: Vacuum every day!

I didn't know this when I started, so I never vacuumed the reddit databases. Then one day I was forced to, and it took reddit down for almost a day while I waited for it to finish.

FreakLegion

5 days ago

No autovacuum? At reddit's scale I'm surprised you didn't run out of transaction IDs.

jedberg

5 days ago

I turned it off because it would run at inopportune times.

And we did run out of transaction IDs, which is why I was forced to do it.

I never turned on the auto-vacuumer but I did set up a daily vacuum.

Keep in mind I left Reddit 13 years ago and I’m sure they’ve made improvements since.

gtaylor

5 days ago

Auto vacuuming is enabled now. We did have some near misses due to long running vacuums that barely completed before wraparounds, but got things tuned over time.

I did none of that work but was on the team where it happened.

marcosdumay

6 days ago

> Normalize your data unless you have a good reason not to

Ouch. You don't want to just say that and move on.

The author even linked to a page citing 10 different kinds of normalization (11 with the "non-normalized"). Most people don't even know what those are, and have no use for 7 of those. Do not send people on wild-goose chases after those higher normal forms.

pavel_lishin

6 days ago

But the author did have a paragraph explaining, in general, what they mean.

And they're right! I've had to fix a few issues of this in a project I recently got moved to. There's almost never a reason to duplicate data.

cryptonector

6 days ago

The general rule is to normalize to the max, then denormalize till you get the performance that you need.

DanHulton

6 days ago

The one exception I'll make from the very start is "add tenant identifier to every row, yes, even if it's linked to another table that has tenant identifiers."

Sure, this means you will have some "unnecessary" `tenant_id` columns in some tables that you could get through a relation, but it saves you from _having_ to include that relation just to limit by tenant, which you will some day almost be guaranteed to want. (Especially if you want to enable row-level security¹ someday.)

¹ - https://www.postgresql.org/docs/current/ddl-rowsecurity.html

magicalhippo

5 days ago

We do a variant of this across all our tables. If we have a parent-child-child relationship, then all child tables, regardless of depth, will have the parent id.

This way we can load up all the data needed to process an order or invoice etc easily, without a ton of joins.

We don't do multi-tenant, instead separate databases per tenant so far.

wadadadad

5 days ago

How is this working for you so far? Do you ever need to report across the multiple tenants, and how do database migrations go? I'm starting to look into this, and purely for reporting and database migrations I'm leaning towards multi-tenant.

magicalhippo

5 days ago

The product I'm working have not needed to report across multiple tenants.

We do have some customers which have separate daughter companies which might technically be individual tenants, and where we might need to report across those. But in all those cases so far we've been able to host the databases on the same server, so can easily join tables from the individual databases into the same query.

Database migrations are very smooth, given each tenant has their own database, so we can do it when they're ready instead of needing a service window that fits all our tenants. We have several that are essentially 24/7 operations, and while they do have periods of low activity that fits a service window, they usually don't line up well between them.

Likewise schema upgrades are trivial. We have written our own little tool that updates the database given a source schema description in XML (since we've had it for 15+ years now), and as it does not do destructive updates is is low risk. So schema upgrades is done automatically as part of our automated update deployment (ala Windows Update).

Of course this requires a bit more thought during schema design, and sometimes we do add some manual cleanup or similar of tables/columns to our tool that we know are safe to remove/change.

One upside is that performance is easy to manage. A single tenant will almost never cause performance issues for others than themselves. If they start doing that we can always just trivially move them to their own server.

A downside is that we have a lot of large code lists, and currently these are kept per-database as we've traditionally been deployed on-prem. So we're looking to consolidate those.

We do have another product that I haven't worked on, it's a more traditional SAAS web app thing, and that does have multi-tenant. It's not as business-critical, so the service window constraint isn't an issue there.

Given that it has an order of magnitude more tenants than the application I work on, I think that multi-tenant was a decent choice. However it has also had some complications. I do recall hearing some issues around it being a more critical point of failure, and also that certain larger customers have started to want their own separate database.

I think ideally we would have combined the two approaches. Allow for multi-tenant by having a tenant id everywhere, but also assume different tenants will run in different databases.

wadadadad

5 days ago

Thank you very much for your time, I appreciate it! It definitely seems automated schema updating is necessary if you're doing more than a couple of databases, and you raised many other good points that I hadn't fully considered. I can definitely appreciate larger clients wanting their own dedicated databases, so planning for that initially could be a wise choice. Thank you again!

getcrunk

5 days ago

Nice one! I’m working on an app the is user/consumer facing but will eventually have a teams/bussiness offering. Everything has a uid with it cus users are the core of your app. But yea if your multitennant then tenants should be treated like that as well.

jherdman

5 days ago

What happens if the row is shared between two tenants?

zbentley

5 days ago

In a great number of (probably most) business-to-business software domains, most rows are only relevant to one tenant at a time. It’s a partition/shard key, basically: queries will only uncommonly retrieve rows for more than one tenant at a time.

Hassles emerge when the tenant that owns a row changes (client splits or mergers), but even then this is a common and robust architecture pattern.

marcosdumay

6 days ago

Oh, no, it's absolutely not.

It's usually to normalize into the 3rd form. But that's not enough on some cases, that's too much on some other cases, and the reason it breaks is performance about as often as it's not.

jayknight

6 days ago

cryptonector

5 days ago

Isn't 6NF essentially a flavor of EAV? I think essentially it is.

6NF means having one non-PK column, so that if the value would be NULL then the row need not exist, and so the value column can be NOT NULL.

But when you need all the columns of what you'd think of as the <thing> then you need to go gather them from all those rows in all those tables. It's a bit annoying. On the other hand a proper EAV schema has its positives (though also its negatives).

jayknight

5 days ago

It's similar, but instead of one table holding lots of attributes, there are separate tables that hold optional fields that might otherwise be null if they were in the main table.

magicalhippo

5 days ago

Every time I see a suggestion like this, I wonder what kind of projects people work on that only require a few columns.

In my experience every non-trivial project would need several dozen columns in core tables, all which could be NULL while the user works on the record during the day.

In our current project we'd have to do several hundred joins to get a single main-table record.

I also wonder why people have such an aversion for NULL. Crappy tools?

jayknight

4 days ago

No, it's just a dedication to pure relational algebra. Nulls introduce "three valued logic" into your code/sql. A nullable boolean column can contain 3 different "values" instead of two. True is not false, true is not null, false is not null (and null is neither true nor false). Or in a nullable numeric field, you have rows where that column is neither greater than zero or less than zero.

On the other hand, even if you eliminate all nulls, you still have to do some three-valued logic ("A join B on B.is_something", "A join B on not B.is_something" and the "anti-join": "A where no B exists"), but only in the context of joins.

It feels a little like rust's Option type. If something in "nullable", these are tools that forces you to deal with it in some way so there are fewer surprises.

marcosdumay

4 days ago

AFAIK, the 6NF is exclusively for theory research.

yen223

5 days ago

"normalise til it hurts, denormalise til it works"

hobs

6 days ago

I guess this is targeted towards noobs, but the answer is pretty much always 3rd normal form if you are clicking this and are not sure.

makr17

5 days ago

Interesting, I would have said Boyce-Codd unless you have a good reason to vary in either direction.

philipwhiuk

5 days ago

The problem with BCNF is generally that you're enforcing a generally fairly complex and subject-to-change relation at database level rather than application logic.

ellisv

6 days ago

I really wish developers cared more about normalization and stop shoving everything into a JSON(b) column.

abraae

6 days ago

Long before databases could even store structured JSON data, junior developers used to bikeshed viciously over the correct degree of normalization.

More experienced developers knew that the correct answer was to duplicate nothing (except for keys obviously) and then to denormalize only with extreme reluctance.

Then databases like mongo came along and encouraged those juniors by giving them something like a database, but where normalization was difficult/irrelevant. The result was a brief flowering of horrible database designs and unmaintainable crap towers.

Now the pendulum has swing back and people have rediscovered the virtues of a normalized database, but JSON columns provide an escape hatch where those bad practices can still flower.

christophilus

6 days ago

Eh. JSON has its place. I have some stateful data that is fairly transient in nature and which doesn’t really matter all that much if it gets lost / corrupted. It’s the sort of thing I’d throw into Redis if we had Redis in our stack. But the only storage in my project is S3 and Postgres. Postgres allows me to trivially query, update, analyze the usage of the feature, etc. Normalization wouldn’t buy me much, if anything, for my use case, but it would make “save this stuff for later” more of a nuisance (a sync across a bunch of rows vs a single upsert).

That said, I’ve worked on projects that had almost no normalization, and it was pure hell. I’m certainly not arguing against normalizing; just saying that data blobs are useful sometimes.

codr7

6 days ago

Yeah, I'm def not taking a any more mongodb jobs if I can avoid it.

I'm fine with using it for simple throw away stuff, but deciphering someone else's ball of json is soul killing.

yen223

5 days ago

There are two reasons to use a jsonb column:

1. To store JSON. There's a pattern where when your webserver calls into some third-party API, you store the raw API response in a JSONB column, and then process the response from there. This gives you an auditable paper trail if you need to debug issues coming from that 3rd-party API.

2. To store sum types. SQL not supporting sum types is arguably the biggest deficiency when modelling data in SQL databases. There are several workarounds - one of them being "just chuck it in a JSONB column and validate it in the application" - but none of the workarounds is particularly great.

zie

5 days ago

I would add:

3. End-user extra fields. Stuff you don't care about, but someone somewhere does.

devin

6 days ago

Even if you care about it, you will still often wind up with a junk drawer of JSONB. I don't really see it as a problem unless people are writing bad queries against it instead of lifting values out of the JSONB into their own columns, etc.

mnahkies

5 days ago

Yeah exactly, and I'll take a JSON(B) column over TEXT with maybe valid serialised JSON, maybe RON, maybe something completely random any day

randomdata

6 days ago

Most developers using these kinds of tools these days are actually building their own database management systems, just outsourcing the persistence to another DMBS, so there isn't a strong imperative to think about good design so long as it successfully satisfies the persistence need.

Whether we actually should be building DMBSes on top of DMBSes is questionable, but is the current state of affairs regardless.

tightbookkeeper

6 days ago

A previous employer thought that sql databases didn’t understand graphs. So they made their own system for serializing/deserializing graphs of objects into Postgres

. They never used queries and instead had their own in-memory operators for traversing the graph, had to solve problems like deleting an entry and removing all references, partial graph updates.

And I still don’t think it works.

theamk

6 days ago

This needs working schema migration process, including ability to undo schema change if the new column tanks the performance or breaks stuff.

If there are CLI tools involved, you also need to ensure you can handle some downtime, or do synchronized version update across company, or support both old and new schemas for a while.

If a database is not part of team's primary product all of this could be missing.

leighleighleigh

2 days ago

I've been learning Postgres and SQL on the job for the first time over the last six months - I can confirm I've learnt all of these the hard way!

I'd also recommend reading up on the awesome pg statistics tables, and leverage them to benchmark your things like index performance and macro call speeds.

avg_dev

6 days ago

this is really nice. i am glad the author put it together. i didn't know the pg docs were 3200 pages long! i have been using it for a while and try to learn as i go. i really do like the docs. and i also like to read articles on various particular subjects as i find a need to.

i think the author might find it helpful to readers to add a note to https://challahscript.com/what_i_wish_someone_told_me_about_... that if someone is selecting for b alone, then an index on columns (b, a) would work fine. i think this is kind of implied when they talk about selecting on a alone, but maybe it wouldn't hurt to be extra explicit.

(i didn't spend much time on the json/jsonb part since i rarely use that stuff)

nasretdinov

6 days ago

Nice article! One thing I'd add is that almost all of it applies to other MVCC databases like MySQL too. While some details might be different, it too suffers from lon transactions, holds metadata locks during ALTERs, etc, all the good stuff :).

whiskey-one

6 days ago

Really great post! It belongs on a reading list somewhere for everyone who is using Postgres independently or as a part of their stack.

ahachete

5 days ago

> Most notably, 'null'::jsonb = 'null'::jsonb is true whereas NULL = NULL is NULL

Because 'null' in the JSON spec is a literal value (a constant), not SQL's NULL. Nothing to see here.

https://datatracker.ietf.org/doc/html/rfc7159

hibachrach

5 days ago

Right--it makes sense and shouldn't be changed, but it's a bit unintuitive for newcomers.

sroerick

5 days ago

It’s always a relief to read stuff articles this, realize I know 90% of it, and I’ve deserved the jobs I’ve had.

Great and super useful notes

leoqa

5 days ago

I had an interesting problem occur to the pg stats. We were migrating and had a version column, I.e key, val, version.

We were migrating from version 1 to version 2, double writing into the same table. An index on (key, val, version) was being hit by our reader process using a where clause like key=k and version=1.

When we flipped the reader to read version 2, the latency jumped from 30ms to 11s. Explain showed a sequential scan even though the index could serve the query. I was able to use MATERIALIZED and reorder CTEs to get the planner to do the right thing, but it caused an outage.

We were autovacuuming as well. I ended up deleting the old version and rebuilding the index.

My theory is that because the read load was on 50% of the data, the stats were super skewed.

zmmmmm

5 days ago

> It’s possible that adding an index will do nothing

This is one of the more perplexing thing to me where Postgres ideology is a bit too strong, or at least, the way it works is too hard for me to understand (and I've tried - I'm not going to claim I'm a genius but I'm also not a moron). I hear there may be finally support for some kind of hints in upcoming versions, which would be very welcome to me. I've spent way too much time trying to divine the voodoo of why a slow query is not using indexes when it seems obvious that it should.

jaza

5 days ago

Only some of these are really Postgres specific (use "text" / "timestamptz"; make psql more useful; copy to CSV). Most of them apply to relational databases in general (learn how to normalise!; yes, NULL is weird; learn how indexes work!; locks and long-running transactions will bite you; avoid storing and querying JSON blobs of doom). Not that that detracts from the usefulness of this article - pretty much all of them are important things to know when working with Postgres.

nodakai

6 days ago

That nested JSON query operator chains such as json_col->'foo'->'bar'->>'baz' internally return (copy) entire sub-objects at each level and can be much slower than jsonb_path_query(json_col, '$.foo.bar.baz') for large JSONB data

... although I haven't had the chance to verify this myself

mdaniel

6 days ago

I got nerd-sniped on this, because I actually hadn't heard that before and would be horrified if it were true. It took some non-trivial digging to even get down into the "well, what does foo->>'bar' even map to in C?" level. I for sure am not claiming certainty, but based merely on "getIthJsonbValueFromContainer" <https://sourcegraph.com/github.com/postgres/postgres@REL_17_...> it does seem that they do palloc copies for at least some of the JSONB calls

anarazel

6 days ago

Postgres does have infrastructure to avoid this in cases where the result is reused, and that's used in other places, e.g. array constructors / accessors. But not for jsonb at the moment.

hans_castorp

6 days ago

You can also use #>> operator for that:

    json_col #>> '{foo,bar,baz}'

vivzkestrel

6 days ago

since we are on the topic and since your article clearly mentions "Normalize your data unless you have a good reason not to" I had to ask. I am trying to build a news aggregator and I have many websites. Each of them has slightly different format. Even though I use feedparser in python, it still doesn't change how some of them put html text inside content and some of them break it down into a separate media xml attribute while retaining only basic textual summary inside a summary attribute. Do you think it makes more sense to store a single rss item as a column inside postgres or should it be stored after parsing it? I can see upsides and downsides to both approaches. Store it as XML and you have the ability to change your processing logic down the line for each stored item but you lose the flexibility of querying metadata and you also have to parse it on the fly every single time. Store it in multiple columns after processing it and it may require different logic for different websites + changing your overall python processing logic requires a lot of thinking on how it might affect some source. What do you guys recommend?

sea6ear

5 days ago

With the caveat that you probably shouldn't listen to me (or anyone else on here) since you are the only one who knows how much pain each choice will be ...

I think that given that you are not really dealing with structured data - you've said that different sites have different structures, and I assume even with processing, you may not be able to generate identical metadata structures from each entry.

I think I would go for one column of XML, plus maybe another column that holds a parsed data structure that represents the result of your processing (basically a cache holding the post-processed version of each site). Hopefully that could be re-evaluated by whatever language (Python?) you are using for your application. That way you don't have to do the full parsing each time you want to examine the entry, but you have access to something that can quickly give you whatever metadata is associated with it, but which doesn't tie you to the rigid structure of a table based database.

Once you know what you are really doing with the data, then you could add additional metadata columns that are more rigid, and which can be queried directly in SQL as you identify patterns that are useful for performance.

vivzkestrel

5 days ago

i am using the feedparser library in python https://github.com/kurtmckee/feedparser/ which basically takes an RSS url and standardizes it to a reasonable extent. But I have noticed that different websites still get parsed slightly differently. For example look at how https://beincrypto.com/feed/ has a long description (containing actual HTML) inside but this website https://www.coindesk.com/arc/outboundfeeds/rss/ completely cuts the description out. I have about 50 such websites and they all have slight variations. So you are saying that in addition to storing parsed data (title, summary, content, author, pubdate, link, guid) that I currently store, I should also add an xml column and store the raw <item></item> from each url till I get a good hang of how each site differs?

krab

5 days ago

Just shooting without knowing your real needs - take this with a grain of salt.

Store some parsed representation that makes it easier for you to work with (probably normalized). Keep an archive of raw data somewhere. That may be another column, table or even S3 bucket. Don't worry about schema changes but you need to not lose the original data. There are some pitfalls to schema migrations. But the schema should be the representation that works for you _at the moment_, otherwise it'll slow you down.

setr

5 days ago

If you’re going to be parsing it anyways, and there’s the possibility of actually doing something with that parsed info beyond just reprinting it, then storing the post-parse results is probably better. Especially if you’re only going to need a reduced subset of the information and storage matters.

If you’re just reprinting — you’re parsing only for the sake of rendering logic — then storing the parse-result is probably just extra unnecessary work.

Also if storage isn’t a concern, then I like using the database as intermediate storage for the pipeline. Grab the RSS, stuff it in the DB as-is. Take it out of the DB, parse, store the parse results. Etc. You’ll have to do this anyways if you’re going to end up with a processing queue (and you can use PG as a simple queue.. SELECT…FOR UPDATE), but it’s nice-to-have if your pipeline is going to eventually change as well — and you’re able to reprocess old items

> Store it in multiple columns after processing it and it may require different logic for different websites + changing your overall python processing logic requires a lot of thinking on how it might affect some source.

Don’t you need to deal with this problem regardless? Ideally you’ll find a common subset you actually care about and your app-logic will look like website —> website handler -> extract data subset —> add to common structure —> render common structure

Even if you don’t use a database at all your app-code will need to figure out the data normalization

cryptonector

6 days ago

With all RDBMSes the rule is "normalize to the max, then denormalize till you get the performance that you need".

roydivision

5 days ago

Really like the

Don't <thing>

Why not?

When should you?

format of the Postgres Don't Do This page.

deadbabe

6 days ago

Don’t create views that reference other views.

cyanydeez

6 days ago

I'd say I tend to ignore the standard docs because they rarely have examples and rely on the arcane procedure of trying to decipher the super command options with all it's "[OR THIS|THAT]".

I assume _someone_ can read this pseudo programming, but it's not me.

PhilippGille

6 days ago

I love the diagrams from the SQLite documentation: https://www.sqlite.org/syntaxdiagrams.html

o11c

6 days ago

I've been bitten by those before because they are not generated from the actual syntax-parsing code, and thus are sometimes out of sync and wrong (or at least misleading).

christophilus

6 days ago

SQLite diagrams plus examples are pretty much the way I wish all of my documentation worked.

Clubber

6 days ago

FWIW square brackets indicate an optional parameter and the pipe inside indicates all the parameters available; but I agree, I don't particularly care for that notation and prefer actual running SQL examples.

hobs

6 days ago

Railroad-esque diagrams can be weird but they say a lot in a very short space, I would highly recommend spending a little extra time working on thinking through them, they are everywhere!

Clubber

5 days ago

>Railroad-esque diagrams

Now and forever I will think of railroads when I see these.

hobs

5 days ago

Well, you should! But, I didn't invent anything there, that's what they are named :)

Syntax diagrams (or railroad diagrams) are a way to represent a context-free grammar. They represent a graphical alternative to Backus–Naur form, EBNF, Augmented Backus–Naur form, and other text-based grammars as metalanguages... https://en.wikipedia.org/wiki/Syntax_diagram

paulryanrogers

6 days ago

Sometimes I find it annoying but mostly it works well for me. I've come to embrace the find feature and visually scanning over any parenthetical stuff.

The alternative is they have to break it up into several quasi examples, each with their own optional modifiers.

RedShift1

5 days ago

I can't horizontally scroll on mobile, can't see the full query texts...

hibachrach

5 days ago

(author here) hmm it seems to work on my device. Are you scrolling within the code block?

Lvl999Noob

5 days ago

Unrelated to that issue, the right hand side TOC does not display completely. I am using firefox on windows with a default zoom of 120%. The TOC ends up going below the screen and hiding the last few entries. It also floats above the footer and hides the text there.

If I may suggest, use `position: sticky` instead of `position: fixed` or some equivalent to whatever trick you used.

RedShift1

5 days ago

I'm trying to but the entire page scrolls horizontally instead of just the code block.

hibachrach

5 days ago

It might be because of the larger tables. Thanks for letting me know--I'll take a look at it soon.

cryptonector

6 days ago

Watch out, for row/record values, if a column in the row/record is NULL then IS NULL will be true! You want to use IS [NOT] DISTINCT FROM NULL, full stop.

fatih-erikli-cg

5 days ago

Storing data in text costs less. A tcp connection to get some blog posts from another process is not necessary.

bvrmn

5 days ago

On other hand it's a mere blog post. You should not be bothered by TCP cost. But reliable data storage and reliability/restore (in case of backups) cost is a concern.

fatih-erikli-cg

5 days ago

It is mostly a blog post. A usecase for a database that holds tables and rows is very rare in real world. I know noone who uses contacts app in their mobile phones. Access is already there with checkboxes, select inputs and everything for years. Noone uses.

teddyh

5 days ago

  \pset null '␀'

5Qn8mNbc2FNCiVV

5 days ago

Your code sections are almost unscrollable on mobile

hibachrach

5 days ago

(author here) hmm it seems to work on my device. what happens when you try to scroll?

mikrotikker

5 days ago

Shout-out to my postgres server that has been sitting untouched doing it's thing perfectly for 10 years, you're a real one

philipwhiuk

5 days ago

> Name your tables in snake_case

This bit me. It's highly tedious that case isn't preserved honestly.

michaelcampbell

5 days ago

> null weirdness

Oracle enters the chat... (last I used it it considered an empty string '' the same as NULL)

onoke

6 days ago

Haha, good job. :)

TwoNineFive

5 days ago

"going on a journey similar to mine"

At this point "journey" is a cringe word because of it's excessive usage in blogspamverts. It tells me you are a gross aspiring influencer.

hibachrach

5 days ago

lol this is the first blog post i've written in 4 years and i didn't even post it here so I think you're imagining things

cynicalsecurity

5 days ago

People who use PostgreSQL instead of MySQL just want to suffer while pretending "they are better than others".

Jean-Papoulos

5 days ago

>"Normalize your data"

>"You don't need to write SQL in uppercase"

>"What's an index ?" section

From reading this, it's clear that the author never sat down and learned to use databases from the ground up. The author started using them and learned as he went, so his "tips" include things you'll be told in the first hour of any course.

This doesn't hold any value for someone who's been using databases for almost any length of time.

andrewstuart

6 days ago

Dump your schema, drop the entire thing into ChatGPT or Claude and ask it to write your Postgres query.

Then ask it to rewrite the query three different ways and explain the pros and cons of each.

Do the same for any existing queries in your system … drop in the schema then the query and ask for an analysis and suggested improvements.

bilekas

6 days ago

People have too much faith in LLM's at the moment. This might be able to give some insights, but if you're trying to LEARN something in detail, such as the pros and cons on a particular custom query on your unuiqe-ish schema.. The LLM will be prone to making stuff up by design, it wasn't trained on YOUR data..

It will give some good basic results and probably some functional queries, but it misses the mark with the finer points of optimization etc. If you don't know how to properly build the queries, go learn it properly with confidence that what you're learning is actually reliable.

cpursley

6 days ago

This is similar to what I’ve done with my sass where most of the logic lives in Postgres. I went through much of the logic and refactored it with GPT a while back and improved it significantly. It’s soooo good with SQL. Like insanely good. And these days I'm using Claude for all things code, which is significantly better than GPT.

sgarland

5 days ago

Agreed that LLMs are quite good at SQL (and relational algebra). However, ask yourself this: how many programming languages do you know to at least a comfortable level? If N > 1, why not add another (SQL)? It’s not a difficult language, which is in part why LLMs are so good at it – it’s easy to predict the next token, because there are so few of them.

cpursley

5 days ago

I know SQL pretty well (enough to write complex procedures) and prefer it over doing biz logic in application code (yuck ORMs). But SQL is just such a wonky syntax that I appreciate any help that I can get.

cyanydeez

6 days ago

then have corporate knock on your teams window asking why all it's IP ended up in a Chinese competitor?

junaru

6 days ago

Schema != data.

jrpelkonen

6 days ago

“Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowcharts; they'll be obvious.” - Fred Brooks

josephg

6 days ago

Yep, but if you think your database schema is "secret sauce", you're fooling yourself. You can almost always guess the database schema if you're given 10 minutes to play with the application.

hobs

6 days ago

Generally your schema is your code, which IS your IP - your data is often NOT your IP (and often might just be a list of facts, a thing which is not copyrightable)

PittleyDunkin

6 days ago

It's certainly data to the extent that it's copyrightable. I doubt this sentiment would stand up in court.