MaxGabriel
18 days ago
This might stem from the domain I work in (banking), but I have the opposite take. Soft delete pros to me:
* It's obvious from the schema: If there's a `deleted_at` column, I know how to query the table correctly (vs thinking rows aren't DELETEd, or knowing where to look in another table)
* One way to do things: Analytics queries, admin pages, it all can look at the same set of data, vs having separate handling for historical data.
* DELETEs are likely fairly rare by volume for many use cases
* I haven't found soft-deleted rows to be a big performance issue. Intuitively this should be true, since queries should be O log(N)
* Undoing is really easy, because all the relationships stay in place, vs data already being moved elsewhere (In practice, I haven't found much need for this kind of undo).
In most cases, I've really enjoyed going even further and making rows fully immutable, using a new row to handle updates. This makes it really easy to reference historical data.
If I was doing the logging approach described in the article, I'd use database triggers that keep a copy of every INSERT/UPDATE/DELETEd row in a duplicate table. This way it all stays in the same database—easy to query and replicate elsewhere.
nine_k
18 days ago
> DELETEs are likely fairly rare by volume for many use cases
All your other points make sense, given this assumption.
I've seen tables where 50%-70% were soft-deleted, and it did affect the performance noticeably.
> Undoing is really easy
Depends on whether undoing even happens, and whether the act of deletion and undeletion require audit records anyway.
In short, there are cases when soft-deletion works well, and is a good approach. In other cases it does not, and is not. Analysis is needed before adopting it.
whizzter
17 days ago
If only 50-70% of your data is dead and causing issues then you probably have an underlying indexing issue anyhow (because scaling to 2x-3x customers would cause the same issues by magnitude).
That said, we've had soft-deletes and during discussions of keeping it on one argument was that it was really only a half-assed measure (data lost due to updates rather than deletes aren't really saved)
KronisLV
17 days ago
> I've seen tables where 50%-70% were soft-deleted, and it did affect the performance noticeably.
I think we largely need support for "soft deletes" to be baked into SQL or its dialects directly and treated as something transparent (selecting soft deleted rows = special case, regular selects skip those rows; support for changing regular DELETE statements into doing soft deletes under the hood).
https://news.ycombinator.com/item?id=43781109
https://news.ycombinator.com/item?id=41272903
And then make dynamically sharding data by deleted/not deleted really easy to configure.
You soft deleted a few rows? They get moved to another DB instance, an archive/bin of sorts. Normal queries wouldn't even consider it, only when you explicitly try to select soft deleted rows would it be reached out to.
indigo945
17 days ago
Well, Microsoft SQL Server has built-in Temporal Tables [1], which even take this one step further: they track all data changes, such that you can easily query them as if you were viewing them in the past. You can not only query deleted rows, but also the old versions of rows that have been updated.
(In my opinion, replicating this via a `validity tstzrange` column is also often a sane approach in PostgreSQL, although OP's blog post doesn't mention it.)
[1]: https://learn.microsoft.com/en-us/sql/relational-databases/t...
warpspin
16 days ago
MariaDB has system-versioned tables, too, albeit a bit worse than MS SQL as you cannot configure how to store the history, so they're basically hidden away in the same table or some partition: https://mariadb.com/docs/server/reference/sql-structure/temp...
This has, at least with current MariaDB versions, the annoying property that you really cannot ever again modify the history without rewriting the whole table, which becomes a major pain in the ass if you ever need schema changes and history items block those.
Maria still has to find some proper balance here between change safety and developer experience.
pif
17 days ago
> I think we largely need support for "soft deletes" to be baked into SQL
I think web and GUI programmers must stop expeting the database to contain the data already selected and formatted for their nice page.
KronisLV
17 days ago
> I think web and GUI programmers must stop expeting the database to contain the data already selected and formatted for their nice page.
So a widespread, common and valid practice shouldn't be made better supported and instead should rely on awkward hacks like "deleted_at" where sooner or later people or ORMs will forget about those semantics and will select the wrong thing? I don't think I agree. I also don't think that it has much to do with how or where you represent the data. Temporal tables already do something similar, just with slightly different semantics.
lazide
17 days ago
What way of making it better supported wouldn’t require custom semantics that people would forget and then select the wrong thing.
KronisLV
17 days ago
> custom semantics
Making those custom semantics (enabled at per-schema/per-table level) take over what was already there previously: DELETE doing soft-deletes by default and SELECT only selecting the records that aren't soft deleted, for example.
Then making the unintended behavior (for 90% of normal operational cases) require special commands, be it a new keyword like DELETE HARD or SELECT ALL, or query hints (special comments like /*+DELETE_HARD*/).
Maybe some day I'll find a database that's simple and hackable enough to build it for my own amusement.
da_chicken
18 days ago
> I've seen tables where 50%-70% were soft-deleted, and it did affect the performance noticeably.
At that point you should probably investigate partitioning or data warehousing.
nextaccountic
17 days ago
What would be the benefit of data warehousing in this case?
da_chicken
17 days ago
The reason to soft delete is to preserve the deleted data for later use. If you need to not query that data for a significant amount of the system use that 75% soft deletes is a performance problem, then you either need to move the soft deleted data out of the way inside the table (partition) or to another table entirely.
The correct thing to do if your retention policy is causing a performance problem is to sit down and actually decide what the data is truly needed for, and if you can make some transformations/projections to combine only the actual data you really use to a different location so you can discard the rest. That's just data warehousing.
Data warehouse doesn't only mean "cube tables". It also just means "a different location for data we rarely need, stored in a way that is only convenient for the old data needs". It doesn't need to be a different RDBMS or even a different database.
redman25
17 days ago
Exactly, partition the table vertically by month. Surprised no one else seems to be mentioning this.
whatevaa
16 days ago
This only works if the data is actually historical. Not everything is "montly".
tharkun__
18 days ago
Agreed. And if deletes are soft, you likely really just wanted a complete audit history of all updates too (at least that's for the cases I've been part of). And then performance _definitely_ would suffer if you don't have a separate audit/archive table for all of those.
pixl97
18 days ago
I mean, yes, growth forever doesn't tend to work.
I've seen a number of apps that require audit histories work on a basis where they are archived at a particular time, and that's when the deletes occurred and indexes fully rebuilt. This is typically scheduled during the least busy time of the year as it's rather IO intensive.
tharkun__
17 days ago
Oldest I've worked with was a project started in ~1991. I don't recall when they started keeping history and for how long and they might have trimmed history after some legal period that's shorter but, I worked on it ~15 years after that. And that's like what, 15,..., 20 years ago by now and I doubt they changed that part of the system. You've all likely bought products that were administered through this system.
FWIW, no "indexes fully rebuilt" upon "actual deletion" or anything like that. The regular tables were always just "current" tables. History was kept in archive tables that were always up-to-date via triggers. Essentially, current tables never suffered any performance issues and history was available whenever needed. If history access was needed for extensive querying, read replicas were able to provide this without any cost to the main database but if something required "up to the second" consistency, the historic tables were available on the main database of course with good performance (as you can tell from the timelines, this was pre-SSDs, so multi-path I/O over fibre was what they had at the time I worked with it with automatic hot-spare failover between database hosts - no clouds of any kind in sight). Replication was done through replicating the actual SQL queries modifying the data on each replica (multiple read replicas across the world) vs. replicating the data itself. Much speedier, so that the application itself was able to use read replicas around the globe, without requiring multi-master for consistency. Weekends used to "diff" in order to ensure there were no inconsistencies for whatever reason (as applying the modifying SQL queries to each replica does of course have the potential to have the data go out of sync - theoretically).
Gee, I'm old, lol!
scott_w
17 days ago
> I've seen tables where 50%-70% were soft-deleted, and it did affect the performance noticeably.
Depending on your use-case, having soft-deletes doesn't mean you can't clean out old deleted data anyway. You may want a process that grabs all data soft-deleted X years ago and just hard-delete it.
> Depends on whether undoing even happens, and whether the act of deletion and undeletion require audit records anyway.
Yes but this is no more complex than the current situation, where you have to always create the audit records.
user
17 days ago
paulddraper
17 days ago
50-70% as the worst case isn't even necessarily that bad.
(Again, a lot is O(log n) right?)
postexitus
17 days ago
Soft deletes in banking are just a Band-Aid to the much bigger problem of auditability. You may keep the original record by soft deleting it, but if you don't take care of amends, you will still lose auditability. The correct way is to use EventSourcing, with each change to an otherwise immutable state being recorded as an Event, including a Delete (both of an Event and the Object). This is even more problematic from a performance sense, but Syncs and Snapshots are for that exact purpose - or you can back the main table with a separate events table, with periodic "reconstruct"s.
infamia
17 days ago
> The correct way is to use EventSourcing, with each change to an otherwise immutable state being recorded as an Event, including a Delete (both of an Event and the Object).
Another great (and older) approach is adding temporal information do your traditional database, which gives immutability without the eventual consistency headaches that normally comes with event sourcing. Temporal SQL has their own set of challenges of course, but you get to keep 30+ years of relational DB tooling which is a boon. Event sourcing is great, but we shouldn't forget about other tools in our toolbelt as well!
postexitus
16 days ago
I am using Temporal tables in SQL Server right now - I agree it's a bit of best of both worlds; but they are also painful to manage. I believe there could be a better solution without sacrificing SQL tools.
taeric
17 days ago
Isn't this, essentially, backing into double-entry accounting for all things banking? Which, fair, it makes sense.
postexitus
16 days ago
Good analogy, double-entry book keeping, generalized. (Nothing specific to banking btw)
taeric
16 days ago
Fair that I shouldn't have said it was specific to banking.
gleenn
18 days ago
If you're implementing immutable DB semantics maybe you should consider Datomic or alternatives because then you get that for free, for everything, and you also get time travel which is an amazing feature on top. It lets you be able to see the full, coherent state of the DB at any moment!
arter45
17 days ago
My understanding is that Datomic uses something like Postgres as a storage backend. Am I right?
Also, it doesn't support non-immutable use cases AFAIK, so if you need both you have to use two database technologies (interfaces?), which can add complexity.
ndr
17 days ago
Datomic can use various storage services. Yes, pg is one option, but you can have DynamoDB, Cassandra, SQLServer and probably more.
> Also, it doesn't support non-immutable use cases AFAIK
What do you mean? It's append only but you can have CRUD operations on it. You get a view and of the db at any point in time if you so wish, but can support any CRUD use case. What is your concern there?
It will work well if you're read-heavy and the write throughput is not insanely high.
I wouldn't say it's internally more complex than your pg with whatever code you need to make it work for these scenarios like soft-delete.
From the DX perspective is incredibly simple to work on (see Simple Made Easy from Rich Hickey).
ndr
17 days ago
Also good real-world use case talk: https://www.youtube.com/watch?v=A3yR4OlEBCA
arter45
17 days ago
Thanks, I'll look into it. My current setup for this kind of use cases is pretty simple. You essentially keep an additional field (or key if you're non relational) describing state. Every time you change state, you add a new row/document with a new timestamp and new values of state. Because I'm not introducing a new technology for this use case, I can easily mix mutable and non-mutable use cases in the same databases (arguably even in the same table/collection, although it probably makes little sense at least to me).
arnsholt
17 days ago
The core system at my previous employer (an insurance company) worked along the lines of the solution you outline at the end: each table is an append only log of point in time information about some object. So the current state is in the row with the highest timestamp, and all previous stars can be observed with appropriate filters. It’s a really powerful approach.
arter45
17 days ago
So basically something like this?
(timestamp, accountNumber, value, state)
And then you just
SELECT state FROM Table WHERE accountNumber = ... ORDER BY timestamp DESC LIMIT 1
right?
arnsholt
17 days ago
Yeah, basically. The full system actually has more date stuff going on, to support some other more advanced stuff than just tracking objects themselves, but that's the overall idea. When you need to join stuff it can be annoying to get the SQL right in order to join the correct records from a different table onto your table of interest (thank Bob for JOIN LATERAL), but once you get the hang of it it's fairly straightforward. And it gives you the full history, which is great.
arter45
17 days ago
Sounds cool! Do you keep all data forever in the same table? I assume you need long retention, so do you keep everything in the same table for years or do you keep a master table for, let's say, the current year and then "rotate" (like logrotate) previous stuff to other tables?
Even with indices, a table with, let's say, a billion rows can be annoying to traverse.
arnsholt
17 days ago
I wasn’t involved in the day to day operations of the system, but it had records going back to the 90s at least I think. I think data related to non accepted offers were deleted fairly quickly (since they didn’t end up being actual customers), but outside of that I think everything was kept more or less indefinitely.
user
17 days ago
ndr
17 days ago
This is also a recurring pattern when using bigtable.
ozim
18 days ago
DELETEs are likely fairly rare by volume for many use cases
I think one of our problems is getting users to delete stuff they don’t need anymore.
eddd-ddde
18 days ago
I never got to test this, but I always wanted to explore in postgres using table partitions to store soft deleted items in a different drive as a kind of archived storage.
I'm pretty sure it is possible, and it might even yield some performance improvements.
That way you wouldn't have to worry about deleted items impacting performance too much.
gleenn
18 days ago
It's definitely an interesting approach but the problem is now you have to change all your queries and undeleting get more complicated. There are strong trade-offs with almost all the approaches I've heard of.
snuxoll
18 days ago
With partitioning? No you don't. It gets a bit messy if you also want to partition a table by other values (like tenant id or something), since then you probably need to get into using table inheritance instead of the easier declarative partitioning - but either technique just gives you a single effective table to query.
edmundsauto
17 days ago
Pg moves the data between positions on update?
bandrami
17 days ago
If you are updating the parent table and the partition key is correctly defined, then an update that puts a row in a different partition is translated into a delete on the original child table and an insert on the new child table, since v11 IIRC. But this can lead to some weird results if you're using multiple inheritance so, well, don't.
tomnipotent
17 days ago
I believe they were just pointing out that Postgres doesn't do in-place updates, so every update (with or without partitions) is a write followed by marking the previous tuple deleted so it can get vacuumed.
snuxoll
17 days ago
That’s not at all what the child to me was saying in even a generous reading.
But HOT updates are a thing, too.
tomnipotent
17 days ago
What do you think they were saying? I don't see any other way to read it.
HOT updates write to the same tuple page and can avoid updating indexes, but it's still a write followed by marking the old tuple for deletion.
snuxoll
17 days ago
> Pg moves the data between positions on update?
I assume they typo'd "partitions" as "positions", and thus the GP comment was the correct reply.
paulddraper
17 days ago
IDK if the different drive is necessary, but yes partitioning on a deleted field would work.
Memory >>>>> Disk in importance.
MaxGabriel
16 days ago
One thing to add about performance: it's also pretty easy in Postgres to index only non-soft deleted data.
I think this is likely unnecessary for most use cases and is mostly a RAM saving measure, but could help in some cases.
rawgabbit
18 days ago
I have worked with databases my entire career. I hate triggers with a passion. The issue is no one “owns” or has the authority to keep triggers clean. Eventually triggers become a dumping ground for all sorts of nasty slow code.
I usually tell people to stop treating databases like firebase and wax on/wax off records and fields willy nilly. You need to treat the database as the store of your business process. And your business processes demand retention of all requests. You need to keep the request to soft delete a record. You need to keep a request to undelete a record.
Too much crap in the database, you need to create a field saying this record will be archived off by this date. On that date, you move that record off into another table or file that is only accessible to admins. And yes, you need to keep a record of that archival as well. Too much gunk in your request logs? Well then you need to create an archive process for that as well.
These principles are nothing new. They are in line with “Generally Accepted Record Keeping Principles” which are US oriented. Other countries have similar standards.
indigo945
17 days ago
What you describe is basically event sourcing, which is definitely popular. However, for OLAP, you will still want a copy of your data that only has the actual dimensions of interest, and not their history - and the easiest way to create that copy and to keep it in sync with your events is via triggers.
rawgabbit
17 days ago
Business processes and the database systems I described (and built) have existed before event sourcing was invented. I had built what is essentially event sourcing using nothing more than database tables, views, and stored procedures.