> The companies that win won’t be those with the most or even the best features. AI will democratize those. The winners will be built on a data model that captures something true about their market, which in turn creates compounding advantages competitors can’t replicate.
And that's why I think the future of the software industry is data-driven, and we will end up with another GNU-like movement around free and open data models/schemas. I think we already have a good starting point: Linked Data[1] and schema.org[2]
[1]: https://www.w3.org/wiki/LinkedData
[2]: https://schema.org/
Open Science folks understood this fact around 2018 IIRC, and there are a couple of nice standards for encapsulating research data such as RO-Crate [0].
Moreover, the science folks are not a picky bunch and they tend to use what works well, whether it be CSV and XML. As long as there's tooling and documentation, everything is acceptable, which is something I like.
[0]: https://www.researchobject.org/ro-crate/
This was also the aim of RDF and the various metadata schemas like Dublin Core, to standardise ontologies for marking up knowledge.
I totally agree and would like to shill for the FOCUS project - https://focus.finops.org/focus-specification/ - which is an open source project to normalize and standardize the billing format of cloud vendors and Saas vendors alike. It brings greater transparency and efficiency to understanding that massive cloud bill your company pays every month.
I've used this schema to merge together AWS, GCP, and Azure into 1 unified cloud bill, which unlocks a ton of understanding of where the money is going inside the cloud bills.
I don't have much to say about this post other than to vigorously agree!
As an engineer who's full-stack and has frequently ended up doing product management, I think the main value I provide organizations is the ability to think holistically, from a product's core abstractions (the literal database schema), to how those are surfaced and interacted with by users, to how those are talked about by sales or marketing.
Clear and consistent thinking across these dimensions is what makes some products "mysteriously" outperform others in the long run.
It's one of the core ideas of Domain-Driven Design. In the early stage of the process, engineers should work closely with stakeholders to align on the set of terms (primitives as another commenter has put it), define them and put them in neat little contextual boxes.
If you get this part right, then everything else becomes and implementation effort. You're no longer fighting the system, you flow with it. Ideas becomes easier to brainstorm and the cost of changes is immediately visible.
> If you get this part right
And yet it's so easy to get wrong.
We ended up with something like five microservices - that, in principle, could have been used by anyone else in the company to operate on the Domains they were supposed to represent and encapsulate. This one holds Users and User data! This one holds Products, and Product interaction data!
Nobody touched any of those except us, the engineers working on this one very specific product. We could have - should have - just put it all on one service, which would have also allowed us to trivially run database joins instead of having to have services constantly calling each other for data, stitching it together in code.
Sigh.
Immediately thought DDD too!
DDD suggests continuous two-way integration between domain experts <-> engineers, to create a model that makes sense for both groups. Terminology enters the language from both groups so that everyone can speak to each other with more precision, leading to the benefits you stated.
Now how do you get your company / do yourself the hiring of those people in such a way that you can basically just have a team of people like this work with PMs to build their ideas?
I like doing this FS journey myself but am stuck "leading teams" of FS/BE/FE mixes and trying to get them to build stuff that I clearly understand and would be able to build with enough time but all I have is a team of FE or BE people or even FS people that can't just do the above. You need to be very involved with these people to get them to do this and it just doesn't scale.
I've recently tried AI (Claude specifically) and I feel like I can build things with Claude much quicker than with the FE/BE/FS people I have. Even including all the frustrations that Claude brings when I have to tell it that it's bullshitting me.
Is that bad? How do you deal with that? Advice?
I have exactly the same experience as you. I tried educating people but all those developers (and beyond, up to stakeholders), no matter their seniority, do not want to get involved in the domain too much, just as little as they need. That naturally leads to me micromanaging all the things, leading to non scalability and finally overburn. As soon as I stop doing micro, all the stuff start to break down pretty fast. I wrote a book per project trying to get everyone on the same level but nah (more than 3000 pages in last decade, 20+ projects). Tried everything in hiring too, found almost nobody during all that time.
I am now off the previous work and will devote time to try AI, because I concluded it can't be worse than that.
Same here. No matter how hard I try, and use different approaches, from coaching, to sharing videos, through poiting out why this can benefit you personally, to showing how exactly it creates results, there simply is no interest. People don't care.
It's even worse than that - even the owner of the company I worked for didn't care that the product of his own company will be mediocre, while shouting generally the quality is the goal. It turns out that it was the goal as long as it was incidental and free (no such thing, but it looks that way if you are not deeply involved) and because it sounds good. As soon as reputation collides with the immediate profit, profit always wins.
That’s something I relate too as well. I like working on different abstraction levels throughout the system.
Only way to cope was to let go things and pick my battles.
I always think about the joke where a sailor goes down to the dock and asks dock men if they speak French, English or German- dock men only shake their heads showing no. Later dock men chat and one saying to other he could learn languages so he would be able to talk with the sailor. The other replied that sailor knew 3 and it didn’t help him.
Reading this thread brought back fond memories of sitting with front-line staff and just chatting with them while watching them work from the corner of my eye. My gimmick was to turn up for morning tea (the staff were older ladies that took homemade cakes to work), and by lunchtime have some frustration of theirs resolved.
It’s such a great feeling when you can make someone’s work better, for the life of me I can’t understand why others wouldn’t jump at the opportunity!
Sadly at current $dayjob, the devs are held at arm's length from the customer. On purpose!
Decoder for people reading:
PM - Product Manager
FS - Fullstack developer
FE - Frontend developer
BE - Backend developer
Huh, my experience has been generally the opposite - most FS/BE/FE folks want to understand the business, and while a good PM will enhance that, the median PM is actively detrimental.
Frankly if the people you have aren't good enough then you need to get good at training, get better in your hiring (does your hiring process test the skills you want? Without being so long that anyone decent is going to get a better offer before they reach the end of it?), or maybe your company is just not offering enough to attract decent talent. There are plenty of better-than-AI programmers out there, but even in this job market they still have decent options.
The truth is that everyone is correct when going by past experience. With many millions of developers and PMs, all combinations happen.
> Is that bad? How do you deal with that? Advice?
Everything is too recent, nobody can give a sure advice on how to deal with your situation. From my view as a fullstack engineer working with LLMs for the past 3 years, your generated product is probably crap and your only way to assess it is by asking Claude or ChatGPT if it's good, which it'll probably say yes to make you feel good.
Now go ahead and publish it. If your app brings revenue, then you build something quicker. A Claude-generated prototype is as much a product as some PowerPoint slides
> sales or marketing
Also operations and customer support. They are your interface to real, not hypothetical, customers.
I don't suppose you have any tips on how to get this going in an org? I love where I work and I love the products we make, but my team (phone apps) are treated very often like an afterthought; we just receive completed products from other teams, and have to "make them work." I don't think it's malicious on the part of the rest of the teams, we're just obviously quite a bit smaller and younger than the others, not to mention we had a large departure just as I arrived in the form of my former boss who was, I'll fully admit, far more competent in our products than I am.
I've worked on learning all I can and I have a much easier time participating in discussions now, however we still feel a bit silo'd off.
There's a term for this - inventing a new primitive. A primitive is a foundational abstraction that reframes how people understand and operate within a domain.
A primitive lets you create a shared language and ritual ("tweet"), compound advantages with every feature built on top, and lock in switching costs without ever saying the quiet part out loud.
The article is right that nearly every breakout startup manages to land a new one.
I call it lego pieces. We want to enable teams to compose useful units together; to enable builders (generally internal teams) to build things with a clear mental model. "Primitives" are the same: base unit of abstraction for the domain.
Another industry term for this is defined in the Domain Driven Design world as a domain's "ubiquitous language"[0]:
These aspects of domain-driven design aim to foster a
common language shared by domain experts, users, and
developers—the ubiquitous language. The ubiquitous language
is used in the domain model and for describing system
requirements.
0 -
https://en.wikipedia.org/wiki/Domain-driven_design#Overviewi think youre actually serious but this is excellent satire
A well engineered data model can also be used as the basis for a business rules engine. This is popular in enterprise environments that use technology like oracle db or mssql. It is possible to implement all the core business logic as stored procedures and functions. These can be directly invoked from something like a web server. Instead of putting all the session validation logic in backend code, it could live in PL/SQL, T-SQL, etc.
The benefit to having the logic and the data combined like this is difficult to overstate. It makes working in complex environments much easier. For example, instead of 10 different web apps each implementing their own session validation logic in some kind of SSO arrangement, we could have them call a procedure on the sql box. Everyone would then be using the same centralized business logic for session validation. Any bugs with the implementation can be fixed in real time without rebuilding any of the downstream consumers.
Counter point: spooky code at a distance is bad. Splitting your code to live partially in source control and partly in the database means keeping multiple layers in sync. This is coupling, and coupling multiple things, especially if that means teams, together means increased overhead.
I have seen business rules as stored procedures lock a business into their current model across with a dozen teams, effectively making system improvements impossible. And because they needed some olap atop oltp in some cases, their very beefy postgres solution crawled down to a max of 2k queries per second. I worked with them for over a year trying to pull apart domain boundaries and unlock teams from one another. Shared, stored procedures was a major pole in the tent of things making scaling the technical org incredibly hard.
Repeat after me: uncoupled teams are faster.
A veritable thread with bad advice!
For me, this is a "near miss" in that the data model is an implementation detail. Instead, the subtitle identifies where the value resides:
Your product's core abstractions determine whether new
features compound into a moat or just add to a feature list.
Which is captured by the Domain Model[0]. How it is managed in a persistent store is where a data model comes into play.
See also Domain Driven Design[1].
0 - https://en.wikipedia.org/wiki/Domain_model
1 - https://en.wikipedia.org/wiki/Domain-driven_design
there is a subtle difference, it is not just domain driven desgin. It is basically trying to innovate a new way to think about in an existing domain (eg docs vs blocks in note taking). ~ "Your data model is your destiny. The paradox is that this choice happens when you know the least about your market, but that’s also why it’s so powerful when you get it right. Competitors who’ve already built on different foundations can’t simply copy your insight. They’d have to start over, and by then, you’ve compounded your advantage.".
You’re describing core features of Domain Driven Design.
Innovating, evolving, creating, and capturing new domain concepts to create Blue Ocean solutions inside and outside the Enterprise. Iterating on core concepts, via subject matter expert led/involved discussions and designs, and using new concepts to better articulate the domain. Managing that change over time and accounting for ontological and taxonomical overlap versus Enterprise System development needs.
That’s the foundation that can actively copy insights, and doesn’t rely on Immaculate Specification or premature data modelling. No need to start over, thanks to clearly separated concerns.
Note: copying an insight is a far cry from having the wherewithal to make that insight, there are numerous downstream benefits to articulating your business domain clearly and early.
I was expecting a discussion of the foundations of data modelling: star schema vs snowflake schema data models vs one big table. The benefits of 3NF vs when you have to denormalize.
This underlying choice of data model actually does define your destiny. What I think the author was thinking of is domain modelling and correct entity identification, which is also important. It's a layered approach - and if you ignore the foundations (the actual data model), you hit limitations higher up.
For example, in real-time AI systems, you might want users to provide a single value (like an order number) to retrieve precomputed features for a model. With Snowflake Schema data models, it works. But for Star Schema data models, you have to provide entity IDs for all tables containing precomputed features - which leads to big problems (the need for a mapping table, a new pipeline, and higher latency).
Reference:
https://www.hopsworks.ai/post/the-journey-from-star-schema-t...
I prefer your terminology. That being said, domain modelling (what the article describes) comes first, hence is more foundational and important than data modelling.
This is an application of an engineering term to a product-level concept, but it fits. I guess you'd say "domain model" in product-speak, but to my engineering brain it doesn't evoke the cascading consequences of the model for the rest of the system. It's a rare product manager who treats the domain model as a consequential design product and a potential site of innovation.
I totally agree. Early days Cloudflare was a great example of this. We treated IP addresses as data, not as configuration. New subnet? INSERT INTO and we're done. Blocked IP? DELETE FROM, and tadam. This was a huge differentiator from other CDN's and allowed us extreme flexibility. The real magic and complexity was with automatic generating and managing HTTPS certs (days before SNI).
Can you explain more? I don’t understand the distinction in this case between data and configuration in the context of IP addresses.
In simplest scenarios software is not aware of the IP space. Like you bind to 0.0.0.0:443 and move on.
In more sophisticated configs adding / removing IP's or TLS certs requires restarting server, configuring applications. This gets out of hand quickly. Like what if your server has primary IP removed, because the IP space is recycled.
At CF all these things were just a row in database, and systems were created to project it down to http server config, network card setting, BGP configurations, etc. All this is fully automated.
So an action like "adding an IP block" is super simple. This is unique. AFAIK everyone else in the industry, back in 2012, was treating IP's and TLS more like hardware. Like a disk. You install it once and it stays there for the lifetime of the server.
Not OP but I think the insight was to treat them as first class objects that are interacted with directly. The implementation itself seems secondary.
I enjoyed the article and wanted to share a story that highlights how a good data model can even bused for incident management tickets:
- Incident ticket gets created
- It used to go to a department wide alias
- Head of Dept used to open the email, hit forward and then have to To/CC the owner of the system affected
- JIRA (which we used) already the idea of a Component and you could tie an owner to each Component
- Update the notifications to be: in the "To" field was the Component owner and who opened the ticket. In the "Bcc" field was the department wide alias
- Now, the Dept Head could just hit reply and get to the right people. The BCC meant that everyone knew something had occurred.
I 100% agree in the importance of the data model, and the examples show that it often makes more sense to start from the user's perspective of it rather than your db schema.
Interestingly AI Agents are all about disrupting the hard bounds of existing data and interaction models and it turns out the lowest common denominator is often the winner. Eg.: file system > database, grep > embeddings, markdown > pdf, generative ui > portals, computer use > api-s etc.
There simply is no need for all that abstraction / interface / infrastructure to eg. answer questions about documents or to keep track of todo lists, workflows or sending messages etc. when you have glue that can translate between the data models.
It seems that any problem solving starts by defining the data.
« Always define your variables » is the first thing I learned during my engineering studies, in both math and physics class. Professors were insisting a lot about it. I still consider it is the most important thing I ever learned 10 years later.
I recently spent a week or so creating a library for my project. There's not a lot of code, but it was hard to reason about the data model, what I wanted the API to look like, and what I wanted actually rendered on the other side.
I was proud after getting it working, but when I had to run dozens of files through it, it was horribly slow. I don't tend to write a lot of hot code, so I was excited by the fact I had to profile it and make it efficient.
I decided that I should rewrite the system, as my mental model had improved and the code was doing a lot more than it should be doing for no reason. I've spent the last few days redesigning it according to Data-Oriented Design. I'm trying to get the wall-clock time down by more than an order of magnitude. I'm not sure how it's going to turn out, wish me luck :)
Since I mentioned DoD, these three links will probably come up in conversation:
Mike Acton's famous performance talk: https://www.youtube.com/watch?v=rX0ItVEVjHc
DoD in the Zig compiler: https://www.youtube.com/watch?v=IroPQ150F6c
The DoD book: https://dataorienteddesign.com/dodbook.pdf
One nit, while I think Notion's data model is probably superior to that of Google Docs, I don't think their data model is what allowed them to succeed. Much stronger, I think, is their execution.
I would think their data model choice _is_ part of the execution?
Exactly like Google docs couldn't be Notion because Google tried to build Microsoft Office online, but Notion tried to build lovable without AI and accidentally made a better Google doc.
Sure, like a transmission is part of a car. No car could work without one, and a bad one makes an otherwise good car bad. However, a great one can’t make an otherwise bad car good.
I was working for a company recently and we were exploring how to model what a minor can do with their guardian managed account.
I did initially look at RBAC frameworks but since it was too complex for a small greenfield project I went with one or more accounts linked to a user's profile with a RBAC junction table linking account and profile ID in a relational database.
The junction table was the secret sauce, it allows you to stuff the RBAC permissions into its rows.
I could get very far with this model. For example it allows for example who can pay for features(guardian not minor). Have multiple people manage a minor. Validate permissions for a logged in account.
Decoder: RBAC = role based access control.
Whatever happened to people being charitable enough to readers to define their acronyms and abbreviations? This page is full of "insider" talk.
Google exists... It's not responsibility of commenter sharing knowledge for free to also expand anything that can be unknown to somebody else?
I really like this post. The only caveat I would add is it is possible to change your data model, but it requires constant and sustained high-effort work. It can pay off in spades, and it's always preferable to get it right.
I've lead a change like that - the very core of our data model was compromised from the early days of our company and we knew it... and knew it... and four years into working there I started a serious effort that ended up taking about a year and a half to pay off. These efforts always need a lot of careful planning and you usually want to work within the constraints of early model decisions as much as possible but it is quite possible to gracefully transition. When you're doing something like this it's important to be extremely greedy with SMEs to try and understand as much as you can about the field to future proof your new solution - our company did that once - there's not a chance it'd do it twice.
I did it for my own startup. Messed up the whole "how do we break down what constitutes a tenant" thing in the initial design at 0 customers. Made me really feel the whole "experience is reading your own code from 5 years ago and wondering what idiot wrote that" thing.
Worked out OK in the end, but took substantial effort to fix.
As a customer I often look for the data model without a "moat". I want to be able to move my data to a different supplier without too much hassle
luckily for sass builders most people aren’t like you
The examples are great but .. I think.. the conclusion is naiive/wrong in a lot of cases.
Aren't both right and left sides of most examples fundamentally different views of the same exact underlying data/relationships?
If you're locked in to one view, then your code sucks. I'm not an expert in this stuff, but I think this is just a natural consequence of thinking in terms of tables and SQL.
If you think in terms of triple stores and graph databases then you can derive either a left side or right side view as needed and you can operate on either abstraction
The data might be a graph, but we're interested in conceptual units of data / abstraction.
Organizations as a whole need to talk the same language. "We have the data somewhere, somehow - so figure it out" doesn't work when communicating.
A good proxy measurement is how fast it takes two random employees to talk about the same thing. It's the effective knowledge transfer speed after taking into account the data density / compression.
The infinite many views of a graph don't compress.
Or more practically, any graph will continuously evolve a set of 'common views' that an organization understands and uses.
I think you're mixing things up.
People don't need to have the same language at all.
Compression can happen by "users" at a higher level when interacting with the data model.
The data model itself does not need to encode this.
HR may want to look at employees from one view and your accountant might want to look from an entirely different one. They communicate with people in their own field using their own compression. You don't need to settle on one abstraction for everyone.
The data-model designers need to communicate with each other.. sure.. but it's their job to think in a non-compressed way about the data model.
Data model here is not necessarily what is in the DB, I think TFA is using "data model" more abstractly, is the model of the business domain that is prevalent in the organization and structures how they talk about the domain.
An interesting take. In the abstract it probably makes sense. You can't make a large org like Instagram start doing a Twitter or something like that.. Though I read the subtext to be talking about code
> By the time the architecture solidifies around these implicit choices, it’s nearly impossible to change.
> Incumbents couldn’t match this without rebuilding from scratch.
My point is that if your actual code-level datamodel is flexible then you often architecturally can pivot to a different view (the Rippling example). My guess as to why HipChat can't change to a Slack model is because they "coded it wrong". It's not somehow inherently orthogonal. The essay present the models are a fait accompli when in fact, if you step back they're for the most part just views on the same fundamental data/relationships - and if modeled correctly I don't see any reason why you wouldn't be able to change between views.
The central thesis that "Your data model is your destiny" does not have to be true
Though that said, there are some examples that probably can't pivot due to some inherent design limitation (ex: the Adobe case)
This reminds me of "Good programmers worry about data structures and their relationships. (https://read.engineerscodex.com/p/good-programmers-worry-abo...).
From Linus Torvalds:
"git actually has a simple design, with stable and reasonably well-documented data structures. In fact, I'm a huge proponent of designing your code around the data, rather than the other way around, and I think it's one of the reasons git has been fairly successful […] I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important.
...
Bad programmers worry about the code. Good programmers worry about data structures and their relationships."
Is Linus actually a good programmer though? Linux is certainly popular but he had a lot of help with it. As a person he seems prone to lashing out and childish. I doubt hed last in the real world with his attitude, the only place he could succeed is heading his own open source project
He can act however he wants considering his contributions to humanity at large
He built git alone in a week as a side project. If you think you are a good programmer, there's no way Linus isn't many orders of magnitude better.
Sure. Now show us your best creation and we will compare.
Now there is a difference from what the article is talking about and what you are talking about and I think that's quite important, because we tend to mix these things up often.
The article describes domain modeling, what you describe is computational modelling. The former lives at a higher abstraction closer to the user. The latter is about data processing.
A lot of people have mentioned DDD (or similar) in this thread, but I think that is an example of mixing up computational modeling and domain modeling. I think this is what object orientation and its descendants like micro services generally have been doing wrong: Applying domain structure at a level where it makes no sense anymore. This mismatch can add a lot of friction, repetition and overhead.
The git data model isn't ideal though, it misses content-defined chunking of file content and directory entries, which leads to lots of duplicate data with large text files or directories containing large numbers of files. Newer backup tools like restic/borg support this though.
That seems like an implementation detail, not a fundamental design decision as it should be easy to change how packfiles are implemented. I'm not sure it would be an improvement though: it already only stores deltas for similar objects.
Am I the only one who is bothered by the gradual shift of the expression "data model" from something actually meaning something to a vaguely defined buzzwordy idea which can be brandished to mean anything from "ontology" to "data flow diagram" to even less precise business like entity?
It's eveywhere in my current company where the top management have somehow agreed that data is the future and like to talk about data products all the time but with an actual understanding of what it means, requires and entails with is close to nil. It's very lucrative for our suppliers however.
It feels to me like some kind of repackaging of what the semantic web was promising with very little in term of actual novelty and no real solution to the problems encountered at the time. It's everywhere in this discussion "domain driven design". I saw a post this week about knowledge graph.
Where is the push coming from? I'm missing some fundamental innovation somewhere which makes this more practical?
In my experience, it always ends the same. It's a slow death by governence as business objects owners always end up lagging what's needed in the field, gateways and caches start popping up everywhere as your model doesn't fit what the software you are buying require, data quality becomes uneven in the IT system, costs creep up through duplication until someone higher up probably promoted while putting in place the mess decides it's time to decouple and simplify and get their next promotion by ending what they created.
Isn’t this more of a modal usage thing than the actual data model?
Isn’t the slack data model presented here totally possible with hipchats actual data model?
How it's presented in the UI is roughly a function of how the underlying data is structured and manipulated. You can put in a lot of effort and construct a different view on top of a data model that "wants to" be seen in a different way (Delta Chat being an example of this on top of email), sure. But it increases the complexity of the implementation and makes the abstraction thicker, making iteration harder and introducing space for users (and onboarding developers) to misunderstand how things actually work.
What about your algorithm destiny? code = data
Wow the illustrations are good! They really help in understanding what the text says - the slack one is one that made me go "ooooooh"
My data model is going to be Apache Arrow. Any kind of Arrow table.
What is my destiny?
Is it data model or product?
Are they effectively the same?
While the title read a bit dramatic, I find it hard to disagree on the concepts
This is why I come to the HN.
yeah.. point me to a business where the data model is more important than the bottom line...
The data model is what drives the bottom line.
Why is Goldman Sachs so profitable? They have a good data model and have spent 20+ years refining and applying it.
Optimize your organization for dual-write migrations and log replays. Now you can do what many cannot: change the data model.
Agree with the first half of the article, but every example the author pointed out predates AI. What are examples of companies that have been founded in the past 3 years and prove the authors point that the data model is the definitive edge?
What does AI have to do with anything here?
Just had a chat with AI to see how we could address the issues mentioned in the article. You can create models that cater to multiple use cases. You can split the domain model into facts (tables) and perspectives (views). This gives you a lot of flexibility in addressing the different perspectives presented in the artcile from a shared domain model.
The value of data model in post is spot on. AI has the potential to offer a mapping from the old to ideal (materialising a view); potentially offering an evolutionary path out for the smarter orgs.