Don't let dicts spoil your code

151 pointsposted 13 hours ago
by juniperplant

82 Comments

cardanome

8 hours ago

This is absolute key advice.

Another way to look at it is the functional core, imperative shell pattern.

Wrapping up your dict in a value object (dataclass or whatever that is in you language) early on means you handle the ugly stuff first. Parse don't validate. Resist the temptation of optional fields. Is there really anything you can do if the field is null? No, then don't make it optional. Let it crash early on. Clearly define you data.

If you have put your data in a neat value objects you know what is in it. You know the types. You know all required fields are there. You will be so much happier. No checking for null throughout the code, no checking for empty strings. You can just focus on the business logic.

Seriously so much suffering can be avoided by just following this pattern.

mcdeltat

an hour ago

The "loosey-goosey" approach to data in coding is one of my biggest pet peeves. Some people absolutely insist on making everything as dynamic as possible, and then wonder why we end up with a buggy mess. I always found it very natural to move as much as possible into the type system, because why wouldn't I want the machine to find all my inevitable mistakes for me?

Kinrany

2 hours ago

There's little to disagree with here, and yet this comment reads like a slogan soup.

Jean-Papoulos

4 hours ago

>"unstructured data is problematic"

>"solution : use dataclasses"

Damn, it's almost like using an untyped language for large projects is not a great idea.

mkesper

an hour ago

Python is absolutely typed. By default, it's really dynamic, though.

ktosobcy

2 hours ago

And yet we are overwhelmed by javascript nonsense... I get it - it's so easy to get up to speed with tiny snippets but it quickly becomes hot mess.

Yes, decades ago I was also fascinated by python and it's ease of doing stuff (compiler doesn't complain that I missed something) but with time I grew fond of statically typed languages... they simply catch swaths of errors earlier...

kabes

an hour ago

Are we still overwhelmed by js? I almost only see TS code these days.

jimmytucson

8 hours ago

Here’s an out-there take, but one I’ve held loosely for a long time and haven’t shed yet: dicts are not appropriate for what people mostly use them for, which is named access to member attributes.

dict is an implementation of a hash table. Hash table are designed for o(1) lookup of items. As such, they are arrays which are much bigger than the number of items they store, to allow hashing items into integers and sidestep collisions. They’re meant to act like an index that contains many records, not a single record.

A single record is more like a tuple, except you want named access instead of, title = movie[0], release_year = movie[1], etc. And Python had that, in NamedTuple, but it was kinda magical and no one used it (shoutout Raymond Hettinger).

Granted, this rant is pretty much the meme with the guy explaining something to a brick wall, in that dicts are so firmly entrenched as the "record" type of choice in Python (but not so in other languages: struct, case class, etc. and JSON doesn’t just deserialize to a weak type but I digress).

fallingsquirrel

7 hours ago

NamedTuples are great, but they let you do too much with the objects. You probably don't want users of your GitHubRepo class to be able to do things like `repo[1]` or `for foo in repo`. Dataclasses have more constrained semantics, so I reach for them by default. In my ideal world they would default to frozen=True, kw_only=True, slots=True, but even without those they're a big improvement.

jsyang00

8 hours ago

I think most modern Python codebases are using dataclasses/ something like Pydantic. I think dicts are mostly seen, like the author suggests, because something which you hacked up to work quickly ends up turning into actual software and it's too much work refactor the types

aatarax

6 hours ago

Dicts in python are for when you have a thing and you aren't sure what the keys are. Dataclasses are for when you have a thing and you're sure what the keys (attributes are). The trouble is when you have a thing and you're sort of sure, but not entirely sure, and some things are definitely there but not everything you might be thinking of.

travisjungroth

5 hours ago

I think I once heard a Clojure talk where they were referred to as big and small maps. Small ones are what you’re comparing to arrays.

A place where dicts for hard coded keys makes sense is notebooks. The convenience is worth it and it’s unlikely to get out of hand.

seabrookmx

5 hours ago

Subclassing NamedTuple is very ergonomic, and given they're immutable unlike data classes I often reach for them by default. I still use Pydantic when I want custom validation or when it ties into another lib like FastAPI.

jonathrg

3 hours ago

dicts are used internally in the language to look up class and module attributes. They are optimized for this use case. How can it be wrong to use them that way when the very fabric of the language depends on it?

namedtuple is widely used in Python code, especially before the introduction of dataclasses.

ungamedplayer

4 hours ago

Can someone educate me in why dicts are uncool for explained reasons, but clojure (which seems to be highly recommended on hn) seems to suffer the same issues when dealing with a map as a parameter (ring request etc).

I know how to deal with missing values or variability in maps, and so do a lot of people.. what am I missing here?

bloppe

4 hours ago

Dicts are great when the data is uniform and dynamic, like an address book mapping names to contact info. You never assume that a key must be in there. Lookups can always fail. That's normal for this kind of use-case.

When the data is not uniform (different keys point to differently-typed values), and not as dynamic (maybe your data model evolves over time, but certain functions always expect certain keys to be present), a dict is like a cancer. Sure, it's simple at first, but wait until the same dict gets passed around to a hundred different functions instead of properly-typed parameters. I just quit my job tech at a company that shall remain nameless, partially because the gigantic Ruby codebase I was working on had a highly advanced form of this cancer, and at that point it was impossible to remove. You were never sure if the dict you're supplying to some function had all the necessary keys for the function it would eventually invoke 50 layers down the call stack. But, changing every single call-site would involve such a major refactor that everybody just kept defining their functions to accept these opaque mega-dicts. So many bugs resulted because of this. That was far from the only problem with that codebase, but it was a major recurring theme.

I learned this lesson the hard way.

cornholio

3 hours ago

This should be the top answer. It's not about using dicts in their primary use case, it's about abusing them as a catch all variadic parameter for quick prototyping and "future expansion"

scotty79

an hour ago

I think the problem is that different data containers have completely different interfaces.

If getting a filed of your object had the same syntax as getting a value from a dict you could easily replace dicts with smarter, more rigid types at any point.

My dream is a language that has the containers share as much interface as possible so you can easily swap them out according to your needs without changing most of the code that refers to them. Like easily swap dict for BTreeMap or Redis.

I think the closest is Scala but it fallen out of favor before I had a chance to know it.

lispisok

4 hours ago

Maps arent nearly as problematic in clojure because data is immutable by default on top of the functional paradigm where your program is basically a big composition of functions and the language is built around using maps. In Python I largely agree with the author. In clojure I love my maps.

Here is Rich Hickey with an extreme counter example although I would argue he's really demonstrating against getters and setters. https://www.youtube.com/watch?v=aSEQfqNYNAc

nlitened

3 hours ago

In Clojure, maps don’t have either of the flaws highlighted in the article. They are neither opaque (they are self-describing with namespaces keys) nor mutable.

As a result, they are very powerful and simple to use.

orf

3 hours ago

They also work fine with JavaScript.

The issue is that the concrete types are implicit. Depending on the language, runtime or type system expressing the type in a “better” way might be very hard or un-ergonomic.

bigstrat2003

9 hours ago

For better or for worse, Python doesn't do typing well. I don't disagree that I prefer well defined types, but if that is your desire then I think Python is perhaps not the correct choice of language.

Ey7NFZ3P0nzAe

7 hours ago

Personnaly I became a huge fan of beartype : https://pypi.org/project/beartype/

Leyec, the magic dev behind it managed to make a full python type checker with super advanced features and about 0 overhead. It's crazy

skeledrew

5 hours ago

I tried using it, but beartype quickly became a pain with having to decorate things manually. Then I found typeguard which goes even further and never looked back. Instead of manually decorating each individual function, an import hook can be activated that automatically decorates any function with type annotation. Massive QoL improvement. I have it set to only activate during testing though as I'm unsure of the overhead.

nerdponx

8 hours ago

Python does typing pretty darn well now for data like API requests and responses.

"Typed Python" does poorly (compared to e.g. Typescript) on things like overloading functions, generics, structural subtyping, et al.

est

7 hours ago

> Python doesn't do typing well

Golang does typing, but JSONs are PITA to handle.

Try parsing something like `[{"a': 1, "b": "c", "d": [], "e": {}}, null, 1, "2"]` in go.

Types are a bless as well as a curse.

Aditya_Garg

7 hours ago

Thats only because your list has different types. Its a badly formed API and if you really need to support that use case then you can use maps and reflection to handle it.

est

7 hours ago

The problem is, programmers can't dictate what JSON should look like in the wild.

We used to have strict typed XML. Nobody even bothered.

a57721

5 hours ago

> The problem is, programmers can't dictate what JSON should look like in the wild.

Not JSONs in general, but a sane API would never return something like that.

> We used to have strict typed XML. Nobody even bothered.

Nowadays there is OpenAPI, GraphQL, protobuf, etc. and people do bother about such things.

mook

5 hours ago

Unfortunately, a lot of the time you need to deal with other people's APIs.

shiroiushi

5 hours ago

>We used to have strict typed XML. Nobody even bothered.

Yeah, because it was ugly as hell and not human-readable.

Turskarama

4 hours ago

And if you got that JSON back in Python, how would you do anything with it? This API is essentially useless. You can deserisalise it, sure, but then what?

Garlef

3 hours ago

I don't think dicts themselves are the problem.

In typescript using plain JS objects is very straightforward. Of course you have to validate the schema at your system boundaries. But you'll have to do this either way.

So: If this works very well in TS it can't be dicts themselves but must be the way they integrate into- and are handled in python.

This leads me to the conclusion that arguments presented in the article might be the wrong ones.

(But I still think, the conclusion the article arrives at is okay. But I don't think there's a strong case being made in the article about wether to prefer data classes or typed dicts.)

soulchild77

2 hours ago

This. I think types really make the difference here. You can get very far with just plain old JS objects as long as you've got strong types in place.

Attummm

35 minutes ago

Python has made its rise as an antithesis to Java thinking. Classes used to be seen by some in the community as an anti-pattern. [0] The coding style used to focus on "Pythonic-ness," which meant using Python's expressiveness to write code in such a way that type information could be inferred without explicitly stating the type.

Most developers will carry their previous language paradigms into their new ones. But if types, DDD (Domain-Driven Design), and classes are what you're looking for, then Python isn't the best fit. Python doesn't have compiler features that work well with those paradigms, such as dead code removal/tree shaking. However, starting out with dictionaries and then moving over to dataclasses is a great strategy.[1] As a small note, it's kind of ironic that the statically typed language Go took inferred typing with their := operator, while there is now a movement in Python to write foo: str = "bar".

[0] https://youtu.be/o9pEzgHorH0?si=pv0QQyM-iBrHuXUN

[1] https://docs.python.org/3/library/dataclasses.html

CraigJPerry

3 hours ago

This has merit in some cases but let me try to make a counterpoint.

You lose the algebra of dict’s - and it’s a rich algebra to lose since in python it’s not just all the basic obvious stuff but it’s also powerful things like dict comprehensions and ordering guarantees (3.7+ only).

You tightly couple to a definition - in the simple GitHubRepository example this is unlikely to be problematic. In the real world, coupling like this[1] to objects trying to capture domain data with dynamic structures is regularly the stuff of nightmares.

The over-arching problem with the approach given is that it puts code above data. You take what could be a schema, inert data about inert data, and instead use code. But it might also be an interesting case to consider as a slippery slope - if you can put code concerns above data concerns then maybe soon you will see cases where code concerns rank higher than the users of your software?

[1] - by coupling like this I mean the “parse don’t validate” school of thought which says as soon as you get a blob of data from an external source, be it a file, a database or in this case a remote service, you immediately tie yourself to a rocket ship whose journey can see you explosively grow the number of types to accurately capture the information needed for every use case of the data. You could move this parsing operation to be local to the use case of the data (much better) rather than have it here at the entry point of the data to the system but often times (although not always) we can arrive at a simpler solution if we are clever enough to express it in a style that can easily be understood by a newbie to programming. That often means relying on the common algebra of core types rather than introducing your own types.

zmgsabst

3 hours ago

You also make a nightmare of dynamically adding middleware — which can piggyback on a generic dict and have no meaningful way to insert themselves into your type maze.

fhdsgbbcaA

9 hours ago

Seems like the issue is less using dicts than not treating external APIs as input that needs to be sanitized.

physicsguy

18 minutes ago

The code in the examples doesn't even check the API response code, let alone the structure of the response.

pmarreck

9 hours ago

Agreed. If you sanitize/allowlist API data you should not have issues with dicts.

imron

9 hours ago

You'll have issues if you ever rename things in the dict.

Linting tools will pick up on every instance where you forgot to rename the fields of a class, but won't do the same for dicts.

FreakLegion

8 hours ago

TypedDicts solve the linting problem, but refactoring tools haven't caught up (unlike e.g. ForwardRef type annotations, which are strings but can be transformed alongside type literals).

tomjakubowski

8 hours ago

Is there any advantage to using a TypedDict for a record over a dataclass?

FreakLegion

6 hours ago

TypedDicts "aren't real" in the sense that they're a compile-time feature, so you're getting typing without any deserialization cost beyond the original JSON. Dataclasses and Pydantic models are slow to construct, so that's not nothing.

This of course means TypeDicts don't give you run-time validation. For that, and for full-blown custom types in general, I tend to favor msgspec Structs: https://jcristharif.com/msgspec/benchmarks.html#json-seriali....

orf

4 hours ago

> Dataclasses and Pydantic models are slow to construct

Citation needed? Pydantic is really quite fast, and you can pass raw JSON responses into it.

It may be slower (depending on the validators or structure), but I’d expect it to be comparably fast to the stdlib JSON module.

FreakLegion

3 hours ago

Pydantic's JSON parsing is faster than the built-in module, on par with orjson, but creating model instances and run-time type checking net out to be much slower. I linked msgspec's benchmarks in the previous post.

cle

6 hours ago

Dicts can be a problem, but this particular example isn't that great, like in this diagram from the article:

  External API <--dict--> Ser/De <--model--> Business Logic
Life's all great until "External API" adds a field that your model doesn't know about, it gets dropped when you deserialize it, and then when you send it back (or around somewhere else) it's missing a field.

There's config for this in Pydantic, but it's not the default, and isn't for most ser/de frameworks (TypeScript is a notable exception here).

Closed enums have a similar tradeoff.

mjr00

5 hours ago

If external API adds a new field but your software already worked, you didn't need it in the first place, so why should it matter?

Dropping unknown/unused fields makes sense in 99% of cases.

buzer

5 hours ago

Unfortunately some APIs assume that they will get all the fields as part of the update. If field doesn't exist in the input it gets it will drop the original value during the update.

_ZeD_

4 hours ago

yet, again, most of the libraries already deal with extra fields... i.e. for pydantic https://docs.pydantic.dev/latest/concepts/models/#extra-fiel...

vouwfietsman

3 hours ago

I don't deal with external APIs often, but this is a development nightmare. You can't just magically let data flow through your system without knowing about it, because this is not how programming works. Your API has a contract and your code is written to support that contract, if the contract changes it should either be a very consciously decided breaking change that is versioned somehow, or it should be an unversioned non breaking change. Apparently whatever data is added like this is completely meaningless to your program so why do you need to be in charge of passing it back to the API.

Changing your API and assuming everything just keeps working is a nonsense cowboy attitude to software compatibility, even if some frameworks bend over backwards to support it through magic that's hidden from the developer. Furthermore, many programming languages are simply incapable of doing this, and this approach to APIs is immediately restricting those languages from use.

Finally, transforming objects to an internal domain model is really the cornerstone of a lot of recent well-thought-out programming discipline, and this API design is throwing that in the garbage. It's explicitly asking you to mess up your service architecture, spreading bad architecture like a virus to all systems that interact with the API.

cschneid

9 hours ago

I generally support this. When dealing with API endpoints especially I like to wrap them in a class that ends up being. I also like having nested data structures as their own class sometimes too. Depends on complexity & need of course.

    class GetThingResult
      def initialize(json)
        @json = json
      end
    
      # single thing
      def thing_id
        @json.dig('wrapper', 'metadata', 'id')
      end
    
      # multiple things
      def history
        @json['history'].map { |h| ThingHistory.new(h) }
      end
      ... two dozen more things
    end

cranium

5 hours ago

Python dataclasses are a good start for internal use. They are just a bit of a pain to serialize/deserialize natively. When it comes to that, I prefer to use Pydantic objects and have all the goodies, at the cost of some complexity.

xenoxcs

5 hours ago

I'm a big fan of using Protobuf for the third-party API validation task. After some slightly finniky initial schema definition (helped by things like json-to-proto.github.io), I can be sure the data I'm consuming from an external API is strongly typed, and the functions included in Protobuf which convert JSON to a Proto message instance blows up by default if there's an unexpected field in the API data it's consuming.

I use it to parse and validate incoming webhook data in my Python AWS Lambda functions, then re-use the protobuf types when I later ship the webhook data to our Flutter-based frontend. Adding extensions to the protobuf fields gives me a nice, structured way to add flags and metadata to different fields in the webhook message. For example, I can add table & column names to the protobuf message fields, and have them automatically be populated from the DB with some simple helper functions. Avoids me needing to write many lines of code that look like:

MyProtoClass.field1 = DB.table.column1.val

MyProtoClass.field2 = DB.table.column2.val

pansa2

4 hours ago

> convert [dicts] immediately to data structures providing semantics [...] You can simplify your work by employing a library that makes “better classes” for you

Python seems to have many different kinds of "better classes" - the article mentions `dataclass` and `TypedDict`, and AFAIK there are also two different kinds of named tuple (`collections.namedtuple` and `Typing.NamedTuple`).

What are the advantages of these "better classes" over traditional classes? How would you choose which of the four (or more?) kinds to use?

pansa2

4 hours ago

To me, the proliferation of "better classes" implies there's a problem with Python's built-in classes - but what's wrong? Are they just too flexible and/or too verbose? Or actually deficient in some way?

zmgsabst

3 hours ago

People enjoy the flexibility and many Python systems rely on duck-typing via dicts, etc.

So people are trying to force Python to be something it isn’t in adherence to their ideology — but it fails to gain consensus because there’s a sizable cohort that use Python because it isnt those things.

So we get repeated implementations, from each ideologically motivated group.

karmakurtisaani

7 hours ago

I've cleaned up code where input parameters came in a dict form. Absolute shit show.

- The only way to figure out which parameters are even possible was to search through the code for the uses of the dict.

- Default values were decided on the spot all over the place (input.getOrDefault(..)).

- Parameter names had to be typed out each time, so better be careful with correct spelling.

- Having a concise overview how the input is handled (sanitized) was practically impossible.

0/10 design decision, would not recommend.

est

7 hours ago

dicts are OK, because at least they do have a `key` and it does mean something.

un-annotated tuples and too many func params are cancer.

ramraj07

7 hours ago

Who does this still??

directevolve

3 hours ago

In bioinformatics, one of our main dataflow platforms, Nextflow, is built with unnamed tuples in mind. Implementing the ability to conveniently pass data with HashMaps instead of unnamed tuples was a huge boost to usability for me.

stonethrowaway

7 hours ago

No no,

Un-annotated tuples and too many func params are OK, because at least they are pushed and popped from the stack.

Calls and rets without a prologue and epilogue on the other hand…

est

7 hours ago

> from the stack

Or many, many stacks you can't comprehend nor amend.

I dare to add a new `key` to a dict, can you modify a func call or a tuple with confidence?

Waterluvian

9 hours ago

I think one really nice thing about Python is duck typing. Your interfaces are rarely asking for a dict as much as they’re asking for a dict-like. It’s pretty great how often you can worry about this kind of problem at the appropriate time (now, later, never) without much pain.

There’s useful ideas in this post but I’d be careful not to throw the baby out with the bath water. Dicts are right there. There’s dict literals and dict comprehensions. Reach for more specific dict-likes when it really matters.

turnsout

8 hours ago

Duck typing is so fragile… Once you have implementations that are depending on your naming or property structure, you can’t update the model without breaking them all.

If you use a real type, you never have to worry about this.

pistoleer

5 hours ago

You would still have to update everything if you rename a field in a struct, what do you mean you never have to worry?

dwattttt

4 hours ago

If you use type checking, the breakage occurs when you introduce the change: the author of the change is the one who can figure out what it means if 'foo' is no longer being passed into this function.

If you're duck typing, you find this out in the best case when your unit tests exercise it, and in the worst case by a support call when that 1/1000 error handling path finally gets exercised in production.

pistoleer

2 hours ago

I agree with that, in the context of dynamically typed languages.

Slowly but surely, new languages are starting to develop with static duck typing. Implicit interfaces if you will.

zmgsabst

3 hours ago

And now inserting every middleware is an exercise in retyping the system, rather than piggybacking on the parameter dict.

pmarreck

9 hours ago

Less important in Elixir (where they are "maps") due to the immutable nature of them as well as the Struct type which is a structured map.

nesarkvechnep

5 hours ago

Yes, usually my APIs in Elixir receive their arguments as a well-typed map, not stringly keyed, and transform them to structs which the core business logic expects.

mikhmha

7 hours ago

Yup! I find Elixir makes it really intuitive to know when to represent a collection as a map and when to use a list of tuples. And its easy to transform between the two when needed.

scotty79

an hour ago

> Ignore fields coming from the API if you don’t need them. Keep only those that you use.

This is great if you know what you need from the start. If you only find out what you need after passing your data through multiple layers and modules of your system then you need to backtrack through all your code to the place of creation.

If you have immutable data structures then you have to backtrack through multiple places where your data is used from previous structures to create new ones to pass your additional data through all that.

So if your data travels through let's say 3 immutable types to reach the place you are working on then even if you know exactly where the new field that you need originates, you need to alter 3 types and 3 places where data is read from one type and crammed into another.

If you have a dict that you fill with all you got from the api there's zero work involved with getting the new piece of information that you thought you didn't need but you actually do. It's just there.

leoh

7 hours ago

Big structs as params in rust have similar issues

saintfire

6 hours ago

In what way? They're not opaque or mutable (by default).

They can be unwieldy but they do define a pretty strongly typed API.

klyrs

9 hours ago

Lists and sets suffer the same drawbacks. If the advice is to not use any of the batteries included if the language, why are we using Python?

If you want an immutable mapping, why not use an enum?

o11c

8 hours ago

This isn't arguing against them in general, but against the unfortunate Javascript-esque abandonment of specified semantics.

In particular, whenever anyone thinks that "deep clone vs shallow clone" is a meaningful distinction, that means their types are utterly void of meaning.

gotoeleven

6 hours ago

Personally I find it is often helpful to keep Dicts in a BigBag ie:

BigBag<Dict>

Barrin92

7 hours ago

It's a bit of an odd article because the second part kind of shows why dicts aren't a problem. You basically just need to apply the most old school of OO doctrines: "recipients of messages are responsible for how they interpret them", and that's exactly what the author advocates when he talks about treating dict data akin to data over the wire, which is correct.

If you're programming correctly and take encapsulation seriously, then whatever shape incoming data in a dict has isn't something you should take an issue with, you just need to make sure if what you care about is in it (or not) and handle that within your own context appropriately.

Rich Hickey once gave a talk about something like this talking about maps in Clojure and I think he made the analogy of the DHL truck stopping at your door. You don't care what every package in the truck is, you just care if your package is in there. If some other data changes, which data always does, that's not your concern, you should be decoupled from it. It's just equivalent to how we program networked applications. There are no global semantics or guarantees on the state of data, there can't be because the world isn't in sync or static, there is no global state. There's actually another Hickey-ism along the lines of "program on the inside the same way you program on the outside". Dicts are cool, just make sure that you're always responsible for what you do with one.

alfons_foobar

4 hours ago

I assume you're basically referring to this quote from the article?

"Ignore fields coming from the API if you don’t need them. Keep only those that you use."

IMO this addresses only one part of the problem, namely "sanitize your inputs". But if you follow this, and therefore end up with a dict whose keys are known and always the same, using something "struct-like" (dataclasses, attrs, pydantic, ...) is just SO much more ergonomic :)