qixxiq
20 hours ago
Current implementation has the following limitations:
Maximum object size: 65534 keys
The order of object keys is not preserved
...
These limitations may be lifted by using more bytes to store offset pointers and counts on binary level. Though it's hard to imagine a real application which would need that.
I've worked on _many_ applications which have needed those features. Object keys is a per implementation detail, but failing at 65k keys seems like a problem people would likely hit if this were to be used at larger scales.halayli
20 hours ago
I don't know what kind of data you are dealing with but its illogical and against all best practices to have this many keys in a single object. it's equivalent to saying having tables with 65k columns is very common.
on the other hand most database decisions are about finding the sweet spot compromise tailored toward the common use case they are aiming for, but your comment sound like you are expecting a magic trick.
jerf
19 hours ago
Every pathological case you can imagine is something someone somewhere has done.
Sticking data into the keys is definitely a thing I've seen.
One I've done personally is dump large portions of a Redis DB into a JSON object. I could guarantee for my use case it would fit into the relevant memory and resource constraints but I would also have been able to guarantee it would exceed 64K keys by over an order of magnitude. "Best practices" didn't matter to me because this wasn't an API call result or something.
There are other things like this you'll find in the wild. Certainly some sort of "keyed by user" dump value is not unheard of and you can easily have more than 64K users, and there's nothing a priori wrong with that. It may be a bad solution for some specific reason, and I think it often is, but it is not automatically a priori wrong. I've written streaming support for both directions, so while JSON may not be optimal it is not necessarily a guarantee of badness. Plus with the computers we have nowadays sometimes "just deserialize the 1GB of JSON into RAM" is a perfectly valid solution for some case. You don't want to do that a thousand times per second, but not every problem is a "thousand times per second" problem.
Groxx
19 hours ago
redis is a good point, I've made MANY >64k key maps there in the past, some up to half a million (and likely more if we didn't rearchitect before we got bigger).
pests
13 hours ago
re: storing data in keys
FoundationDB makes extensive use of this pattern, sometimes with no data on the key at all.
kevincox
19 hours ago
You seem to be assuming that a JSON object is a "struct" with a fixed set of application-defined keys. Very often it can also be used as a "map". So the number of keys is essentially unbounded and just depends on the size of the data.
mpweiher
16 hours ago
Erm, yes, structs seems to be the use-case this is consciously and very deliberately aiming at:
SICK: Streams of Independent Constant Keys
And "maps" seems to be a use case it is deliberately not aiming at.
zamadatix
13 hours ago
Isn't the example in the readme just a smaller map style object instead off a larger one?
zarzavat
19 hours ago
Let's say you have a localization map: the keys are the localization key and the values are the localized string. 65k is a lot but it's not out of the question.
You could store this as two columnar arrays but that is annoying and hardly anyone does that.
duped
19 hours ago
A pattern I've seen is to take something like `{ "users": [{ "id": string, ... }]}` and flatten it into `{ "user_id": { ... } }` so you can deserialize directly into a hashmap. In that case I can see 65k+ keys easily, although for an individual query you would usually limit it.
mpweiher
16 hours ago
Hmm...would all the user id's be known beforehand in this use-case?
duped
15 hours ago
I wouldn't get worked up about the actual names of things I used here, but there's no difference between having the key contained in the user data versus lifted up to the containing object... every language supports iterating objects by (key, value).
You would do a query like "give me all users with age over 18" or something and return a `{ [id: string]: User }`
mpweiher
11 hours ago
There is a huge difference between a fixed, constant set of keys vs. the keys being an open-ended set that depends on user data.
duped
9 hours ago
Not really, this is a very minor difference that people exploit all the time to minimize the size of serialized data or make it more readable. This is a great example of bikeshedding.
mpweiher
2 hours ago
Yes there is if you optimize for the former case.
If that optimization isn't for you, choose a different library.
If that optimization works for your use-case, it can make a huge difference.
xienze
20 hours ago
> I don't know what kind of data you are dealing with but its illogical and against all best practices to have this many keys in a single object.
The whole point of this project is to handle efficiently parsing "huge" JSON documents. If 65K keys is considered outrageously large, surely you can make do with a regular JSON parser.
pshirshov
19 hours ago
> If 65K keys is considered outrageously large
You can split it yourself. If you can't, replace Shorts with Ints in the implementation and it would just work, but I would be very happy to know your usecase.
Just bumping the pointer size to cover relatively rare usecases is wasteful. It can be partially mitigated with more tags and tricks, but it still would be wasteful. A tiny chunking layer is easy to implement and I don't see any downsides in that.
cogman10
17 hours ago
How wasteful?
Presumably 4 bytes dedicated to the keys would be dwarfed by any strings thrown into the dataset.
Regardless, other than complexity, would there be any reason to not support a dynamic key size? You could dedicate the first 2 bits on the key to the length of the key. 1 byte would work if there's only 64 keys, 2 bytes would give you 16k keys and 3 4M. And if you wanted to you could use a frequency table to order the pointers such that more frequently used keys are smaller values in the dictionary.
pshirshov
16 hours ago
Most of the data the library originally was written for consists of small objects and arrays with high levels of duplication (think state of the world in a videogame with tons of slightly varying objects). Pointer sizes really matter.
MangoToupe
16 hours ago
Data shape often outlives the original intentions.
paulddraper
18 hours ago
That's like saying it's illogical to have 65k elements in an array.
What is the difference?
pshirshov
18 hours ago
If the limitation affects your usecase, you can chunk your structures.
The limitation comes with benefits.
paulddraper
13 hours ago
Fair enough. Implementation details matter.
I was just responding to the “X is an absurd way to do JSON”. Which seemed to single out objects vs arrays.
Like in this case maybe, but I don’t see a reason to make that general statement.
jiggawatts
13 hours ago
I do not miss having to use “near” and “far” pointers in 16-bit mode C++ programming!
pshirshov
20 hours ago
In our usecase, for which we created the library, we made this tradeoff to save several bytes per pointer and keep binary form more compact. The application splits large objects into smaller chunks. 99% of the structures there are relatively small but there are tons of them. Most likely you can do the same - just split large structures into smaller ones.
If you need support for larger structures, you may create your own implementation or extend ours (and I would really like to hear about your usecase).
SICK as a concept is simple. SICK as the library was created to cover some particular usecases and may be not suitable for everyone. We would welcome any contributions.
duped
19 hours ago
If you use varints for pointers you can have the best of both worlds, and achieve even better compaction for the smallest objects.
pshirshov
19 hours ago
Yep, can be done but they aren't free because of variable size. With constant pointers I can access const-sized elements directly, with varints I would have to do some rituals.
I have another binary encoding for different purposes (https://github.com/7mind/baboon) which relies on varints, in case of SICK I decided to go with pointers of constant size to save some pennies on access efficiency.
nine_k
19 hours ago
I'd say that it's generally unwise to use fixed-width integers in a data structure where this width can vary widely, and has no logical upper limit. Arbitrary-size integers are well known, used in practice, and not hard to implement.
pshirshov
19 hours ago
Even if it's "generally unwise" it was a well-thought decision in this particular case. See my other comments. An array of elements with constant size is indexed for free. An array of elements of varying size needs a separate index. SICK's binary representation (EBA) was created with several particular usecases in mind. I needed most compact representation and fastest access (for very underpowered devices), large objects were not a big concern-they can be chunked externally.
nine_k
10 hours ago
For indexes, I completely agree!
If I were to add support for larger amount of keys, I probably would introduce two versions of the data structure, with 16-bit and 32-bit indexing. And, maybe, 8-bit indexing for tiny amounts of keys. But that would definitely complicate the design, and should be done only when there's a real need.
Every such decision is a trade-off; I think yours is fine.
pshirshov
10 hours ago
I guess that can even be done without breaking the format. I may introduce Obj32 table and maintain backward compatibility. But I'm not sure that's necessary, external chunking works just fine.
eknkc
18 hours ago
.net has a polymorphic serializer where the output json contains a $type field for deserializer to choose the concrete type.
It needs to be the very first key in the object. I’ve been bitten by this because postgresql’s jsonb also does not preserve the key ordering.
I believe the latest .net release addresses this but key ordering does matter sometimes.
pshirshov
17 hours ago
When order is important it can be maintained by an external layer with, e.g. a an encoding into a list of pairs.
ericmcer
14 hours ago
Isn't the order of JSON keys not guaranteed by the official spec? I don't remember when I learned that but I have always behaved as if that cannot be relied upon.
btschaegg
14 hours ago
Since RFC 4627 (the original):
> An object is an unordered collection of zero or more name/value pairs, [...]
Further, since RFC 7159:
> JSON parsing libraries have been observed to differ as to whether or not they make the ordering of object members visible to calling software. Implementations whose behavior does not depend on member ordering will be interoperable in the sense that they will not be affected by these differences.
Both are in the current version (RFC 8259).
OTOH, I find the "but the order is not supposed to be guaranteed!" debate REALLY stupid when it comes to software where it's clear that at some point, a human will have to look at the content and correlate it with another system.
There's nothing more evil than re-shuffling JSON just for the fun of it and making everyone who has to look at the result miserable. Yes, I'm talking about you, ELK devs.
Edit: (And/or whoever wrote the underlying Java/Go libs they use for JSON that don't allow developers to patch ordering in. I remember reading GitHub issues about this.)
harrall
12 hours ago
Less than 1% of the hash maps I use have ever needed order.
The underlying data structures between both are different. Ordered hash maps use more memory, are slower, and are more complicated.
Knowing CS fundamentals, using an ordered hash map should be a deliberate choice like renting a box truck when you need to move a lot of stuff. Don’t just drive a box truck everywhere because you might pick up a couch from a thrift store one day.
btschaegg
12 hours ago
All well and true if all you have to do is process the data programmatically.
And yet, as I said, if the same thinking gets applied to e.g. a store of JSON documents (like ELK), chances are good the thing will ruin the UX for countless people who have to deal with the result. Note that you need exactly no hash maps to store the JSON as it is text.
To expand your analogy: …and yet roads are built so that you can drive your regular car or a box car over them, depending on your use case. You make the choice. A JSON library that doesn't afford such choices (and isn't hyper focused on performance) isn't a good one in my book.
Edit: As a sidenote: Or do you mean a freight train wagon? Then replace "road" with "rails" and "car" with "draisine" :)
pshirshov
12 hours ago
No, according to the spec the order is not preserved but most parsers preserve the order (or maintain some order based on side effects) and engineers rely on that (and sometimes that backfires).
Essentially, SICK also maintains some strange order based on values of some homegrown trivial hash function but the only right approach to JSON objects is to treat their keys as an unordered set.
pcthrowaway
14 hours ago
An object is unordered, but most (nearly all?) parsing software will preserve the order, and some software may rely on that behaviour, so it's worth mentioning.