hackernews client

Java 24 to Reduce Object Header Size and Save Memory

31 pointsposted 4 days ago

30 Comments

cesarb

4 days ago

Part of this complexity comes from what IMO was a mistake on Java's design (which was AFAIK copied by C#): the base Object class does too much. It has equality comparison, string conversion, object hashing, and a per-object re-entrant lock. Other than equality comparison (which is also bad because it contributes to the perennial confusion between identity equality and value equality), these need extra storage for each and every object in the system (string conversion contains the object hash code as part of its default output). Some tricks are used to avoid most of the space overhead for the per-object lock, at the cost of extra complexity.

pulse7

9 hours ago

On the other hand: Smalltalk had many many more methods on Object than Java have today...

ygra

9 hours ago

Isn't just the lock something that potentially needs space per object?

layer8

6 hours ago

Object::hashCode returns System.identityHashCode(Object) by default. Since GC can move objects around in memory, and the hash code of an object needs to be stable, this default hash code can’t be based on the memory address of the object, and thus needs to be stored per object.

Since System.identityHashCode() can be invoked (for example by IdentityHashMap) even for objects of classes that implement a custom hash code, it also can’t be optimized away even for such classes.

Conceivably it could be optimized away for an object unless or until System.identityHashCode() is invoked for it. It could thus be allocated on demand similarly to how the object locks are. Of course, this has all kinds of performance trade-offs.

guipsp

9 hours ago

You can lock on any object - if you dynamically create these locks, then you need to coordinate creation among threads.

masklinn

9 hours ago

They're saying that only the lock should need storage in the object header, everything else can be computed on the fly (obviously at a CPU cost rather than memory).

Equality is computed (and the default is trivial, it just compares addresses), hashcode is computed (and the default is trivial, it just returns the object's address), string conversion is computed (and the default is trivial, it just prints the class name and the hashcode IIRC).

greiskul

9 hours ago

I'm not sure you can just return the object address for hashcode, because with GC the object address can move right? So if you are using it for hashcode, you do need to persist it somewhere after it's called the first time.

Nevermark

8 hours ago

Would an optional "Hashable" class interface be a good way to reduce the classes (and objects) that need to have a hash? Or is having a hash an unavoidable primitive feature?

Also, would an optional "Lockable" class interface be a good way to drastically reduce the classes (and objects) that need to maintain lock information?

I am religiously averse to unused but implemented "Positive-Cost Abstractions". I dream of a root class with no methods except "new", and instance methods delete() and isSelf(x).

Even "Polymorphism" could be an opt-in interface. Subclasses of non-polymorphic classes would inherit functionality (yay, reusability), but not be a subtype of their root class.

(With commutativity between subclassing & the polymorphic interface. I.e. If A is a non-polymorphic class, and Ap is a polymorphic version of class A, then all subclasses of Ap and all polymorphic subclasses of A, would be subtypes of Ap.)

_old_dude_

9 hours ago

> the default is trivial, it just returns the object's address

This was trivial a long time ago. Now, all Java GCs move objects in memory.

SpaghettiCthulu

9 hours ago

The default `toString` implementation isn't cached, is it?

layer8

6 hours ago

Not the OP, but my beef with toString is that for some classes it is an essential part of the interface contract that requires a stable and documented string mapping (e.g. for value types like BigInteger or URI, and for String itself), whereas for other classes it just serves as a way to provide a debugging/logging representation that may change from one version to the next, and whose exact representation should not be relied on. These are really two separate purposes with a different interface contract.

It would have been better for Object to have a toDebugString method, and to restrict implicit string conversion (concatenation) to classes implementing a StringConvertible interface with a corresponding separate toString method.

mcdeltat

6 hours ago

Astounding how so many languages and programmers don't make the clear distinction between "debug string", "canonical string", "human readable string", etc. There is no such thing as a totally generic "to string" function for any nontrivial program.

The approach I'm most a fan of is functional languages where everything has a fixed canonical string representation (even cooler when you can convert the string directly back to code), and everything else you must explicitly create a function for.

josefx

4 hours ago

> These are really two separate purposes with a different interface contract.

This is a basic feature of inheritance in an object oriented language, you can take an interface that guarantees "this returns some string" and offer a more concrete guarantee "this returns the objects value as decimal" in the implementation.

> and to restrict implicit string conversion (concatenation) to classes implementing a StringConvertible interface with a corresponding separate toString method.

So anyone wanting to make their code trivially loggable now has to implement StringConvertible by copy pasting String toString(){ return toDebugString(); } into every single class they are implementing? You managed to make Java more verbose for no gain at all, please collect your AbstractAwardInstanceFactoryBuilder on your way out.

layer8

3 hours ago

> So anyone wanting to make their code trivially loggable now has to implement StringConvertible by copy pasting String toString(){ return toDebugString(); } into every single class they are implementing?

If you actually want to output a debug representation, you’ll explicitly call toDebugString(). (And a debugger would call it by default.) This would also make the purpose explicit in the code. And you would’t accidentally output a random debug representation (like the default "@xxxxxxxx") as part of regular string concatenation/interpolation, like on a user-level message, or as a JSON value or whatever. This is why it would be wrong to have a toString() forward to toDebugString().

Currently, for most classes I have to add javadoc for the toString() method saying something like: “Returns a debug representation of this object. WARNING: The exact representation may change in future versions, do not rely on it.” For some of these classes a reliable non-debug string representation would conceivably make sense, but I chose not to have one because there is no immediate need. However, callers need to know which it is, and therefore the documentation is needed.

Conversely, whenever I want to use the toString() of a third-party class, I have to check what kind of output it generates, but unfortunately it’s often not documented. And if testing it (or looking at the source) seems to produce a stable value representation, one has no choice but to hope that it will remain stable, instead of that being part of the contract.

Furthermore, for classes with a value representation it often makes sense to have a different debug representation (for example, with safely escaped control characters or additional meta-data). In current Java, it’s safer to have those in a different, non-standard method than toString() (because users expect the latter to provide the value representation), but then there’s the inconvenience that the debug representation won’t be picked up by debuggers by default, due to the non-standard method.

This is all a symptom of the same method being used for different purposes. And a debug representation makes always sense (as evidenced by the default implementation), while a value representation only sometimes makes sense, and might be absent even when it would make sense. But you generally can’t tell from the method.

Having different methods would solve those issues. With a toDebugString() method, one wouldn’t have to document anything, because the javadoc I paraphrased above would already be contained in the Object class. And the toString() method would only be present for classes that do provide a defined value representation that makes sense on the business/domain level of the class.

invalidname

9 hours ago

It is not but strings are cached (interned) in Java which is a different thing.

layer8

5 hours ago

String literals are interned, strings in general aren’t.

specialist

8 hours ago

Ya. Hindsight's 20/20.

I've half expected the Java/JVM team to change Object to extend a new "NakedObject", and implementing new interfaces Equalable, Hashable, Finalizable, and Waitable. (Current interface Clonable is a goof, so maybe deprecate it and replace with Copyable.)

Then "NakedObject" would only need getClass method, right?

Then values and records could also extend NakedObject, right?

Then equals and clone/copy could be generic, right?

Alternately, maybe prevent the gotchas with missing equals, hashCode, and toString by having the runtime autogenerate something reasonable.

layer8

6 hours ago

This would break the invariant that x instanceof Object is true for all non-primitive values x. This assumption is baked into too much code and too many APIs.

For example, you couldn’t add a NakedObject-but-not-Object to a java.util.List, because what Object would List::get(index) then return for it? (Note that the List’s type parameter doesn’t exist at runtime and may also not exist at compile time.)

masklinn

8 hours ago

> maybe prevent the gotchas with missing equals, hashCode, and toString

There's no actual gotcha to them not existing. It works perfectly fine in haskell or rust for instance.

Although it's not a fundamentally useful change to make objects for which a sensible equals/hashcode is trivial not have it, and still have it for objects for which it's not. So without the ability to reach back and remove those properties being universal I fail to see what the point would be.

layer8

6 hours ago

> This means that [with Compact Object Headers] the number of different class types we can load into a JVM process is [reduced to] around ~4 million [from previously 4 billion].

Comparing the class count of some of today’s Java projects (including dependencies) to two decades ago, I wonder if we won’t risk hitting that limit in another two decades or so, and then revert back to the bigger header size again. ;).

cesarb

5 hours ago

> Comparing the class count of some of today’s Java projects (including dependencies) [...]

You are forgetting one important use case: defining classes dynamically. You have to count not only every class (including inner classes) of every dependency, but also all classes created at runtime through direct bytecode manipulation.

layer8

5 hours ago

You are right, and I actually thought of that, but I suspect it’s usually still dominated by, or at most roughly on par with, static class count. Dynamic proxy classes, for example, are cached. And one probably shouldn’t dynamically create classes based on uncontrolled external input.

The other thought I had is that if AI-generated code takes off, this could explode class count. On the other hand, AI could then also be instructed to refactor to minimal class count.

sctb

9 hours ago

For more information, there's a recent talk on Project Lilliput by Roman Kennke: https://www.youtube.com/watch?v=kHJ1moNLwao.

nwellnhof

9 hours ago

> This is lightweight, by way of comparison: until quite recently, Python's header tax was 308 bytes

Really? I thought that PyObject_HEAD only contains two machine words.

masklinn

9 hours ago

You are correct, but PyObject_HEAD is not the full header for the average object.

- the base requirement is a class pointer and a refcount, that's PyObject_HEAD (and PyObject)

- then unless you have disabled this at compilation time, they have two pointers for the cycle-breaking part of the GC

- a dict pointer (or as many instance value pointers as there are object members when using slots)

- and a weakrefs pointer (except for slotted classes, unless you added it back)

That is however only 6 words (48 bytes), or 4 (32) for slotted classes with no members or weakrefs.

I believe the header is larger when running without GIL on 3.13 because PyObject has 2 more words (a local refcount, a gc bitset, and a mutex packed in a word with some padding, and a tid pointer).

Still nowhere near 308 though, I've no idea where they got that. Maybe whoever wrote that article included the instance dict in their calculations? That would add 200~300 bytes. Or maybe they got mixed up between bits and bytes, calculated 308 bits somehow then wrote that up as bytes.

user

8 hours ago

[deleted]

user

4 days ago

[deleted]

exabrial

4 days ago

I always kind of found it interesting that specifying the size of object fitters was part of the JVM specification. This sort of seems like an arbitrary implementation detail, since the programmer will never have to know anything about it.

papercrane

4 days ago

The JVM specification does not specify anything about how objects should be represented. The only time internal representation of objects is discussed in the JVM spec is section that says "The Java Virtual Machine does not mandate any particular internal structure for objects."

In this case the JEP is scoped to just the hotspot runtime. Other implementations are free to represent objects however they want.

https://docs.oracle.com/javase/specs/jvms/se23/html/jvms-2.h...

exabrial

3 days ago

Ah! that makes sense, thanks!