As far as I understand the "chunking" of input bytes is learned completely end to end, so it's basically up to the model to figure out how to most efficiently delineate and aggregate the information from the inputs according to the patterns provided to it during training.
Since it's end to end this allows them to apply this process not only to raw byte encodings but basically representations of any level, such as stacking two stages of aggregation one after another.
So in principle they could either let the model do its thing on raw bytes of an image or alternatively maybe cut it up into tiny patches ViT-style and feed that to their H-Net.
I wonder how hard would it be to adapt chunking to work in 2D and what would that even look like.
Some other notes on how multimodal inputs could be handled using this architecture are mentioned in Albert Gu's (one of the author's) blog, although only briefly, there's still much to figure out it would seem: https://goombalab.github.io/blog/2025/hnet-future/#alternati...
Thanks for sharing this blog post is a great speculative deep-dive.
You can make image networks (unet-like things) by chunking rectangles in 2D (with some convolution steps)... I wonder if there is an image-specific architecture a bit like this that could possibly work well?
it mentions native multimodality somewhere in either the Arxiv or post -- seems like it might handle it well?