hackernews client

Show HN: I invented a new generative model and got accepted to ICLR

655 pointsposted 10 days ago

(discrete-distribution-networks.github.io)

91 Comments

cooljoseph

9 days ago

This sounds somewhat like a normalizing flow from a discrete space to a continuous space. I think there's a way you can rewrite your DDN layer as a normalizing flow which avoids the whole split and prune method.

1. Replace the DDN layer with a flow between images and a latent variable. During training, compute in the direction image -> latent. During inference, compute in the direction latent -> image. 2. For your discrete options 1, ..., k, have trainable latent variables z_1, ..., z_k. This is a "code book".

Training looks like the following: Start with an image and run a flow from the image to the latent space (with conditioning, etc.). Find the closest option z_i, and compute the L2 loss between z_i and your flowed latent variable. Additionally, add a loss corresponding to the log determinant of the Jacobian of the flow. This second loss is the way a normalizing flow avoids mode collapse. Finally, I think you should divide the resulting gradient by the softmax of the negative L2 losses for all the latent variables. This gradient division is done for the same reason as dividing the gradient when training a mixture-of-experts model.

During inference, choose any latent variable z_i and flow from that to a generated image.

diyer22

8 days ago

Thanks for the idea, but DDN and flow can’t be flipped into each other that easily.

1. DDN doesn’t need to be invertible. 2. Its latent is discrete, not continuous. 3. As far as I know, flow keeps input and output the same size so it can compute log|detJ|. DDN’s latent is 1-D and discrete, so that condition fails. 4. To me, “hierarchical many-shot generation + split-and-prune” is simpler and more general than “invertible design + log|detJ|.” 5. Your design seems to have abandoned the characteristics of DDN. (ZSCG, 1D tree latent, lossy compression)

The two designs start from different premises and are built differently. Your proposal would change so much that whatever came out wouldn’t be DDN any more.

godelski

4 days ago

Fwiw, I'm not convinced its a Flow and that's my niche. But there are some interesting similarities that actually make me uncertain. A deeper dive is needed.

But you address your points

  > 1. DDN doesn’t need to be invertible

The flow doesn't need to be invertible at every point in the network. As long as you can do the mapping the condition will hold. Like the classic layer is [x_0, s(x_1)*x_0 + t(x_1)]. s,t are parametrized by an arbitrary neural network. But some of your layers look more like invertible convolutions.

I think it is worth checking. FWIW I don't think an equivalence would undermine the novelty here.

  > 2. Its latent is discrete, not continuous.

That's perfectly fine. Flows aren't restricted that way. Technically all flows aren't exactly invertible as you noise the data to dequantize it.

Also note that there are discrete flows. I'm not sure I've seen an implementation where each flow step is discrete but that's more an implementation issue.

  > 3. As far as I know, flow keeps input and output the same size so it can compute log|detJ|.

You have a unet, right? You're full network is doing T:R^n -> R^n? Or at least excluding the extra embedding information? Either way I think you might not interested in "Approximation Capabilities of Neural ODEs and Invertible Residual Networks". At minimum their dimensionality discussion and reference to the Whitney Embedding Theorem is likely valuable to you (I don't think they say it by name?).

You may also want to look at RealNVP since they have a hierarchical architecture which does splitting.

Do note that NODEs are flows. You can see Ricky Chen's works on i-resnets.

As for the Jacobian, I actually wouldn't call that a condition for a flow but it sure is convenient. The typical Flows people are familiar with use a change of variables formula via the Jacobian but the isomorphism is really the part that's important. If it were up to me I'd change the name but it's not lol.

  > 5. Your design seems to have abandoned the characteristics of DDN. (ZSCG, 1D tree latent, lossy compression)

I think you're on the money here. I've definitely never seen something like your network before. Even if it turns out to not be its own class I don't think that's an issue. It's not obviously something else but I think it's worth digging into.

FWIW I think it looks more like a diffusion model. A SNODE. Because I think you're right that the invertibility conditions likely don't hold. But in either case remember that even though you're estimating multiple distributions that that's equivalent to estimating a single distribution.

I think the most interesting thing you could do is plot the trajectories like you'll find in Flow and diffusion papers. If you get crossing you can quickly rule out flows.

I'm definitely going to spend more time with this work. It's really interesting. Good job!

michaeldoron

10 days ago

Very impressive to see a single author paper in ICLR, especially for an innovative method. Well done!

f_devd

10 days ago

Pretty interesting architecture, seems very easy to debug, but as a downside you effectively discard K-1 computations at each layer since it's using a sampler rather than a MoE-style router.

The best way I can summarize it is a Mixture-of-Experts combined with an 'x0-target' latent diffusion model. The main innovation is the guided sampler (rather than router) & split-and-prune optimizer; making it easier to train.

yorwba

10 days ago

Since the sampling probability is 1/K independent of the input, you don't need to compute K different intermediate outputs at each layer during inference, you can instead decide ahead of time which of the outputs you want to use and only compute that one.

(This is mentioned in Q1 in the "Common Questions About DDN" section at the bottom.)

crondee

9 days ago

you dont get to do that for conditional generation though. When we have a target then we have to generate multiple, pick closest to target, and discard the rest.

kevmo314

10 days ago

This is a very clever insight, nice work!

ActivePattern

9 days ago

I don't think you've understood the paper.

- There are no experts. The outputs are approximating random samples from the distribution.

- There is no latent diffusion going on. It's using convolutions similar to a GAN.

- At inference time, you select ahead-of-time the sample index, so you don't discard any computations.

diyer22

9 days ago

I agree with @ActivePattern and thank you for your help in answering.

Supplement for @f_devd:

During training, the K outputs share the stem feature from the NN blocks, so generating the K outputs costs only a small amount of extra computation. After L2-distance sampling, discarding the other K-1 outputs therefore incurs a negligible cost and is not comparable to discarding K-1 MoE experts (which would be very expensive).

f_devd

9 days ago

You are probably right, although it's not similar to a GAN at all, it is significantly more like diffusion (although maybe not latent, the main reason I assumed so is because the "features" are passed-through but these can just be the image).

The ahead-of-time sampling doesn't make much sense to me mechanically, and isn't really mentioned much. But I will hold my judgement for future versions since the FID performance of this first iteration is still not that great.

mysterEFrank

9 days ago

Green flag that he references the I Ching, most original ideas come through analogy. Paul Werbos claims he invented backprop to formalize Freud's theory of “psychic energy” into an algorithm.

qazxcvbnm

9 days ago

An uninformed question: If the network is fully composed of 1x1 convolutions, doesn’t that mean no information mixing between pixels occur? Would that not imply that each pixel is independent of each other? How can that not lead to incoherent results?

gwern

9 days ago

This apparently doesn't apply here, but in fact, pixels can be generated independently of each other. There are architectures where you can generate an arbitrary pixel or element of the image without generating the others; they are just implicit. See NeRFs or 'single-pixel GANs' or MAEs: eg https://arxiv.org/abs/2003.08934 https://arxiv.org/abs/2011.13775 https://arxiv.org/abs/2401.14391

Why is this possible? I tend to think of it as reflecting the ability to 'memorize' all possible data, and the independent generation is just when you 'remember' a specific part of a memory. The latent space is a Platonic object which doesn't change, so why should your generative process for materializing any specific point in the latent space have to? It's not surprising if you could generate arbitrary points from a function like 'y = mx + b' without generating every other point, right? It's just an atemporal mathematical object. Similarly with 'generating images from a random seed'. They too are just (complicated) functions mapping one number to another number.

(You might wonder if this is limited to images? It is not. In fact, you can generate even natural language like this to some degree: https://github.com/ethan-w-roland/AUNN based on my proposal for taking the 'independent generation' idea to a pathological extreme: https://gwern.net/aunn )

diyer22

9 days ago

In DDN, 1×1 convolutions are used only in the output layers of the Discrete Distribution Layer (DDL). The NN blocks between DDLs, which supply the fundamental computational power and parameter count, adopt standard 3×3 convolutions.

randomNumber7

9 days ago

Was there a specific reason for this choice?

diyer22

9 days ago

1x1 convolution is the most lightweight operator for transforming features into outputs.

3x3 convolution is the most common operator used to provide basic computational power.

intalentive

10 days ago

I built something similar in structure, if not in method, using a hierarchy of cross attention and learned queries, made sparse by applying L1 to the attention matrices.

Discrete hierarchical representations are super cool. The pattern of activations across layers amounts to a “parse tree” for each input. You have effectively compressed the image into a short sequence of integers.

GaggiX

10 days ago

It's so cool to see the hierarchical generation of the model, on their Github page they have one with L=4: https://discrete-distribution-networks.github.io/img/tree-la...

The one shown on their page is L=3.

BrokenCogs

10 days ago

This is a great figure

hatthew

9 days ago

Super cool concept!

Looking at the examples below the abstract, there's several details that surprise me with how correct the model is. For examples: hairline in row 2 column 3; shirt color in row 2 columns 7, 8, 9, 11; lipstick throughout rows 4 and 6; face/hair position and shape in row 6 column 4. Of particular note is the red in the bottom left of row 6 column 4. It's a bit surprising—but still understandable—that the model realized there is something red there, but it's very surprising that it chose to put the red blob in exactly the right spot.

I think some of this can be explained by bias in the dataset (e.g. lipstick) and cherry picking on my part (I'm ignoring the ones it got wildly wrong), but I can't explain the red shoulder strap/blob. Is there any possibility of data leakage and/or overfitting of a biased dataset, or are these just coincidences?

diyer22

9 days ago

It's just a coincidence—the guided images used for ZSCG all come from Celeb-A, whereas the DDN model was trained only on FFHQ.

Besides, I feel the red shoulder strap/blob is reconstructed rather poorly.

CuriouslyC

10 days ago

Pretty interesting. I was just doing research on diffusion using symbolic transform matrices to try and parallelize a deep graph reactive system a few days ago, seems to be a general direction that people are going, I wouldn't be surprised to see diffusion adjacent models take over for codegen in the next year or two.

moconnor

10 days ago

Super cool, I spent a lot of time playing with representation learning back in the day and the grids of MNIST digits took me right back :)

A genuinely interesting and novel approach, I'm very curious how it will perform when scaled up and applied to non-image domains! Where's the best place to follow your work?

diyer22

10 days ago

Thank you for your appreciation. I will update the future work on both GitHub and Twitter.

https://github.com/DIYer22 https://x.com/diyerxx

diyer22

3 days ago

Thanks for all the great feedback! I've created a Twitter thread to discuss future development and share updates. Would love to connect with you all there:

https://x.com/diyerxx/status/1978531040068321766

Getting started on Twitter is so tough—engaging with my posts would really help me out a lot!

FitchApps

10 days ago

Can you train this model to detect objects (e.g detect a fish in the picture)?

diyer22

10 days ago

I believe DDN is exceptionally well-suited to the “generative models for discriminative tasks” paradigm for object detection.

Much like DiffusionDet, which applies diffusion models to detection, DDN can adopt the same philosophy. I expect DDN to offer several advantages over diffusion-based approaches: - Single forward pass to obtain results, no iterative denoising required. - If multiple samples are needed (e.g., for uncertainty estimation), DDN can directly produce multiple outputs in one forward pass. - Easy to impose constraints during generation due to DDN's Zero-Shot Conditional Generation capability. - DDN supports more efficient end-to-end optimization, thus more suitable for integration with discriminative models and reinforcement learning.

porridgeraisin

9 days ago

Yep, the mental model I have from a cursory read of the paper is "generative decision tree".

0xdeadbeefbabe

9 days ago

The part about pruning and selecting sounds similar to genetic algorithms from before the popularity of nn.

diyer22

9 days ago

That's right! The second paragraph on OPTIMIZATION WITH SPLIT-AND-PRUNE in the original paper:

> Inspired by the theory of *evolution and genetic algorithms*, we propose the Split-and-Prune algorithm to address the above issues, as outlined in algorithm 1.

frumiousirc

9 days ago

After reading the paper there is one thing I don't understand about the DDL. It seems each "concat" will increase the size of the "output feature" relative to the "input feature" by the size of the "generated image".

Is that right?

If so, how is this increased size handled by each downstream DDL?

Or, is there a 2x pooling in the "concat" step so that final size remains unchanged?

diyer22

8 days ago

Yes, there is a transform that make final size of stem features remains unchanged

Lerc

10 days ago

It's not often you read a title like that and expect it to pan out, but from a quick browse, it looks pretty good.

Now I just need a time-turner.

aseg

9 days ago

Slightly meta-level: I'm glad the authors finds the ICLR reviews useful, and this illustrates one of the successes of ICLR's policy of always open sourcing the reviews (regardless of whether the paper is accepted or rejected).

The authors benefit from having "testimonials" of how anonymous reviewers interpreted their works, and it also allows opens the door to people outside of the classic academic pipeline to see the behind the scenes arguments to accept/reject a paper.

Here are the reviews for this paper btw: https://openreview.net/forum?id=xNsIfzlefG

And here's a list of all the rejected papers: https://openreview.net/group?id=ICLR.cc/2025/Conference#tab-...

3abiton

9 days ago

Absolutely, whenever I got ICLR rejections, at least I could always ppint out to that reviewer who didn't understand core concepts of the paper.

cs702

9 days ago

This looks like great work.

I've added it to my reading list.

Thank you for sharing it on HN.

VoidWhisperer

10 days ago

I don't have a super deep understanding of the underlying algorithms involved, but going off the demo and that page, is this mainly a model for image related tasks, or could it also be trained to do things like what GPT/Claude/etc does (chat conversations)?

diyer22

10 days ago

Yes, it's absolutely possible—just like how diffusion LLMs work, we can do the same with DDN LLMs.

I made an initial attempt to combine [DDN with GPT](https://github.com/Discrete-Distribution-Networks/Discrete-D...), aiming to remove tokenizers and let LLMs directly model binary strings. In each forward pass, the model adaptively adjusts the byte length of generated content based on generation difficulty (naturally supporting speculative sampling).

vintermann

10 days ago

This is what I find most impressive, that it's a natural hierarchial method which seems so general, yet is actually quite competitive. I feel like the machine learning community has been looking for that for a long time. Non-generative uses (like hierarchial embeddings, maybe? Making Dewey's decimal like embeddings for anything!) are even more exciting.

diyer22

10 days ago

Exactly! The paragraph on Efficient Data Compression Capability in the original paper also highlights:

> To our knowledge, Taiji-DDN is the first generative model capable of directly transforming data into a semantically meaningful binary string which represents a leaf node on a balanced binary tree.

This property excites me just as much.

cubefox

9 days ago

This sounds a bit like H-Net [1] or Byte Latent Transformer [2].

1: https://arxiv.org/abs/2507.07955

2: https://arxiv.org/abs/2412.09871

diyer22

8 days ago

It does seem that way — we’re both trying to overcome the limitations imposed by LLM tokenization to achieve a truly end-to-end model.

And, their work is far more polished; I’ve only put together a quick GPT+DDN proof-of-concept.

Thank you for sharing.

lukan

9 days ago

I vouched for this comment. Your account seems to be shadow banned, but your last comments look fine to me, so you maybe want to email dang to revoke that status ..

cubefox

9 days ago

Thanks. I sent an email.

booli

10 days ago

The posts mentions that: https://github.com/Discrete-Distribution-Networks/Discrete-D...

p1esk

10 days ago

How does it compare to state of the art models? Does it scale?

diyer22

10 days ago

The first version of DDN was developed in less than three months, almost entirely by one person. Consequently, the experiments were preliminary and the results far from SoTA.

The current goal in research is scaling up. Here are some thoughts in blog about future directions: https://github.com/Discrete-Distribution-Networks/Discrete-D...

aDyslecticCrow

9 days ago

It has other advantages and properties to diffusion models. I doubt it will generate "art" anytime soon better than diffusion... But it's zero-shot and relative* shallow structure could make it amazing at edge compute or image/data analysis (as another comment discusses) at limited datasets or compute.

Even one of the examples is a very effective re-colorized that beat other approaches I've seen with less risk of modifying the subject. It's clever, and simple.

it's compared more with GAN in the article than Diffusion, and that excites me. GAN are badly behaved, but are really powerful reinforcement learners. If this method can compensate for the greatest bane of GAN (mode collapse), it can be very useful.

diyer22

8 days ago

Exactly what i think!

- The DDN single-shot generator architecture is more efficient than diffusion.

- DDN is fully end-to-end differentiable, allowing for more efficient optimization when integrated with discriminative models or reinforcement learning.

- Moreover, DDN inherently avoids mode collapse.

These points are all mentioned in the blog: https://github.com/Discrete-Distribution-Networks/Discrete-D...

throwaway314155

9 days ago

Deeply uninformed person here:

Is the inference cost of generating this tree to be pruned something of a hindrance? In particular I'm watching your MNIST example and thinking - does each cell in that video require a full inference? Or is this done in parallel at least? In any case, you're basically memory for "faster" runtime (for more correct outputs), no?

diyer22

9 days ago

This understanding is incorrect. The video samples all the leaf nodes of the entire tree only to visualize the distribution in latent space. In normal use, only the L outputs along a single path are generated.

throwaway314155

8 days ago

Interesting, thanks for clarifying.

cellis

9 days ago

Could this be used to train a text -> audio model? I'm thinking of an architecture that uses RVQ. Would RVQ still be necessary?

diyer22

8 days ago

I believe DDN is capable of handling TTS (text-to-speech) tasks, because with the text condition, the generation space is significantly reduced.

And it's recommended to combine it with an autoregressive model (GPT) for more powerful modeling capabilities.

mellosouls

9 days ago

fwiw ICLR:

International Conference on Learning Representations

https://en.wikipedia.org/wiki/International_Conference_on_Le...

serf

10 days ago

isn't this kind of like an 80% vq-vae?

diyer22

10 days ago

No, DDN and VQ-VAE are clearly different.

Similarities: - Both map data to a discrete latent space.

Differences: - VQ-VAE needs an external prior over code indices (e.g. PixelCNN or a hierarchical prior) to model distribution. DDN builds its own hierarchical discrete distribution and can even act as the prior for a VQ-VAE-like system. - DDN’s K outputs are features that change with the input; VQ-VAE’s codebook is a set of independent parameters (embeddings) that remain fixed regardless of the input. - VQ-VAE produces a 2-D grid of code indices; DDN yields a 1-D/tree-structured latent. - VQ-VAE needs Straight-Through Estimator. - DDN supports zero-shot conditional generation.

So I’d call them complementary rather than “80 % the same.” (See the paper’s “Connections to VQ-VAE.”)

ProjectArcturis

8 days ago

This is probably an uninformed question, but why are you comparing layers' output to ground truth? Isn't the point that ground truth is unknown?

diyer22

8 days ago

During neural network training, the ground truth (GT) must be known to compute the loss.

In DDN, the GT is only used to calculate the loss and guide sampling; it never becomes an input to the model.

highd

10 days ago

Do you have any details on the experiment procedures? E.g. hardware, training time, loss curves? It is difficult to confidently reproduce research without at least some of these details.

diyer22

10 days ago

We provide the source code and weights along with a Docker environment to facilitate reproducing the experimental results. The original paper’s EXPERIMENTS section mentions the hardware configuration (8× RTX 2080 Ti).

Zacharias030

9 days ago

Impressive setup :)

cttet

9 days ago

It seem to have both feature and a discrete number passed into next layer, which one did you think of first? or it is both by design?

diyer22

9 days ago

I understand that by "discrete number" you mean the selected output of each layer.

Both the "feature" and the "selected output" are designed to be passed to the next layer.

cttet

9 days ago

Oh it is selected output, yes I meant that I was a bit confused. So in the initial design when you first tried it, you passed both to the next layer? or it is part of where you find out to perform better?

diyer22

9 days ago

Even in the earliest stages of the DDN concept, we had already decided to pass features down to the next layer.

I never even ran an ablation that disabled the stem features; I assume the network would still train without them, but since the previous layer has already computed the features, it would be wasteful not to reuse them. Retaining the stem features also lets DDN adopt the more efficient single-shot-generator architecture.

Another deeper reason is that, unlike diffusion models, DDN does not need the Markov-chain property between adjacent layers.

cttet

9 days ago

Thanks! Really like your intuition!

gurtinator

10 days ago

How did this get accepted without any baseline comparisons? They should have compared this to VQ-VAE, diffusion inpainting and a lot more.

diyer22

10 days ago

I believe it is the novelty. Here I would like to quote Reviewer r4YK’s original words:

> Many high rated papers would have been done by someone else if their authors never published them or were rejected. However, if this paper is not published, it is not likely that anyone would come up with this approach. This is real publication value. I am reminding again the original diffusion paper from 2015 (Sohl-Dickstein) that was almost not noticed for 5 years. Had it not been published, would we have had the amazing generative models we have today?

Cite from: https://openreview.net/forum?id=xNsIfzlefG&noteId=Dl4bXmujh1

Besides, we compared DDN with other approaches in the Table 1 of original paper, including VQ-VAE.

kaiokendev

10 days ago

very interesting stuff! great work and congratulations on the ICLR acceptance!

wordglyph

9 days ago

Amazing! So basically the statistical LLM concept for imaging.

nothrowaways

9 days ago

Impressive, congrats.

v9v

9 days ago

Reminds me of particle filters.

Der_Einzige

10 days ago

Wtf, iclr reviews are happening right now. Did you get accepted into a workshop? How do you know it’s been accepted?

albertzeyer

10 days ago

ICLR 2026 reviews are happening now (or soon). This paper here was accepted at ICLR 2025.

elchananHaas

9 days ago

First, I think this is really cool. Its great to see novel generative architectures.

Here are my thoughts on the statistics behind this. First, let D be the data sample. Start with the expectation of -Log[P(D)] (standard generative model objective).

We then condition on the model output at step N.

- Expectation of Log[Sum over model outputs at step N{P(D | model output at step N) * P(model output at step N)}]

Now use Jensen's inequality to transform this to

<= - expectation of Sum over model outputs at step N{Log[P(D | model output at step N) * P(model output at step N)]}

Apply Log product to sum rule

= - expectation of Sum over model outputs at step N {Log(P(D | model output at step N)) + Log(P(model output at step N))}

If we assume there is some normally distributed noise we can transform the first term into the standard L2 objective.

= - expectation of Sum over model outputs at step N {L2 distance(D, model output at step N) + Log(P(model output at step N))}

Apply linearity of expectation

= Sum over model outputs at step N [expectation of{L2 distance(D, model output at step N)}] - Sum over model outputs at step N [expectation of {Log(P(model output at step N))}]

and the summations can be replaced with sampling

= expectation of {L2 distance(D model output at step N)} - expectation of {Log(P(model output at step N))}]

Now, focusing on just the - expectation of Log(P(sampled model output at step N)) term.

= - expectation of Log[P(model output at step N)]

and condition on the prior step to get

= - expectation of Log[Sum over possible samples at N-1 of (P(sample output at step N| sample at step N - 1) * P(sample at step N - 1))]

Now, for each P(sample at step T | sample at step T - 1) this is approximately equal to 1/K. This is enforced by the Split-and-Prune operations which try to keep each output sampled at roughly equal frequencies.

So this is approximately equal to

≃ - expectation of Log[Sum over possible samples at N-1 of (1/K * P(possible sample at step N - 1))]

And you get an upper bound by only considering the actual sample.

<= -Log[1/K * expectation of P(actual sample at step N - 1))]

And applying some log rules you get

= Log(K) - expectation of Log[P(sample at step N - 1)]

Now, you have (approximately) expectation of -Log[P(sample at step N)] <= Log(K) - expectation of Log[P(sample at step N - 1)]. You can repeatedly apply this transformation until step 0 to get

(approximately) expectation of -Log[P(sample at step N)] <= N * Log(K) - expectation of Log[P(sample at step 0)]

and WLOG assume that expectation of P(sample at step 0) is 1 to get

expectation of -Log[P(sample at step N)] <= N * Log(K)

Plugging this back into the main objective, we get (assuming the Split-and-Prune is perfect)

expectation of -Log[P(D)] <= expectation of {L2 distance(D, sampled model output at step N)} + N * Log(K)

And this makes sense. You are providing the model with an additional Log_2(K) bits of information every time you perform an argmin operation, so in total you have provided the model with N * Log_2(K) bits for information. However, this is constant so you can ignore it from the gradient based optimizer.

So, given this analysis my conclusions are:

1) The Split-and-Merge is a load-bearing component of the architecture with regards to its statistical correctness. I'm not entirely sure about how this fits with the gradient based optimizer. Is it working with the gradient based optimizer, fighting the gradient based optimizer, or somewhere in the middle? I think the answer to this question will strongly affect this approaches scalability. This will also need a more in-depth analysis to study how deviations from perfect splitting affect the upper bound on loss.

2) With regards to statistical correctness, the L2 distance between the output at step N and D is the only one that is important. The L2 losses in the middle layers can be considered auxiliary losses. Maybe the final L2 loss / L2 losses deeper in the model should be weighted more heavily? In final evaluation the intermediate L2 losses can be ignored.

3) Future possibilities could include some sort of RL to determine the number of samples K and depth N on a dynamic basis. Even a split with K=2 increases NLL loss by Log_2(2) = 1. For many samples after a given depth the increase in loss due to the additional information outweighs the decrease in L2 loss. This also points to another difficulty, it is hard to give fractional information in this Discrete Distribution Network architecture. In contrast, diffusion models and autoregressive models can handle fractional bits. This could be another point of future development.

elchananHaas

9 days ago

A thought on why the intermediate L2 losses are important: In the early layers there is little information so the L2 loss will be high and images blurry. In much deeper layers the information from the argmins will dominate and there will be little information left to learn. The L2 losses from the intermediate layers help this by providing a good training signal when there is some information known about the target, but there are still large unknowns.

The model can be thought of as N Discrete Distribution Networks, one of each depth 1 to N, that are stacked on each other and are being trained simultaneously.

elchananHaas

9 days ago

One more concern I noticed: This generative approach needs not only for each layer to select each output with uniform probability, but also for each layer to select each output with uniform probability regardless of the input.

This is the bad case I am concerned about.

Layer 1 -> (A, B) Layer 2 -> (C, D)

Lets say Layer 1 outputs A and B each with probability 1/2 (perfect split). Now, Layer 2 outputs C when it gets A as an input and D when it gets B as an input. Layer 2 is then outputting each output with probability 1/2, but it is not outputting each output with probability 1/2 when conditioned on the output of layer 1.

If this happens, the claim of exponential increase in diversity each layer breaks down.

It could be that the first-order approximation provided by Split-and-Prune is good enough. My guess though is that the gradient and the split-and-prune are helping each other to keep the outputs reasonably balanced on the datasets you are working on. The split and prune lets the optimization process "tunnel" though regions of the loss landscape that would make it hard to balance the classes.

nvr219

10 days ago

Congrats!! Very cool.

curtistyr

10 days ago

I've been thinking about this too—how different DDN is from other generative models. The idea of generating multiple outputs at once in a single pass sounds like it could really speed things up, especially for tasks where you need a bunch of samples quickly. I'm curious how this compares to something like GANs, which can also generate multiple samples but often struggle with mode collapse.

The zero-shot conditional generation part is wild. Most methods rely on gradients or fine-tuning, so I wonder what makes DDN tick there. Maybe the tree structure of the latent space helps navigate to specific conditions without needing retraining? Also, I'm intrigued by the 1D discrete representation—how does that even work in practice? Does it make the model more interpretable?

The Split-and-Prune optimizer sounds new—I'd love to see how it performs against Adam or SGD on similar tasks. And the fact that it's fully differentiable end-to-end is a big plus for training stability.

I also wonder about scalability—can this handle high-res images without blowing up computationally? The hierarchical approach seems promising, but I'm not sure how it holds up when moving from simple distributions to something complex like natural images.

Overall though, this feels like one of those papers that could really shift the direction of generative models. Excited to dig into the code and see what kind of results people get with it!

diyer22

10 days ago

Thank you very much for your interest.

1. The comparison with GANs and the issue of mode collapse are addressed in Q2 at the end of the blog: https://github.com/Discrete-Distribution-Networks/Discrete-D...

2. Regarding scalability, please see “Future Research Directions” in the same blog: https://github.com/Discrete-Distribution-Networks/Discrete-D...

3. Answers or relevant explanations to any other questions can be found directly in the original paper (https://arxiv.org/abs/2401.00036), so I won’t restate them here.

Invictus0

9 days ago

Why do you refer to yourself as "we" in the paper?

Uehreka

9 days ago

When I heled author a paper in undergrad, one of the professors told me its just the style all papers are written in: first person plural and present tense.

mindcrime

9 days ago

Can't speak for the OP, but FWIW, I was always taught to use the "we" construct in academic writing, even when writing as a solo author. But from doing some Googling around and reading threads on Reddit / *.se sites / etc. just now, it seems like this may be something where the guidance has changed over time. I guess it's more common now to actually use the first-person voice in that situation?

antegamisou

9 days ago

Obsolete academia etiquette.

jama211

9 days ago

Congrats!

E-Reverance

9 days ago

>Figure 18: The Taiji-DDN exhibits a surprising similarity to the ancient Chinese philosophy of Taiji. Records of Taiji can be traced back to the I Ching (Book of Changes) from the late 9th century BC, often described by the quote on the left (a) that explains the universe’s generation and transformation. This description coincidentally also summarizes the generation process and the transformations in the generative space of Taiji-DDN. Moreover, the diagram (b) from the book Tom (2013) bears a closely resemblance to the tree structure of DDN’s latent fig. 1b. Therefore, we have named the DDN with K = 2 as Taiji-DDN.

Very nitpicky comment, but I personally find such things to make for a bad impression. To be more specific, branching structures are a fairly universal idea, so the choice of relating it ancient proverbs instead of something much mundane raises an eyebrow.

kcexn

9 days ago

Not sure about the LLM community, but it's not uncommon in other computer-science communities to assign common names to models and implementations.

Common names will always be influenced by the authors culture, so it seems unfair to exclude a name based on any individual opinion that it is or isn't a mundane choice.

Unless you would also exclude names based on their relationship to old, but culturally relevant texts in the western tradition to, e.g., the bible.

E-Reverance

9 days ago

I don't take issue with the name itself, but more so having a whole figure and paragraph dedicated to it in what is supposed to be a technical paper.

idiotsecant

9 days ago

I think it's just a fun justification for a somewhat obscure naming choice, I don't think it's trying to introduce woo