How does mixture of experts architecture work? Are they debating, or merely delegating?
From what I've read, for each token or input patch, the gate computes a set of probabilities (or scores) over the experts, then selects a small subset (often the top‑[k]) and routes that input only to those.
Ie each expert computes its own transformation on the same original input (or a shared intermediate representation), and then their outputs are combined at the next layer via the gate’s weights.
That’s post hoc combination, not B reasoning over A’s reasoning.
A MoE model is one model with expert parts which use less tokens. Which makes it easier for an expert to diverge to a better optimum state. Its easier to only need to know medicin instead of everything and being able to separate everything away from medicin even if certain names, concepts etc. are the same.
AI agents discussing things with each others would be more like one thinking model thinking throught the problem with different personas.
With different underlying models, you can leverage the best model for one persona. Like people said before (6 month ago, no clue if this is still valid) that they prefer GPT for planning and Claude for executing / coding.