I’d love your opinion here!
Right now, we assume first call is correct, and will eagerly take the first match we find while traversing the tree.
One of the worst things that could currently happen is we cache a bad run, and now instead of occasional failures you’re given 100% failures.
A few approaches we’ve considered
- maintain a staging tree, and only promote to live if multiple sibling nodes (messages) look similar enough. Decision to promote could be via tempting, regex, fuzzy, semantic, or LLM-judged
- add some feedback APIs for a client to score end-to-end runs so that path could develop some reputation
I’d assume RL would be baked in to the request structure. I’m surprised OAI spec doesn’t include it, but I suppose you could hijack a conversation flow to do so