So that’s what it is! I was wondering why reducing context and summarising still makes it make mistakes and forget the steering. And couldn’t find explanation to why it starts ignoring instructions when context is not full at all.
How did you find that tool call is what degrades it?
Isn’t this a biggest problem there is and not just “design tension”?
Yeah, this is basically what I ran into too. I actually wrote about this in Stage 6 (https://ivanmagda.dev/posts/s06-context-compaction/) I went with your option (1): once history crosses a token threshold, the agent asks the model to summarize everything so far, then swaps the full history for that summary. Keeps the context window clean, though you do lose the ability to go back and reference exact earlier tool outputs.
The hard part was picking when to trigger it. Too early and you're throwing away useful context. Too late and the model's already struggling. I ended up just using a simple token count — nothing clever, but it works.
And yeah, the Swift angle was genuinely fun. Defining tool schemas as Codable structs that auto-generate JSON schemas at compile time, getting compiler errors instead of runtime API failures is a huge win.