oggreen
9 hours ago
How is prompt caching different than caching responses in a database? If you use the same prompt wouldn't you want the same answer? Or can this be used for some type of intermediate process where different questions may utilize the same prompt in some type of workflow?
kreidema
8 hours ago
Fair question! The main thing to think about here: We are not caching responses, we are caching intermediate calculation results (see first graphic in the post). And those are most of the time just the result of a part of the full prompt. So maybe just the system prompt, but then not what comes after, etc.
This way we can save Anthropic and co. for example the effort of recalculating all the linear algebra needed for the prefill of the system prompt, for which they reward us with reduced input token cost. The result is the same, if cached or not cached, it's just less computation.
For the same prompt you would still get the same answer (assuming temperature = 0).
The big savings you will get in an agentic/conversational context: Every new turn always puts the full message array back into the request. If we don't cache the calculation result at every step, the provider has to recalculate the early turns potentially hundreds of times (see second/third graphic).