Contextual Retrieval

194 pointsposted 15 hours ago
by loganfrederick

52 Comments

postalcoder

9 hours ago

To add some context, this isn't that novel of an approach. A common approach to improve RAG results is to "expand" the underlying chunks using an llm, so as to increase the semantic surface area to match against. You can further improve your results by running query expansion using HyDE[1], though it's not always an improvement. I use it as a fallback.

I'm not sure what Anthropic is introducing here. I looked at the cookbook code and it's just showing the process of producing said context, but there's no actual change to their API regarding "contextual retrieval".

The one change is prompt caching, introduced a month back, which allows you to very cheaply add better context to individual chunks by providing the entire (long) document as context. Caching is an awesome feature to expose to developers and I don't want to take anything away from that.

However, other than that, the only thing I see introduced is just a cookbook on how to do a particular rag workflow.

As an aside, Cohere may be my favorite API to work with. (no affiliation) Their RAG API is a delight, and unlike anything else provided by other providers. I highly recommend it.

1: https://arxiv.org/abs/2212.10496

resiros

9 hours ago

I think the innovation is using caching as so to make the cost of the approach manageable. The way they implemented it is that each time you create a chunk, you ask the llm to create an atomic chunk from the whole context. You need to do this for all tens of thousands of chunks in your data. This costs a lot. By caching the documents, you can spare costs

skeptrune

9 hours ago

You could also just save the first outputted atomic chunk and store it then re-use it each time yourself. Easier and more consistent.

IanCal

5 hours ago

I don't understand how that helps here. They're not regenerating each chunk every time, this is about caching the state after running a large doc through a model. You can only do this kind of thing if you have access to the model itself, or it's provided by the API you use.

postalcoder

9 hours ago

To be fair, that only works if you keep chunk windows static.

postalcoder

9 hours ago

Yup. Caching is very nice.. but the framing is weird. "Introducing" to me, connotes a product release, not a new tutorial.

bayesianbot

7 hours ago

I was trying to do this using Prompt Caching like a month ago, but then noticed there's five minute maximum lifetime for the cached prompts - doesn't really work for my RAG needs (or probably most), where the queries would be ran during the next month or a year. I can't see any changes to that policy. Little surprised to see them talk about Prompt Caching relating to RAG.

spott

2 hours ago

They aren’t using the prompt caching on the query side, only on the embedding side… so you cache the document in the context window when ingesting it, but not during retrieval.

ValentinA23

an hour ago

Interesting. One problem I'm facing is using RAG to retrieve applicable rules instead of knowledge (chunks): only rules that may apply to the context should be injected into the context. I haven't done any experiment, but one approach that I think could work would be to train small classifiers to determine whether a specific rule could apply. The main LLM would be tasked with determining whether the rule indeed applies or not for the current context.

An example: let's suppose you're using an LLM to play a multi user dungeon. In the past your character has behaved badly with taxis so that the game has decided to create a rule that says that whenever you try to enter a taxi you're kicked out: "we know who you are, we refuse to have you as a client until you formally apologize to the taxi company director". Upon apologizing, the rule is removed. Note that the director of the taxi company could be another player and be the one who issued the rule in the first place, to be enforced by his NPC fleet of taxis.

I'm wondering how well this could scale (with respect of number of active rules) and to which extent traditional RAG could be applied. It seems deciding whether a rule applies or not is a problem that is more abstract and difficult than deciding whether a chunk of knowledge is relevant or not.

In particular the main problem I have identified that makes it more difficult is the following dependency loop that doesn't appear with knowledge retrieval: you need to retrieve a rule to identify whether it applies or not. Does anyone know how this problem could be solved ?

will-burner

28 minutes ago

I wish they included the datasets they used for the evaluations. As far as I can tell, in appendix II they include some sample questions, answers, and golden chunks but they do not give the entire dataset or give an explicit information on exactly what the datasets are.

Does anyone know if the datasets they used for the evaluation are publicly available or if they give more information on the datasets than what's in appendix II?

There are standard publically available datasets for this type of evaluation, like MTEB (https://github.com/embeddings-benchmark/mteb). I wonder how this technique does on the MTEB dataset.

underlines

6 hours ago

We build a corporate RAG for a government entity. What I've learned so far by applying an experimental A/B testing approach to RAG using RAGAS metrics:

- Hybrid Retrieval (semantic + vector) and then LLM based Reranking made no significant change using synthetic eva-questions

- HyDE decreased answer quality and retrieval quality severly when measured with RAGAS using synthetic eval-questions

(we still have to do a RAGAS eval using expert and real user questions)

So yes, hybrid retrieval is always good - that's no news to anyone building production ready or enterprise RAG solutions. But one method doesn't always win. We found semantic search of Azure AI Search being sufficient as a second method, next to vector similarity. Others might find BM25 great, or a fine tuned query post processing SLM. Depends on the use case. Test, test, test.

Next things we're going to try:

- RAPTOR

- SelfRAG

- Agentic RAG

- Query Refinement (expansion and sub-queries)

- GraphRAG

Learning so far:

- Always use a baseline and an experiment to try to refute your null hypothesis using measures like RAGAS or others.

- Use three types of evaluation questions/answers: 1. Expert written q&a, 2. Real user questions (from logs), 3. Synthetic q&a generated from your source documents

williamcotton

4 hours ago

Could you explain or link to explanations of all of the acronyms you’ve used in your comment?

jiggawatts

3 hours ago

It makes me chuckle a bit to see this kind of request in a tech forum, particularly when discussing advanced LLM-related topics.

This is akin to a HN comment asking someone to search the Internet for something on their behalf, while discussing search engine algorithms!

williamcotton

3 hours ago

It adds useful context to the discussion and spurs further conversation.

williamcotton

3 hours ago

HyDE: Hypothetical Document Embeddings [1]

RAGAS: RAG Assessment [2]

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval [3]

Self-RAG: Self-Reflective Retrieval-Augmented Generation [4]

Agentic RAG: Agentic Retrieval-Augmented Generation [5]

GraphRAG: Graph Retrieval-Augmented Generation [6]

[1] https://docs.haystack.deepset.ai/docs/hypothetical-document-...

[2] https://docs.ragas.io/en/stable/

[3] https://arxiv.org/html/2401.18059v1

[4] https://selfrag.github.io

[5] https://langchain-ai.github.io/langgraph/tutorials/rag/langg...

[6] https://www.microsoft.com/en-us/research/blog/graphrag-unloc...

justanotheratom

18 minutes ago

Looking forward to some guidance on "chunking":

"Chunk boundaries: Consider how you split your documents into chunks. The choice of chunk size, chunk boundary, and chunk overlap can affect retrieval performance1."

simonw

10 hours ago

My favorite thing about this is the way it takes advantage of prompt caching.

That's priced at around 1/10th of what the prompts would normally cost if they weren't cached, which means that tricks like this (running every single chunk against a full copy of the original document) become feasible where previously they wouldn't have financially made sense.

I bet there are all sorts of other neat tricks like this which are opened up by caching cost savings.

My notes on contextual retrieval: https://simonwillison.net/2024/Sep/20/introducing-contextual... and prompt caching: https://simonwillison.net/2024/Aug/14/prompt-caching-with-cl...

thruway516

6 hours ago

I follow your blog and read almost everything you write about Llms. Just curious (if you havent already written about it somewhere and I missed it), how much do you spend monthly, exploring all the various Llms and their features? (I think its a useful context for having a grasp of how much I would have to spend to keep up to date with the models out there and the latest features)

simonw

4 hours ago

Most months I spend less than $10 total across the OpenAI, Anthropic and Google APIs - for the kind of stuff I do I’m just not racking up really high token counts.

I spend $20/month on ChatGPT plus and $20/month on Claude Pro. I get GitHub Copilot for free as an open source maintainer.

davedx

an hour ago

Cost is one aspect, but what about ingest time? You’re adding significant processing time to your pipeline with this method right?

jillesvangurp

9 hours ago

You could do a lot of stuff with pre-calculating things for your embeddings. Why cache when you can pre-calculate. That brings into play a whole lot of things people commonly do as part of ETL.

I come from a traditional search back ground. It's quite obvious to me that RAG is a bit of a naive strategy if you limit it to just using vector search with some off the shelf embedding model. Vector search simply isn't that good. You need additional information retrieval strategies if you want to improve the context you provide to the LLM. That is effectively what they are doing here.

Microsoft published an interesting paper on graph RAG some time ago where they combine RAG with vector search based on a conceptual graph that they construct from the indexed data using entity extraction. This allows them to pull in contextually relevant information for matching chunks.

I have a hunch that you could probably get quite far without doing any vector search at all. It would be a lot cheaper too. Simply use a traditional search engine and some tuned query. The trick is of course query tuning. Which may not work that well for general purpose use cases but it could work for more specialized use cases.

TmpstsTrrctta

8 hours ago

I have experience in traditional search as well and I think this is doing some limiting of my imagination when it comes to vector search. In the post, I did like the introduction of the Contextual BM25 compared to other hybrid approaches then doing rrf.

For question answering, vector/semantic search is clearly a better fit in my mind, and I can see how the contextual models can enable and bolster that. However, because I’ve implemented and used so many keyword based systems, that just doesn’t seem to be how my brain works.

An example I’m thinking of is finding a sushi restaurant near me with availability this weekend around dinner time. I’d love to be able to search for this as I’ve written it. How I would search for it would be search for sushi restaurant, sort by distance and hope the application does a proper job of surfacing time filtering.

Conversely, this is mostly how I would build this system. Perhaps with a layer to determine user intention to pull out restaurant type, location sorting, and time filtering.

I could see using semantic search for filtering down the restaurants to related to sushi, but do we then drop back into traditional search for filtering and sorting? Utilize function calling to have the LLM parameterize our search query?

As stated, perhaps I’m not thinking of these the right way because of my experiences with existing systems, which I find seem to give me better results when well built

ValentinA23

an hour ago

Another approach I saw is to build a conceptual graph using entity extraction and have the LLM suggest search paths through that graph to enhance the retrieval step. The LMM is fine-tuned on the conceptual graph for this specific task. Could work in your case, but you need to deal with an ontology that suits your use case, in other words it must already contain restaurant location, type of dishes served and opening hours.

lmeyerov

2 hours ago

This was my exact question. Why do an LLM rewrite, when you can add a context vector to a chunk vector, and for plaintext indexing, add a context string (eg, tfidf)?

The article claimed other context augmentation fails, and that you are better off paying anthropic to run an LLM on all your data, but it seems quite handwavy. What vector+text search nuance does a full document cache LLM rewrite catch that cheapo methods miss? Reminds me of "It is difficult to get a man to understand something when his salary depends on his not understanding it". (We process enough data that we try to limit LLMs to the retrieval step, and only embeddings & light LLMs to the indexing step, so it's a $$$ distinction for our customers.)

The context caching is neat in general, so I have to wonder if this use case is more about paying for ease than quality, and its value for quality is elsewhere.

visarga

7 hours ago

GraphRAG requires you define upfront the schema of entity and relation types. This works when you are in a known domain, but in general, when you want to just answer questions from a large reference, you don't know what you need to put in the graph.

postalcoder

8 hours ago

Graph RAG is very cool and outstanding at filling some niches. IIRC, Perplexity's actual search is just BM25 (based a lex fridman interview of the founder).

jillesvangurp

8 hours ago

Makes sense; perplexity is really responsive and fast usually.

I need to check out that interview with Lex Fridman.

skeptrune

10 hours ago

I'm not a fan of this technique. I agree the scenario they lay out is a common problem, but the proposed solution feels odd.

Vector embeddings have bag-of-words compression properties and can over-index on the first newline separated text block to the extent that certain indices in the resulting vector end up much closer to 0 than they otherwise would. With quantization, they can eventually become 0 and cause you to lose out on lots of precision with the dense vectors. IDF search overcomes this to some extent, but not enough.

You can "semantically boost" embeddings such that they move closer to your document's title, summary, abstract, etc. and get the recall benefits of this "context" prepend without polluting the underlying vector. Implementation wise it's a weighted sum. During the augmentation step where you put things in the context window, you can always inject the summary chunk when the doc matches as well. Much cleaner solution imo.

Description of "semantic boost" in the Trieve API[1]:

>semantic_boost: Semantic boost is useful for moving the embedding vector of the chunk in the direction of the distance phrase. I.e. you can push a chunk with a chunk_html of "iphone" 25% closer to the term "flagship" by using the distance phrase "flagship" and a distance factor of 0.25. Conceptually it's drawing a line (euclidean/L2 distance) between the vector for the innerText of the chunk_html and distance_phrase then moving the vector of the chunk_html distance_factorL2Distance closer to or away from the distance_phrase point along the line between the two points.

[1]:https://docs.trieve.ai/api-reference/chunk/create-or-upsert-...

torginus

7 hours ago

Sorry random question - do vector dbs work across models? I'd guess no, since embeddings are models specific afaik, but that means that a vector db would lock you into using a single LLM and even within that, a single version, like Claude-3.5 Sonnet, and you couldn't move to 3.5 Haiku, Opus etc., never mind ChatGPT or Llama without reindexing.

rvnx

6 hours ago

In short: no.

The vector databases are here to store vectors and calculating distance between vectors.

The embeddings model is the model that you pick to generate these vectors from a string or an image.

You give "bart simpson" to an embeddings model and it becomes (43, -23, 2, 3, 4, 843, 34, 230, 324, 234, ...)

You can imagine it like geometric points in space (well, it's a vector though), except that instead of being 2D, or 3D-space, they are typically in higher-number of dimensions (e.g: 768).

When you want to find similar entries, you just generate a new vector "homer simpson" (64, -13, 2, 3, 4, 843, 34, 230, 324, 234, ...) and send it to the vector database and it will return you all the nearest neighbors (= the existing entries with the smallest distance).

To generate these vectors, you can use any model that you want, however, you have to stay consistent.

It means that once you are using one embedding model, you are "forever" stuck with it, as there is no practical way to project from one vector space to another.

torginus

6 hours ago

that sucks :(. I wonder if there are other approaches to this, like simple word lookup, with storing a few synonyms, and prompting the LLM to always use the proper technical terms when performing a lookup.

kordlessagain

2 hours ago

Back of the book index or inverted indexes can be stored in a set store and give decent results that compare to vector lookups. The issue with them is you have to do an extraction inference to get the keywords.

davedx

an hour ago

Even with prompt caching this adds a huge extra time to your vector database create/update, right? That may be okay for some use cases but I’m always wary of adding multiple LLM layers into these kinds of applications. It’s nice for the cloud LLM providers of course.

I wonder how it would work if you generated the contexts yourself algorithmically. Depending on how well structured your docs are this could be quite trivial (eg for an html doc insert the title > h1 > h2 > chunk).

paxys

2 hours ago

Waiting for the day when the entire AI industry goes back full circle to TF-IDF.

davedx

an hour ago

Yeah it did make me chuckle. I’m guessing products like elasticsearch support all the classic text matching algos out of the box anyway?

valstu

9 hours ago

We're doing something similar. We first chunk the documents based on h1,h2,h3 headings. Then we add headers in the beginning of the chunk as a context. As an imagenary example, instead of one chunk being:

  The usual dose for adults is one or two 200mg tablets or 
  capsules 3 times a day.
It is now something like:

  # Fever
  ## Treatment
  ---
  The usual dose for adults is one or two 200mg tablets or 
  capsules 3 times a day.
This seems to work pretty well, and doesn't require any LLMs when indexing documents.

(Edited formatting)

visarga

7 hours ago

I am working on question answering based on long documents / bundles of documents, 100+ pages, and I took a similar approach. I first summarize each page, give it a title and extract a list of subsections. Then I put all the summaries together and I ask the model to provide a hierarchical index. It will organize the whole bundle into a tree. At querying time I combine the path in the tree as additional context.

cabidaher

9 hours ago

Did you experiment with different ways to format those included headers? Asking because I am doing something similar to that as well.

valstu

9 hours ago

Nope, not yet. We have sticked with markdownish syntax so far.

_bramses

8 hours ago

The technique I find most useful is to implement a “linked list” strategy where a chunk has multiple pointers to the entry it is referenced by. This task is done manually, but the diversity of the ways you can reference a particular node go up dramatically.

Another way to look at it, comments. Imagine every comment under this post is a pointer back to the original post. Some will be close in distance, and others will be farther, due to the perception of the authors of the comments themselves. But if you assign each comment a “parent_id”, your access to the post multiplies.

You can see an example of this technique here [1]. I don’t attempt to mind read what the end user will query for, I simply let them tell me, and then index that as a pointer. There are only a finite number of options to represent a given object. But some representations are very, very, very far from the semantic meaning of the core object.

[1] - https://x.com/yourcommonbase/status/1833262865194557505

msp26

4 hours ago

> If your knowledge base is smaller than 200,000 tokens (about 500 pages of material)

I would prefer that anthropic just release their tokeniser so we don't have to make guesses.

mark_l_watson

5 hours ago

I just took the time to read through all source code and docs. Nice ideas. I like to experiment with LLMs running on my local computer so I will probably convert this example to use the light weight Python library Rank-BM25 instead of Elastic Search, and a long context model running on Ollama. I wouldn’t have prompt caching though.

This example is well written and documented, easy to understand. Well done.

skybrian

12 hours ago

This sounds a lot like how we used to do research, by reading books and writing any interesting quotes on index cards, along with where they came from. I wonder if prompting for that would result in better chunks? It might make it easier to review if you wanted to do it manually.

visarga

7 hours ago

The fundamental problem of both keyword and embedding based retrieval is that they only access surface level features. If your document contains 5+5 and you search "where is the result 10" you won't find the answer. That is why all texts need to be "digested" with LLM before indexing, to draw out implicit information and make it explicit. It's also what Anthropic proposes we do to improve RAG.

"study your data before indexing it"

regularfry

6 hours ago

I've been wondering for a while if having ElasticSearch as just another function to call might be interesting. If the LLM can just generate queries it's an easy deployment.

vendiddy

7 hours ago

I don't know anything about AI but I've always wished I could just upload a bunch of documents/books and the AI would perform some basic keyword searches to figure out what is relevant, then auto include that in the prompt.

average_r_user

7 hours ago

It would help if you tried Notebooklm by Google. It does this, you can upload a document/PDF whatever, and ask questions. The model replies to you giving also a reference to your material

mark_l_watson

4 hours ago

+1 Google’s NotebookLM is amazing. In addition to the functionality you mention, I tried loading the PDF for my entire Practical AI Programming with Clojure book and had it generate an 8 minute podcast that was very nuanced - to be honest, it seriously blew my mind how well it works. Here is a link to the audio file it automatically generated https://markwatson.com/audio/AIClojureBook.wav

NotebookLM is currently free to use and was so good I almost immediately started paying Google $20 a month to get access to their pro version of Gemini.

I still think the Groq APIs for open weight models are the best value for the money, but the way OpenAI, Google, Anthropic, etc. are productizing LLMs is very impressive.

timwaagh

9 hours ago

I guess this does give some insights. Using a more space efficient language for your codebase will mean more functionality in the ais context window when working with Claude and code.

thelastparadise

6 hours ago

Can someone explain simply how these benchmarks work?

What exactly is a "failure rate" and how is it computed?