MemoRAG – Enhance RAG with memory-based knowledge discovery for long contexts

70 pointsposted 4 hours ago
by taikon

12 Comments

simpaticoder

2 hours ago

I am naive about LLM technology, in particular the relationship between base models, fine-tuning, and RAG. This particular branch of effort seems aimed at something that is of great interest to me (and I'm sure many others) which is to specialize a more general base model to know a particular domain in great detail and so improve it's responses within that domain. In the past, this might have been called an "expert system". For example, you might want to train an LLM on your project codebase and documentation such that subsequent code suggestions prioritize the use of internal libraries or code conventions over those represented by the public sources encoded in the base model.

I found the Google Colab notebook of MemoRag[1] to be of great use in understanding roughly the scope and workflow of this work. The interesting step is when you submit your domain as text to encode a new thing that requires a GPU, a process they call "forming memory"[2]. Perhaps there is some sort of back-and-forth between the base model and your data that results in new weights added to the base model. As I said, I am naive about LLM technology so I'm not sure about the details or the nomenclature. However, if this is even partially correct I'd like to understand how the "formed memory" and the base model cohabitate during inference, because this would create memory pressure on the GPU. If the memory required for the base model is M, and the formed memory is N, it's reasonable to assume you'd need M+N memory to use both.

1 - https://colab.research.google.com/drive/1fPMXKyi4AwWSBkC7Xr5...

2 - https://colab.research.google.com/drive/1fPMXKyi4AwWSBkC7Xr5...

bbor

an hour ago

   In the past, this might have been called an "expert system". 
Heh, it comes full circle... After ~50 years of Expert Systems winter, we're training our new AGIs to become more specialized! This is a memorable lesson that binaries must always be deconstructed, at least to some extent -- kinda like the endless dance we're doing between monoliths and microservices as each new generation of tools runs into the problems inherent in each.

  I am naive about LLM technology so I'm not sure about the details or the nomenclature
You've got all the details right though, so that's pretty impressive :). AFAICT from a quick glance at the code (https://github.com/qhjqhj00/MemoRAG/blob/main/memorag/memora...), it is indeed "fine tuning" (jargon!) a model on your chosen book, presumably in the most basic/direct sense: asking it reproduce sections of text at random from the book given their surrounding context, and rewarding/penalizing the neural network based on how well it did. The comment mentions GPU memory in the Colab Notebook merely because this process is expensive -- "fine tuning" is the same thing as "training", just with a nearly-complete starting point. Thus the call to `AutoModelForCausalLM.from_pretrained()`.

To answer your question explicitly: the fine-tuning step creates a modified version of the base model as an "offline" step, so the memory requirements during inference (aka "online" operation) are unaffected. Both in terms of storage and in terms of GPU VRAM. I'm not the dev tho so obv apologies if I'm off base!

I would passionately argue that that step is more of a small addition to the overall pipeline than a core necessity, though. Fine-tuning is really good for teaching a model to recreate style, tone, structure, and other linguistic details, but it's not a very feasible way to teach it facts. That's what "RAG" is for: making up for this deficiency in fine-tuning.

In other words, this repo is basically like that post from a few weeks back that was advocating for "modular monoliths" that employ both strategies (monolith vs. microservices) in a deeply collaborative way. And my reaction is the same: I'm not convinced the details of this meshing will be very revolutionary, but the idea itself is deceptively clever!

spmurrayzzz

13 minutes ago

> AFAICT from a quick glance at the code (https://github.com/qhjqhj00/MemoRAG/blob/main/memorag/memora...), it is indeed "fine tuning" (jargon!) a model on your chosen book, presumably in the most basic/direct sense: asking it reproduce sections of text at random from the book given their surrounding context, and rewarding/penalizing the neural network based on how well it did.

Maybe your use of quotes is intentional here, but for posterity's sake there is no actual fine-tuning happening in the code you linked, insofar as the weights of the model aren't being touched at all, nor are they modifying anything else that could impact the original weights (like a LoRA adapter). You touch on this, I think (?), in some of your subsequent language but it read as a little confusing to me at first glance. Or maybe I've been too deep in the ML weeds for too many years at this point.

The paper details the actual process, but the TL;DR is that the memory module they use, basically a draft model, does go through a pretraining phase using the redpajama dataset, and then an SFT phase with a different objective. This all happens before and irrespective of the inference-time task (i.e. asking questions about a given text). Also, as has been pointed out in other comments, the draft model could really be any model that supports long context and has decent retrieval performance. So the actual training phases here may be non-essential depending on your infra/cost constraints.

quantadev

an hour ago

The overview paragraph needs to be expanded quite a bit. The only operative phrase about how this thing works is "By recalling query-specific clues". I think people need a bit more knowledge about what this is and how this works, in an overview, to get them interested in trying it. Surely we can be a bit more specific.

3abiton

an hour ago

This comment brought back academic paper reviewer associated ptsd

davedx

3 hours ago

I don’t understand what the memory is or does from the README. Can anyone explain how it works differently from vector database results in vanilla RAG applications?

jszymborski

2 hours ago

Ok, I think I get it now from scanning the paper and reading Eq. 1 and 2.

Normally RAG just sends your query `q` to a information retrieval function which searches a database of documents using full-text search or vector search. Those documents are then passed to a generative model along with your query to give you your final answer.

MemoRAG instead immediately passes `q` to a generative model to generate some uninformed response `y`. `y` is then passed to the information retrieval function. Then, just like vanilla RAG, `q` and the retrieved documents are sent to a generative model to give you your final answer.

Not sure how this is any more "memory-based" than regular RAG, but it seems interesting.

Def check out the pre-print, especially eq. 1 and 2. https://arxiv.org/abs/2409.05591

EDIT: The "memory" part comes from the first generative model being able to handle larger context, covered in Section 2.1

danielbln

2 hours ago

jszymborski

an hour ago

It seems to be fundamentally the same deal except instead of passing `q` to GPT-4, they have some long-context "Memory Model" (whose details I've yet to fully understand). Also, MemoRAG uses a more conventional Retrieve/Generate pipeline downstream of the generated queries than "Contriever" (whose details I similarly haven't informed myself on).

It would be interesting to see a performance comparison, it certainly seems the most relevant one (that or an ablation of their "memory model" with the LLMs upon which they are based).

isoprophlex

2 hours ago

thanks for boiling it down to the most salient point... to me, their approach is just query rewriting, which is pretty standard when doing RAG.

jszymborski

an hour ago

There's a lot there about the generative model ("Memory Models") in the paper, so perhaps I've misrepresented it, but generally speaking yah I agree with you. It doesn't sound like a fundamental change to how we think about RAG, but it might be a nice formalization of an incremental improvement :)

bbor

an hour ago

  Not sure how this is any more "memory-based" than regular RAG, but it seems interesting.
I can't remember where I read this joke, but as a self-proclaimed Cognitive Engineer I think about it every day: "An AI startup's financial evaluation is directly proportional to how many times they can cram 'mind' into their pitch deck!"