ck_one
2 days ago
Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.
All 7 books come to ~1.75M tokens, so they don't quite fit yet. (At this rate of progress, mid-April should do it ) For now you can fit the first 4 books (~733K tokens).
Results: Opus 4.6 found 49 out of 50 officially documented spells across those 4 books. The only miss was "Slugulus Eructo" (a vomiting spell).
Freaking impressive!
grey-area
a day ago
Surely the corpus Opus 4.6 ingested would include whatever reference you used to check the spells were there. I mean, there are probably dozens of pages on the internet like this:
https://www.wizardemporium.com/blog/complete-list-of-harry-p...
Why is this impressive?
Do you think it's actually ingesting the books and only using those as a reference? Is that how LLMs work at all? It seems more likely it's predicting these spell names from all the other references it has found on the internet, including lists of spells.
sigmoid10
a day ago
Most people still don't realize that general public world knowledge is not really a test for a model that was trained on general public world knowledge. I wouldn't be surprised if even proprietary content like the books themselves found their way into the training data, despite what publishers and authors may think of that. As a matter of fact, with all the special deals these companies make with publishers, it is getting harder and harder for normal users to come up with validation data that only they have seen. At least for human written text, this kind of data is more or less reserved for specialist industries and higher academia by now. If you're a janitor with a high school diploma, there may be barely any textual information or fact you have ever consumed that such a model hasn't seen during training already.
rendx
a day ago
> I wouldn't be surprised if even proprietary content like the books themselves found their way into the training data
No need for surprises! It is publicly known that the corpus of 'shadow libraries' such as Library Genesis and Anna's Archive were specifically and manually requested by at least NVIDIA for their training data [1], used by Google in their training [2], downloaded by Meta employees [3] etc.
[1] https://news.ycombinator.com/item?id=46572846
[2] https://www.theguardian.com/technology/2023/apr/20/fresh-con...
[3] https://www.theverge.com/2023/7/9/23788741/sarah-silverman-o...
paodealho
a day ago
also:
"Researchers Extract Nearly Entire Harry Potter Book From Commercial LLMs"
https://www.aitechsuite.com/ai-news/ai-shock-researchers-ext...
sigmoid10
a day ago
The big AI houses are all in involved in varying degrees of litigation (all the way to class action lawsuits) with the big publishing houses. I think they at least have some level of filtering for their training data to keep them legally somewhat compliant. But considering how much copyrighted stuff is spread blisfully online, it is probably not enough to filter out the actual ebooks of certain publishers.
rendx
13 hours ago
> I think they at least have some level of filtering for their training data to keep them legally somewhat compliant.
So far, courts are siding with the "fair use" argument. No need to exclude any data.
https://natlawreview.com/article/anthropic-and-meta-fair-use...
"Even if LLM training is fair use, AI companies face potential liability for unauthorized copying and distribution. The extent of that liability and any damages remain unresolved."
https://www.whitecase.com/insight-alert/two-california-distr...
joenot443
a day ago
> even proprietary content like the books themselves
This definitely raises an interesting question. It seems like a good chunk of popular literature (especially from the 2000s) exists online in big HTML files. Immediately to mind was House of Leaves, Infinite Jest, Harry Potter, basically any Stephen King book - they've all been posted at some point.
Do LLMS have a good way of inferring where knowledge from the context begins and knowledge from the training data ends?
rendx
a day ago
> It seems like a good chunk of popular literature (especially from the 2000s) exists online in big HTML files
Anna's Archive alone claims to currently publicly host 61,654,285 books, more than 1PB in total.
user
a day ago
beepbooptheory
20 hours ago
> If you're a janitor with a high school diploma, there may be barely any textual information or fact you have ever consumed that such a model hasn't seen during training already.
The plot of Good Will Hunting would like a word.
yunohn
21 hours ago
Maybe y’all missed this?
https://www.washingtonpost.com/technology/2026/01/27/anthrop...
Anthropic, specifically, ingested libraries of books by scanning and then disposing of them.
vercaemert
a day ago
It's impressive, even if the books and the posts you're talking about were both key parts of the training data.
There are many academic domains where the research portion of a PhD is essentially what the model just did. For example, PhD students in some of the humanities will spend years combing ancient sources for specific combinations of prepositions and objects, only to write a paper showing that the previous scholars were wrong (and that a particular preposition has examples of being used with people rather than places).
This sort of experiment shows that Opus would be good at that. I'm assuming it's trivial for the OP to extend their experiment to determine how many times "wingardium leviosa" was used on an object rather than a person.
(It's worth noting that other models are decent at this, and you would need to find a way to benchmark between them.)
adastra22
a day ago
I don’t think this example proves your point. There’s no indication that the model actually worked this out from the input context, instead of regurgitating it from the training weights. A better test would be to subtly modify the books fed in as input to the model so that there was actually 51 spells, and see if it pulls out the extra spell, or to modify the names of some spells, etc.
In your example, it might be the case that the model simply spits out consensus view, rather than actually finding/constructing this information on his own.
vercaemert
a day ago
Ah, that's a good point.
rlt
15 hours ago
They should try the same thing but replace the original spell names with something else.
MarcellusDrum
a day ago
So a good test would be replacing the spell names in the books with made-up spells. And if a "real" spell name was given, it also tests whether it "cheated".
outofpaper
a day ago
A real test is synthesizing 100,000 sentences of this slect random ones and then inject the traits you want thr LLM to detect and describe, eg have a set of words or phrases that may represent spells and have them used so that they do something. Then have the LLM find these random spells in the random corpus.
lxgr
a day ago
It could still remember where each spell is mentioned. I think the only way to properly test this would be to run it against an unpublished manuscript.
staticman2
a day ago
Any obscure work of fiction or fanfiction would likely be fine as a casual test.
If you ask a model to discuss an obscure work it'll have no clue what it's about.
This is very different than asking about Harry Potter.
lxgr
a day ago
Yeah, that's what I've been doing as well, and at least Gemini 3 Pro did not fare very well.
staticman2
a day ago
For fun I've asked Gemini Pro to answer open ended questions about obscure books like "Read this novel and tell me what the hell is this book, do a deep reading and analyze" and I've gotten insightful/ enjoyable answers but I've never asked it to make lists of spells or anything like that.
ehatr
a day ago
The poster you reply to works in AI. The marketing strategy is to always have a cute Pelican or Harry Potter comment as the top comment for positive associations.
The poster knows all of that, this is plain marketing.
throw10920
a day ago
This sounds compelling, but also something that an armchair marketer would have theorycrafted without any real-world experience or evidence that it actually works - and I searched online and can't find any references to something like it.
Do you have a citation for this?
zaphirplane
a day ago
Why doesn’t you ask it and find out ;)
grey-area
a day ago
Because the model doesn't know but will happily tell a convincing lie about how it works.
hereonout2
a day ago
I was playing about with Chat GPT the other day, uploading screen shots of sheet music and asking it to convert it to ABC notation so I could make a midi file of it.
The results seemed impressive until I noticed some of the "Thinking" statements in the UI.
One made it apparent the model / agent / whatever had read the title from the screenshot and was off searching for existing ABC transcripts of the piece Ode to Joy.
So the whole thing was far less impressive after that, it wasn't reading the score anymore, just reading the title and using the internet to answer my query.
anomaly_
a day ago
Sounds pretty human like! Always searching for a shortcut
lpcvoid
a day ago
It sounds like it's lying and making stuff up, something everybody seems to be okay with when using LLMs.
LeanderK
a day ago
I am not sure why...you want the LLM to solve problems not come up with answers itself. It's allowed to use tools, precisely because it tends to make stuff up. In general, only if you're benchmarking LLMs you care about whether the LLM itself provided the answer or it used a tool. If you ask it to convert the notation of sheet music it might use a tool, and it's probably the right decision.
cherrycherry98
a day ago
The shortcut is fine if it's a bog standard canonical arrangement of the piece. If it's a custom jazz rendition you composed with an odd key changes and and shifting time signatures, taking that shortcut is not going to yield the intended result. It's choosing the wrong tool to help which makes it unreliable for this task.
kouunji
21 hours ago
For structured outputs like that wouldn’t it be better to get the LLM to create a script to repeatably make the translation?
nobodywillobsrv
a day ago
Yes I have found that grok for example actually suddenly becomes quite sane when you tell it to stop querying the internet And just rethink the conversation data and answer the question.
It's weird, it's like many agents are now in a phase of constantly getting more information and never just thinking with what they've got.
xiomrze
2 days ago
Honest question, how do you know if it's pulling from context vs from memory?
If I use Opus 4.6 with Extended Thinking (Web Search disabled, no books attached), it answers with 130 spells.
ozim
2 days ago
Exactly there was this study where they were trying to make LLM reproduce HP book word for word like giving first sentences and letting it cook.
Basically they managed with some tricks make 99% word for word - tricks were needed to bypass security measures that are there in place for exactly reason to stop people to retrieve training material.
pron
2 days ago
This reminds me of https://en.wikipedia.org/wiki/Pierre_Menard,_Author_of_the_Q... :
> Borges's "review" describes Menard's efforts to go beyond a mere "translation" of Don Quixote by immersing himself so thoroughly in the work as to be able to actually "re-create" it, line for line, in the original 17th-century Spanish. Thus, Pierre Menard is often used to raise questions and discussion about the nature of authorship, appropriation, and interpretation.
ck_one
2 days ago
Do you remember how to get around those tricks?
djhn
2 days ago
This is the paper: https://arxiv.org/abs/2601.02671
Grok and Deepmind IIRC didn’t require tricks.
eek2121
2 days ago
This really makes me want to try something similar with content from my own website.
I shut it down a while ago because the number of bots overtake traffic. The site had quite a bit of human traffic (enough to bring in a few hundred bucks a month in ad revenue, and a few hundred more in subscription revenue), however, the AI scrapers really started ramping up and the only way I could realistically continue would be to pay a lot more for hosting/infrastructure.
I had put a ton of time into building out content...thousands of hours, only to have scrapers ignore robots, bypass cloudflare (they didn't have any AI products at the time), and overwhelm my measly infrastructure.
Even now, with the domain pointed at NOTHING, it gets almost 100,000 hits a month. There is NO SERVER on the other end. It is a dead link. The stats come from Cloudflare, where the domain name is hosted.
I'm curious if there are any lawyers who'd be willing to take someone like me on contingency for a large copyright lawsuit.
londons_explore
a day ago
> only to have scrapers ignore robots, bypass cloudflare
Set the server to require cloudflares SSL client cert, so nobody can connect to it directly.
Then make sure every page is cacheable and your costs will drop to near zero instantly.
It's like 20 mins to set these things up.
apsurd
a day ago
Can we help get your infra cost down to negligible? I'm thinking things like pre-generated static pages and CDNs. I won't assume you hadn't thought of this before, but I'd like to understand more where your non-trivial infra cost come from?
djhn
a day ago
I would be tempted to try and optimise this as well. 100000 hits on an empty domain and ~200 dollars worth of bot traffic sounds wild. Are they using JS-enabled browsers or sim farms that download and re-download images and videos as well?
raphman
a day ago
a) As an outside observer, I would find such a lawsuit very interesting/valuable. But I guess the financial risk of taking on OpenAI or Anthropic is quite high.
b) If you don't want bots scraping your content and DDOSing you, there are self-hosted alternatives to Cloudflare. The simplest one that I found is https://github.com/splitbrain/botcheck - visitors just need to press a button and get a cookie that lets them through to the website. No proof-of-work or smart heuristics.
camdenreslink
2 days ago
The new cloudflare products for blocking bots and AI scrapers might be worth a shot if you put so much work into the content.
prawn
a day ago
Further, some low effort bots can be quickly handled with CF by blocking specific countries (e.g., Brazil and Russia, for one of my sites).
WarmWash
20 hours ago
What's not clear from the study (at least skimming it) is if they always started the ball rolling with ground truth passages or if they chained outputs from the model until they got to the end of the book. I strongly suspect the latter would hopelessly corrupt relatively quickly.
It seems like this technique only works if you have a copy of the material to work off of, i.e. enter a ground truth passage, tell the model to continue it as long as it can, and then enter the next ground truth passage to continue in the next session.
djhn
14 minutes ago
Oh! That’s a huge caveat if that’s indeed the case.
petercooper
2 days ago
One possible trick could be to search and replace them all with nonsense alternatives then see if it extracts those.
andai
2 days ago
That might actually boost performance since attention pays attention to stuff that stands out. If I make a typo, the models often hyperfixate on it.
jazzyjackson
a day ago
A fine instruction following task but if harry potter is in the weights of the neural net, it's going to mix some of the real ones with the alternates.
ck_one
2 days ago
When I tried it without web search so only internal knowledge it missed ~15 spells.
clanker_fluffer
2 days ago
What was your prompt?
matt-p
26 minutes ago
Now try it without giving it the books as context. I'm sure it probably knows there are 49.
golfer
2 days ago
There's lots of websites that list the spells. It's well documented. Could Claude simply be regurgitating knowledge from the web? Example:
qwertytyyuu
a day ago
Hmm… maybe he could switch out all the spells names slightly different ones and see how that goes
ck_one
2 days ago
It didn't use web search. But for sure it has some internal knowledge already. It's not a perfect needle in the hay stack problem but gemini flash was much worse when I tested it last time.
viraptor
2 days ago
If you want to really test this, search/replace the names with your own random ones and see if it lists those.
Otherwise, LLMs have most of the books memorised anyway: https://arstechnica.com/features/2025/06/study-metas-llama-3...
jazzyjackson
a day ago
Being that it has the books memorized (huh, just learned another US/UK spelling quirk), I would suppose feeding it the books with altered spells would get you a confused mishmash of data in the context and data in the weights.
ribosometronome
2 days ago
Couldn't you just ask the LLM which 50 (or 49) spells appear in the first four Harry Potter books without the data for comparison?
viraptor
2 days ago
It's not going to be as consistent. It may get bored of listing them (you know how you can ask for many examples and get 10 in response?), or omit some minor ones for other reasons.
By replacing the names with something unique, you'll get much more certainty.
szundi
a day ago
[dead]
Grimblewald
a day ago
might not work well, but by navigating to a very harry potter dominant part of latent space by preconditioning on the books you make it more likely to get good results. An example would be taking a base model and prompting "what follows is the book 'X'" it may or may not regurgitate the book correctly. Give it a chunk of the first chapter and let it regurgitate from there and you tend to get fairly faithful recovery, especially for things on gutenberg.
So it might be there, by predcondiditioning latent space to the area of harry potter world, you make it so much more probable that the full spell list is regurgitated from online resources that were also read, while asking naive might get it sometimes, and sometimes not.
the books act like a hypnotic trigger, and may not represent a generalized skill. Hence why replacing with random words would help clarify. if you still get the origional spells, regurgitation confirmed, if it finds the spells, it could be doing what we think. An even better test would be to replace all spell references AND jumble chapters around. This way it cant even "know" where to "look" for the spell names from training.
user
a day ago
angst
a day ago
btw it recalls 42 when i asked. (without web search)
full transcript: pastebin.com/sMcVkuwd
f33d5173
a day ago
Not sure how they're being counted, but that adds up to 46 with the pair spells counted separately. But then nox is counted twice, so maybe 45.
user
a day ago
heavyset_go
a day ago
No, because you don't know the magic spell (forgive me) of context that can be used to "unlock" that information if it's stored in the NN.
I mean, you can try, but it won't be a definitive answer as to whether that knowledge truly exists or doesn't exist as it is encoded into the NN. It could take a lot of context from the books themselves to get to it.
joshmlewis
2 days ago
I think the OP was implying that it's probably already baked into its training data. No need to search the web for that.
user
a day ago
obirunda
a day ago
This underestimates how much of the Internet is actually compressed into and is an integral part of the model's weights. Gemini 2.5 can recite the first Harry Potter book verbatim for over 75% of the book.
NiloCK
a day ago
I'm getting astrology when I search for this. Any links on this?
f33d5173
a day ago
Iirc it's not quite true. 75% of the book is more likely to appear than you would expect by chance if prompted with the prior tokens. This suggests that it has the book encoded in its weights, but you can't actually recover it by saying "recite harry potter for me".
jdminhbg
a day ago
Do you happen to know, is that because it can’t recite Harry Potter, or because it’s been instructed not to recite Harry Potter?
jazzyjackson
a day ago
It's a matter of token likelihood... as a continuation, the rest of chapter one is highly likely to follow the first paragraph.
The full text of Chapter One is not the only/likeliest possible response to "recite chapter one of harry potter for me"
jamesfinlayson
a day ago
Instructed not to was my understanding.
obirunda
a day ago
altmanaltman
a day ago
> But for sure it has some internal knowledge already.
Pretty sure the books had to be included in its training material in full text. It's one of the most popular book series ever created, of course they would train on it. So "some" is an understatement in this case.
Trasmatta
a day ago
Do the same experiment in the Claude web UI. And explicitly turn web searches off. It got almost all of them for me over a couple of prompts. That stuff is already in its training data.
soulofmischief
2 days ago
The only worthwhile version of this test involves previously unseen data that could not have been in the training set. Otherwise the results could be inaccurate to the point of harmful.
IAmGraydon
a day ago
I'm not sure what your knowledge level of the inner workings of LLMs is, but a model doesn't need search or even an internet connection to "know" the information if it's in its training dataset. In your example, it's almost guaranteed that the LLM isn't searching books - it's just referencing one of the hundreds of lists of those spells in it's training data.
This is the LLM's magic trick that has everyone fooled into thinking they're intelligent - it can very convincingly cosplay an intelligent being by parroting an intelligent being's output. This is equivalent to making a recording of Elvis, playing it back, and believing that Elvis is actually alive inside of the playback device. And let's face it, if a time traveler brought a modern music playback device back hundreds of years and showed it to everyone, they WOULD think that. Why? Because they have not become accustomed to the technology and have no concept of how it could work. The same is true of LLMs - the technology was thrust on society so quickly that there was no time for people to adjust and understand its inner workings, so most people think it's actually doing something akin to intelligence. The truth is it's just as far from intelligence your music playback device is from having Elvis inside of it.
kgeist
21 hours ago
>The truth is it's just as far from intelligence your music playback device is from having Elvis inside of it.
A music playback device's purpose is to allow you hear Elvis' voice. A good device does it well: you hear Elvis' voice (maybe with some imperfections). Whether a real Elvis is inside of it or not, doesn't matter - its purpose is fulfilled regardless. By your analogy, an LLM simply reproduces what an intelligent person would say on the matter. If it does its job more-less, it doesn't matter either, whether it's "truly intelligent" or not, its output is already useful. I think it's completely irrelevant in both cases to the question "how well does it do X?" If you think about it, 95% we know we learned from school/environment/parents, we didn't discover it ourselves via some kind of scientific method, we just parrot what other intelligent people said before us, mostly. Maybe human "intelligence" itself is 95% parroting/basic pattern matching from training data? (18 years of training during childhood!)
eek2121
2 days ago
Honestly? My advice would be to cook something custom up! You don't need to do all the text yourself. Maybe have AI spew out a bunch of text, or take obscure existing text and insert hidden phrases here or there.
Shoot, I'd even go so far as to write a script that takes in a bunch of text, reorganizes sentences, and outputs them in a random order with the secrets. Kind of like a "Where's Waldo?", but for text
Just a few casual thoughts.
I'm actually thinking about coming up with some interesting coding exercises that I can run across all models. I know we already have benchmarks, however some of the recent work I've done has really shown huge weak points in every model I've run them on.
clhodapp
2 days ago
Having AI spew it might suffer from the fact that the spew itself is influenced by AI's weights. I think your best bet would be to use a new human-authored work that was released after the model's context cutoff.
matt_lo
a day ago
use AI to rewrite all the spells from all the books, then try to see if AI can detect the rewritten ones. This will ensure it's not pulling from it's trained data set.
gbalduzzi
a day ago
Neat idea, but why should I use AI for a find and replace?
It feels like shooting a fly with a bazooka
jack_pp
a day ago
it's like hiring someone to come pick up your trash from your house and put it on the curb.
it's fine if you're disabled
miohtama
a day ago
Bazooka guarantees the hit
xenodium
a day ago
I like LLMs, but guarantees in LLMs are... you know... not guaranteed ;)
throwaway290
a day ago
I think that was the point
imafish
a day ago
If all you have is a hammer.. ;)
bilekas
a day ago
You're missing the point, it's only a testing excersize for the new model.
happyraul
a day ago
No, the point is that you can set up the testing exercise without using an LLM to do a simple find and replace.
kakacik
a day ago
Its a test. Like all tests, its more or less synthetic and focused on specific expected behavior. I am pretty far from llms now but this seems like a very good test to see how geniune this behavior actually is (or repeat it 10x with some scramble for going deeper).
inexcf
a day ago
This thread is about the find-and-replace, not the evaluation. Gambling on whether the first AI replaces the right spells just so the second one can try finding them is unnecessary when find-and-replace is faster, easier and works 100%.
bilekas
a day ago
... I'm not sure if you're trolling or if you missed the point again. The point is to test the contextual ability and correctness of the LLMs ability's to perform actions that would be hopefully guaranteed to not be in the training data.
It has nothing to do about the performance of the string replacement.
The initial "Find" is to see how well it performs actually find all the "spells" in this case, then to replace them. They using a separate context maybe, evaluate if the results are the same or are they skewed in favour of training data.
luckydata
a day ago
do you know all the spells you're looking for from memory?
wickedsight
a day ago
You could just, you know, Google the list.
Applejinx
a day ago
and then the first thing you see will be at least one of ITS AI responses, whether you liked it or not
LeoPanthera
a day ago
That won't help. The AI replacing them will probably miss the same ones as the AI finding them.
steve1977
a day ago
I think the question was if it will still find 49 out of 50 if they have been replaced.
meroes
2 days ago
What is this supposed to show exactly? Those books have been feed into LLMs for years and there's even likely specific RLHF's on extracting spells from HP.
muzani
2 days ago
There was a time when I put the EA-Nasir text into base64 and asked AI to convert it. Remarkably it identified the correct text but pulled the most popular translation of the text than the one I gave it.
majewsky
a day ago
Sucks that you got a really shitty response to your prompt. If I were you, the model provider would be receiving my complaint via clay tablet right away.
noumenon1111
20 hours ago
Imagine you ordered the new Claude Opus and instead you got Gemini telling you to glue the cheese on your pizza...
rvz
2 days ago
> What is this supposed to show exactly?
Nothing.
You can be sure that this was already known in the training data of PDFs, books and websites that Anthropic scraped to train Claude on; hence 'documented'. This is why tests like what the OP just did is meaningless.
Such "benchmarks" are performative to VCs and they do not ask why isn't the research and testing itself done independently but is almost always done by their own in-house researchers.
jaco6
a day ago
[dead]
kmacdough
18 hours ago
What are we testing here?
It feels like a very odd test because it's such an unreasonable way to answer this with an LLM. Nothing about the task requires more than a very localized understanding. It's not like a codebase or corporate documentation, where there's a lot of interconectedness and context that's important. It also doesn't seem to poke at the gap between human and AI intelligence.
Why are people excited? What am I missing?
zamadatix
2 days ago
To be fair, I don't think "Slugulus Eructo" (the name) is actually in the books. This is what's in my copy:
> The smug look on Malfoy’s face flickered.
> “No one asked your opinion, you filthy little Mudblood,” he spat.
> Harry knew at once that Malfoy had said something really bad because there was an instant uproar at his words. Flint had to dive in front of Malfoy to stop Fred and George jumping on him, Alicia shrieked, “How dare you!”, and Ron plunged his hand into his robes, pulled out his wand, yelling, “You’ll pay for that one, Malfoy!” and pointed it furiously under Flint’s arm at Malfoy’s face.
> A loud bang echoed around the stadium and a jet of green light shot out of the wrong end of Ron’s wand, hitting him in the stomach and sending him reeling backward onto the grass.
> “Ron! Ron! Are you all right?” squealed Hermione.
> Ron opened his mouth to speak, but no words came out. Instead he gave an almighty belch and several slugs dribbled out of his mouth onto his lap.
sobjornstad
2 days ago
I have a vague recollection that it might come up named as such in Half-Blood Prince, written in Snape's old potions textbook?
In support of that hypothesis, the Fandom site lists it as “mentioned” in Half-Blood Prince, but it says nothing else and I'm traveling and don't have a copy to check, so not sure.
zamadatix
a day ago
Hmm, I don't get a hit for "slugulus" or "eructo" (case insensitive) in any of the 7. Interestingly two mentions of "vomit" are in book 6, but neither in reference to to slugs (plenty of Slughorn of course!). Book 5 was the only other one a related hit came up:
> Ron nodded but did not speak. Harry was reminded forcibly of the time that Ron had accidentally put a slug-vomiting charm on himself. He looked just as pale and sweaty as he had done then, not to mention as reluctant to open his mouth.
There could be something with regional variants but I'm doubtful as the Fandom site uses LEGO Harry Potter: Years 1-4 as the citation of the spell instead of a book.
Maybe the real LLM is the universe and we're figuring this out for someone on Slacker News a level up!
ck_one
2 days ago
Then it's fair that id didn't find it
muzani
2 days ago
There's a benchmark which works similarly but they ask harder questions, also based on books https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/o...
I guess they have to add more questions as these context windows get bigger.
kybernetikos
2 days ago
I recently got junie to code me up an MCP for accessing my calibre library. https://www.npmjs.com/package/access-calibre
My standard test for that was "Who ends up with Bilbo's buttons?"
dwa3592
2 days ago
have another LLM (gemini, chatgpt) make up 50 new spells. insert those and test and maybe report here :)
ActionHank
a day ago
The books were likely in the training data, I don't know that it's that impressive.
SebastianSosa
a day ago
now thx to this post (and the infra provider inclination to appeal to hacker news) we will never know if the model actually discovered the 50 spells or memorized it. Since it will be trained on this. :( But what can you do, this is interesting
bartman
2 days ago
Have you by any chance tried this with GPT 4.1 too (also 1M context)?
dom96
2 days ago
I often wonder how much of the Harry Potter books were used in the training. How long before some LLM is able to regurgitate full HP books without access to the internet?
user
19 hours ago
psychoslave
21 hours ago
Ah and no one thrown TOAC in it yet?
irishcoffee
2 days ago
The top comment is about finding basterized latin words from childrens books. The future is here.
LanceJones
2 days ago
Assuming this experiment involved isolating the LLM from its training set?
grey-area
a day ago
Of course it didn't. Not sure you really can do that - LLMs are a collection of weights from the training set, take away the training set and they don't really exist. You'd have to train one from scratch excluding these books and all excerpts and articles about them somehow, which would be very expensive and I'm pretty sure the OP didn't do that.
So the test seems like a nonsensical test to me.
guluarte
2 days ago
you can get the same result just asking opus/gpt, it is probably internalized knowledge from reddit or similar sites.
ck_one
2 days ago
If you just ask it you don't get the same result. Around 13 spells were missing when I just prompted Opus 4.6 without the books as context.
guluarte
a day ago
because it is non-deterministic, i just did and got 50 spells.
prompt: finding every spell in all Harry Potter books in the first 4 books, only the officially documented spells across those 4 books, there are 50, without using websearch
Here's my best compilation from memory, organized by book. I'll be upfront: getting to exactly 50 is tricky without cross-referencing, so some might be debatable or I might miss a couple. Here goes:
---
*Book 1 - Philosopher's/Sorcerer's Stone*
1. *Lumos* - produces light 2. *Nox* - extinguishes Lumos 3. *Alohomora* - unlocks doors 4. *Wingardium Leviosa* - levitation 5. *Petrificus Totalus* - full body-bind 6. *Locomotor Mortis* - leg-locker curse 7. *Incendio* - produces fire 8. *Reparo* - mends broken objects
*Book 2 - Chamber of Secrets*
9. *Expelliarmus* - disarms opponent 10. *Rictusempra* - tickling charm 11. *Tarantallegra* - forces target to dance 12. *Serpensortia* - conjures a snake 13. *Finite Incantatem* - ends spell effects 14. *Obliviate* - erases memories 15. *Aparecium* - reveals invisible ink 16. *Immobulus* - freezes target in place 17. *Homorphus* - mentioned by Lockhart (werewolf cure claim) 18. *Peskipiksi Pesternomi* - Lockhart's failed pixie spell
*Book 3 - Prisoner of Azkaban*
19. *Expecto Patronum* - produces a Patronus 20. *Riddikulus* - repels a Boggart 21. *Waddiwasi* - shoots objects at target 22. *Impervius* - waterproofing charm 23. *Dissendium* - opens secret passage 24. *Ferula* - conjures bandages/splint 25. *Mobilicorpus* - moves a body 26. *Lumos Maxima* - intense light
*Book 4 - Goblet of Fire*
27. *Accio* - summoning charm 28. *Avada Kedavra* - killing curse 29. *Crucio* - Cruciatus curse (torture) 30. *Imperio* - Imperius curse (control) 31. *Stupefy* - stunning spell 32. *Engorgio* - enlarges target 33. *Reducio* - shrinks target 34. *Sonorus* - amplifies voice 35. *Quietus* - reverses Sonorus 36. *Morsmordre* - conjures the Dark Mark 37. *Priori Incantatem* - reveals last spell cast 38. *Deletrius* - erases magical residue 39. *Densaugeo* - enlarges teeth 40. *Furnunculus* - causes boils 41. *Impedimenta* - slows/stops target 42. *Reducto* - blasts solid objects 43. *Diffindo* - severing charm 44. *Relashio* - releases sparks/grip 45. *Orchideous* - conjures flowers 46. *Avis* - conjures birds 47. *Point Me* - Four-Point Spell (compass) 48. *Ennervate* - revives stunned person 49. *Protego* - shield charm 50. *Conjunctivitis Curse* - affects eyesight (Krum on the dragon)
---
A few caveats: some of these (like Lumos Maxima, Homorphus, Peskipiksi Pesternomi) are borderline since they're either mentioned rather than properly cast, or might be film additions that bleed into memory. The Conjunctivitis Curse is described but its incantation isn't explicitly given in the text. And Protego might technically first appear with its incantation in Book 5 during DA practice rather than Book 4.
If you want, I can turn this into a spreadsheet or document with columns for spell name, effect, who casts it, and which chapter.
siwatanejo
a day ago
> All 7 books come to ~1.75M tokens
How do you know? Each word is one token?
koakuma-chan
a day ago
You can download the books and run them through a tokenizer. I did that half a year ago and got ~2M.
kylehotchkiss
15 hours ago
I love the fun metric.
My hope is that locally run models can pass this test in the next year or two!
user
a day ago
polynomial
20 hours ago
You need to publish this tbh
TheRealPomax
2 days ago
That doesn't seem a super useful test for a model that's optimized for programming?
huangmeng
a day ago
you are rich
user
2 days ago
dr_dshiv
a day ago
Comparison to another model?
IhateAI
a day ago
like I often say, these tools are mostly useful for people to do magic tricks on themselves (and to convince C-suites that they can lower pay, and reduce staff if they pay Anthropic half their engineering budget lmao )
adarsh2321
2 days ago
[dead]
dudewhocodes
a day ago
There are websites with the spells listed... which makes this a search problem. Why is an LLM used here?
bilekas
a day ago
It's just a benchmark test excersize.
hansmayer
a day ago
> Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.
Clearly a very useful, grounded and helpful everyday use case of LLMs. I guess in the absence of real-world use cases, we'll have to do AI boosting with such "impressive" feats.
Btw - a well crafted regex could have achieved the same (pointless) result with ~0.0000005% of resources the LLM machine used.