ahaspel
7 days ago
I rebuilt the 1911 Encyclopædia Britannica into a clean, structured, navigable site:
What it does:
– ~37k articles reconstructed from the original volumes – section-level structure (contents are clickable within articles) – cross-references extracted and linked – contributors indexed and searchable – original volume + page references preserved and shown while reading – links to the original scans for each page – ancillary material included (prefaces, abbreviations, etc.) – topic index reproduced and cross-linked – full-text search with article metadata (length, volume, etc.)
Most of the work was in parsing and reconstruction: headings, multi-page articles, tables, math, languages, footnotes, plates, and all the small edge cases that come up in a work like this.
The goal was to make something that feels like the original, but is actually usable.
I’d especially appreciate feedback on: – search quality – navigation (sections, cross-references) – anything that looks structurally off
Happy to answer questions about the pipeline or data model
zozbot234
7 days ago
You might want to add The Reader's Guide to the Encyclopaedia Britannica, PD text available at https://www.gutenberg.org/ebooks/74039 and scans at https://archive.org/details/readersguidetoen00londuoft - It would fit naturally with the Ancillary material that includes the topic-based index.
ahaspel
7 days ago
It would indeed. I will see about working this in, it's highly pertinent.
user
4 days ago
ahaspel
4 days ago
The Reader's Guide has been added to the ancillary material. Thanks for the excellent suggestion.
zozbot234
4 days ago
Thanks for adding this! Do you plan to add back-links in the article pages (and perhaps in contributors pages) pointing to the chapters in the Reader's Guide that mention them, similar to what's done for the subject-based index?
ahaspel
4 days ago
Not a bad idea. I'll see what I can work out on that score. But I imagine the far more common path is from the Guide to the encyclopedia than the reverse.
logicallee
7 days ago
Thanks so much for sharing this. It looks fantastic. A couple of questions, if you don't mind: what license are you releasing this under, if any? Is there any way to download it? The reason someone might want to download it is for use as training data.
zozbot234
7 days ago
Wikisource has the original scans available in the public domain, and their enriched text under CC-BY-SA: https://en.wikisource.org/wiki/EB1911
ahaspel
7 days ago
Thanks!
The underlying text (1911 edition) is public domain, but the structured version here — the parsing, reconstruction, and linking — is something I put together for this site. Right now there isn’t a bulk download available. I’m considering exposing structured access (API or dataset) in some form, but haven’t decided exactly how that will work yet.
If you have a specific use case in mind (especially for training), I’d be interested to hear more.
logicallee
7 days ago
Regarding the specific use case, I was thinking this: I had Gemma 4 (a small but highly capable offline model released by Google) make a public domain cc0 encyclopedia of some core science and technology concepts[1]. I thought it was pretty good.
Separately, I've fine-tuned the Gemma 4 model[2], it was very quick (just 90 seconds), so I think it could be interesting to train it to talk like 1911 Encyclopedia Britannica.
I would use the entries as training data and train it to talk in the same style. There isn't a specific use case for why, I just think it would be interesting. For example, I could see how it writes about modern concepts in the style of 1911 Britannica.
[1] https://stateofutopia.com/encyclopedia/
[2] To talk like a pirate! https://www.youtube.com/live/WuCxWJhrkIM
ahaspel
7 days ago
That’s a fun idea — I can see the appeal of that style.
The underlying text is public domain, but the structured version here is something I put together for the site. I haven’t released a bulk dataset yet.
If you end up experimenting with it, I’d love to hear how it turns out — and I’m still figuring out what structured access might look like.
hallole
7 days ago
I've wanted to do something like this for The Encyclopédie, a hugely relevant text to the Enlightenment. If you ever get around to adding a rough "How I (generally) Made This" section, that'd be appreciated! Site looks great :)
ahaspel
4 days ago
Thanks for the kind words. I've had a few requests for a technical appendix (i.e., "how I built this") and it is in the works.
realityfactchex
7 days ago
> Is there any way to download it? The reason someone might want to download it is for use as training data.
Another reason would be to able to keep running/using it even if the main site were to go down for whatever reason eventually; or, to operate a mirror of it, for redundancy (linking back to the original, of course).
bentley
6 days ago
There’s an escaping issue in tables of contents. See, e.g., “Roosevelt's” in the “United States” article. https://britannica11.org/article/27-0635-united-states-the/u...
ahaspel
4 days ago
This is now fixed, along with several more serious rendering errors in "United States". Thanks a lot for pointing it out.
huijzer
6 days ago
Really nice. Well done.
As a feature request, would it possible for your pipeline to also create an EPUB? Then people can easily access and search through the document even when your site would go down. EPUB by default uses compression so the file size might even not be too bad for the full encyclopedia.
nyc_pizzadev
7 days ago
Very nice. I actually spent a bit of time browsing a few topics, which is something I rarely do these days!
A few things... when I click an article and try to jump to a new topic, the top search box (labeled "Search titles and full text...") doesn't work. Second, when I first came to the site, I was a bit stuck. It took a bit of time to realize I need to click on "Articles" or even "Topics" to start browsing. Not sure why, maybe I expected the image to let me enter the site somehow...?
gnerd00
7 days ago
legal terms question here also -- several major world economies are operating under very different rules regarding datasets and publication rights. I am in the USA / California.. will there be terms for me, given that I am not a giant deep-pockets FAANG, just a book person ? commercial use terms for "small business" scale ?
ahaspel
7 days ago
The 1911 text itself is public domain, so anyone is free to use it.
What I’ve built here is a structured edition — the parsing, reconstruction, linking, indexing, etc. I haven’t published a formal license for that yet.
For casual or small-scale use there’s no issue at all. For bulk use (e.g. dataset / training / redistribution), I’d prefer people get in touch so I can figure out a sensible way to support that.
Kerrick
6 days ago
> What I’ve built here is a structured edition — the parsing, reconstruction, linking, indexing, etc. I haven’t published a formal license for that yet.
If you live in the U.S. I recommend you read No Sweat of the Brow Copyright: https://www.gutenberg.org/help/no_sweat_copyright.html
dessimus
7 days ago
It's been on Project Gutenburg for over 20 years: https://www.gutenberg.org/ebooks/13600
They only release books that are in the public domain.
bentley
6 days ago
> They only release books that are in the public domain.
Not necessarily. Project Gutenberg does provide some works still under US copyright, such as F. P. Walter’s 1999 translation of Twenty Thousand Leagues Under the Seas: https://gutenberg.org/ebooks/2488
gnerd00
6 days ago
better link here https://www.gutenberg.org/ebooks/search/?query=Encyclopaedia...
TremendousJudge
7 days ago
I guess such an old edition is in the public domain
ks2048
6 days ago
Nice job. How about wikipedia-style links to other articles for topics mentioned within another article?
Soluod
7 days ago
[dead]