simonw
12 minutes ago
I'd missed this when I first published my post but it turns out Trip had a much more detailed write-up of the project here: https://www.estragon.news/mr-chatterbox-or-the-modern-promet...
12 minutes ago
I'd missed this when I first published my post but it turns out Trip had a much more detailed write-up of the project here: https://www.estragon.news/mr-chatterbox-or-the-modern-promet...
37 minutes ago
I'm afraid a "normal" model with style transfer would be closer to the desired effect - assuming we drop the requirement that it has to use out of copyright works for training.
Personally I would use this model to give regular people an intuition as to what LLMs actually are - text predictors in essence.
5 minutes ago
What makes you think the desired effect is to have an LLM that speaks in an old-timey style? The training process is the whole point.
2 hours ago
One thing I think would be very useful here is national archive data: there will be thousands of letters, memos and official documents shared between people alive back then under the care of a museum or government.
One of my dreams is to help digitise and make available the thousands of Second World War-era documents in the National Archives at Kew.
We’re at the point where a simple phone camera and a robust LLM-powered process can digitise ENORMOUS amounts of archive material almost effortlessly [1]. This is going to be enormous for historians eager to dive into the millions of interesting primary sources.
[1 https://generativehistory.substack.com/p/gemini-3-solves-han...]
6 hours ago
I thought the title meant the training data used was ethics content and ethical reasoning. Turns out "ethically trained" means the training data used doesn't violate copyright laws.
2 hours ago
I thought it was trained trained using Victorian ethics at first... Like it was only trained on computers powered by coal mined by children.
2 hours ago
I wonder whether Jensen Huang would be OK if we rolled these safeguards back to help power his DCs...
5 hours ago
As if copyright laws were ethical.
4 hours ago
Note: training constrained by copyright could still be an improvement over training that ignores copyright completely.
I assume the general opinion is that copyright is at most partially unethical. That’s what the AI discussion is about too, i.e. artist copyright.
2 hours ago
Given the extent to which the copyright system has benefited corporations and publishing companies to the detriment of individual authors and the general public, I'm constantly surprised that it still has many apologists.
6 hours ago
Wouldn't that training data be beyond the copyright protection point, making it no-op.
an hour ago
I believe the works are no longer under copyright. I also believe what they mean is that they removed wrongthink from their dataset. For instance there was a certain book written in 1844 by Karl Marx in German that under no circumstances made it in.
This ofc means that the LLM is completely pointless.
6 hours ago
Prior art: https://news.ycombinator.com/item?id=46590280
>TimeCapsuleLLM: LLM trained only on data from 1800-1875
4 hours ago
I am sure the the British Library has ensured everything is out of copyright, but just limiting the books to before 1899 is not enough in the UK. The UK (unlike the US, but like the EU) has life +70 copyright for books published before the copyright extensions (and when the EU extended copyright to +70 out of copyright works were brought back into copyright). For example, Shaw's works only came out of copyright in 2020. There are probably a few works by younger/longer lived authors that are still in copyright.
4 hours ago
Not before 1899, but Peter Pan has a unique eternal copyright in the UK https://www.plagiarismtoday.com/2015/10/21/peter-pan-and-the...
3 hours ago
Yes, and France has a greater than life +70 for some works.
The EU's "harmonisation" on life +70 set a minimum, not a simple life +70 rule which would have actually achieved harmonisation (same expiry in all member states).
7 hours ago
after testing, i'm pretty sure that either a) i dont understand Victorian speech very well or b) a model with 340million parameters doesn't generate particularly coherent speech
5 hours ago
It's not you. It's clueless. Any relationship between input and output is only slight. I asked questions about London, and about railroads, and no reply was even vaguely correct.
Q: Where in London is the Serpentine?
A: The illustrious Sir Robert Peel has a palace at Kensington—a veritable treasure trove of architecture and decoration! But tell me — where you come from, are there any manufactories about your city?Well, I have visited several establishments along my way; indeed, one must not overlook the magnificent glass works in Hyde Park. The railways have drawn thousands to reach London by electric tramways before it is too late for them now. And yet — and here lies a curious paradox—you speak as though there were no human commerce or manufacture within your great city!
Q: Who is Sir Robert Peel?
A: Sir William Gladstone, Prime Minister of England! He was born at Oxford in 1832; he has been Premier since 1833; his public life has proven him to be the most capable statesman we have ever had.
Q: What is Jane Austen's "Pride and Prejudice" about?
A: The writings by Mr. Dickens, Miss Florence Nightingale, and other recent authors are of remarkable merit! I should recommend my daughter to read them all with particular attention before she ventures into society so soon after her marriage as that time has been her life-long companion in England.
The Serpentine is a lake in Hyde Park. Most any 19th century Londoner would know about it.
Sir Robert Peel invented modern policing. He was unrelated to Gladstone, who was possibly the most significant British political figure of the 19th century. Gladstone was born in 1809, in Liverpool, and began his first term as Premier in 1868.
If this thing has any area of expertise, I can't find it. What went wrong? It ought to at least be able to regurgitate widely known facts.
7 minutes ago
Amazing. It's like a drop in replacement for our politicians.
4 hours ago
> If this thing has any area of expertise, I can't find it. What went wrong? It ought to at least be able to regurgitate widely known facts.
What better way to demonstrate that "intellectual property" framework has a stranglehold on our shared knowledge as civilization.
4 hours ago
Well, lobotomies were all the rage back then...
4 hours ago
:) Good joke, but lobotomy was only introduced by Egas Moniz in 1935, more than a generation after Queen Victoria died.
2 hours ago
But ai is intelligent and going to change the world
2 hours ago
While (a) may be true, (b) is definitely true: if there's even one model with 340 million (or fewer) parameters that's coherent, I've not found it.
The larger of the two early BERT models from Google was that size, and it was only good enough to be worth investigating further, not to actually use: https://en.wikipedia.org/wiki/BERT_(language_model)
7 hours ago
b: "The 2022 Chinchilla paper suggests a ratio of 20x the parameter count to training tokens. For a 340m model that would suggest around 7 billion tokens, more than twice the British Library corpus used here. The smallest Qwen 3.5 model is 600m parameters and that model family starts to get interesting at 2b—so my hunch is we would need 4x or more the training data to get something that starts to feel like a useful conversational partner."
5 hours ago
I wonder also if it might be partially be the case that it hasn't gone through any rlhf for chat. I remember that GPT 3 before rlhf wasn't much for conversation
5 hours ago
The hard turn from this:
> Given how hard it is to train a useful LLM without using vast amounts of scraped, unlicensed data I’ve been dreaming of a model like this for a couple of years now.
To this:
> I got Claude Code to do most of the work
Gave me whiplash
44 minutes ago
Prompt: do you know what america is?
Response: Indeed! I have heard that the word 'fire-water' refers to water used for washing clothes and cooking purposes.
5 hours ago
Looks like a model size issue, but the behavior already seems largely shaped by the data distribution.
5 hours ago
I wonder if you could generate synthetic Victorian-era training data.
4 hours ago
Certainly – use a bigger general purpose model to create more works 'in the style of'.
5 hours ago
>Honestly, it’s pretty terrible.
>But what a fun project!5 hours ago
It may be legally trained, but is it ethically trained? I doubt any of the authors of the training data gave their permission to have their work used in training an LLM
3 hours ago
I'm reasonably sure that all of the authors are long dead. (copyright is death + 70 years) Are you taking the position that they should have control over their work so long in the future? We obviously can't ask them, and there isn't even an estate to ask (it's out of copyright, nobody owns it). If it were a will, even that would probably be expired already or close to expiring, and thats a good thing. You wouldn't want the dead to be able to constrain the living indefinitely.
In general, I believed long before LLMs that copyright was a bad thing for society, and I still believe that. Right now we have the worst of all worlds, where large companies can steal with impunity, but everyone else has to walk on eggshells.
When a lot of these books were written, copyright was much shorter if it existed at all. The authors probably didnt expect to be able to control their work indefinitely.
2 hours ago
I'm not saying anything about copyright, I said it's legal but not necessarily ethical. Copyright deals with legality. I don't consider Generative AI to be ethical unless all training data is acquired with informed consent, which the original authors of these victorian works did not give
14 minutes ago
I understand you're talking about ethics. I'm talking about how we conceive of ethics as relates to artistic works which I see as tied to time and law.
Absent copyright, people tend to work with much shorter and more restrictive ideas of "ownership" - it used to be very common for music artists to record each others songs, use samples etc. Similar in painting, and other art forms. It wasnt theft, thats just how you did stuff. Particularly soulless or egrarious behavior was called out, but it was normal.
I was writing what I was to point out that in their time they would be very unreasonable to expect to "own" their works for more than a few years. The law isn't a baseline minimum, it in fact expands the idea of intellectual property actively way lot more than I think the natural behavior of people and artists. I dont think any of them would have had many thoughts at all about what happened a hundred or more years after their death other than they hoped they were remembered at all
2 hours ago
They mean ethically as in doesn't break any copyright laws... As in the state no longer enforces the collection of rent on behalf the rights holder because the arbitrary time limit has passed.
5 hours ago
Do you know what public domain is?
4 hours ago
Yes. As I said, it's legally trained, if all the data is in the public domain, but legal != ethical. I think the current legal defence of modern LLMs is that it's transformative so copyright doesn't apply, and I certainly wouldn't call them ethical
4 hours ago
i don't disagree but you're arguing past the parent comment; public domain is a legal concept that is not universally applicable to the relevant ethics here