TheRoque
17 hours ago
To be honest, these companies already stole terabytes of data and don't even disclose their dataset, so you have to assume they'll steal and train at anything you throw at them
marssaxman
16 hours ago
"Reading stuff freely posted on the internet" constitutes stealing now?
Seems like an excessively draconian interpretation of property rights.
michaelmior
16 hours ago
"Reading stuff freely posted on the internet" is also very different from a business having machines consume large volumes of data posted on the Internet for the purpose of generating value for them without compensating the creators. I'm not making a value judgement one way or the other, but "reading stuff freely posted on the Internet" is an oversimplification.
marssaxman
16 hours ago
Okay, but "stealing" is also an oversimplification, to the point of absurdity.
It makes no sense to put stuff up on the internet where it can freely be downloaded by anyone at any time, by people who are then free to do whatever they like with it on their own hardware, then complain that people have downloaded that stuff and done what they liked with it on their own hardware.
"Having machines consume large volumes of data posted on the Internet for the purpose of generating value for them without compensating the creators" is equally a description of Google.
schwartzworld
9 hours ago
What if that data isn’t publicly posted? For example, copilot regurgitating code from private repos, complete with comments.
ehnto
15 hours ago
They are not free to do whatever they like, there are tomes of laws across all countries governing what someone can and cannot do with your intellectual property. Just because we didn't have the foresight to add in a "if by chance in the future someone invents artificial intelligence, that's not fair use" is a shame, but doesn't make what these companies are doing ethical or morale.
I don't disagree regarding Google, I also think they exploited others IP for their own gain. It was once symbiotic with webmasters, but when that stopped they broke that implied good faith contract. In a sense, their snippets and widgets using others IP and no longer providing traffic to the site was the warning shot for where we are now. We should have been modernising IP laws back then.
marssaxman
15 hours ago
I did say "free to do whatever they like on their own hardware", because intellectual property laws generally govern the transfer of such property rather than the use.
After seeing the harm done by the expansion of patent law to cover software algorithms, and the relentless abuse done under the DMCA, I am reflexively skeptical of any effort to expand intellectual property concepts.
godelski
13 hours ago
> on their own hardware
That doesn't make it technically legal. That only makes it not worth pursuing. You can sue Joe Schmoe for a million dollars but if he doesn't have that then you're not getting a dime. But if Joe Schmoe is using that thing to make money, well then... yeah you bet your ass that's a different situation and the "worth" of pursuing is directly proportional to how much he is making. Doesn't matter if it is his own hardware or not.Like why do you think who owns the hardware even matters? Do you really think the legality changes if I rent a GPU vs use my own? That doesn't make any sense.
marssaxman
9 hours ago
In terms of copyright law, it matters very much whether Joe Schmoe is using his own copy of the data for his own purposes, or whether he is making more copies and distributing them to other people.
If the AI companies were letting people download copies of their training data, copyright law would certainly have something to say about that. But no: once they download the training data, they keep it, and they don't share it.
godelski
9 hours ago
> using his own copy of the data
Yes? That is a different thing? I guess we can keep moving the topic until we're talking about the same topic if you want. But honestly, I don't want to have that kind of conversation.marssaxman
8 hours ago
How is it a different thing? Are we talking about copyright law, or not?
nerdponx
4 hours ago
It's not about the downloading of the data, it's about its use in training models, which is dubious from a copyright perspective.
vunderba
13 hours ago
> "Having machines consume large volumes of data posted on the Internet for the purpose of generating value for them without compensating the creators" is equally a description of Google.
Quid pro quo. Those sites also received traffic from the audiences searching using Google. "Without compensation" really only became a thing when Google started adding the inlined cards which distilled the site's content thus obviating the need for a user to visit the aforementioned site.
godelski
13 hours ago
I'm not sure quid pro quo even matters. A search engine is more like providing a taxi service. You're just taking people to a place.
Now the AI summaries are a different story. One where there is no quid pro quo either. It's different when that taxi service will also offer the same service as that business. It's VERY different when that taxi service will walk into that business, take their services free of charge[0], and then transfer that to the taxi customer.
[0] Scraping isn't going to offer ad revenues
[Side note] In our analogy the little text below the link it more like the taxi service offering some advertising or some description of the business. Bit more gray here but I think the quid pro quo phrase applies here. Taxi does this to help customer find the right place to go, providing the business more customers. But the taxi isn't (usually) replacing the service itself.
sobkas
13 hours ago
Proper term for it is Computer Assisted Plagiarism, CAP for short. Also, I really hope that Google doesn't claim it created sites it crawl for search their engine.
uncletscollie
11 hours ago
That is not at all how the internet works. Try to download music from Napster and Lars will sue your ass.
marssaxman
8 hours ago
No he certainly will not; you will only get sued if you upload Lars' music to share with other people. If you download an illegal copy, the person you downloaded from is the one breaking the law.
godelski
13 hours ago
> where it can freely be downloaded by anyone at any time, by people who are then free to do whatever they like with it on their own hardware
I think you have a strong misunderstanding of the law and the general expectation of others.I'd like to remind you that a lot of celebrities face legal issues for posting photos of themselves. Here's a recent example with Jennifer Lopez[0]. The reason these types of lawsuits are successful is because it is theft of labor. If you hire a professional photographer to take photos of your wedding then the contract is that the photographer is handing over ownership of the photos in exchange of payment. The only difference here is that the photo was taken before a contract was made. The celebrity owns the right to their body and image, but not to the photograph.
Or think about Open Source Software. Just because it is posted on GitHub does not mean you are legally allowed to use it indiscriminately. GitHub has licenses and not all of them are unrestricted. In fact, a repo without a license does not mean unfettered usage. The default is that the repo owner has the copyright[1].
> You're under no obligation to choose a license. However, without a license, the default copyright laws apply, meaning that you retain all rights to your source code and no one may reproduce, distribute, or create derivative works from your work.
A big part of what will make a lawsuit successful or not is if the owner has been deprived of compensation. As in, if you make money off of someone else's work. That's why this has been the key issue in all these AI lawsuits. Where the question is about if the work is transformative or not. All of this is in new legal territory because the laws were not written with this usage in mind. The transformative stuff is because you need to allow for parody or referencing. You don't want a situation where, say... someone including a video of what the president has said to discuss what was said[2]. But this situation is much closer to "Joe stole a book, learned from that book, and made a lot of money through the knowledge that they obtained from this book AND would not have been able to do without the book's help." Just, it's usually easier to go after the theft part of that situation. It's definitely a messy space.But basically, just because a piece of art exists on public property does not mean you have the right to do whatever you want with it.
> is equally a description of Google.
Yes and no. The AI summaries? Yeah. The search engine and linking? No. The latter is a mutually beneficial service. It's one thing to own a taxi service and it is another to offer a taxi service that will walk into a starbucks take a random drink off the counter and deliver it to you. I'm not sure why this is difficult to understand.[0] https://www.bbc.com/news/articles/cx2qqew643go
[1] https://docs.github.com/en/repositories/managing-your-reposi...
pigeons
14 hours ago
But they didn't only train on information the creators made freely available. They trained on copyrighted materials obtained illicitly.
pigeons
7 hours ago
I know we're not supposed to comment about downvotes, but the original comment was talking about "these companies", and none of the information indicating that they, or at the very least Meta, trained on terabytes of books downloaded from zlib and libgen and other torrent sites, is in dispute. So even if you believe that copyright should not exist, I don't see why this is not a valid dispute of the parents argument that they only trained on information creators made freely available.
bdamm
16 hours ago
We didn't seem to mind when Google was doing it back in 1999, or Lycos, Altavista, etc before them... why do we care about the LLM companies doing it now?
codazoda
15 hours ago
I find LLMs extremely useful but I think the difference is that they regurgitate the content (not verbatim) instead of a link to it. This is not unlike how a human might tell their friend about it.
bdamm
13 hours ago
Google has been regurgitating content right into search results since the very beginning, and they've been providing "synopsis" type of results for over a decade.
Nevermark
15 hours ago
> This is not unlike how a human might tell their friend about it.
Is there someone who has read the whole internet? Can we all be there friend?
The entire basis of fair use is scale matters.
nbulka
15 hours ago
Because they have terms of service they have to adhere to. We need laws to be lawful.
senko
13 hours ago
I consumed large volumes of data posted on the internet for decades, which generated a lot of value for me, without compensating the creators.
The only difference is that I (presumably) have a soul.
gist
8 hours ago
> "Reading stuff freely posted on the internet" is also very different from a business having machines consume large volumes of data posted on the Internet for the purpose of generating value for them without compensating the creators.
The fact that value is being created is irrelevant. The fact that they are making profit is irrelevant. As is non compensation to creators. There isn't any law being broken. Is there?
Bottom line in real world terms there is no expectation of privacy with a freely open and unrestricted web site. Even if that website said 'you can use this for single use but not mass use' that in itself is not legally or practically enforceable.
Let's take the example of a Christmas light show. The idea might be (in the homeowners mind) that people, families, will drive by in their cars to enjoy the light show (either a single home or the entire street or most of it). They might think 'we don't want buses full of people who paid to ride the bus' coming down the street. Unfortunately there is no way to prevent that (without the city and laws getting involved) and there is nothing wrong with the fact that the people who provide the bus are making money bringing people to see the light show.
jMyles
7 hours ago
> "Reading stuff freely posted on the internet" is also very different from a business having machines consume large volumes of data
...not if you believe in the right of general-purpose computing. If they have the right to read the data, why don't they have a right to program a computer to do it for them?
I think we all agree that they're not the good guys here, but this reasoning in particular is troubling.
TheRoque
16 hours ago
I'm not talking about that, I'm taking about downloading gigabytes of books, and movies and who knows what data (since it's not disclosed) without paying. Those are not freely posted on the internet. Well, not legally anyways.
themafia
12 hours ago
Faithfully reproducing something you've previously read while passing it off as your own original work is a violation of the most basic tenets of intellectual property rights.
Sohcahtoa82
11 hours ago
This is a quintessential bad faith comment.
The reference to terabytes of stolen data refers to copyrighted material. I think you know this but chose to frame it as "stuff freely posted on the internet" in order to mislead and strawman the other comment.
marssaxman
11 hours ago
I meant it exactly as I said it. I do not agree that any theft occurred, either in law or in spirit, and I believe that reinterpretation of intellectual-property law in order to make it a crime would cause significant harm, greatly outweighing the benefits, as has been the case with every other expansion of intellectual property law I have seen.
fcarraldo
10 hours ago
Anthropic downloaded books from Library Genesis and The Pirate Library mirror. This is factual and reported on from court documents.
What’s the angle that describes this as fair use?
[0] https://www.businessinsider.com/anthropic-cut-pirated-millio...
marssaxman
9 hours ago
The simple fact that they are not republishing any of that data. Fair use does not apply, because copyright does not apply, because nothing is being copied.
Wowfunhappy
7 hours ago
So you don't think downloading something from The Pirate Bay constitutes copyright infringement provided you don't republish it?
marssaxman
6 hours ago
Precisely. The person sharing is the one breaking the law.
TheRoque
5 hours ago
That's factually wrong, downloading without sharing is also illegal.
WA
12 hours ago
Forgot the 82TB of torrented books Meta has been using for training? I mean, yeah, it’s Meta. No surprise. But I won’t believe for one second that the other players didn’t do a similar thing. They just haven’t been caught yet.
exe34
13 hours ago
so I can take a screenshot from a movie trailer on YouTube and sell posters of it now? I thought copyright still applied to the poor.
timeon
16 hours ago
What "reading"?
marssaxman
16 hours ago
The same reading search engine crawlers have been doing since time immemorial.
ehnto
15 hours ago
No one gave them permission to access their webservers back then either. Before it's cited that there is precedent in law, that is in the US. No such precedent exists in my country, and our laws suggest that unauthorized access regardless of "gates up or down" would constitute trespassing. There are also no protections for scrapers coming out of prior lawsuits, and copying copyrighted material is of course illegal.
Which is just to point out that the world wide web is not its own jurisdiction, and I believe AI companies are going to be finding that an ongoing problem. Unlike search, there is no symbiosis here, so there is an incentive to sue. The original IP holders do not benefit in any way. Search was different in that way.
TheRoque
16 hours ago
Search engines never claimed that their content was orignal, and redirect to the original author (which gets proper retribution)
kridsdale1
16 hours ago
Looking at and gaining knowledge.
nbulka
15 hours ago
No you don't. You don't have to assume people are going to be bad! We should not normalize it either.
kolektiv
15 hours ago
You don't have to assume people are going to be bad, but it's reasonable and prudent to expect it from people who have already shown themselves to be so (in this context).
I trust people until they give me cause to do otherwise.
nbulka
15 hours ago
Training on personal data people thought was going to remain private vs. stuff out in public view (copyright or not), are two different magnitudes of ethics breaches. Opt OUT instead of Opt IN for this is CRAZY in my opinion. I hope that the reddit post is WRONG on that detail but I seriously doubt it.
I asked Claude: "If a company has a privacy policy and says they will not train on your data and then decides to change the policy in order "to make the models better for everyone." What should the terms be?"
The model suggests in the first paragraph or so EXPLICIT OPT IN. Not Opt OUT
locallost
14 hours ago
No, nbulka is correct. People should not shrug off and accept things that are wrong just because it's to be expected. It's one of the worst things you can do because as already pointed out, it just normalizes wrong.
szczepano
14 hours ago
You can and should safely assume people can do anything that's possible to do. Weather something is bad or good is a term of historical debate.
nickpsecurity
12 hours ago
Yours is the sanest interpretation of this.