Facebook scraped every Australian adult user's public posts to train AI

273 pointsposted 7 days ago
by elashri

266 Comments

shsbdncudx

6 days ago

Presumably “scraped” isnt the right term here. They already have the raw data, they Won’t be “scraping “ it from the website they’ll just be investing it from where they store it

camkego

6 days ago

It’s interesting to think about what the right verb is.

I’d probably say Meta trained their models using all self-hosted, public AU citizen’s data.

But it doesn’t really sound as scary as “scraped” to non-technical users.

themoose8

6 days ago

It does feel like scraped is being used for it's negative connotations.

Perhaps "consumed" would work just as well and be more accurate.

Cyclone_

6 days ago

Agreed, scraping is more appropriate for when one gathers data from a 3rd party site.

fwn

6 days ago

The authors of the title most likely wanted to suggest a similarity between metas use of the data and scraping.

In some legislations there are rules about scraping. And for many less technical people it sounds scary.

1oooqooq

6 days ago

Au contraire. By calling it wrong you get the clicks and muddle legislation discussions. The Conversation will start with anti scraping, but then every user "gave" metabook the images and accepted privacy terms...

michaelnny

6 days ago

Maybe they have no idea what’s the difference between “scraping” and “mining”? I mean for non-tech people, these are just buzzwords…

royaltjames

6 days ago

Feels like they are scraping the near-empty jar of mayonnaise.

parasti

7 days ago

It's funny because the entire Facebook ecosystem is designed to disincentivize meaningful posting. Just keep watching the ads and short form videos, user.

encoderer

7 days ago

That’s nothing. AOL has just finished training on 29 years of emails and messages. it’s hoped that with more H100s the AI will finally be able to calculate the full amount due by BillG for the emails mom has been forwarding.

tdeck

7 days ago

TIL AOL still exists.

nicbou

6 days ago

It's even creepier that the company no longer meaningfully exists, but the data lives on.

meiraleal

6 days ago

AOL exists on every modern walled garden (unfortunately)

Noumenon72

7 days ago

Was "public" ever the default setting? I remember it as being opt-in if you ever wanted something to show beyond your friends-of-friends.

grandma_tea

7 days ago

I have a very different memory of my time on Facebook 10 or so years ago... It felt like every two weeks some update would change my settings to "public" in some way.

not2b

7 days ago

On Facebook, public isn't the default, but on Instagram it is. All those billions of photos, including all those famous people: evidently fair game.

giobox

7 days ago

I don't think you can make a blanket statement like this - the defaults and privacy policy for FB have changed a lot in 16 years, multiple times. It might be the case today, but this is 16 years of data.

mattcantstop

7 days ago

I am very likely in the minority here, but I think AI SHOULD be trained on everything that is in the public sphere. I'd be disappointed if it wasn't trained on everything they had access to.

If it is trained on private information, then I would have issue with it.

trimbo

7 days ago

I don't agree because it creates this dilemma for creators: you need to put your work out there to get traction, but if you put your work out there and anything public is fair game, then it will be sampled by a computer and instantly recreated at scale. This might even happen without the operator knowing whose work is being ripped off.

Commercial art producers have always ripped off minor artists. They would do it by keeping it very similar to the original but just different enough to avoid being sued. Despite this, I personally know two artists who have sued major companies who ripped off their work for ads, and both won million-plus settlements. Why would we embrace this now that a computer can do it and there's a level of deniability? I don't understand how this benefits anyone.

Ukv

7 days ago

> Why would we embrace this now that a computer can do it and there's a level of deniability?

Generally I don't think people are arguing that copyright law should be more lenient to AI than it is to humans. If your work gets ripped off (a substantially similar copy not covered by fair use) you can sue regardless of tools used in its creation.

Question would be whether machine learning, unlike human learning, should be treated as copyright infringement. There are differences and the law does not inherently need to treat them the same, but it could.

As to why it should: I think there's huge benefit across a large range of industries to web-scale pretraining and foundation models, and I'd like it to remain accessible to open-source groups or smaller companies without huge data moats. Realistically I think the alternative would likely just benefit Getty/Universal with near-identical outcomes for most actual artists.

When the very basis of copyright is for the "progress of sciences and useful arts", it seems backwards to use it in a way that would set back advances in language translation, malware/spam/DDoS filtering, defect detection, voice dictation/transcription, medical image segmentation, etc.

marcosdumay

7 days ago

> Question would be whether machine learning, unlike human learning, should be treated as copyright infringement.

No, the question is whether those genAI we have around are mass copyrights violation machines or whether they "learn" and build non-violating work.

And honestly, I have seen evidence pointing both ways. But the "copyrights protection" institutions are all quickly to decide the point dismissing any evidence on philosophical basis.

Ukv

7 days ago

> No, the question is whether those genAI we have around are mass copyrights violation machines or whether they "learn" and build non-violating work.

I refer to the training process in question, which may or may not be be violating copyright, as "machine learning" since that's the common terminology. Question is whether that process is covered by fair use. Whether or not it actually "learn"s is not irrelevant, but I'd say more a philosophical framing than a legal one.

marcosdumay

7 days ago

> I refer to the training process in question

Yeah, you go for the red herring.

All of the worthwhile debate is about the real violations. But the public discourse is surely inundated with that exact red herring.

Ukv

7 days ago

I addressed model output (infringes copyright if substantially similar, as with manually-created works) and the process of training the model (requires collating/processing ephemeral copies, possibly fair use). What do you think the "real violations" are, if not those?

skissane

6 days ago

> Generally I don't think people are arguing that copyright law should be more lenient to AI than it is to humans. If your work gets ripped off (a substantially similar copy not covered by fair use) you can sue regardless of tools used in its creation.

With humans, copyright law deals with knowing and intentional infringement more severely than accidental and unintentional infringement.

With an AI, any infringement on the part of the AI end-user is very likely going to be accidental and unintentional rather than knowing and intentional, so the legal system is going to deal with it more leniently, even if actual infringement is proven. The exception would be if you deliberately prompted it to create a modified version of a pre-existing copyrighted work.

With humans, whether infringement is knowing or not, intentional or not, can turn into a massive legal stoush. Whereas, if you say it is AI output, and it appears to actually be AI output, it is going to be much harder for the plaintiff (or prosecution) to convince the court that infringement was knowing and intentional.

autoexec

7 days ago

> but if you put your work out there and anything public is fair game, then it will be sampled by a computer and instantly recreated at scale.

That's just how the internet works. Don't put something on the internet if you don't want it to be globally distributed and copied.

> I personally know two artists who have sued major companies who ripped off their work for ads, and both won million-plus settlements.

Ultimately "AI did it" should never be allowed to be used as an excuse. If a company pays for a marketing guy who rips off someone's work and they can be sued for it, then a company that pays for an AI that rips off someone's work should still be able to be sued for it.

trimbo

7 days ago

> That's just how the internet works. Don't put something on the internet if you don't want it to be globally distributed and copied

Until now, this has been an acceptable tradeoff because there's some friction to theft. Directly cloning the work is easy, but that also means an artist can sue or DMCA. It also means the original artist's work can go more viral, which, despite the short-term downsides, can help their popularity long term.

The important difference is that imitating an artist's style with new work used to take significant time (hours or days). With an LLM, it takes milliseconds, and that model will be able to churn out the likes of your work millions of times per day, forever. That's the difference, and why the dilemma is new.

> Ultimately "AI did it" should never be allowed to be used an as an excuse

With the exception of an LLM directly plagiarizing, the only way to prove it didn't is by not allowing it to train on something. LLMs are the sum of everything. We could say the same about humans, sure, we are a model trained on everything we've ever seen too. But humans aren't machines who can recreate stuff in the blink of an eye, with nearly perfect recall, at millions of qps.

panarky

7 days ago

"That's just how the internet works" is nonsensical when AI is changing how the internet works.

Just because the tradeoffs of sharing on the internet used to work before AI, doesn't mean those tradeoffs continue to be workable after AI.

It's like having drones follow everyone around and publish realtime telephoto video of them because they have "no expectation of privacy" in public places.

Maybe before surveillance tech existed, there was no expectation of privacy in public places, but now that surveillance tech exists, people naturally expect that high-res video of their every move won't be collected, archived and published even if they are in public.

autoexec

7 days ago

> Maybe before surveillance tech existed, there was no expectation of privacy in public places, but now that surveillance tech exists, people naturally expect that high-res video of their every move won't be collected, archived and published even if they are in public.

currently, that'd be an unrealistic expectation. I'd agree that it would be nice if that wasn't the case but laws need to catch up with technology. Right now, AI doesn't change things too much since a company who publishes something that violates copyright law is still breaking the law. It shouldn't matter if an AI was used to create the infringing copy or not.

I'm all for new laws giving extra rights to people on top of what we already have if needed, but generally copyright law is already far too oppressive so I'd need to consider a specific proposed law and its impacts.

3np

6 days ago

The topic of expectations reminds me of this article

https://spectrum.ieee.org/online-privacy

autoexec

6 days ago

I think that "shifting baseline syndrome" is a major issue, but on the privacy side of things people don't seem to really understand where we're at currently, and they seem to be very good at lying to themselves about it.

You can find youtube videos of people outright screaming at photographers in public, insisting that no one has a right to take a picture of them without their permission while the entire time they're also standing under surveillance cameras.

When it's in their face they genuinely seem to care about privacy a lot, but they also know the phone in their pocket is so much more invasive in every way. They've been repeatedly told that they're tracked and recorded everywhere.They sign themselves up for it again and again. As long as they don't see it going on right in front of them in the most obvious way possible I guess they can lie to themselves in a way that they can't when they see a man with a camera, but even though on some level they already know that the street photographer is easily the last thing that should concern them, they still get upset to the point where they're screaming in public. I really don't understand it.

jprete

6 days ago

There's little that any individual can realistically do about smartphone spying. There are situations where it's borderline unworkable to not have a smartphone. The importance of the phone increased greatly over the same period during which companies increased tracking or at least acknowledged that they were doing it.

Leaving that aside, people probably react that way because the corporation just wants to gather advertising data from everyone, impersonally, while the photographer is taking a direct and specific interest.

autoexec

6 days ago

> Leaving that aside, people probably react that way because the corporation just wants to gather advertising data from everyone, impersonally,

There's nothing more personal than the collection, use, and sale of every intimate detail of your life. And corporations don't just want to gather advertising data. They want to collect as much data as they possibly can in order to use it in any and every way that might somehow benefit them and the vast majority of the time that means using it against you. It stopped being about advertising decades ago.

That data is now used to set the prices you pay, it determines what jobs you get, it influences where you are allowed to live. Companies have multiple versions of their policies and they use that data to decide which version they will apply to you, how you will be treated by them, even how long they leave you on hold when you call them. That data is used to extract as much money from you as possible. It's used to manipulate you and to lie to you more effectively. It is used against you in court rooms. It can get you questioned or arrested by police even if you've done nothing wrong. It gets bought by scammers, extremists, and activists looking for targets. Everyone who collects, sells, and buys your data is only looking to help themselves so that data only ever ends up hurting you.

More and more that data has very real world consequences on your daily offline life. You're just almost never aware of it. A company that charges you 10% more than the last person when you buy the same item isn't going to tell you that it was because of the data they have on you, you just see the higher price tag and assume it applies to everyone. The company that doesn't hire you because of something they found in the dossier they bought from a data broker isn't going to inform you that it was a social media post from 15 years ago that made them pass you over, or the fact that you buy too much alcohol, or that you have a history of depression, they'll just ghost you.

If the data companies collect only ever determined what ads you see nobody would care, but that data is increasingly impacting your life in all kinds of ways. It never goes away. You have no ability to correct errors in the record. You aren't allowed to know who has it or why. You can't control what anyone does with it.

The guy taking pictures and video on the street probably isn't looking to spend the rest of his life using that footage against you personally, but that's exactly what the companies collecting your data are going to do with it and if/when they die they'll sell that data to someone else before they go and that someone else will continue using it to try and take something from you.

jofla_net

7 days ago

Yup, this is just a new-age tragedy of the commons. As soon as armies of sheep come to graze, or consume your content, the honeymoon's over.

autoexec

7 days ago

> With an LLM, it takes milliseconds, and that model will be able to churning out the likes of your work millions of times per day, forever.

AI does cause a lot of problems in terms of scale. The good news is that if AI churns out millions of copies of your copyrighted works you're entitled to compensation for each and every copy. In addition to pushing out copies of copyrighted material, AI is also capable of writing up DMCA notices and legal paperwork.

> With the exception of an LLM directly plagiarizing, the only way to prove it didn't is by not allowing it to train on something. LLMs copy everything and nothing at the same time.

An AI's output should be held to the exact same standard as anyone else's output. If it's close enough to someone else's copyrighted work to be considered infringing then the company using that AI should be liable for copyright infringement the same way they would be if AI had never been involved. AI's ability to produce a large number of infringing works very quickly might even be what causes companies to be more careful about how they use it. Breaking the law at speeds approaching the speed of light isn't a good business model.

skeledrew

6 days ago

Outside of competing profit motives, there is no dilemma. It's that underlying motive, and it's root, that will have to undergo a drastic change. The Pandora that AI is is already out of the box and there's no putting it back in; only dealing with the consequences.

wpietri

7 days ago

> That's just how the internet works. Don't put something on the internet if you don't want it to be globally distributed and copied.

You could make the same argument about paper. "That's just how photocopiers work! If you don't want your creations to be endlessly duplicated and sold, don't write them down!" Heck, you could make the same argument about leaving the house. "That's just how guns work! Don't go out in public if you don't want to take the risk of getting shot!"

But it's a bad argument every time. That something is technically possible doesn't make it morally right. It's true that a big point of technology is to increase an individual's power. But I'd say that increased power doesn't diminish our responsibility for our actions. It increases it.

autoexec

7 days ago

> You could make the same argument about paper. "That's just how photocopiers work! If you don't want your creations to be endlessly duplicated and sold, don't write them down!"

No, the argument would be about photocopies, not paper. "That's just how photocopiers work! Don't put something into a photocopier if you don't want photocopies of it." It isn't possible for anyone to access anything on the internet without making copies of that thing. Copies are literally how the internet works.

Shooting everyone who steps outside isn't how guns work either so that also fails as an analogy.

The internet was specifically designed for the global distribution of copies. If that isn't what you want, don't publish your works there.

> That something is technically possible doesn't make it morally right.

Morality is entirely different from how the internet works, but in practice, I don't see anything immoral about making a copy of something. Morality only becomes an issue when it comes to what someone does with that copy.

ulbu

7 days ago

> If that isn't what you want, don't publish your works there.

"Women are oppressed in Iran. Well, that's just how Iran is. Just leave it if you don't want to be oppressed"

Oh my. Yea, and whatever is some way, is that way – "it is how it is, deal with it". It's an empty statement. The topic is an ethical and political discussion in light of current technologies. It's a question of whether it should work this way. That's how all moral questions come about – by asking if something should be the way it is. And the current state of technology brings a dilemma that hasn't existed before.

And no, the internet was not designed for that. Quite obviously. Sounds like you haven't heard of private messages.

I'm very surprised this has to be stated.

autoexec

7 days ago

> "Women are oppressed in Iran. Well, that's just how Iran is. Just leave it if you don't want to be oppressed" Yea, and whatever is some way, is that way – "it is how it is, deal with it". It's an empty statement.

No, because Iran can stop oppressing women and still exist as a functional country. oppressing women today is "how it is". The internet on the other hand is designed to be a system for the distribution of copies. That isn't "how it is", but rather "what it is".

The internet cannot do anything except distribute copies and anything that doesn't distribute copies wouldn't be the internet.

> Sounds like you haven't heard of private messages.

Private messages are also not what is being discussed here. The comment being discussed said: "I don't agree because it creates this dilemma for creators: you need to put your work out there to get traction, but if you put your work out there and anything public is fair game, then it will be sampled by a computer and instantly recreated at scale."

"anything public". For what it's worth though, private messages are still copies.

ulbu

6 days ago

all received messages are copies of the original. a broadcast is a copy. language itself is an incessant copying. so it’s a truism to say that of the internet. and a generality that doesn’t apply to specifics. downloading cracked software is also copying, but this genus is irrelevant to its discussion. its morality is beside its being a copy, even though it is essential that it be a copy. likewise with other data.

we don’t have rules set yet, that’s why this discussion is active, ie, not just a niggle from after a couple of beers. it’s a question of respect for the author.

sabbaticaldev

6 days ago

yea so everybody can copy a book just like the internet and nobody is persecuted for memorizing it.

wpietri

7 days ago

Yes, if one over-narrowly construes any analogy, it can be quickly dismissed. I suppose that's my fault for putting an analogy on the internet.

We've had copying technologies since people invented the pen. It was such an important activity that there were people who spent their whole lives copying texts.

With the rise of the printing press, copying became a significant societal concern, one so big that America's founders put copyright into the constitution. [1] The internet did add some new wrinkles, but if anything the surprise is is that most of the legal and moral thinking that predates it translated just fine to the internet age. That internet transmission happens to make temporary copies of things changed very little, and certainly not the broad principles.

I understand why Facebook and other people lining their pockets would like to claim that they are entitled to take what they want. But we don't have to believe them.

[1] https://constitution.congress.gov/browse/essay/artI-S8-C8-1/...

autoexec

7 days ago

I don't think that facebook should be allowed to violate copyright law, but clearly they have the same rights as you do to copy works made publicly avilable on the internet.

wpietri

7 days ago

We are talking about more than the current law here. We're talking about what the law should be, based on what people see as right. And I'd add that Facebook is doing a lot more here than just quietly having a copy of something.

autoexec

7 days ago

If the concern isn't copyright infringement what would the new law be about? What's the harm people want solved? Is it just some philosophical objection, like people not liking the idea of other people doing something they don't like? Is it fear of future potential harms that haven't been seen in real life yet?

Is it just "someone else may be using what I published publicly to make money somehow in a way that is legal and doesn't infringe on my copyrights but I still don't like it because they aren't giving me a cut?" What would the new law look like?

wpietri

6 days ago

I think you haven't really grappled with how laws get made.

Copyright is a thing we made up because of "people not liking the idea of other people doing something they don't like". The current specific boundaries are a careful compromise about exactly when we protect "using what I published publicly to make money somehow".

Those boundaries in large part exist because of responses to specific technologies. Before the technologies, people didn't care. After, people got upset and then we changed the laws, or the court rulings that provide detailed meaning to the laws shifted.

As an example, you could look at moral rights, which are another intellectual property right the laws for which came much later than copyright. Or you could look at how intellectual property law around music has shifted in response to recording, and again in response to sampling. Or you could look at copyright for text, which was invented in response to the printing press. And more specifically in response to some people using pioneering technology to profit from other people's creative work.

And we might not need any changes in the law here. The people at OpenAI and elsewhere know they're doing something that could well be illegal. They've been repeatedly sued, and they've chosen to make deals with some data-holders. They wouldn't be paying for published data at all if they knew they were in the clear, but they've chosen to cut many deals with publishers powerful enough to sue. They're hoovering up data anyhow because, like too many startups, they've decided they'll do what they want and see if they can get away with it.

orthecreedence

7 days ago

> You could make the same argument about paper.

Most paper doesn't come with Terms and Conditions that everything you write on it belongs to the paper company. I hate Facebook (with a fiery passion) but people gave them their data in exchange for the groundbreaking and unprecedented ability to make friends with another person (which has never been done before). It sucks, but don't use these "free" systems without understanding the sinister dynamics and incentives behind them.

People make the same arguments about the NSA. "They aren't doing anything bad with the data their collecting about every US citizen." Well, at some point they will. Stop borrowing against future freedom for a tiny bit of convenience today.

wpietri

7 days ago

I think you're confusing a legal point (whether a T&C really gives Facebook any particular legal right in court) with the moral question of whether or not people should just roll over for large companies because of language we all, Facebook included, know that nobody ever reads.

Even if FB's T&C made it clear they could do this (something I haven't seen proven), that at best means people would have a hard time suing as individuals. They can still get upset. They can still protest to the regulators and legislators whose job it is to keep these companies in line, and who create the legal context that gives a T&C document practical meaning.

capital_guy

7 days ago

> That's just how the internet works. Don't put something on the internet if you don't want it to be globally distributed and copied.

And if someone takes a picture of your artwork, or takes a picture of your person, and posts that to the internet without your consent? Have you given up your rights then?

My answer: Absolutely not.

samatman

7 days ago

What AI does is much more like the Old Masters approach of going to a museum and painting a copy of a painting by some master whose technique they wish to learn. This has always been both legal, and encouraged.

Or borrowing a thick stack of books from the library, reading them, and using that knowledge as the basis for fiction. That's a transformative work, and those are fine as well.

My take is that training AI models is a bespoke copyright situation which our laws were never designed to handle, and finding an equitable balance will take new law. But as it stands, it's both legal and encouraged for a human to access a Web site (thereby making a copy) and learn from the contents of that website.

That is, fundamentally, what happens when an LLM is trained on corpus data. The difference in scale becomes a difference in kind, but as I said, our laws at present don't really account for that, because they weren't designed to.

LLMs sometimes plagiarize, which is not ok, but most people, myself included, wouldn't consider the dilemma satisfactorily resolved if improvements in the technology meant that never happened. Outside of that, we're talking about a new kind of transformative work, and those are legal.

pvaldes

7 days ago

> This has always been both legal, and encouraged.

Not always. The copy must be easily identifiable as copy. An exact reproduction can't have the same dimensions as the original for example.

Drawing just a person or a detail of the picture, or redoing the picture in a different context or style, is encouraged.

Selling a full scale photo of the picture is forbidden. The copyright of famous art belongs to the museum.

samatman

6 days ago

The second example is better than the first, yes. I was thinking about the process more than the fact that painting a study produces a work, and a derived one at that, so more normal copyright considerations apply to the work itself.

> An exact reproduction can't have the same dimensions as the original

This is a rule, not a law, and a traditional and widespread one. Museums don't want to be involved in someone selling a forgery, so that rule is a way of making it unlikely. But the difference between "if you do this a museum will kick you out" and "this is illegal" is fairly sharp.

> The copyright of famous art belongs to the museum.

Not in a great number of cases it doesn't, most famous art is long out of copyright and belongs to the public domain. Museums will have copyright on photos of those works, and have been known to fraudulently claim that photos taken by others owe a license fee to the museum, but in the US at least this isn't true. https://www.huffpost.com/entry/museum-paintings-copyright_b_...

1oooqooq

6 days ago

Nice scapegoating Anthropomorphized.

Correct analogy is like someone taking pictures of the paintings, going home and applying a photoshop filter, erasing the original signature and adding theirs.

The law already covers that very much so.

autoexec

7 days ago

If someone takes a picture of me while I'm in public that picture is their copyrighted work and they have every right to post that on the internet. There is no expectation of privacy in public, and Americans have very few rights against other people using photos/video of them (there are some exceptions for things like making someone into your company's spokesperson against their will)

If someone took a photo of my copyrighted work, their photo becomes their copyrighted work. They also have a right to post that picture on the internet without my consent. Every single person who takes a picture of a painting in a museum and posts it to social media is not a criminal. There are legal limitations there too however and that's fine because we have an entire legal system created to deal with that which didn't go away when AI was created.

If a company uses AI to create something that under the law violates your copyright you can still sue them.

Buttons840

7 days ago

> That's just how the internet works. Don't put something on the internet if you don't want it to be globally distributed and copied.

This is true for average people. Is it true for the wealthy? Is it true for Disney? Does our law acknowledge this truth and ensure equal justice for all?

autoexec

7 days ago

It's 100% true for everyone. You can't access anything at disney.com without making a copy of that thing. Disney can't access anything at yourdomain.whatever without making a copy of that thing.

Whatever crimes either of you can get away with using your copies is another matter entirely. Any rights you had under the legal system you had before AI haven't gone away, neither have the disadvantages you have against the wealthy.

Buttons840

7 days ago

One of the comments you replied to was complaining that their work would be copied and used in training LLMs or other lucrative algorithms, and then you responded taking about how it's common to temporarily copy data into RAM to show a web page. Those are very different, and bringing up such technical minutia is not helpful to the discussion.

If someone asks "how can I share my work online without it being copied?", "actually, you can't share it without people copying it into RAM" is not the answer they're looking for. That answer it too technical, too focused on minutia, and our laws recognize that.

autoexec

7 days ago

The point is that "copies" was never the problem. "sampled by a computer and instantly recreated at scale" is the expected outcome of publishing something publicly on the internet.

Their problem was copyright infringement and like you said, our laws recognize that problem. We have an entire legal framework for dealing with companies that publish infringing copies of copyrighted works. None of that has changed with LLMs.

If a company publishes something that violates copyright law they can be sued for it, it shouldn't matter if an AI was involved in the creation of what was published or not.

user

7 days ago

[deleted]

guerrilla

7 days ago

> That's just how the internet works. Don't put something on the internet if you don't want it to be globally distributed and copied.

Or we could be ethical and encourage others to be ethical.

vasco

7 days ago

I see you're one of the ones that wouldn't download a car.

pbhjpbhj

7 days ago

I would share a car I had rights to, and download a car made free to me. Facebook would certainly sue me if it were their car, they should thus be held to that standard in my personal opinion.

schmorptron

7 days ago

We could make a distinction between individuals and companies doing it

guerrilla

7 days ago

Depends on the risk assesment but I'd say I'm a lot more like Robin Hood. Facebook is obviously Prince John.

kaashif

7 days ago

Okay, but that doesn't change how the Internet works.

Encouraging people to be ethical isn't actually a real way to prevent people copying photos you put up online.

orthecreedence

7 days ago

We can encourage profit-driven megacorps to be ethical? Sure, by abolishing them. Otherwise, you're just screaming into the void.

guerrilla

7 days ago

I think what I said is a prerequisite for that. There will be no structural changes without widespread cultural changes.

grumbel

7 days ago

> then it will be sampled by a computer and instantly recreated at scale.

You don't need to train the AI on the work for that. You don't even need to show the AI the work itself. You can just give the AI a vague description of the work and it is able to replicate something very close to it.

That's something you can try today, hand Claude or ChatGPT an image, let them describe the image, put that description into your favorite image generator. The output is a clean-room-clone of the original. It won't be a photocopy, but it will contain all the significant features that made up the original, even with surprisingly short descriptions of just a 100 words.

Won't be long and you can hand the AI a movie trailer and the AI will build you the rest of the movie from it.

> Why would we embrace this now that a computer can do it

You can't stop it long term. That old "How To Draw an Owl"-meme is reality now. You give the AI some key points and it will fill in all the rest. The issue here isn't so much copyright, but that we'll be so flooded with content that it will be impossible for anybody to stand out. We might be heading towards the death of static content and heading into a world were everything is generated on the fly.

solardev

7 days ago

Well, just as another perspective...

I'm not convinced that the philosophy of copyright is a net positive for society. From a certain perspective, all art is theft, and all creativity builds upon preexisting social influences. That's how genres develop, periods, styles... and yes, blatant ripoffs and copycats too.

If the underlying goal is to be able to feed creators, maybe society needs better funding models...? The current one isn't great anyway, with 99% of artists starving and 1% of them becoming billionaires.

I'd much prefer something more like the model we have for some open-source projects, where an employer (or other sponsors) pays the living wage for the creator, but the resulting work is then reusable by all. Many works of the federal government are similarly funded, where a government employee is paid by your taxes but their resulting work automatically goes into the public domain without copyright.

I don't buy the argument that nobody would make things if they weren't copyrightable/paid directly. Wikipedia, OSM, etc. are all living proof that many people will volunteer their time to produce creative things without any hope of ever getting paid. As a frequent contributor to those and also open-source code, Creative Commons photography, etc., a large part of the joy for me is seeing how my work gets reused, transformed, and sometimes stolen by others (credit is always nice, but even when they don't mention me, at least I know the work I'm doing is useful to people).

But the difference for me is that I don't rely on those works to put food on the table. I have a day job and can afford to produce those works in my spare time.

I wish all would-be creators would have such a luxury, either via an employer relationship or perhaps art grants and the such. I wonder how other societies handle this... back in the day, I guess there were rich patrons, while some communities sponsor their artists for communal benefit. Not sure what works best, but copyright doesn't have to be the only way society could see creative outputs.

ruthmarx

6 days ago

> I'm not convinced that the philosophy of copyright is a net positive for society.

It absolutely is, just not in it's current overpowered form.

> where an employer (or other sponsors) pays the living wage for the creator, but the resulting work is then reusable by all.

Some creators want control over their own narrative, and that's entirely reasonable, at least for a limited time.

> I don't buy the argument that nobody would make things if they weren't copyrightable/paid directly.

That was never the argument as far as I'm aware. There are other concerns, like a creator losing all control of their creation before they had a chance to even finish what they wanted to do/tell.

marcosdumay

7 days ago

> I'm not convinced that the philosophy of copyright is a net positive for society.

I'm ok with that. But the philosophy of copyrights is not under debate here. All that is being debated is if it should protect small people from big corporations too.

solardev

7 days ago

It's not? I thought we were talking about "AI SHOULD be trained on everything that is in the public sphere" and "[your work] will be sampled by a computer and instantly recreated at scale. [...] Commercial art producers have always ripped off minor artists". Isn't that all about copyright and the ability to make money off your creative works?

When I put something on Wikipedia or any other commons, I don't worry about which other person, algorithm, corporation, or AI ends up reusing it.

But if my ability to eat tomorrow depended on that, then I would very much care. Hence, copyright seems an integral part of people's ability to contribute creatively.

My argument is that by detaching their income from the reusability of their work, we would be able to free more creators from that constraint. Under such a system, the little guy would never get rich off their work, but they wouldn't starve when a big corporation (or anyone else) rips them off either.

jstummbillig

7 days ago

This benefits actually everyone.

If our combined creative work until this point is what turns out to be necessary to kick-start a great shot at abundance (and if you do not believe that, if it's all for nothing, why care at all about the money wasted on models?) it might simply be our societal moral obligation to endorse it -- just as is will be the model creators moral obligation to uphold their end of this deal.

Interestingly, Andrej Karpathy recently described the data we are debating as more or less undesirable to build a better LLM and accidentally good enough to have made it work so far (https://youtu.be/hM_h0UA7upI?t=1045). We'll see about that.

Guvante

7 days ago

I want to see any indication that abundance form AI would benefit man kind first.

While I would love Star Trek society has been going very much towards Cyberpunk aesthetic aka "the rich hold all the power".

To be precise AI models fundamentally need content to survive but they need so much content there is no price that makes sense.

Allowing AI to monetize without enriching the people who allowed it to exist isn't a good path forward.

And to be clear I do not believe there is a fundamental rift here. Shorten copyright to something reasonable like 20 years and in a decade AI will have access to all of the data it needs guilt free.

dyauspitr

6 days ago

There are glimpses. Getting a high score on an Olympiad means there is the possibility of being able to autonomously solve very difficult problems in the future.

roenxi

6 days ago

Where the puck is about to be is very different from where it is. Generative AI hasn't cracked the creativity problem yet. It can generate new art but it can't develop its own style like a human can (from first principles, humans basically caricature high quality video feed).

There is pretty good reason to believe that this will be a solved problem inside a decade. We're moving towards processing video and the power put behind model training keeps increasing. How much is a style worth when computers can just generate 1,000 of them at high speed? It is going to be cheap enough that legal protection is almost irrelevant; ripping off a style will probably be harder than just creating a new original one.

We can wait a bit to find out where the equilibrium is before worrying about what the law should be.

lolc

6 days ago

I'm not convinced machines can come up with styles like humans can. After all, a style will be judged by humans. How humans respond cannot be determined from previous styles.

mlazos

7 days ago

> I don't agree because it creates this dilemma for creators: you need to put your work out there to get traction, but if you put your work out there and anything public is fair game, then it will be sampled by a computer and instantly recreated at scale. This might even happen without the operator knowing whose work is being ripped off.

This is no different than the current day, copying already happens (as your friends have seen) AI makes it a little easier but the same legal frameworks cover this - I don’t see why AI stealing is any different than a person doing the same thing. The ability to copy with zero cost was incredibly disruptive and incredibly beneficial to society. Settled case law will catch up and hopefully arrive at the same conclusion it has for human copyright infringement (is it close enough to warrant a case)

paxys

7 days ago

There is either copyright violation or there isn't. Like you said, artists can still sue companies for copying their work, AI or not. If the work was transformative enough then, well, what's the problem?

dyauspitr

6 days ago

It doesn’t recreate anything outside of edge cases you really have to go looking for. It will ingest and spit out the style though and I see nothing wrong with that. It’s basically what people do right now.

baby

7 days ago

Who cares? Don't we want the most absolute intelligence to help human civilization? Credits and creators are below that.

dewarrn1

7 days ago

Creators may disagree.

ruthmarx

6 days ago

AI isn't ripping off anyone's work. Certainly if it is, it's doing so to a much lesser extent than commissioning an artist to do a piece in another artists style is.

nxicvyvy

6 days ago

Information wants to be free, man.

datavirtue

7 days ago

My wife works in a studio with a gaggle of artists who all blatantly "rip each other off" constantly.

segasaturn

7 days ago

AFAIK, AI models have no way of differentiating high quality input from garbage. If it's fed peer-reviewed, academic papers as well as a paranoid, violent person's Facebook manifesto it treats them with equal weight as long as the sentences are coherent.

potato3983

7 days ago

On some level it needs to be fed some amount of garbage because it takes in all sorts of garbage inputs like we do.

AI that needs painstakingly curated training data isn't interesting in the same way that early lightbulbs that used precious metals and cost too much to be commercially viable aren't interesting.

casenmgreen

7 days ago

When I speak to my friends, it's a conversation not wholly private - after all, I've shared whatever I'm saying with them - but it certainly isn't wholly public.

In all our conversations, we have and we understand there are degrees of privacy; that which we share with family, that with friends, that with strangers.

When I post on-line, I both expect and expected that my conversations would be between me and the group of people I conversed with. I knew who was reading, and I was fine to write whatever I was writing to that group.

I may be wrong, but I think this is generally how people feel, how they act, what they expect, how they are, as humans. We think about who we are writing to. It does not come naturally to imagine that third parties are listening in, or will listen in, in the decades to come.

This brings us to now, with a third party, reaching back over ten or fifteen years, for absolutely everyone, everywhere, taking copies of everything it can get access to, for its own use, whatever that may be.

I profoundly reject Microsoft, and Google, and all entities and companies which act in such ways, these smiling evils, with their friendly icons and bright colours, happy faces and hundred page T&Cs to utterly obscure and obliterate the truth of their actions.

swatcoder

7 days ago

That sounds compelling when you borrow the marketing term "AI" and position the work as part of a sweeping revolution into some beautiful sci-fi future.

It's less compelling when you see the technology as noisy content generators that will flood the network with spam and devour the livelihood and opportunity to learn for low-market artists and programmers.

In the former perspective, you may look at this is "well, what's the best way we can make this happen?" while the latter sees it more like "So you insist on making this happen. Are you sure there's a suitably responsible way for you to do that?"

sli

6 days ago

This is just copyright infringement reworded to pretend it's not. I own the things I write, and publishing it on the internet doesn't negate that. OpenAI doesn't have the right to claim it, no matter what they think, and neither does anyone else.

bruce511

6 days ago

Firstly publishing something on Facebook explicitly gives them the right to "copy" it. It certainly gives them the right to exploit it (it's literally their business model.)

Secondly, Facebook is behind a login, so it's not "public" in the way HN comments are public. You'd have gained more kudos had you argued that point.

Thirdly this article I about MetaAI not OpenAI. So, no, OpenAI isn't claiming anything about your Facebook post.

I'll assume however that you digressed from the main topic, and were complaining about OpenAI scraping the web.

Here's the thing. When you publish something publically (on the internet or on paper) you can't control who reads it. You can't control what they learn from it, or how they'll use that knowledge in their own life or work.

You can of course control republishing of the original work, but that's a very narrow use case.

In school we read setwork books. We wrote essays, summaries, objections, theme analysis and so on. Some of my class went on to be writers, influenced by those works and that study.

In the same way OpenAI is reading voraciously. It is using that to assign mathematical probabilities to certain word pairings. It is studying published material in the same way I did at school, albeit with more enthusiasm, diligence and success.

In truth you don't "own the things you write" not in the conceptual sense. You cannot own a concept, argument or position. Ultimately there is nothing new under the sun (see what I did there?) and your blog post is already a rehash of that which came before.

Yes, you "own" the text, to the degree to each any text can be "owned" (which is not much.)

vharuck

6 days ago

>Firstly publishing something on Facebook explicitly gives them the right to "copy" it. It certainly gives them the right to exploit it (it's literally their business model.)

This isn't necessarily true for a user content host. I haven't read Facebook's TOS, but some agreements restrict what the host can do with the users' content. Usually things like save content on servers, distribute it over the web in HTML pages to other users, and make copies for backups. This might encourage users to post poetry, comics, or stories without worrying about Facebook or Twitter selling their work in anthologies and keeping all the money.

>In school we read setwork books. We wrote essays, summaries, objections, theme analysis and so on. Some of my class went on to be writers, influenced by those works and that study.

Scholarly reports are explicitly covered under a Fair Use exception.

https://www.copyright.gov/help/faq/faq-fairuse.html

But also be careful not to anthropomorphize LLMs. Just because something produces content similar to what a human would make doesn't mean it should be treated as human in the law. Or any other way.

jprete

6 days ago

OpenAI is not reading voraciously, it is not a human being. It makes copies of the data for training.

If there were an actual AI system which was trained by continuously processing direct fetches from the Web, without storing them but directly using it when for internal state transitions, then that might make the reading analogy work. But then AI engineers couldn't do all the analysis and annotation steps that are vital to the training process.

alok-g

6 days ago

Beautifully written. Thanks.

mr_toad

6 days ago

> publishing it on the internet doesn't negate that

The terms of use of most sites (including this one) include giving the site owners a license to use what you post, often in any way they see fit.

qwery

7 days ago

Why?

A statement that extraordinary would be interesting if it had some reasoning alongside it.

Also, Facebook posts aren't really "in the public sphere" / publicly accessible, but that's a nitpick.

limit499karma

6 days ago

We need to distinguish between modalities of machine intelligence and proceed to set policy tailored to each specific type. (Arguably) we can further include the variable of private, public control; and the orthogonal matter of private or public service.

Machine intelligence is anthropomorphic in utility, that is it serves as either surrogate or substitute for a human cognitive capability. This permits enumeration of AI utility categories. Broadly we can distinguish between creative, knowledgeable, analytical, judicial, predictive, and directing.

As an example use of this approach, consider the case of the AI trained on all public domain material and optionally having had training access to private matter (think Vatican archives). Such an instance should generally not be afforded creative rights, but we would be remiss to restrict its utility as a knowledge base.

The other parameters noted in terms of dual of public|private can of course have bearing on setting type specific constraints.

_heimdall

6 days ago

Are you concerned that this approach will lead to the abandonment of an open web?

If AI companies, and whatever comes next, are expected to take advantage of everything shared online, regardless of copyrights, it seems reasonable that people will stop sharing most things of value.

dyauspitr

6 days ago

If you never display your work online, you’ll probably never gain any traction as an artist.

_heimdall

6 days ago

I'd be concerned with getting traction if my art is online and anyone can feasibly copy my style.

Maybe that's unimportant and no different than being able to make physical copies, though good forgeries haven't always been so easily done and the forgery is meant to be an identical copy of the original work. It could just be me, but the idea of an attempted identical copy of a well known work feels different than a new creation being passed off as the work of a well known artist. For example, you can claim to have a really god copy of the Mona Lisa but that wouldn't be as valuable as claiming you have a previously unknown, unique work from the artist.

ipaddr

7 days ago

The question become are post with a limited reach (friends) the public sphere.

7bit

6 days ago

Yet, you signed an EULA for every publicly available service that legally prevents you from doing anything the company doesn't want you to do. So why do you want them to be legally using your data without, while they explicitly deny you things like scraping data from their platform.

1vuio0pswjnm7

6 days ago

Is there anything in the "public sphere" that is not (a) published to the web and (b) under a license that allows Meta to use it for training "AI".

It seems that "AI" is biased toward (1) only bits, and (2) only bits that are published to the internet.

mupuff1234

7 days ago

I think the question is what is included in the "public sphere".

If I'm a Facebook user, I definitely don't see posts that I meant to share among friends as something that should be considered part of the public sphere.

user

7 days ago

[deleted]

__loam

7 days ago

You're probably among friends on this site, but outside tech coded spaces, most people understand that publicly available is not the same thing as an unlimited license to do whatever you want.

ben_w

7 days ago

While true, likely more pertinent that most people don't have a clue what's possible legally or technically until it gets in the news.

Can't give informed consent if you don't know what the EULA means or what the machines can do.

Ekaros

7 days ago

This site is weird when you compare it to open source software projects and these same companies selling those as a service on their platforms it is again huge massive problem and exploitation... When the license explicitly allows that, without single legal question.

I wonder if things would be different if software could be copied and then recreated by these models by the mega corps. Would there still be such push in favour of it?

Dylan16807

6 days ago

The people arguing in favor of AI use are not in fact arguing for "an unlimited license to do whatever you want" so that solves that apparent hypocrisy nice and quick.

Also your theoretical software cloner would also make clones of proprietary software, right? I think that would be welcomed just fine.

__loam

6 days ago

They're apparently arguing for the legal right to use all content on the internet to create a product that is commercial and competes with the original content.

Dylan16807

6 days ago

As long as it's only borrowing very small amounts from any particular source work, I think it's fine for a new work to be commercial and compete with the originals.

__loam

6 days ago

They ingested the whole work.

Dylan16807

6 days ago

Is that a problem? Ingesting entire works is the norm, no matter how much gets used.

user

7 days ago

[deleted]

uoaei

7 days ago

"If they can, they will."

tdeck

7 days ago

Can we talk about how most of us haven't read 80% of everything on the internet and yet we are all still better at many basic things than these AIs? At what point do we admit to ourselves that this isn't a sustainable path forward.

latentsea

6 days ago

Yup. We sure as heck haven't made a lick of progress in the last 2 years. No new, useful technology to see here folks.

fidla

7 days ago

Well they don't really know if someone is an adult or not. Just because they say they are 13 doesn't mean that they really were when they signed up. And 13 is hardly an adult now is it?

qup

7 days ago

Is this an important distinction?

alwa

7 days ago

Yep. For reasons of propriety, for one thing. But also because the data protection laws get especially opinionated about what you do with kids’ speech, and one line they draw is at age 13. The American variant, COPPA, dates back to 2000, and requires verifiable parental consent to process the data of under-13s.

No idea if that matters retrospectively in legal terms—it’s seems to me that the main problem was providing service to the kids in the first place—but it’s icky either way. Then again there’s an ickiness to the entire project of pretending casual users’ arcane permissions settings from 15 years ago indicate affirmative consent today…

playingalong

7 days ago

Surely I am not a fan of FB, but what else should they do other than relying on self-reported age?

kylehotchkiss

7 days ago

It's OK. Meta is training their AI on hundreds of thousands of posts with photos of veterans with toilet plunger legs celebrating their birthdays in the middle of the street while sitting as sturdy as the Lincoln memorial. The AI brain rot has already begun in this model.

SketchySeaBeast

7 days ago

I am truly impressed by how quickly AI generated content has filled up every public space. We've gone from "AI is the future!" to a digital Kessler's syndrome in a short few years.

kylehotchkiss

7 days ago

Not only filled up every public space, but the quality of it all is so crude. Like a bad Pixar animation. It's not like Pixel's "Add the photographer back into photo".

orochimaaru

7 days ago

Why is this surprising? They’ve always done this. In fact I’d be surprised if they didn’t do this. Fwiw - llama is free to use. So I guess it’s a good enough return.

I don’t use Facebook. I’m not sure if they can peek into WhatsApp messages.

Cheer2171

7 days ago

> Why is this surprising? They’ve always done this. In fact I’d be surprised if they didn’t do this.

This is such an unconstructive attitude. This is the first time they have publicly admitted it.

cbsmith

7 days ago

It's the first time they've publicly acknowledged what specifically was used in the training set for their LLMs. It's NOT the first time they've said that that data could be used for research. The presumption was that they had used the data for their LLMs.

Maybe it is surprising to some that they this particular research used that data, but it really shouldn't be to anyone aware of how LLM development is done.

wpietri

7 days ago

Yeah, to me it's part of what press critic Jay Rosen calls The Church of the Savvy. It's kind of a performative cynicism where one tries to gain status by appearing so smart that you're above it all. One can do it with pretty much anything, which ironically means it demonstrates very little actual smarts.

To me it's related to the sort of person you'll see who on every startup failure says how they knew it wasn't going to work. Which again doesn't require particular smarts; most startups fail, so predicting failure doesn't take a genius. What I think is much more interesting is spotting a problem before it's obvious and naming it in advance. Or better, fixing it early on. That doesn't get you many internet points, though.

notatoad

7 days ago

was it supposed to be surprising?

it's still good to have confirmation of these things that we all assume to be true.

latexr

6 days ago

> Why is this surprising? They’ve always done this.

Ah, well, that makes it alright then. Move along, people, there’s nothing to see here. What’s that? What do you mean you didn’t know about the company slurping your data for their own personal gain, or knowingly poisoning you¹ and selling you defective deadly products²? Didn’t you read that one forum comment by a random person that one time? You shouldn’t be surprised. They’ve always been evil, so what can we do other than cross our arms? All hail our corporate overlords!

> I’m not sure if they can peek into WhatsApp messages.

But someone knows. And if you use WhatsApp and get screwed later, wouldn’t you rather not having your concerns dismissed?

I don’t use WhatsApp, but do have an anecdote. I have a friend who just started using Instagram and purposefully follows no one. She doesn’t have Facebook. Recently a friend recommended to her a specific product on WhatsApp and then she started getting ads for it on Instagram.

¹ https://www.sydney.edu.au/news-opinion/news/2024/05/02/how-c...

² https://www.decof.com/documents/dangerous-products.pdf

bobthepanda

7 days ago

They did get sued successfully recently by an AG for the autotag faces in photo feature, so maybe another lawsuit is in the cards.

candiddevmike

7 days ago

Need to test how good facebooks sanitization was, maybe you could find some PII in llama responses with the right prompt.

ethbr1

7 days ago

I'd assume that was one thing they were extremely diligent about ironing out, doing this at scale.

ziddoap

7 days ago

What is that assumption based on?

There have been numerous scandals over the years regarding their lack of care when handling personally identifiable data.

Meta is not who I look to for being extremely diligent with data.

ethbr1

7 days ago

Because open source, publically-released LLMs regurgitating specific user data is enough of a strategic risk to the entire effort that it would have been #1 on the test cases.

LLM training is also apple and oranges with regards to how they handle data in the course of normal processing.

thunder-blue-3

7 days ago

yeah I'm surprised that anyone didn't see this coming. The amount of times i've heard, "this is all our customer data, perform <xyz transformation> on it," I would have about $5.

ChrisArchitect

7 days ago

paxys

7 days ago

So did OpenAI and Anthropic and Google. That's what "public" means.

koolala

7 days ago

Skynet Ads are "said" to be preferred. "People prefer to see relevant ads." Can AI understand humans better than humans understand themselves? Can Humans understand the consciousness of Dogs and Cats better than they do?

The objective answer feels like No but the subjective answer feels like Yes. Humans will never understand how an animal truely thinks but we understand how to control them.

autoexec

7 days ago

I don't believe for a moment that they haven't used the data of countless children. Especially early on when kids just had to click an "I'm over 18" button or enter a fake birthday to get accounts and facebook, like everyone else, just looked the other way.

datavirtue

7 days ago

The monsters!

ziddoap

7 days ago

Indeed, knowingly collecting and processing the data of children who cannot understand the implications nor consent to their data being used for commercial gain is monstrous.

datavirtue

6 days ago

No different that the rest of the bureaucracy they "live" in.

AlexandrB

7 days ago

    People just submitted it.
    I don't know why.
    They 'trust me'.
    Dumb fucks.
-Mark Zuckerberg

Things change, but this never stop being a concise summary of Meta's ethos as a company.

geertj

7 days ago

I imagine a future AI trained on this going into therapy to uncover childhood trauma.

user

7 days ago

[deleted]

not2b

7 days ago

This would include all those celebrity posts on Instagram. Great for deepfakes. They'll try to protect against that, but a bit of cleverness with prompts should be able to get around the filters.

PaulHoule

7 days ago

Assuming they want to build a model that can do useful things with their own data (say any kind of content filtering, summarization, etc.) it is exactly what they should do.

whoitwas

7 days ago

I don't understand how this surprises anyone. You choose to give them your data. It's not free. If you don't want them to have your data, don't give it away.

giobox

7 days ago

Right, and in exchange users received rather a lot of services for free... photo and video storage etc as just one example, Llama is free to use, etc etc.

While I've no sympathy for mishandling private user data as Meta has of course been guilty of in the past, I think users getting 16 years of free service in exchange for their public posts being used in this fashion is not that bad of a deal.

ipaddr

7 days ago

You get photo/video storage worth a fraction of a cent. Llama doesn't have a free to use service.

People got 'free service' in exchange for putting ads around content not for this.

It's a terrible deal no one asked the users if they want to agree to.

It's visiting a website and saving an image and claiming the website owes you for storage.

throwaway913242

7 days ago

Didn't the users explicitly agree to the deal when they made an account and clicked "I agree to the Terms of Use"? Sure there's a degree of "oh nobody reads those anyway" to it that society at large should be approaching more rigorously (maybe a common-contract type of system), but at the end of the day the users are choosing to interact with a service, upload their content to it, and explicitly WANTING their content to be shared (to their friends, on their feeds, etc).

The real question is whether the agreement that was made between them and their users during 2007-now covered uses like training AI, but most likely it was a very broad "we can use your data how we want" type of statement and that agreement wasn't fought (or fought hard enough?) back then.

squigz

7 days ago

> It's a terrible deal no one asked the users if they want to agree to.

Every single user of Facebook was asked and agreed to it. [0]

This was not unexpected. Anyone who read the ToS of most large tech companies in the past 10 years, and gave it more than a few seconds thought, would have realized that they were giving them access to use the data however they want - a fact many of us have been warning people about for many years.

LLMs have also been an expected evolution of technology for many years. Nobody should have been surprised about this.

[0] https://www.facebook.com/terms.php

baby

7 days ago

Talk for you, I can't imagine what my life would have been without facebook/whatsapp. I've met countless of friends, probably am married to my wife today thanks to it, and without that I wouldn't be able to keep in touch with most of my friends today.

The "facebook is not useful for me so I don't get it" needs to die seriously

JeremyNT

7 days ago

Easily said in 2024, but 17 years ago? I don't think this was quite so obvious (even amongst technical people).

yoyoyo1122

7 days ago

As the infamous saying goes, "If you aren't paying for the product, you are the product"

user

7 days ago

[deleted]

aplusbi

7 days ago

Honestly this feels like a better policy than most AI training - Meta actually has explicit rights to the content it is using. Sure it was EULA click-through but at least it's something that the content creator ostensibly agreed to.

Of course I'm sure Meta is also training their AI on content that they scraped from the internet/other sources without permission...

WuxiFingerHold

5 days ago

Meta can and will use every WhatsApp, Facebook or Insta post of every user of the planet, if they think they can benefit from it. They don't care about any data protection laws or ridiculous low fines anyway. Meta is the most evil and powerful company on the planet. Believing anything else is naive. No news here.

nottorp

6 days ago

So facebook's "AI" will speak australian slang instead of nigerian business english?

MisterBastahrd

7 days ago

Meta just created the dumbest object known to mankind. Quite an achievement given our current political landscape.

golergka

7 days ago

If it's publicly posted, it literally means that everybody can read it. What's exactly the issue here?

XorNot

7 days ago

It's also not used to learn facts about stuff: it's used to learn how language works and is used to describe the world.

The sum of all variants of human communication in text is pretty good for this and examples of wrong or different also matter: there was that article a few months ago about how a chemical prediction model performed better when it was also trained on invalid SMILES representations compared to totally sanitized datasets.

candiddevmike

7 days ago

Internet pubic and Facebook public are two different things. The latter can't be scrapped easily or really discovered at all. It's not indexed or usable outside of Facebook, and most folks don't think public means "folks not on Facebook".

consteval

7 days ago

Because simply having the right to view something doesn't mean you can do whatever the hell you want with it. People understand this generally, but for whatever reason this is something tech people really struggle with.

I can freely view a billboard on the highway. I can even take pictures of it and post them and say, "hey look at this funny billboard!" I can't, however, take the billboard design, make a shirt, and sell that shirt. I can't strip the billboard and use its materials to build a house. I can't set the billboard on fire to keep myself warm.

Because those are all different things that require their own permission, their own license. We haven't thought about this much for AI.

ilrwbwrkhv

6 days ago

Serves them right. Anyone who puts up their images on Facebook willingly deserves to be subjugated.

Cyclone_

6 days ago

Aren't most AIs trained on puic data, i.e. this doesn't seem terribly surprising?

ado__dev

7 days ago

Not surprised at all. Facebook owns the platform and outlined in the ToU that they can do whatever they want with the content you post on there.

At least it's better than scraping content off platform (which I'm sure they've done) and using that, but using content posted on their own platform seems like a no-brainer.

nkmnz

7 days ago

Is this true for posts from people with deactivated/deleted accounts as well?

alephxyz

7 days ago

Unlikely: >We don’t use posts or comments with an audience other than Public for these purposes.

musicale

6 days ago

I guess that's why I'm getting recommendations for Tim Tams.

user

7 days ago

[deleted]

almost_usual

7 days ago

Can’t wait to see the memes it generates.

dboreham

6 days ago

Journalists discover how AI works...

CamperBob2

6 days ago

If the service is free, you're the product. Here's a radical idea: if you don't want Facebook to use your information and content, don't post it to Facebook.

... or does everyone around here think anything different is happening to their posts?

gmd63

7 days ago

"They just trust me...Dumb f**s" - Mark Zuckerberg

gmd63

6 days ago

So, downvoters here are OK with academic results and early resume building disproportionately affecting life trajectory, yet early displays of poor character aren't relevant in holding someone to account in the same way?

This is not a made up quote though I didn't transcribe it exactly, it was an actual message he sent during the early days of building Facebook when asked how he obtained so much personal contact information from Harvard students, so entirely relevant to this context.

https://www.theatlantic.com/magazine/archive/2024/03/faceboo...

jewelry

6 days ago

Why is this even a news? Google scrape all public posts to build search index… Bunch of 3rd party vendors scraped all public post to build the ads price model…

mylons

7 days ago

how is _anyone_ surprised by this?

jppope

7 days ago

I for one am shocked. Shocked I say. There are dozens of us surprised by Facebook's actions... DOZENS.

globalnode

6 days ago

how is this even news? people getting outraged that data put in the public domain gets used buy someone... what world am I living in here?

user

6 days ago

[deleted]

SoftTalker

7 days ago

Funny to think that the distillation of 16 years of Facebook posts is now considered "intelligence."

SmellTheGlove

7 days ago

I wonder how much safety work they have to do specifically because of this. I’d imagine their model might have a fairly paranoid, slightly racist bias if not. Particularly as younger demographics shifted away from FB in the last decade.

PaulHoule

7 days ago

Actually if you want to train a model that can recognize bad things you want to have those bad things in the language model training data, otherwise it won’t see the characteristics of those things and it will later struggle to recognize them in later training stages.

btown

7 days ago

Presence may be necessary, but researcher-driven weighting of different sources of content can still introduce bias. For instance, [0] suggests (sources are unclear) that OpenAI boosted by 5x the weight of their WebText2 dataset, which consists of sites linked to by upvoted Reddit comments. Reddit, in this sense, with all the biases of its various communities, was artificially elevated in importance. (Per [1], there were well-thought-through reasons for this around previous failures due to overreliance on Common Crawl, but it's nonetheless a choice that was made by humans to go in this direction.)

[0] https://gregoreite.com/drilling-down-details-on-the-ai-train...

[1] https://insightcivic.s3.us-east-1.amazonaws.com/language-mod...

fire_lake

7 days ago

Considering the standard datasets contain 4chan posts it’s probably a marginal improvement.

changoplatanero

7 days ago

I haven't worked at facebook for a number of years but I'm imagining what they trained on here was public instagram images and not the random text that people write on facebook. The text in facebook posts is likely to be low value but the images are a data gold mine.

kevin_thibedeau

7 days ago

With suitable sentiment analysis, you can train known bigoted models and then run unknown input through them to see if it appeals to the model.

dotancohen

7 days ago

That would be supervised learning: they tell the model that these are undesirable::racist values. Or just fed into the RAG::blacklist dataset.

squigz

7 days ago

It's... really not. Despite what some people seem to think, most of the 3 billion+ Facebook users are normal people who aren't just posting nonsense or memes.

solardev

7 days ago

Their algorithm sure does a good job at filtering those normal people out...

marcosdumay

7 days ago

Yes, it does. But that doesn't invalidate the GP.

solardev

7 days ago

It just makes it less likely that these normal people will be able to see the other normal posts from other normal people. Early Facebook's Wall was like that, mostly just friends chatting with each other about cats and babies, but then the company purposely started to optimize the timeline for controversy instead and it all went downhill.

It's not that there aren't normal people on there, it's that organic feel-good posts have a harder time gaining traction vs the sea of flamebait and sponsored ads and astroturfed spam. The signal to noise ratio was very, very low by the time I left. I don't know how it is these days...

SoftTalker

7 days ago

I mean technically they were optimizing for "engagement" but it turns out that things that get strong reactions from people tend to be controversy and rage-bait.

squigz

7 days ago

Maybe that says more about you and how you engage with it than it does the platform though.

solardev

7 days ago

Does it? I deleted my Facebook back in 2016 or so when it got so bad. Before that, I used extensions to make it show posts from friends in chronological order instead of whatever it was optimizing for by default (engagement/controversy, I think?)

It's just such a toxic, manipulated environment... I don't use social media anymore, just text some friends, and am much happier.

doublepg23

7 days ago

You don’t consider HN social media?

solardev

7 days ago

I guess it's a gray area? It's more like a forum to me, of the pre-Facebook sort, more like Slashdot than reddit.

And there's extremely strong moderation here that does the opposite of Facebook: It optimizes against controversy and vitriol rather than encouraging it.

We end up with a bunch of nerds mostly talking shop and sometimes complaining about the job market, but that's still far less ragebaitey than most social media.

SoftTalker

7 days ago

And it's not just human moderation (which would be biased based on the moderator) but things like the flamewar detector which actively demotes content that is getting a lot of quick responses. This is precisely what Facebook would promote, and HN might also if they were dependent on ad revenue to fund the service.

Much of what's bad about Facebook and other social media pretty much boils down to their need to show as many ads to as many eyeballs as possible.

randomdata

7 days ago

> It optimizes against controversy and vitriol rather than encouraging it [...] a bunch of nerds mostly talking shop

Something doesn't add up here. Nerds talking shop and controversy and vitriol about the their technical preferences is the same thing.

Perhaps what you're saying is that controversy and vitriol is only apparent when you're an "innocent bystander" who doesn't have a passion for the subject? Which HN avoids by usually remaining focused on a fairly narrow set of subjects, to a user base generally passionate towards those subjects, so you don't notice?

That is an astute observation, if that is what you're trying to say. Indeed, if you showed your non-technical grandmother HN, she would no doubt see it the way Facebook was talked about earlier.

solardev

7 days ago

OK, but I think there's a pretty big difference between "Next.js is too bloated, you should use HTMX" and "so and so group of people are all _____ and they should all be ________, and oh, your mom sucks".

I don't think – I hope, at least – no one is going to start a shooting war over their framework of choice. You can't say the same about much of the content circulating around social media.

(Edit: You added more to your post after I replied. To your point of "Which HN avoids by usually remaining focused on a fairly narrow set of subjects", that's not just my observation, that's the actual guidelines: https://news.ycombinator.com/newsguidelines.html. We self-select into a narrow slice of nerdtalk or we end up getting downvoted or banned from the site. To me that is the big difference between an interest-based forum that generally stays a functional monoculture vs a general social media site that brings diverse strangers together into shouting wars about whatever the controversy du jour is.)

randomdata

7 days ago

> there's a pretty big difference between "Next.js is too bloated, you should use HTMX" and "so and so group of people are all _____ and they should all be ________, and oh, your mom sucks".

Is there? Perhaps the trouble here is that your examples are too far apart to recognize how they compare?

What if Facebook, instead, said "Fat workers are too bloated, you should hire skinny workers"? Or if HN said "so and so projects are all ____ and they should all be ____, and oh, your vacuum doesn't suck".

I see no practical difference. It seems the only difference is that Facebook tends to talk about people, HN about tech. But that's not a significant distinction – aside from where your interests lie. Certainly tech-minded folks often find people to be uninteresting.

squigz

7 days ago

> Is there?

The difference is that talking about a framework with passion doesn't generally end up with severe real world consequences for other people.

The difference between tech and people is that people are... people. They have lives, feelings, all those fun human things. Passionately talking about how X group of people should die (or <insert rhetoric of your choice>) is quite different than passionately talking about how Y framework is the worst thing programmers have ever concocted. Generally the latter won't end up with someone getting stabbed. (Generally...)

> What if Facebook, instead, said "Fat workers are too bloated, you should hire skinny workers"? Or if HN said "so and so projects are all ____ and they should all be ____, and oh, your vacuum doesn't suck".

I think this is a terribly unfair switcharoo to make. Let's fill in the blanks for that second quote.

> "so and so group of people are all inhuman and they should all be killed, and oh, your mom sucks"

> "so and so projects are all terrible and they should all be deleted, and oh, your vacuum doesn't suck"

randomdata

6 days ago

> Generally the latter won't end up with someone getting stabbed.

A technology may not be literally stabbed, but I think it is fair to say that it very well may get stabbed metaphorically. It may even lead to the complete demise of the technology.

Passionately calling for the death of Next.js isn't really any different than calling for the death of a person, other than you may perceive it as being different if you hold people more dear than technology. But, in that case, that's just your arbitrary opinion that means nothing to anything other than you. There are no doubt some who would be more concerned about the loss of Next.js (bad example; nobody would miss Next.js, but you get the idea) than the loss of a person, which is equally valid. The true arbiter of truth, the universe, certainly doesn't have greater feelings towards one than the other.

squigz

6 days ago

> Passionately calling for the death of Next.js isn't really any different than calling for the death of a person

I love these HN takes.

randomdata

6 days ago

Of course. You clearly demonstrate an interest in people, so it plays to your passions. Thanks for sharing the obvious.

squigz

6 days ago

> But not everyone has concern for them. They are actually quite uninteresting.

Well, you certainly are.

randomdata

6 days ago

From passionate love to complete disinterest in a matter seconds. Interesting.

Admittedly, when you consider people as a technology, one does start to wonder what it would take to fix the glaring bugs. But then that questions if the technology provides enough value to bother with at all.

squigz

6 days ago

Yeah yeah, humans are silly broken machines, the universe is indifferent, you're very edgy and smart.

randomdata

6 days ago

For what reason would someone who is edgy and smart spend time writing messages to a computer program? Those who are edgy and smart, hell even just passingly typical, can go out into the real world and find actual friends.

user

6 days ago

[deleted]

solardev

7 days ago

I think it's just the different degree of emotional attachment people to have to these topics. Yeah, people can get a bit worked up about how annoying JS development has become, but not quite to the same level as the major headlines of the day about the latest Middle East controversy/Soviet threat/identity politics thing.

Oh, and my vacuum does suck just fine, thank you very much.

randomdata

7 days ago

> not quite to the same level as the major headlines of the day about the latest Middle East controversy/Soviet threat/identity politics thing.

Why do you say that? I share in the tech-minded proclivity towards not having much interest in people, so I admit to being largely out of the loop, but what is there to be worked up about where you wouldn't equally get worked up around some tech-based topic?

It just sounds boring an uninteresting to me. When I have encountered discussions about those topics, all I have is some laughter at how silly the people sound. Just like I'm sure how the "Next.js is awful" conversations sound to the metaphorical grandma.

solardev

7 days ago

You, yourself, might just be different/lucky then :) Or you could just be typical here. The average HN user is probably not representative of the average person in the overall population. It's a self-selecting crowd of nerds, many of whom aren't all that interested in more worldly topics. (Nothing wrong with that, mind you. It's part of the reason I'm here and not on mainstream social media.)

But out in the greater society, people get all worked up about those other topics, and will go protest in the streets, blockade representatives' houses, send death threats, rage endlessly online, attempt assassinations, go on shooting sprees, etc. Those things are usually triggered by some emotional attachment to some tribal identity and anger towards some out-group.

In a monoculture like this, we're mostly peers. Yeah, we might disagree about some particular things, but there's less of an "us vs them" mindset as you see politics, geopolitics, values-based communities, etc.

Discussions of technical merit, even when heated, thankfully don't reach those emotional heights quite as often. In my experience, at least...

randomdata

7 days ago

> But out in the greater society, people get all worked up about those other topics

I understand that. Believe me, I can definitely get worked up about Next.js, if you really want to go there...! It is in no way surprising that people interested in people can end up in the same place around topics related to people.

> In my experience, at least...

But the underscore here is that your experience is based on where your interests (perhaps not the perfect word to describe this, but it's what I got) lie. If you have a greater interest in people, you're going to feel more strongly about it. I'm not bothered by it at all.

But, by the same token, what is to say that I don't feel more strongly towards technical controversy than you do?

solardev

7 days ago

You very well might! No algorithm is perfect, and no moderation system will get it right every time. But I hope your experience on HN is still a net positive.

randomdata

7 days ago

> no moderation system will get it right every time.

It doesn't need to, of course. Moderation, outside of blatant spam perhaps, is rather pointless. If someone wants to exclaim "Next.js is the second coming of Jesus", I want to hear it. There is a reason they are saying it.

To not get stuck in my bias: For those interested in people, why wouldn't you want to hear "so and so group of people are all _____ and they should all be ________, and oh, your mom sucks"? It is a ridiculous statement at face value, admittedly, but it is someone's attempt at communicating. What they are actually trying to say might teach you something important. Wouldn't you want to surface what they are truly trying to get at?

I certainly would for its tech-based counterpart. Especially if it seems to contradict what I hold dear.

> But I hope your experience on HN is still a net positive.

Naturally! I'm out of here the second it is in any way not positive. No sense wasting your free time on something you don't enjoy.

dotancohen

7 days ago

  > And there's extremely strong moderation here that does the opposite of Facebook: It optimizes against controversy and vitriol rather than encouraging it.
Well, that and hot grits you insensitive clod.

solardev

7 days ago

Oh god, now I feel old :P

user

7 days ago

[deleted]

barbazoo

7 days ago

Even if you don't engage with the content that gets recommended/shown in your timeline gets weird real quick.

I used to have an account for marketplace and our local neighborhood group and that's my experience.

exe34

7 days ago

it's true, I only have to reply to one stupid post and I'm flooded by similar nonsense. now I've taken to closing it if it shows me anything other than a nice house or a cat. i definitely see less dumb.

romwell

7 days ago

>It's... really not. Despite what some people seem to think, most of the 3 billion+ Facebook users are normal people who aren't just posting nonsense or memes.

Most users also don't post much.

Most Facebook posts (as opposed to users) are nonsense and memes.

glenstein

7 days ago

Right, the training isn't just reproducing the content, it's training on it to derive underlying themes for language and communication. And it can then effectively use those capabilities in dynamic ways.

derefr

7 days ago

What you get from the self-supervised training of a base model is more like "language fluency plus a web of crystallized-knowledge relationships."

But also, ML model training is a bit like the stock market: the noise/stupidity in individual examples points in a bunch of random directions, and so ends up cancelling out; while the signal all points in the same direction, and so ends up captured in the distilled model. (You might call this the "Anna Karenina principle of Information Distillation": all right answers are the same, while each wrong answer is wrong in a different way.)

dougb5

7 days ago

People are frequently wrong in the same way in environments where they are easily influenced by each other like social media. That's part of why these models, especially early on, exhibited so many racial and gender biases.

glenstein

7 days ago

A great point, and a real problem, but I think in the grand scheme of things, the models inherit those problems while also simultaneously inheriting all kinds of useful knowledge about information and how to communicate effectively and dynamically with that information. It coexists with the benefit rather than serving as a defeater of the idea that there's any benefit.

veidelis

7 days ago

It's still wrong if the majority is wrong, isn't it? Like propaganda on major channels which praise the same thing, is consumed by a major part of the population which then assume that it's the right thing and continue the misinformation, which then propagates to the AI model.

derefr

7 days ago

Yes; but in the case of a "common misconception" like this, there's always also a nontrivial minority who do know the "right answer"[1] — and so enough examples of that occur in the training data to enable the model to embed the knowledge-web of "right ideas" (as a niche activation), alongside the "wrong idea" (in its default-mode network).

The "initial fine-tuning of a 'raw' base model to produce a 'generalized pre-trained' base model" process, is commonly talked about in terms of "alignment" — making the model ethical, making it not swear at you, making it refuse to engage with certain content, etc. And really, that part is all optional, with there existing "non-aligned" or "orthogonalized" models that don't have these steps performed on them or have had them reversed, but which are still useful.

But a large part of this initial fine-tuning process, consists of debiasing the model's default activation, moving it away from making associations with "common misconceptions" and toward making associations with "right answers." And this process is crucial to a model being able to reason intelligently — as these common misconceptions aren't coherent in a chain of reasoning they appear in, and so lead to the chain of reasoning falling apart / being non-productive. This is, in large part, the "secret sauce" that makes a model of a given size "more intelligent" than another model of the same size.

Every base model that anyone actually cares about or uses — "aligned" or not — has had some process of de-biasing like this applied to it; or, at least, has short-cutted this process by training on a training dataset generated by or filtered by a model that has already had this de-biasing applied to it, such that the derived training dataset doesn't contain the "common misconceptions" in the first place.

And when OpenAI and Meta brag about using RLHF, a large part of what they mean, is crowdsourcing recognition of long-tail "common misconceptions" at scale, to allow a much more thorough version of this de-biasing process[2].

---

[1] Of course, if nobody in the training data ever demonstrates the "right answer" knowledge/associations, then the model will never learn that knowledge/associations. But then, given that these training datasets usually represent decent samples of the population, a "right answer" being missing entirely would likely mean that nobody on Earth knows the "right answer" — so we humans wouldn't be able to recognize the model was wrong. The stock market can be wrong too, for the same reason.

[2] Which, perhaps surprisingly, can add up to more than the sum of its parts. The more of this human-labelled "common misconception"-response RLHF data you have, the more you can derive patterns from this bias data. You can distil out negative examples that you can use to prompt a model to filter the training dataset; but more interestingly, you can distil positive examples of the sort of structured chains of reasoning that work best to inherently avoid triggering the bias. Where, if you can overlap the activation-space hyperspheres of many such inversion-of-bias examples, then you get, essentially, the hypersphere within the model's activation space that contains its instrumental rationality. You can then just bias the model toward living in that part of activation space as much as possible — and this shoots its apparent reasoning capacity way up.

TheOtherHobbes

7 days ago

Maybe the plan is to replace all FB users with bots but keep charging advertisers anyway.

"But how can anyone tell?"

VoodooJuJu

7 days ago

I don't know why that's funny, since that is the very essence of the ideal of democracy: the distillation or averaging of the sentiment of the masses yields truth. Some contributions are outliers in one way or another, and it's these outliers that live in many people's heads, but these extremes are ironed out by averaging against the masses.

emporas

7 days ago

If it is high entropy, what's the significance of the source of information's age? It could as well be 2 years old and no one would care.

jprd

7 days ago

I keep waiting for one of their models to just start the answer with "FWD: FWD FWD: FWD: BLAH BLAH FWD ON"

vunderba

7 days ago

Amusing but a large part of the training data other LLM models is social media platforms.

I mean hop on chat GPT right now and ask it to come up with any kind of novel joke. I can almost guarantee it will either be some kind of word play or other equally low hanging kind of pun that you would find on Reddit fifty replies deep.

Where's the Mensa member only social media dating platform for a properly erudite and snobbishly arrogant LLM to train on?

mig_

7 days ago

Google scholar ?

glitchc

7 days ago

This made me chuckle.

ithkuil

7 days ago

Lol-ed myself but OTOH if you want a model to learn how people actually speak you cannot expect to get that by reading curated scientific documents.

pjs_

7 days ago

There's a fun retrospective PoV which is that Silicon Valley has been through a sequence of consecutive hype waves, each of which has been viewed with derision as mega-scale fraud or brainrot:

- Video games (epic waste of time and moral debasement) gave us GPUs and NVIDIA

- Social media (rage bait-fueled argument machine and cat pic repository) gave us a huge corpus of text on which to train language models. Yes scientific papers and Wikipedia are necessary but probably not sufficient?

- Crypto (fraud, giant waste of resources) gave us a generation of young people who were comfortable building ludicrously oversized GPU clusters, and to some extent funded NVIDIA and TSMC R&D

all of which led to where we are with AI, which is that it is making (in my humble opinion) impressive dents in the problem of solving intelligence -- cue HN commenters telling me that AI is a giant fraud and waste of resources also :)

user

7 days ago

[deleted]

pbhjpbhj

7 days ago

In the UK I'd say they've definitely committed copyright infringement. Fair Dealing doesn't allow this.

anticensor

4 days ago

Facebook has an explicit licence, per Facebook's terms of use.

askafriend

7 days ago

This isn't really that groundbreaking of a story...

Of course they'd do this! How did people think feed ranking worked?

The only reason this is being reported now is because there's a chatbot and I guess that feels different to people.