Social Media AI Training
Meta is now scraping Facebook posts to train its AI model. While this isn’t surprising on its own, what is surprising is just how difficult Meta is making it for users to opt out of this process.
Via X Daily News:
Instagram is training AI on your data but makes it nearly impossible to opt out.
Social network X (formerly Twitter) recently activated a setting that gives it permission to train Grok AI on user tweets. All X users are opted in by default, with X failing to notify customers about the change.
You can disable this using the web but it’s hidden. You can’t disable using the mobile app.
I found the outrage about X’s default setting to use user content to train xAI’s Grok to be confusing, at least for public tweets.
If you’ve ever posted anything on the public internet, it has been used to train AI. If you do so today, it will as well.
This is my thinking as well. It’s the same with Reddit and YouTube where the users specifically contributed their content to the Web. This is quite different from the situations with Adobe, Slack, Grammarly, and Zoom where the expectation was that the data was private.
Previously:
- Only Google Can Crawl Reddit
- AI Companies Ignoring Robots.txt
- Apple Intelligence Training
- Updated Adobe Terms of Use
- Slack AI Privacy
- Tumblr and WordPress to Sell Users’ Data to Train AI Tools
- Reddit AI Training Data and IPO
- GrammarlyGO Training on User Content With Questionable Opt Out
- Zoom ToS Allowed Training AI on User Content With No Opt Out
12 Comments RSS · Twitter · Mastodon
What exactly are they hoping to gain from training AIs on social media posts? That seems like it's just going to produce terrible results.
On the other hand, if you wanted to design an AI that can make *fake* social media posts, then that's the thing to do.
@bri Yes. Terrible results in the truth, logic, spelling and tone areas. It’s like sending one’s child to a bar instead of a school to study.
Most FB and IG profiles are not set to public. I think it's reasonable to expect posts shared with a defined group of people not to be used for AI training. I did opt out for my private accounts. I also expect there to be some regulatory backlash, at least in the EU.
Isn't the complaint that the AI companies are making money from scraping "public" content? They're using my (precious and brilliant, of course) posts to bolster their products, for which they are (directly or via advertising) profiting. The line between my content and what the LLMs generate is obscured, but still there.
@Jack Yes that's it, for me anyway. If I had known before submitting my brilliant tweets that unscrupulous companies would be making money off them then maybe I wouldn't have posted, or locked my account. There's also the somewhat unclear situation where even though information I published is public it is probably still owned by me. (Maybe not on social media sites but certainly on blogs etc.)
@Michael: Copyright is not extinguished simply because something has been shown to the public. e.g. Movie posters. Copyright is only extinguished if it is voluntarily put into the public domain. Since LLMs regurgitate their training materials verbatim, because they are overparameterized, they violate copyright and it's not fair use.
@Old Unix Geek You still have the copyright, but by posting it to the social media site you’ve granted them a license to do all sorts of things with it (as described in the ToS), including reproduce it verbatim.
This is very different from posting something on your own site and having some AI company crawl it—you have no relationship with that company and have not granted them anything. That’s when fair use would apply.
@Michael: If YouTube TOS says Youtube can use your videos to train their AI, then they'd have that right, but Facebook wouldn't, simply because they can download a Youtube video, unless YouTube's TOS also gives them that right. Why? because the TOS is a contract, and you're trading your copyright to Google for benefits it provides you.
Fair use might apply for LLMs that never (or virtually never) regurgitate their training materials. It would not apply for those that do. The overparameterization of LLMs pretty much guarantees that they will regurgitate their training materials -- if they were underparameterized, they wouldn't, since they'd have learned rules rather than expressions. The problem is that underparameterized neural networks are harder to train and have higher test error.
@Old Unix Geek I agree with you here. That’s why I’m saying that the outrage over Facebook training on Facebook content and Twitter training on Twitter content is different from the other situations previously discussed.