Tumblr and WordPress to Sell Users’ Data to Train AI Tools
Samantha Cole (tweet, Slashdot):
Tumblr and WordPress.com are preparing to sell user data to Midjourney and OpenAI, according to a source with internal knowledge about the deals and internal documentation referring to the deals.
[…]
The internal documentation details a messy and controversial process within Tumblr itself. One internal post made by Cyle Gage, a product manager at Tumblr, states that a query made to prepare data for OpenAI and Midjourney compiled a huge number of user posts that it wasn’t supposed to. It is not clear from Gage’s post whether this data has already been sent to OpenAI and Midjourney, or whether Gage was detailing a process for scrubbing the data before it was to be sent.
[…]
- private posts on public blogs
- posts on deleted or suspended blogs
- unanswered asks (normally these are not public until they’re answered)
- private answers (these only show up to the receiver and are not public)
here’s a podcast where we discuss what’s happening and why
Access to training data & GPUs is going to be key in AI wars.
The key question is how startups can compete against big tech since both the ability to pay for access to data or model training costs aren’t cheap. This battle favors incumbents.
Previously:
Update (2024-03-01): Tumblr (via Mike Rockwell):
Proposed regulations around the world, like the European Union’s AI Act, would give individuals more control over whether and how their content is utilized by this emerging technology. We support this right regardless of geographic location, so we’re releasing a toggle to opt out of sharing content from your public blogs with third parties, including AI platforms that use this content for model training. We’re also working with partners to ensure you have as much control as possible regarding what content is used.
Update (2024-03-06): Jason Koebler and Samantha Cole (tweet):
In September 2023, WordPress.com quietly changed the language of a developer page explaining how to access a “Firehose” of roughly a million daily WordPress posts to add that the feeds are “intended for partners like search engines, artificial intelligence (AI) products and market intelligence providers who would like to ingest a real-time stream of new content from a wide spectrum of publishers.” Before then, this page did not note the AI use case.
[…]
The truth is that Automattic has been selling access to this “firehose” of posts for years, for a variety of purposes.
[…]
This firehose appears to be distinct from any direct data sharing deal with Midjourney and OpenAI, in part because the documentation makes clear that data being sold via this firehose is not limited only to posts on WordPress.com, but also can include posts on self-hosted WordPress.org websites that use Jetpack, a wildly popular plugin that millions of sites use and that users are encouraged to install when setting up a WordPress site.
[…]
After this article was published, Automattic told 404 Media that it is “deprecating” the Firehose: “SocialGist is rolling off as a firehose customer this month and the remaining customers are winding down in the coming months[…]
I am not particularly surprised to learn that public posts on WordPress.com blogs are part of a massive feed, but I am shocked it is not as obvious that self-hosted WordPress sites with Jetpack installed are automatically opted into it as well.
[…]
The New York Times comprehensively blocks known machine learning crawlers, which you can verify by viewing its robots.txt file; the crawlers we are interested in are listed near the bottom, just above all the sitemaps. That is also true for Tumblr. But when I checked a bunch of WordPress.com sites at random — by searching “site:wordpress.com inurl:2024” — I found much shorter automatically generated
robots.txt
files, similar to WordPress’ own. I am not sure why I could not find a single WordPress.com blog with the same opt-out signal.
1 Comment RSS · Twitter · Mastodon
The really sad thing is that it's all for naught. LLMs are a dead end.
Pulling more data and CPU on them might lead to slightly"better" plausible outputs to well defined questions, but the real use cases don't need more. And we won't get agi like this.