Wednesday, June 19, 2024

Apple Intelligence Training

Apple:

In the following overview, we will detail how two of these models — a ~3 billion parameter on-device language model, and a larger server-based language model available with Private Cloud Compute and running on Apple silicon servers — have been built and adapted to perform specialized tasks efficiently, accurately, and responsibly.

[…]

Our foundation models are trained on Apple’s AXLearn framework, an open-source project we released in 2023. It builds on top of JAX and XLA, and allows us to train the models with high efficiency and scalability on various training hardware and cloud platforms, including TPUs and both cloud and on-premise GPUs. We used a combination of data parallelism, tensor parallelism, sequence parallelism, and Fully Sharded Data Parallel (FSDP) to scale training along multiple dimensions such as data, model, and sequence length.

We train our foundation models on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler, AppleBot.

David Pierce:

Wild how much the Overton window has moved that Giannandrea can just say, “Yeah, we trained on the public web,” and it’s not even a thing. I mean, of course it did. That’s what everyone did! But wild that we don’t even blink at that now.

John Voorhees:

As a creator and website owner, I guess that these things will never sit right with me. Why should we accept that certain data sets require a licensing fee but anything that is found “on the open web” can be mindlessly scraped, parsed, and regurgitated by an AI? Web publishers (and especially indie web publishers these days, who cannot afford lawsuits or hiring law firms to strike expensive deals) deserve better.

It’s disappointing to see Apple muddy an otherwise compelling set of features (some of which I really want to try) with practices that are no better than the rest of the industry.

Colin Cornaby:

The justification of “if you posted it on the public web - it’s ok for us to train AI on” is really bizarre - and not completely legally sound? Posting something on the public web doesn’t mean you surrender the copyright.

That’s actually exactly the basis of the NYT’s suit against OpenAI. The NYT proved that OpenAI was able to reproduce articles that it had scraped from the NYT.

Apple:

With Applebot-Extended, web publishers can choose to opt out of their website content being used to train Apple’s foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools.

[…]

Applebot-Extended does not crawl webpages. Webpages that disallow Applebot-Extended can still be included in search results. Applebot-Extended is only used to determine how to use the data crawled by the Applebot user agent.

The models were trained before they told us how to opt-out. If you update your robot.txt to exclude Applebot Extended, it’s not clear when your data will be removed from the models. It can take a long time to re-train a model, and I don’t know whether the on-device models are tied to OS updates.

Joe Rosensteel:

Literally the same presentation talks about protecting your privacy from unscrupulous internet companies. Your data is isolated by a whole auditable cloud solution and will never be used for modeling. BUT if that same Apple customer posted anything on the open web then it’s fair game for Apple to use regardless of copyright, licenses, or expectations. Doing it before anyone could ever object is all the more damning.

Eric deRuiter:

Disabling Apple AI via robots.txt is not supported on Squarespace as you can’t edit your own robots.txt file.

Apple does offer a way to opt out entirely via a <meta> tag, but I don’t see a way to use that to exclude only the AI stuff.

Dan Moren:

To test this out, I’ve added those directives to my personal site. This turned out to be slightly more confusing, given that my site runs on WordPress, which automatically generates a robots.txt file. Instead, you have to add the following snippet of code to your functions.php file by going to the administration interface and choosing Appearance > Theme File Editor and selecting functions.php from the sidebar.

[…]

If you want to go beyond Apple, this same general idea works for other AI crawling tools as well. For example, to block ChatGPT from crawling your site you would add a similarly formatted addition to the robots.txt file, but swapping in “GPTBot” instead of “Applebot-Extended.”

Google’s situation is more complex: while the company does have a Googlebot-Extended that powers some of its AI tools, like Gemini (née Bard), blocking that won’t necessarily remove your site’s content from being crawled for use in Google’s AI search features. To do that, you’d need to block Googlebot entirely, which would have the unfortunate effect of removing your site from its search indexes as well.

Robb Knight (via Nick Heer, Hacker News):

[Perplexity is] using headless browsers to scrape content, ignoring robots.txt, and not sending their user agent string. I can't even block their IP ranges because it appears these headless browsers are not on their IP ranges.

John Voorhees:

Over the past several days, we’ve made some changes at MacStories to address the ingestion of our work by web crawlers operated by artificial intelligence companies. We’ve learned a lot, so we thought we’d share what we’ve done in case anyone else would like to do something similar.

Previously:

Update (2024-06-20): Nick Heer:

The question seems to be whether what Perplexity is doing ought to be considered crawling. It is, after all, responding to a direct retrieval request from a user. This is subtly different from how a user might search Google for a URL, in which case they are asking whether that site is in the search engine’s existing index. Perplexity is ostensibly following real-time commands: go fetch this webpage and tell me about it.

But it clearly is also crawling in a more traditional sense. The New York Times and Wired both disallow PerplexityBot, yet I was able to ask it to summarize a set of recent stories from bothpublications. At the time of writing, the Wired summary is about seventeen hours outdated, and the Times summary is about two days old. Neither publication has changed its robots.txt directives recently; they were both blocking Perplexity last week, and they are blocking it today. Perplexity is not fetching these sites in real-time as a human or web browser would. It appears to be scraping sites which have explicitly said that is something they do not want.

Perplexity should be following those rules and it is shameful it is not. But what if you ask for a real-time summary of a particular page, as Knight did? Is that something which should be identifiable by a publisher as a request from Perplexity, or from the user?

Update (2024-06-24): John Gruber:

Apple should clarify whether they plan to re-index the public data they used for training before Apple Intelligence ships in beta this summer. Clearly, a website that bans Applebot-Extended shouldn’t have its data in Apple’s training corpus simply because Applebot crawled it before Apple Intelligence was even announced. It’s fair for public data to be excluded on an opt-out basis, rather than included on an opt-in one, but Apple trained its models on the public web before they allowed for opting out.

But other than that chicken/egg opt-out issue, I don’t object to this. The whole point of the public web is that it’s there to learn from — even if the learner isn’t human.

Louie Mantia (via Federico Viticci):

This is a critical thing about ownership and copyright in the world. We own what we make the moment we make it. Publishing text or images on the web does not make it fair game to train AI on. The “public” in “public web” means free to access; it does not mean it’s free to use.

Besides that, I’d also add what I’ve seen no one else mention so far: People post content on web that they don’t own all the time. No one has to prove ownership to post anything.

Someone who publishes my work as their own (theft) or republishes my work (like quoting or linking back) doesn’t have the right to make the choice for me to let my content be used for training AI.

That same argument would also apply to indexing for search.

2 Comments RSS · Twitter · Mastodon

Beatrix Willius

Did anyone really expect anything different?

"[Perplexity is] using headless browsers to scrape content, ignoring robots.txt, and not sending their user agent string"

Although in this case, by "scrape content", they mean things like "summarize a page the user asked Perplexity to summarize", not "use it as training data." I think this is a qualitatively different action.

Leave a Comment