{"id":43782,"date":"2024-06-19T15:32:10","date_gmt":"2024-06-19T19:32:10","guid":{"rendered":"https:\/\/mjtsai.com\/blog\/?p=43782"},"modified":"2024-07-30T14:10:44","modified_gmt":"2024-07-30T18:10:44","slug":"apple-intelligence-training","status":"publish","type":"post","link":"https:\/\/mjtsai.com\/blog\/2024\/06\/19\/apple-intelligence-training\/","title":{"rendered":"Apple Intelligence Training"},"content":{"rendered":"<p><a href=\"https:\/\/machinelearning.apple.com\/research\/introducing-apple-foundation-models\">Apple<\/a>:<\/p>\n<blockquote cite=\"https:\/\/machinelearning.apple.com\/research\/introducing-apple-foundation-models\"><p>In the following overview, we will detail how two of these models &mdash; a ~3 billion parameter on-device language model, and a larger server-based language model available with Private Cloud Compute and running on Apple silicon servers &mdash; have been built and adapted to perform specialized tasks efficiently, accurately, and responsibly.<\/p><p>[&#8230;]<\/p><p>Our foundation models are trained on <a href=\"https:\/\/github.com\/apple\/axlearn\">Apple&rsquo;s AXLearn framework<\/a>, an open-source project we released in 2023. It builds on top of JAX and XLA, and allows us to train the models with high efficiency and scalability on various training hardware and cloud platforms, including TPUs and both cloud and on-premise GPUs. We used a combination of data parallelism, tensor parallelism, sequence parallelism, and Fully Sharded Data Parallel (FSDP) to scale training along multiple dimensions such as data, model, and sequence length.<\/p><p>We train our foundation models on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler, AppleBot.<\/p><\/blockquote>\n\n<p><a href=\"https:\/\/www.theverge.com\/2024\/6\/10\/24175625\/wwdc-live-ai-apple-intelligence-federighi\">David Pierce<\/a>:<\/p>\n<blockquote cite=\"https:\/\/www.theverge.com\/2024\/6\/10\/24175625\/wwdc-live-ai-apple-intelligence-federighi\"><p>Wild how much the Overton window has moved that Giannandrea can just say, &ldquo;Yeah, we trained on the public web,&rdquo; and it&rsquo;s not even a thing. I mean, of course it did. That&rsquo;s what everyone did! But wild that we don&rsquo;t even blink at that now.<\/p><\/blockquote>\n\n<p><a href=\"https:\/\/www.macstories.net\/linked\/apple-details-its-ai-foundation-models-and-applebot-web-scraping\/\">John Voorhees<\/a>:<\/p>\n<blockquote cite=\"https:\/\/www.macstories.net\/linked\/apple-details-its-ai-foundation-models-and-applebot-web-scraping\/\"><p>As a creator and website owner, I guess that these things will never sit right with me. Why should we accept that certain data sets require a licensing fee but anything that is found &ldquo;on the open web&rdquo; can be mindlessly scraped, parsed, and regurgitated by an AI? Web publishers (and <em>especially<\/em> indie web publishers <a href=\"https:\/\/retrododo.com\/google-is-killing-retro-dodo\/\">these days<\/a>, who cannot afford <a href=\"https:\/\/www.nytimes.com\/2023\/12\/27\/business\/media\/new-york-times-open-ai-microsoft-lawsuit.html\">lawsuits<\/a> or hiring law firms to strike expensive <a href=\"https:\/\/openai.com\/index\/a-content-and-product-partnership-with-vox-media\/\">deals<\/a>) deserve better.<\/p><p>It&rsquo;s disappointing to see Apple muddy an otherwise <a href=\"https:\/\/www.macstories.net\/news\/apple-intelligence-the-macstories-overview\/\">compelling<\/a> set of features (some of which I really want to try) with practices that are <a href=\"https:\/\/www.semafor.com\/article\/06\/12\/2024\/perplexity-was-planning-revenue-sharing-deals-with-publishers\">no better<\/a> than <a href=\"https:\/\/www.nytimes.com\/2024\/05\/22\/business\/media\/openai-news-corp-content-deal.html\">the rest of the industry<\/a>.<\/p><\/blockquote>\n\n<p><a href=\"https:\/\/mastodon.social\/@colincornaby\/112616508809150798\">Colin Cornaby<\/a>:<\/p>\n<blockquote cite=\"https:\/\/mastodon.social\/@colincornaby\/112616508809150798\"><p>The justification of &ldquo;if you posted it on the public web - it&rsquo;s ok for us to train AI on&rdquo; is really bizarre - and not completely legally sound? Posting something on the public web doesn&rsquo;t mean you surrender the copyright.<\/p><p>That&rsquo;s actually exactly the basis of the NYT&rsquo;s suit against OpenAI. The NYT proved that OpenAI was able to reproduce articles that it had scraped from the NYT.<\/p><\/blockquote>\n\n<p><a href=\"https:\/\/support.apple.com\/en-us\/119829\">Apple<\/a>:<\/p>\n<blockquote cite=\"https:\/\/support.apple.com\/en-us\/119829\">\n<p>With Applebot-Extended, web publishers can choose to opt out of their website content being used to train Apple&rsquo;s foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools.<\/p>\n<p>[&#8230;]<\/p>\n<p>Applebot-Extended does not crawl webpages. Webpages that disallow Applebot-Extended can still be included in search results. Applebot-Extended is only used to determine how to use the data crawled by the Applebot user agent.<\/p>\n<\/blockquote>\n\n<p>The models were trained <a href=\"https:\/\/pxlnv.com\/linklog\/apple-intelligence-wwdc-2024\/\">before<\/a> <a href=\"https:\/\/mastodon.macstories.net\/@viticci\/112606317077490324\">they told us<\/a> how to opt-out. If you update your <tt>robot.txt<\/tt> to exclude Applebot Extended, it&rsquo;s not clear when your data will be removed from the models. It can take a long time to re-train a model, and I don&rsquo;t know whether the on-device models are tied to OS updates.<\/p>\n\n<p><a href=\"https:\/\/duck.haus\/@joesteel\/112606533133873781\">Joe Rosensteel<\/a>:<\/p>\n<blockquote cite=\"https:\/\/duck.haus\/@joesteel\/112606533133873781\">\n<p>Literally the same presentation talks about protecting your privacy from unscrupulous internet companies. Your data is isolated by a whole auditable cloud solution and will never be used for modeling. BUT if that same Apple customer posted anything on the open web then it&rsquo;s fair game for Apple to use regardless of copyright, licenses, or expectations. Doing it before anyone could ever object is all the more damning.<\/p>\n<\/blockquote>\n\n<p><a href=\"https:\/\/mastodon.social\/@ridogi\/112611934598789506\">Eric deRuiter<\/a>:<\/p>\n<blockquote cite=\"https:\/\/mastodon.social\/@ridogi\/112611934598789506\"><p>Disabling Apple AI via robots.txt is not supported on Squarespace as you can&rsquo;t edit your own robots.txt file.<\/p><\/blockquote>\n<p>Apple does offer a way to opt out entirely via a <code>&lt;meta&gt;<\/code> tag, but I don&rsquo;t see a way to use that to exclude only the AI stuff.<\/p>\n\n<p><a href=\"https:\/\/sixcolors.com\/post\/2024\/06\/excluding-your-website-from-apples-ai-crawler\/\">Dan Moren<\/a>:<\/p>\n<blockquote cite=\"https:\/\/sixcolors.com\/post\/2024\/06\/excluding-your-website-from-apples-ai-crawler\/\">\n<p>To test this out, I&rsquo;ve added those directives to my <a href=\"https:\/\/dmoren.com\">personal site<\/a>. This turned out to be slightly more confusing, given that my site runs on WordPress, which automatically generates a <code>robots.txt<\/code> file. Instead, you have to add the following snippet of code to your <code>functions.php<\/code> file by going to the administration interface and choosing Appearance &gt; Theme File Editor and selecting functions.php from the sidebar.<\/p>\n<p>[&#8230;]<\/p>\n<p>If you want to go beyond Apple, this same general idea works for other AI crawling tools as well. For example, to <a href=\"https:\/\/platform.openai.com\/docs\/gptbot\">block ChatGPT from crawling your site<\/a> you would add a similarly formatted addition to the <code>robots.txt<\/code> file, but swapping in &ldquo;GPTBot&rdquo; instead of &ldquo;Applebot-Extended.&rdquo;<\/p>\n<p>Google&rsquo;s situation is more complex: while the company does have a Googlebot-Extended that powers some of its AI tools, like Gemini (n&eacute;e Bard), blocking that <a href=\"https:\/\/searchengineland.com\/google-extended-does-not-stop-google-search-generative-experience-from-using-your-sites-content-433058\">won&rsquo;t necessarily remove your site&rsquo;s content from being crawled for use in Google&rsquo;s AI search features<\/a>. To do that, you&rsquo;d need to block Googlebot entirely, which would have the unfortunate effect of removing your site from its search indexes as well.<\/p>\n<\/blockquote>\n\n<p><a href=\"https:\/\/rknight.me\/blog\/perplexity-ai-is-lying-about-its-user-agent\/\">Robb Knight<\/a> (via <a href=\"https:\/\/pxlnv.com\/linklog\/perplexity-user-agent\/\">Nick Heer<\/a>, <a href=\"https:\/\/news.ycombinator.com\/item?id=40690898\">Hacker News<\/a>):<\/p>\n<blockquote cite=\"https:\/\/rknight.me\/blog\/perplexity-ai-is-lying-about-its-user-agent\/\">\n<p>[Perplexity is] using headless browsers to scrape content, ignoring robots.txt, <em>and<\/em> not sending their user agent string. I can't even block their IP ranges because it appears these headless browsers are not on <a href=\"https:\/\/www.perplexity.ai\/perplexitybot.json\">their IP ranges<\/a>.<\/p>\n<\/blockquote>\n\n<p><a href=\"https:\/\/www.macstories.net\/stories\/ways-you-can-protect-your-website-from-ai-web-crawlers\/\">John Voorhees<\/a>:<\/p>\n<blockquote cite=\"https:\/\/www.macstories.net\/stories\/ways-you-can-protect-your-website-from-ai-web-crawlers\/\">\n<p>Over the past several days, we&rsquo;ve made some changes at MacStories to address the ingestion of our work by web crawlers operated by artificial intelligence companies. We&rsquo;ve learned a lot, so we thought we&rsquo;d share what we&rsquo;ve done in case anyone else would like to do something similar.<\/p>\n<\/blockquote>\n\n<p>Previously:<\/p>\n<ul>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2024\/06\/14\/private-cloud-compute\/\">Private Cloud Compute<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2024\/06\/10\/apple-intelligence-announced\/\">Apple Intelligence Announced<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2023\/12\/28\/the-new-york-times-sues-openai\/\">The New York Times Sues OpenAI<\/a><\/li>\n<\/ul>\n\n<p id=\"apple-intelligence-training-update-2024-06-20\">Update (2024-06-20): <a href=\"https:\/\/pxlnv.com\/blog\/on-robots-and-text\/\">Nick Heer<\/a>:<\/p>\n<blockquote cite=\"https:\/\/pxlnv.com\/blog\/on-robots-and-text\/\"><p>The question seems to be whether what Perplexity is doing ought to be considered crawling. It is, after all, responding to a direct retrieval request from a user. This is subtly different from how a user might search Google for a URL, in which case they are asking whether that site is in the search engine&rsquo;s existing index. Perplexity is ostensibly following real-time commands: <em>go fetch this webpage and tell me about it<\/em>.<\/p><p>But it clearly is also crawling in a more traditional sense. The <a href=\"https:\/\/www.nytimes.com\/robots.txt\"><em>New York Times<\/em><\/a> and <a href=\"https:\/\/www.wired.com\/robots.txt\"><em>Wired<\/em><\/a> both disallow <code>PerplexityBot<\/code>, yet I was able to ask it to summarize a set of recent stories from <a href=\"https:\/\/www.perplexity.ai\/search\/summarize-the-five-9VnFJxMITLiX9Q6P0rV0Zw\">both<\/a><a href=\"https:\/\/www.perplexity.ai\/search\/summarize-the-five-PocA0teAT4mRmqsjebCzPQ\">publications<\/a>. At the time of writing, the <em>Wired<\/em> summary is about seventeen hours outdated, and the <em>Times<\/em> summary is about two days old. Neither publication has changed its <code>robots.txt<\/code> directives recently; they were both blocking Perplexity last week, and they are blocking it today. Perplexity is not fetching these sites in real-time as a human or web browser would. It appears to be scraping sites which have explicitly said that is something they do not want.<\/p><p>Perplexity should be following those rules and it is shameful it is not. But what if you ask for a real-time summary of a particular page, <a href=\"https:\/\/rknight.me\/blog\/perplexity-ai-is-lying-about-its-user-agent\/\">as Knight did<\/a>? Is that something which should be identifiable by a publisher as a request from Perplexity, or from the user?<\/p><\/blockquote>\n\n<p id=\"apple-intelligence-training-update-2024-06-24\">Update (2024-06-24): <a href=\"https:\/\/daringfireball.net\/2024\/06\/training_large_language_models_on_the_public_web\">John Gruber<\/a>:<\/p>\n<blockquote cite=\"https:\/\/daringfireball.net\/2024\/06\/training_large_language_models_on_the_public_web\">\n<p>Apple should clarify whether they plan to re-index the public data they used for training before Apple Intelligence ships in beta this summer. Clearly, a website that bans Applebot-Extended shouldn&rsquo;t have its data in Apple&rsquo;s training corpus simply because Applebot crawled it before Apple Intelligence was even announced. It&rsquo;s fair for public data to be excluded on an opt-out basis, rather than included on an opt-in one, but Apple trained its models on the public web before they allowed for opting out.<\/p>\n<p>But other than that chicken\/egg opt-out issue, I don&rsquo;t object to this. The whole point of the public web is that it&rsquo;s there to learn from&#x2009;&mdash;&#x2009;even if the learner isn&rsquo;t human.<\/p>\n<\/blockquote>\n\n<p><a href=\"https:\/\/lmnt.me\/blog\/training-ai.html\">Louie Mantia<\/a> (via <a href=\"https:\/\/mastodon.macstories.net\/@viticci\/112661820239030330\">Federico Viticci<\/a>):<\/p>\n<blockquote cite=\"https:\/\/lmnt.me\/blog\/training-ai.html\"><p>This is a critical thing about ownership and copyright in the world. We own what we make the moment we make it. Publishing text or images on the web does not make it fair game to train AI on. The &ldquo;public&rdquo; in &ldquo;public web&rdquo; means free to access; it does not mean it&rsquo;s free to <em>use<\/em>.<\/p><p>Besides that, I&rsquo;d also add what I&rsquo;ve seen no one else mention so far: <em>People post content on web that they don&rsquo;t own all the time.<\/em> No one has to prove ownership to post anything.<\/p><p>Someone who publishes my work as their own (theft) or republishes my work (like quoting or linking back) <em>doesn&rsquo;t have the right<\/em> to make the choice for me to let my content be used for training AI.<\/p><\/blockquote>\n<p>That same argument would also apply to indexing for search.<\/p>\n\n<p id=\"apple-intelligence-training-update-2024-07-19\">Update (2024-07-19): <a href=\"https:\/\/www.macrumors.com\/2024\/07\/18\/apple-intelligence-not-trained-on-youtube\/\">Tim Hardwick<\/a>:<\/p>\n<blockquote cite=\"https:\/\/www.macrumors.com\/2024\/07\/18\/apple-intelligence-not-trained-on-youtube\/\"><p>[Apple] emphasized that since OpenELM is not integrated into Apple Intelligence, the &ldquo;YouTube Subtitles&rdquo; dataset is not being used to power any of its commercial AI features.<\/p><\/blockquote>\n\n<p id=\"apple-intelligence-training-update-2024-07-30\">Update (2024-07-30): <a href=\"https:\/\/mastodon.macstories.net\/@johnvoorhees\/112871433297234749\">John Voorhees<\/a>:<\/p>\n<blockquote cite=\"https:\/\/mastodon.macstories.net\/@johnvoorhees\/112871433297234749\">\n<p>If you still had doubts whether Apple scraped the web to build its foundation model and only gave publishers an option to opt-out after the fact, it&rsquo;s all laid out <a href=\"https:\/\/machinelearning.apple.com\/papers\/apple_intelligence_foundation_language_models.pdf\">here<\/a>.<\/p>\n<\/blockquote>\n\n<p><a href=\"https:\/\/mastodon.social\/@stroughtonsmith\/112871602065730721\">Steve Troughton-Smith<\/a>:<\/p>\n<blockquote cite=\"https:\/\/mastodon.social\/@stroughtonsmith\/112871602065730721\"><p>Apple clearly has vacuumed up data from European websites and open-source projects to build its Foundation Models, which makes it incredibly distasteful for them to be trying to hold Apple Intelligence hostage as a bargaining chip against EU regulation.<\/p><p>If for some reason regulators were to angrily demand an immediate purge or audit of the affected data, it could set Apple Intelligence back years and push it well out of the iOS 18 timeframe.<\/p><\/blockquote>\n\n<p><a href=\"https:\/\/www.macrumors.com\/2024\/07\/30\/google-chips-used-to-develop-apple-intelligence\/\">Hartley Charlton<\/a>:<\/p>\n<blockquote cite=\"https:\/\/www.macrumors.com\/2024\/07\/30\/google-chips-used-to-develop-apple-intelligence\/\"><p>The paper reveals that Apple utilized 2,048 of Google&rsquo;s TPUv5p chips to build AI models and 8,192 TPUv4 processors for server AI models. The research paper does not mention Nvidia explicitly, but the absence of any reference to Nvidia&rsquo;s hardware in the description of Apple&rsquo;s AI infrastructure is telling and this omission suggests a deliberate choice to favor Google&rsquo;s technology.<\/p><\/blockquote>","protected":false},"excerpt":{"rendered":"<p>Apple: In the following overview, we will detail how two of these models &mdash; a ~3 billion parameter on-device language model, and a larger server-based language model available with Private Cloud Compute and running on Apple silicon servers &mdash; have been built and adapted to perform specialized tasks efficiently, accurately, and responsibly.[&#8230;]Our foundation models are [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"apple_news_api_created_at":"2024-06-19T19:32:14Z","apple_news_api_id":"869f896c-29df-4f4f-8797-e88aea24b51c","apple_news_api_modified_at":"2024-07-30T18:10:50Z","apple_news_api_revision":"AAAAAAAAAAAAAAAAAAAABQ==","apple_news_api_share_url":"https:\/\/apple.news\/Ahp-JbCnfT0-Hl-iK6iS1HA","apple_news_coverimage":0,"apple_news_coverimage_caption":"","apple_news_is_hidden":false,"apple_news_is_paid":false,"apple_news_is_preview":false,"apple_news_is_sponsored":false,"apple_news_maturity_rating":"","apple_news_metadata":"\"\"","apple_news_pullquote":"","apple_news_pullquote_position":"","apple_news_slug":"","apple_news_sections":"\"\"","apple_news_suppress_video_url":false,"apple_news_use_image_component":false,"footnotes":""},"categories":[2],"tags":[2602,1351,31,2586,30,2598,2613,355,96,2612,555],"class_list":["post-43782","post","type-post","status-publish","format-standard","hentry","category-technology","tag-apple-intelligence","tag-artificial-intelligence","tag-ios","tag-ios-18","tag-mac","tag-macos-15-sequoia","tag-perplexity","tag-privacy","tag-web","tag-web-crawlers","tag-youtube"],"apple_news_notices":[],"_links":{"self":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/43782","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/comments?post=43782"}],"version-history":[{"count":7,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/43782\/revisions"}],"predecessor-version":[{"id":44272,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/43782\/revisions\/44272"}],"wp:attachment":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/media?parent=43782"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/categories?post=43782"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/tags?post=43782"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}