Jason Snell:
It’s like a weight has been lifted from the soul of the iPad. It remains a very nice device to use in full-screen mode with all the simplicity attendant to that mode, or via a single tap it can turn into a multi-window, multitasking device that’s appropriate for the Mac-class hardware underpinning today’s iPads. The iPad no longer feels like it’s trying to live up to the promise of being the Future of Computing; with iPadOS 26, it’s more comfortable being itself.
[…]
On the iPad, it’s a real jumble. Some stuff looks cool, while other stuff is unreadable. For the most part, the new design didn’t hinder my use of iPadOS 26, and given those shifting sands I’m going to withhold my most withering design criticisms for a later time. But, yeah… Apple either needs to figure this thing out, and fast, or it should just frost all the glass for release and keep working on it in the background until it finds a more usable solution.
[…]
iPadOS 26 will be remembered as the update where Apple declared bankruptcy on all its previous attempts to do windowing and multitasking on the iPad, and released an entirely new windowing system that has been unabashedly inspired by the Mac.[…] I admit to forgetting more than once that I was using iPadOS when it was attached to my Studio Display. […] It’s really a flexible set of controls that works well whether you’re using a keyboard and trackpad or your fingers. […] And if you don’t want to use windowing on your iPad? Well, the feature is turned on and off with a single button in Control Center.
[…]
The improvements to Files, support for background recording, and the new background tasks Live Activities are somewhat small changes on their own, but assembled together they create an iPad that just feels more ready for professional productivity tasks.
Previously:
iOS Multitasking iPadOS iPadOS 26 iPadOS Beta Liquid Glass
Apple (PDF):
Recent generations of frontier language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from data contamination and does not provide insights into the reasoning traces’ structure and quality. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of compositional complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs “think”. Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities.
John Gruber:
My basic understanding after a skim is that the paper shows, or at least strongly suggests, that LRMs don’t “reason” at all. They just use vastly more complex pattern-matching than LLMs. The result is that LRMs effectively overthink on simple problems, outperform LLMs on mid-complexity puzzles, and fail in the same exact way LLMs do on high-complexity tasks and puzzles.
Duncan Davidson:
Of course, this doesn’t change the usefulness of these models, but better understanding how they work — and which models are good for which tasks — is essential for using them well.
Anthropic and Open Philanthropy (PDF, via Marcus Mendes):
We demonstrate that their findings primarily reflect experimental design limitations rather than fundamental reasoning failures.
[…]
When we control for these experimental artifacts, by requesting generating functions instead of exhaustive move lists, preliminary experiments across multiple models indicate high accuracy on Tower of Hanoi instances previously reported as complete failures.
Colin Cornaby:
It’s kind of hilarious how the response to Apple’s AI reasoning paper was “well sure but it works great if you remove the reasoning from the reasoning test.”
Gary Marcus (via Hacker News):
My own post here laying out the Apple paper in historical and scientific context was so popular that well over 150,000 people read it, biggest in this newsletter’s history. The Guardian published an adaptation of my post (“When billion-dollar AIs break down over puzzles a child can do, it’s time to rethink the hype“) The editor tells me readers spent a long time reading it, notably longer than usual, as if people really wanted to understand the arguments in detail. (The ACM computer science society is reposting the essay, too, and there is now a French version as well).
Tons of GenAI optimists took cracks at the Apple paper (see below), and it is worth considering their arguments. Overall I have seen roughly seven different efforts at rebuttal, ranging from nitpicking and ad hominem to the genuinely clever. Most (not all) are based on grains of truth, but are any of them actually compelling?
[…]
If people like Sam Altman are sweating, it’s because they should. The Apple paper is yet another clear sign that scaling is not the answer; for once, people are paying attention.
Simon Willison:
Gary rebuts those rebuttals, but given that his previous headline concerning this paper was a knockout blow for LLMs? it’s not surprising that he finds those arguments unconvincing.
[…]
And therein lies my disagreement. I’m not interested in whether or not LLMs are the “road to AGI”. I continue to care only about whether they have useful applications today, once you’ve understood their limitations.
Reasoning LLMs are a relatively new and interesting twist on the genre. They are demonstrably able to solve a whole bunch of problems that previous LLMs were unable to handle, hence why we’ve seen a rush of new models from OpenAI and Anthropic and Gemini and DeepSeek and Qwen and Mistral.
They get even more interesting when you combine them with tools.
They’re already useful to me today, whether or not they can reliably solve the Tower of Hanoi or River Crossing puzzles.
Google:
The International Mathematical Olympiad (“IMO”) is the world’s most prestigious competition for young mathematicians, and has been held annually since 1959. Each country taking part is represented by six elite, pre-university mathematicians who compete to solve six exceptionally difficult problems in algebra, combinatorics, geometry, and number theory. Medals are awarded to the top half of contestants, with approximately 8% receiving a prestigious gold medal.
[…]
An advanced version of Gemini Deep Think solved five out of the six IMO problems perfectly, earning 35 total points, and achieving gold-medal level performance.
Previously:
Apple Artificial Intelligence ChatGPT Claude CS Theory DeepSeek Google Gemini/Bard Math