Apple LLM Generating SwiftUI
Marcus Mendes (PDF):
In the paper UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback, the researchers explain that while LLMs have gotten better at multiple writing tasks, including creative writing and coding, they still struggle to “reliably generate syntactically-correct, well-designed code for UIs.” They also have a good idea why:
Even in curated or manually authored finetuning datasets, examples of UI code are extremely rare, in some cases making up less than one percent of the overall examples in code datasets.
To tackle this, they started with StarChat-Beta, an open-source LLM specialized in coding. They gave it a list of UI descriptions, and instructed it to generate a massive synthetic dataset of SwiftUI programs from those descriptions.
The paper was published last year, but I didn’t see people talking about it until August. In the interim, Apple started using third-party AI providers in Xcode.
18-25% of the output does not even compile. (The model they started with: 97% of the results FAILED to compile. Even the BEST model fails to produce compilable code in 12% of the cases.)
This lines up with GitHub’s report that typed languages are more reliable for generative AI.
To be blunt: after testing them out, I have not used LLMs for programming for the rest of the year. Attempting to use an LLM in that way was simply too frustrating. I don’t enjoy cleaning up flawed approaches and changing every single line. I do regularly ask ChatGPT how to use specific APIs, but I’m really just using it as a better documentation search or asking for sample code that is missing from Apple’s documentation. I’m not directly using any of the code ChatGPT writes in any of my apps.
In the meantime, I have watched plenty of presentations about letting Claude Code, and other tools, completely build an “app” but the successful presentations have usually focussed on JavaScript web apps or Python wrappers around small command-line tools. The two times this year that I’ve watched developers try the same with Swift apps have led to non-working solutions and excuses claiming it does sometimes work if left to run for another 20 minutes.
Previously:
- Top Programming Languages of 2025
- What Xcode 26’s AI Chat Integration Is Missing
- Swift Assist, Part Deux
- Tim, Don’t Kill My Vibe
- Vibe Coding
Update (2026-01-05): Tas:
My brother is working on an IPTV app in SwiftUI and has a similar experience. Claude Code improved the quality of outputs significantly especially if you download the docs and do spec-driven development. But the chance of one-shotting tasks is still lower than with Typescript for example.
rust is a perfect language for agents, given that if it compiles it’s ~correct
I understand the motivation, he wants the borrow checker to help make up for the lack of consistent reasoning in LLMs. But the fact he thinks this is a potential solution is nutballs and makes me think he does not understand the problem really.
Update (2026-01-08): Matt Gallagher:
My blog article last week has had some of the most negative feedback of anything I’ve ever published. So many people emailing me to call me out for insulting AI. I’m not sure you need to defend AI, I hear it’s doing fine.
But also, I gave all the major models 7/10 or better and said they’re much better than last year. That’s not a hit piece, calm down.
Update (2026-01-14): Drew Crawford:
Some of us report almost unbelievable engineering feats using AI. Others say they can’t automate even a simple programming task.
[…]
A very reasonable hypothesis is that some confounding factor explains all the contradictions. Maybe some people have good results, and other people have bad results. Often this is hand-waved as a “skill issue.”
I think that’s broadly true. Practice matters.
[…]
What’s actually happening is quieter, messier, and harder to talk about than a hype cycle. The gains are real, unevenly distributed, and tightly coupled to skills we don’t yet have names for, let alone tutorials. The people getting the most value are often the least able—or least willing—to explain how they do it, because explanation is risky, unrewarded, and professionally counterproductive.