Apple LLM Generating SwiftUI
Marcus Mendes (PDF):
In the paper UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback, the researchers explain that while LLMs have gotten better at multiple writing tasks, including creative writing and coding, they still struggle to “reliably generate syntactically-correct, well-designed code for UIs.” They also have a good idea why:
Even in curated or manually authored finetuning datasets, examples of UI code are extremely rare, in some cases making up less than one percent of the overall examples in code datasets.
To tackle this, they started with StarChat-Beta, an open-source LLM specialized in coding. They gave it a list of UI descriptions, and instructed it to generate a massive synthetic dataset of SwiftUI programs from those descriptions.
The paper was published last year, but I didn’t see people talking about it until August. In the interim, Apple started using third-party AI providers in Xcode.
18-25% of the output does not even compile. (The model they started with: 97% of the results FAILED to compile. Even the BEST model fails to produce compilable code in 12% of the cases.)
This lines up with GitHub’s report that typed languages are more reliable for generative AI.
To be blunt: after testing them out, I have not used LLMs for programming for the rest of the year. Attempting to use an LLM in that way was simply too frustrating. I don’t enjoy cleaning up flawed approaches and changing every single line. I do regularly ask ChatGPT how to use specific APIs, but I’m really just using it as a better documentation search or asking for sample code that is missing from Apple’s documentation. I’m not directly using any of the code ChatGPT writes in any of my apps.
In the meantime, I have watched plenty of presentations about letting Claude Code, and other tools, completely build an “app” but the successful presentations have usually focussed on JavaScript web apps or Python wrappers around small command-line tools. The two times this year that I’ve watched developers try the same with Swift apps have led to non-working solutions and excuses claiming it does sometimes work if left to run for another 20 minutes.
Previously:
- Top Programming Languages of 2025
- What Xcode 26’s AI Chat Integration Is Missing
- Swift Assist, Part Deux
- Tim, Don’t Kill My Vibe
- Vibe Coding
Update (2026-01-05): Tas:
My brother is working on an IPTV app in SwiftUI and has a similar experience. Claude Code improved the quality of outputs significantly especially if you download the docs and do spec-driven development. But the chance of one-shotting tasks is still lower than with Typescript for example.
rust is a perfect language for agents, given that if it compiles it’s ~correct
I understand the motivation, he wants the borrow checker to help make up for the lack of consistent reasoning in LLMs. But the fact he thinks this is a potential solution is nutballs and makes me think he does not understand the problem really.
Update (2026-01-08): Matt Gallagher:
My blog article last week has had some of the most negative feedback of anything I’ve ever published. So many people emailing me to call me out for insulting AI. I’m not sure you need to defend AI, I hear it’s doing fine.
But also, I gave all the major models 7/10 or better and said they’re much better than last year. That’s not a hit piece, calm down.
Update (2026-01-14): Drew Crawford:
Some of us report almost unbelievable engineering feats using AI. Others say they can’t automate even a simple programming task.
[…]
A very reasonable hypothesis is that some confounding factor explains all the contradictions. Maybe some people have good results, and other people have bad results. Often this is hand-waved as a “skill issue.”
I think that’s broadly true. Practice matters.
[…]
What’s actually happening is quieter, messier, and harder to talk about than a hype cycle. The gains are real, unevenly distributed, and tightly coupled to skills we don’t yet have names for, let alone tutorials. The people getting the most value are often the least able—or least willing—to explain how they do it, because explanation is risky, unrewarded, and professionally counterproductive.
45 Comments RSS · Twitter · Mastodon
The post from Gallagher is wild. “I’ve not used the technology at all this year and here’s my opinion based upon nothing whatsoever.” Coding with an AI assistant does take time and effort to perfect, but you’re rewarded with a huge productivity boost. Are they perfect? Of course not. But neither are developers.
Using Claude Code tailored to your needs with excellent plugins, skills, subagents, and MCP servers is a revelation.
@Jonathan I don’t think that’s really a fair summary. He did try multiple assistants, including Claude. I’m glad to hear that it gets better as you move up the learning curve.
@Jonathan, here's where I stand, speaking as a retired developer:
> "Coding with an AI assistant does take time and effort to perfect, but you’re rewarded with a huge productivity boost. Are they perfect? Of course not."
Convince me. Developers, retired or not, live on details.
> "Using Claude Code tailored to your needs with excellent plugins, skills, subagents, and MCP servers is a revelation."
So convince me. You know detiails. (The devil is in them!) What have you experienced as a "huge" productivity boost? Going from concept to base code? And if so, what imperfections have you experienced? Even more important - forget these pesky details - how long did it take you to use Claude for this huge productivity boost? I understand, it may be for a third party employer and you cannot provide many details, but can you provide some? How long did it take you to "perfect" this "huge" productivity boost? I'm will to learn! But as I've already said, your comment lacks these details, where "the devil" resides.
The productivity boost comes from having a second set of eyes on the code. "I have a NilObjectException in a method, check where this could come from". I get a list of half a dozen things to check. A couple are nonsense. But I always get an idea what else I could improve.
My new app has a piece of interface using CSS and Javascript. Of course, I could have learned CSS beyond the basics. But this would have taken weeks.
There used to be this very scaled back online word processor thing called Draftin.
It had no business model since it was a Y combinator project, and therefore it failed.
But I've missed it sorely ever since it went away.
Two days ago I recreated the basic version, sans version control, in an hour using only Claude.
Markdown input mode, reading mode, author info that is used for the Modern Manuscript Format export.
One html file is all it is.
I'd never had done it on my own
I greatly respect Matt and learned a lot from him in the past.
For this post tho, I hvae to say this is a skill issue.
First of he uses GPT 5.2, but not the model specifically designed for coding.
He doesn't list which harness he's using - these make a big difference in output quality.
The prompt doesn't specify which macOS version should be used, so it's expected that the model will use older API like ObservableObject instead of the Macros - this needs to be specified. He also didn't ask the model for questions, it very likely would have asked that.
He didn't specify a line limit for files. These basic things are usually in AGENTS.md, but the article doesn't mention it.
He didn't specify writing tests or that the model is allowed to execute the app - if so it would have automatically fixed any compilation issues.
And then he also tries weak models, that doesn't make sense. Swift is a niche language and therefore gets a smaller space in world knowledge, so compressed models are useless.
Using agents correctly is a new skill that needs to be learned, just like coding is. Expecting to take a few hours to "test them out" will yield crappy results.
I write all my apps in SwiftUI and I haven't written code since ~May.
Everything's open source, so I can back it up.
Code's not always perfect, but it works and I can ship at an incredible pace.
https://x.com/steipete/status/2006484093052268701
@Peter Thanks for chiming in. I’m in awe of what you’ve been able to accomplish here. Given that Matt is super smart and making good a faith effort but not seeing the results you are, do you know of a good resource that explains these things that are not obvious to the rest of us? Certainly much of it must be developing the skills, but it also sounds like it’s important to start on the right path and that there are rookie mistakes that can be avoided.
Thanks Michael! And I must apologize, my prev message might been a bit spicy - this is not meant as a personal criticism. What it however is is a critque on the methodology of the evaluation of the models.
There's not much more I can say since the post is lacking on details (what harness etc...) and there's gaps in understanding how planning works. Instead of throwing this prompt at the model, start a conversation, allow it to ask questions. These things are not human, so they'll do their best shot, but there's so much implicit info that we have in our head that's not conveyed in the prompt. Like a "it's okay to write against modern macOS API, use Macros not the old observation stuff". You can write like that, and the agent will ask that - if you let them.
The compiler errors I never seen, to build this well you need to use a good harness. codex for example will pick up build tooks autoamtically, so if you add a verification allow step the loop, (e.g. write tests - and again if you only spec that it will def ot XCTest, so say swift test), then it'll correct itself. They are *ghosts* and mirror what they learned online and make mistakes, so eval is needed. Give them tools. Add swiftlint, swiftformat. Give them a hint what UI expectations you have. They can build great things. Dragging in images for inspiration works wonders. Or point them at a folder with pre-existing code that fits your expectations on style or design. They are great mirrors.
I wrote more here:
https://steipete.me/posts/just-talk-to-it
https://steipete.me/posts/2025/shipping-at-inference-speed
It took me months to get up to speed and learn the little subtleties of the models. We're absolutely in the age where they can do whatever you ask them to do - but you gotta be more specific and see it as conversation. Once you are happy with the plan, the write "build" and watch the magic happen.
To date, I personally, haven’t seen any ”vibe” coded app, no matter what methodology used, that goes beyond ”toy” app territory. Maybe there is, in private source. But open source? I have seen no poof of concept beyond ”toys” and brittle code.
> I write all my apps in SwiftUI and I haven't written code since ~May.
Everything's open source, so I can back it up.
@Peter can you link the repo or share App Store link of the SwiftUI/LLM project you think shines the most? Most of us are skeptical of SwiftUI on its own alone but an LLM to write most of it and just proofreading the code? Would like to take the app out for a spin
> Given that Matt is super smart and making good a faith effort but not seeing the results you are,
Actually, I disagree with this completely and I am not convinced whether or not Matt is making good arguments.
@Drew I’m not saying the arguments are good. I’m saying that I don’t think he set out to write an unfair hit piece. If you also disagree about him being smart, I don’t know what to say…
I'm not sure if people aren't talking past each other here. There are different ways of using LLMs in programming, and they have different constraints.
Actual vibecoding is the act of generating programs purely using LLMs, without manual intervention or code reviews. In my experience, with a current model and a reasonably common tech stack, this works for smaller code bases. It starts to break down at around 300-600 kb of code, where LLMs become unable to consistently make changes without generating major new problems.
With this approach, there is a substantial gain in velocity, but the cost is a pretty strict upper limit for complexity and a massive potential for bugs, particularly security issues. If you're building a small-ish tool without any security concerns, it's a viable approach.
Another way to use LLMs is to tightly control and supervise them. Commit before every prompt, limit the size of each change, give very specific input for what needs to be changed, ask to provide a plan before any change and approve that plan, ask to write tests and review these tests before writing the actual implementation, and then review any code the LLM generates and potentially manually refactor it.
This will result in very high code quality (potentially better than just manually writing everything), but the velocity advantages are small or even negative.
So if somebody tries out coding with LLMs and thinks they can take their huge existing code base and just tell 5.2 to implement a new feature, they're going to fail. But OTOH when people claim that they are now 100x more productive than before, I also do not trust what they are doing at all.
I'm saying that he did in fact produce an unfair hit piece, regardless of what he may or may not have set out to do. I'm sure he's a nice guy, but let me attack the piece on the merits and why I think it is disqualifying on its face.
First: never trust a staged demo. The procedure in the piece is the same one from every LLM stage demo: Oneshot a whole app, only parameter is the model dropdown, because we need to sell you that our new model will fire your bottom tier of developers. One law of nature is all staged demos are as a matter of course marketing pieces. Another law of nature is that: stage demos only present the most basic 2-minute understanding of a topic, and then try to leverage that simple understanding into the very flashiest result they possibly can. A corollary is that: before you can replicate a stage demo, you must spend more than 15 hours and file more than 1 feedback guessing at the difference between your setup and the one on stage. This is obviously what we did every year, so the special pleading toward LLMs feels disqualifying to me.
I'm with Peter that the compile errors are a smoking gun. First: LLMs are not accurate compilers (yet?) so it is silly to think they have one built in (which appears to be a main thesis of the piece?) However, I can report that three models Matt groups together: (DeepSeek-R2-Lite, Devstral Small 2, Nemotron 3 Nano) write code that absolutely compiles given access to build tools with 95% accuracy, this is easily reproducible. So obviously they did not have access to build tools in his experimental design? Why not? It's not reported. For that matter, if you ask me to whiteboard this prompt I will also whiteboard some code that does not compile, I guess according to Matt's criteria I should be a 1/10 model?
Peter's theory that some of these models weren't coding tuned is a very smart arbitrary guess about the actual underlying problem: coding models are more convinced to find a working compiler somewhere on your system. I have a similar yet different arbitrary guess: Apple command-line tools are unusually bad relative to most languages and often LLMs struggle with them. (Incidentally, this would go a long way toward supporting the major thesis in that somehow LLMs are "bad at Swift", incidentally this could be easily tested) but the fact that I cannot explore whether Peter is correct or whether I am correct or whether some other theory is correct is a disqualifying error in the piece.
Here's a second major error: we're presented screenshots so as to judge either AI output in general, or the one free variable model dropdown parameter. But when *I* write SwiftUI code, *I* have an assistant panel in the righthand pane that *I* can see as I type, which some might say was the stage demo feature of SwiftUI as an an API. So did the *models* have access to that feature, as for example GPT-4o was tested, and is a famously multimodal model that might be motivated to iterate on some screenshot basis? Or is the experimental design instead to lock me in a basement and write SwiftUI code on a whiteboard that I've never seen visually? The fact that I don't know, is a disqualifying error in the piece.
Instead of attacking it on methodology, let me attack the major conclusion: Like Peter, I somehow have extraordinary results from my own LLM use. Here's an example: on December 12, Apple released macOS 26.2. On December 14, I knew the exact line of code in WebKit that caused a memory UB, if and only if a large production codebase, if and only if an exact production workload, if an donly if my codebase was cross-compiled to wasm, if and only if it's run in the Safari browser.
Since by some arbitrary LLM process I happen to know the exact line of code released in 48 hours in Webkit, I can easily present an alternative, parallel consturction of it. Actually, some other guy got here two weeks ago, and he tried to warn Apple not to release this line of code and of course they did anyway, as is tradition.
A question to ask in a serious piece is:
a) what the actual fuck am i doing to trace a native UB to a cross-compiled wasm target in 48 hours, and
b) what the actual fuck is this other guy doing to beat my methods by at least two weeks
I am confident some questions like this are now the state of the field of computer science, and the piece's failure to interact with them will prove to be a methodological flaw.
For anyone still in doubt whether ”LLMs can produce SwiftUI apps”, here’s a reality check: Subscribe to the MacApps Reddit and check out the stuff people are releasing daily: https://www.reddit.com/r/macapps
These apps are not the 6-12-month projects coded by hand by actual indie developers that you’d see 3 years ago. These are apps coded by teenagers (sometimes with zero experience in Swift and macOS dev) in a few days using AI. And they mostly work well enough.
As for myself, I’m a indie Mac app developer since 2012. I stopped writing code in May this year. I stopped vetting the code 4-5 months ago, when the AI started exceeding my expertise 99% of the time. I don’t look at the code, I look at diffs. I do all my work in Claude Code or Codex.
Most recently I shipped Backdrop which was built 80% with AI code. There’s a free trial where you can evaluate it:
https://cindori.com/backdrop
I also shipped a reverse engineered custom MaterialView. While the view itself came from me, I had Claude build the entire demo and GitHub repo for me over a few hours:
https://github.com/OskarGroth/MaterialView
I stand with Peter here. Using AI without even a competent harness (Claude Code or Codex, aka the software that exposes all the tools to the AI model, effectively making it an agent and not a Stack Overflow search assistant) indicates that you should probably go research how to work with AI coding agents first, before you make a post claiming something about their capabilities. To us who actually work with this stuff day-day, it is kinda embarrassing to read to be honest. Although this sort of take is not unique, it says more about the unwillingness of some people to accept the new reality than it does anything about the progress of AI models.
"I'm saying that he did in fact produce an unfair hit piece, regardless of what he may or may not have set out to do."
"Hit piece" implies intent. Having read the article now, I agree that it is pretty nonsensical. Vibecoding a whole app in one shot in a setup where the model can't compile code and then judging the model on compile errors provides no useful information on how well these models actually work in a reasonable setup.
But this guy clearly didn't know how to set up his tools correctly.
It's kind of funny how aggressive people are on this topic. I follow both pro- and anti-LLM subreddits, and both of them are convinced that the people on the other side are either paid actors or complete morons and that anything contradicting their opinion is misinformation spread by a cabal of evildoers.
It's just not that serious.
> “hit piece” implies intent
I disagree, there cannot be any property of a work that implies authorial intent but this is way outside the our shared fields.
The real question is: how do we explain the drastic difference in the empirical reports? Sure, at a surface level maybe I am wrong or maybe Matt is wrong.
A more robust explanation in the face of competing claims is there’s likely to be some confounding factor. My observation is that the founding factors that I think are obvious to explore, are completely absent from the piece.
"there cannot be any property of a work that implies authorial intent but this is way outside the our shared fields."
I find your prose a little difficult to understand, and so I don't think I understand the claim you're making, but the definition of the phrase "hit piece" usually contains intent.
It's less a claim I am making in particular, and more of a standard position in philosophy. The landmark paper about it is Wimsatt and Beardsley's 1946, "The Intentional Fallacy". As the title of their paper suggests, the essential claim is that the intent of an author is the wrong way to judge a work. It follows easily from this position that when I use the adjective "hit piece", describing a work, that has nothing to do with the intent of the author.
All of this background is a a complete distraction from the main question, which is: am I right, or is Matt right, or what is the confounding factor? I think that's a far more interesting question that is much more relevant to our respective areas of expertise.
Drew: Wimsatt and Beardsley are talking about judging literary works. Is a particular poem a great work of art? The answer, they say, should only be based on the work itself. The author's intent in producing the work, that is, is irrelevant. I don't think any of that is relevant to a non-literary work like Matt Gallagher's post. Even if it were, when you call a work a "hit piece," you are by definition referring to intent. A "hit" is an intentional act.
> These apps are not the 6-12-month projects coded by hand by actual indie developers that you’d see 3 years ago. These are apps coded by teenagers (sometimes with zero experience in Swift and macOS dev) in a few days using AI. And they mostly work well enough.
>As for myself, I’m a indie Mac app developer since 2012. I stopped writing code in May this year. I stopped vetting the code 4-5 months ago, when the AI started exceeding my expertise 99% of the time. I don’t look at the code, I look at diffs. I do all my work in Claude Code or Codex.
You stopped vetting the code? Apps vibe coded by teenagers that “mostly work” but they asking for help to finish?
While i think LLMs can be useful a big factor in these debates here is
people who care about their craft vs people who don’t and just want to snap things together quickly to make a quick buck
Use LLMs to your advantage but don’t use it as an excuse to be lazy
Visiting: they are talking about texts, and the genre distinction about “literary works” vs any other text is not meaningful, but there is probably some other blog where we can discuss these 70 year old arguments in philosophy in that comment section.
On a programming blog, I’d prefer to discuss the actual state of programming in the current year, and the specific deficiencies in the hit piece.
Drew, if I understand you correctly, you're saying that we should judge a piece by its contents, as we can't know the author's intent. I agree, which is why I object to you calling it a hit piece, since that is a comment on the author's intent rather than the text's merits.
And again, I think the claims people make about LLMs need to be more precise. I read the vibecoding subreddits, and they're full of non-programmers who are frustrated and angry when, after initial success, LLMs are suddenly unable to make progress on their apps. These people can't read any of the code or debug anything manually.
And yet people make claims like "I stopped vetting the code 4-5 months ago," only to later say that "I look at diffs."
This is a confusing claim about the capabilities of LLMs, so it's hardly surprising that people write confused articles based on a misunderstanding of how other people use LLMs.
Thanks for the feedback, all. I'm happy to take the spicy hits :-P
I was *not* using LLMs in an agentic coding manner. I was not using Pro or Coding frontier models (not GPT-5.2-Codex or Gemini 3 Pro or Claude Opus 4.5). The local LLMs I used were coding focussed (as the names implied) but that was just to help them out as much as possible since they clearly struggled.
As for methodology: I literally entered the prompt into web interfaces (or LM Studio for local models) and compiled the result in Xcode. No agents.md. No iteration. No filter. Nothing interesting.
I'm definitely hearing from these comments that that approach is way too naive for some here who live a non-stop agentic life. But I was more curious to see raw code generation, that what could be reached through 20 minutes of iteration.
> As for myself, I’m a indie Mac app developer since 2012. I stopped writing code in May this year. I stopped vetting the code 4-5 months ago, when the AI started exceeding my expertise 99% of the time. I don’t look at the code, I look at diffs.
If this was done by serious engineering craftsmen, with the current capabilities of Agents with LLMs, for crucial software and systems, actually important for people’s daily lives, and the foundations of modern civilization, bridges would start to collapse, medical equipment would fail, transport systems would fail, industrial processes would crater, digital infrastructure would be in chaos. Our civilization would be built on a minefield ready to take us down at any moment. Thank God smarter people don’t take cues from a particular subreddits to do real work.
It’s all toys.
>You stopped vetting the code?
Absolutely. Programming languages like Swift or Typescript are now closer to assembly code for me & others who work via agents. There is just no point reviewing it, most of the time.
I can write a whole app in ObjC with MVC-pattern using dispatch. With one prompt to my agent, I can turn it into an _identical_ Swift app using MVVM with latest Concurrency just 5 minutes. Why the heck would I sit and nitpick about syntax, styling, patterns, organisation of source files? None of those things matter anymore. The only thing that matters is the outcome. And the outcome is that the features work, and that the bugs are resolved. I don't even have to find the bugs. I can just ask my agent to "find and resolve potential performance issues or bugs". It will find way more than I could in 3 minutes, than I would reviewing the codebase manually in 3 hours.
It's really about trust. Picture I gave you a team of 50 engineers. You are now leading this team to build your product. Will you stand over their shoulder, nitpick about every line they write? Demand that every PR must be personally approved by you? Some people are like that. They have trust issues. They cannot let go of control and end up micromanaging their teams to the point where the whole thing caves in on itself. If you are like that, you will find it very hard to adapt to this new paradigm of engineering.
>Use LLMs to your advantage but don’t use it as an excuse to be lazy
Now is it laziness to leverage tools that allow you to 50x your output as a developer? Even as a teenager, who can now make apps without learning Swift first? Why would they? That era is gone, and they just want to make the thing.
Now, refusing to even properly evaluate and educate yourself about these tools, that's laziness. And it's an insult to your users and the craft itself, because these tools _do_ vastly improve the quality of your work, as long as they are commanded by a person who cares about that.
But I suspect that for many, it's also an emotional response. An effort, conscious or not, to protect your own ego and identity as a person who writes code, and does it well. The sad part is, that the longer you cling to that identity, the world will continue to get stranger by the day, and you'll end up nothing more than the old man yelling at a cloud.
>And yet people make claims like "I stopped vetting the code 4-5 months ago," only to later say that "I look at diffs. This is a confusing claim about the capabilities of LLMs
I'll clarify: I glance at the diffs during agent work to see architectural direction, functions or classes created, as a complementary signal together with the agent summary of the work done. I don't go through the code in detail, nor do I review any of it before shipping. For that, I just do a few review passes with fresh agents.
We can agree to disagree about my word choices.
We also agree that claims people make about LLMs need to be more precise. If I understand you correctly, you are discussing a performance cliff in LLMs beyond some level of task complexity. We agree that there is such a cliff, both because I hit that wall all the time and need to plan around it on a daily basis, and also because I've read a lot of papers about this problem. I believe the current paper about it is https://arxiv.org/abs/2506.06941.
We would disagree, in the event that you are presenting a case that on the subreddit they will continue to be frustrated forever as a law of nature, since the literature is implying that the performance cliff is on the move.
We would also disagree, in the event that the Reddit evidence is presented on the basis that it's actually true, instead of accurately reflecting public sentiment. In all of the technological revolutions I am aware of, the public sentiment was not true.
"There is just no point reviewing it"
This is where I draw the line. Having LLMs write apps without reviewing the code and then distributing these apps to others is negligent. On your website, there is an app that downloads and runs "community-created live wallpapers" on your users' computers. For all you know, you've just installed a backdoor on all of your customers.
> This is where I draw the line. Having LLMs write apps without reviewing the code and then distributing these apps to others is negligent.
For what it's worth: I too, would like to live in a world where someone I trust is reviewing the diffs. I too, think that is is a good principle and I would prefer it be upheld.
The trouble is we do not live in that world and we did not adequately incentivize anyone to do that. We live in a world where all of us are shipping dependencies we didn't read to our customers.
The trouble with the the word "negligent" is that it calls for a legal conclusion, and already in every dependency tree is the language that THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
If that's not an adequate disclaimer that we could have installed a backdoor on all of the customers, what is?
> This is where I draw the line. Having LLMs write apps without reviewing the code and then distributing these apps to others is negligent.
If you think that ANY piece of consumer software ships code that is 100% vetted and reviewed, you're living in a fantasy world. That said, I absolutely review the code I ship, using agents. But the conversation about this concept is bound to go nowhere, because I suspect you do not understand (or want to understand) precisely how that works, how amazingly competent the agents are, and how I could easily make the code I ship more secure, performant, and robust, while also shipping features 10x faster than before, all while not even looking at the code personally. And until you change your mind about those things even being conceptually possible, it's probably going to keep sounding about as safe as "throwing some holy water on it and saying a prayer".
"The trouble is we do not live in that world"
You can't create a problem and then use the fact that you've created that problem as a defense.
"If that's not an adequate disclaimer that we could have installed a backdoor on all of the customers, what is?"
You could review the code you run on your customers' computers.
"If you think that ANY piece of consumer software ships code that is 100% vetted and reviewed, you're living in a fantasy world"
I'm a software engineer. I have worked in this domain since the early 90s. I have held every position from junior to CTO. Not once have I worked at a company that deployed executable code to client computers without at least two humans—the original programmer and a reviewer—having looked at it.
"I absolutely review the code I ship, using agents"
So *you* don't review it.
"But the conversation about this concept is bound to go nowhere, because I suspect you do not understand"
And I suspect it is *you* who doesn't understand, but that is hardly an argument. My actual argument is this: as someone who uses LLMs extensively in development and reviews every line of code they generate, I know you are wrong. What you are doing is dangerous and stupid.
Cheers, Drew. I'd just wanted to let you know you're misinformed about Wimsatt and Beardsley. By all means, carry on and have a great day.
> You can't create a problem and then use the fact that you've created that problem as a defense.
I don't understand what problem I am creating or what defense to that problem I am raising.
> You could review the code you run on your customers' computers.
I think I do a really good job on this, but obviously you are welcome to file issues about it in any project that I maintain.
> I'm a software engineer. I have worked in this domain since the early 90s. I have held every position from junior to CTO.
These were things I assumed from your writing style.
> Not once have I worked at a company that deployed executable code to client computers without at least two humans—the original programmer and a reviewer—having looked at it.
These are things that will prove to be boring semantic stuff. Like: not once have I worked at a company that did not deploy the Linux kernel to customers. No reviewer has ever read the Linux kernel. Maybe I read like... half a million lines of it over my whole career? I definitely did not read the code changes in a point release.
> I know you are wrong. What you are doing is dangerous and stupid.
That may be, but our entire field might be dangerous and stupid. It appears to depend on an unrealistic expectation about human behavior, which seems to break down because some of the forces are no longer human.
>To date, I personally, haven’t seen any ”vibe” coded app, no matter what methodology used, that goes beyond ”toy” app territory.
We've vibe coded a migration tool from one devops solution to another. It needed several passes of iterating, and even the final version had edge cases it didn't get right, but ultimately it was good enough, and saved us time which we frankly did not have.
Critically, this was throwaway code. I'd never want to touch it again. Like anything vibe-coded, it's an unmaintainable stew of tech debt the moment it gets generated. And for this use cases — a one-time migration — that's fine. For something you actually want to keep using, much less selling to clients/customers? No thank you. (Even in this case, I think I would've rather paid someone to build the migration tool. But we were running out of time finding someone, and maybe it was worth the one-time experiment.)
_You_ didn't write that code. Nor did any human being. And even the LLM that _did_ write it cannot answer questions about it, because it doesn't know what a "question" is, or what "code" is, and it has already lost the context on what it was "thinking" when it generated that.
You cannot do code review on "vibed" code. You cannot meaningfully iterate on it. Unlike a junior developer, there's no learning curve, there's no annual performance review, there isn't even watercooler talk. At the end of the day, I want a human being to sign off on code that ends up in a repository. That's a bar so low it shouldn't even debatable. What, when the customer complains about a bug, do you want the LLM to do the deescalating phone call, too?
Smaller-scoped use cases of LLMs in code I can see. Spicy autocomplete, sure, why not. Often, it works OK; "oh yeah, that's exactly what I was gonna type anyway". 20% of the time, it gets it wrong. No biggie. I can even see, as Beatrix suggests, using it as a little research/debug helper to look things up, give suggestions, etc. I _would_ prefer if Stack Overflow had evolved in that direction: integrate the community into the IDE, somehow. Instead, they stole Stack Overflow and everything else on the Web and put it behind a monthly subscription. Oh well.
> Absolutely. Programming languages like Swift or Typescript are now closer to assembly code for me & others who work via agents. There is just no point reviewing it, most of the time.
This comparison makes no sense. You don’t typically have to read assembly code but the assembly code you get when your code is compiled is deterministic. Your code in your higher level programming languages are instructions that produce a predictable/practically guaranteed result (compiler bugs are rare).
LLMs speak in “natural language” and their output while it can be useful in situations and save time, the result is not guaranteed - if you don’t review the code you have no way of knowing what you are shipping. This may be okay for small toy projects for yourself but really isn’t acceptable to try to ship a commercial app this way.
> Now is it laziness to leverage tools that allow you to 50x your output as a developer? Even as a teenager, who can now make apps without learning Swift first? Why would they? That era is gone, and they just want to make the thing.
Except when they hit a wall the LLM can’t figure out- and they don’t have the coding skills to even write a proper prompt because they don’t know how to code. And they keep prompting the thing and fiddling and just hoping the issue goes away but they don’t have the chops to verify a fix. please please LLM fix it because I don’t know what I’m doing! No coding tools to create apps are not new BTW. They can go further now sure but selling this stuff to customers—vibe coded is like letting me fly a plane and just hoping autopilot takes care of everything. I’m not saying you shouldn’t leverage new tools but what you say you’re doing is pretty crazy - especially coming from someone who can code it seems lazy.. Surely you can read the code in your wallpaper app since the LLMs are saving you so much time? How large can that codebase be?
> These are things that will prove to be boring semantic stuff. Like: not once have I worked at a company that did not deploy the Linux kernel to customers. No reviewer has ever read the Linux kernel.
The Linux kernel was not written by an LLM. Companies all over the world are using it and it is reviewed by experts all the time. This is another silly comparison.
Visitor: Meanwhile, I was letting you know that you are misinformed about Wimsatt and Beardsley, cheers, and I hope you have a great day as well.
A simple reason that the genre argument is not meaningful, is obviously somebody in the field would have thought of that between 1946 and now. The usual papers on the topic are Barthes "The Death of the Author" and Foucault's "What is an author?", either one of which will generalize the main result across the genres.
The main result is that interacting with the inner world of an author is a) not possible, but also b) a distraction from actually discussing a text on its merits.
Drew - Respectfully, no. Barthes and Foucault were (post)-structuralists. Wimsatt and Beardsley were New Critics. The two schools of thought were worlds apart.
New Critics believed precisely in a real, categorical distinction between literary texts and all other texts. That's their starting position, that there is something special about so-called literary language, and that it requires training to know how to read and criticize it. That's what they thought English Departments should train students to do. Note that the intentional fallacy was introduced by W&B in a book called "The Verbal Icon: Studies in the Meaning of Poetry." Poetry was W&B's subject. How to read and how not to read poetry was what they cared about.
You're right that the between literary and nonliterary texts is denied by some structuralists and post-structuralists denied. But that has nothing to do with W&B.
W&B's New Criticism dominated the scene in the middle of the 20th century. It was largely supplanted, among scholars, by people like Barthes and Foucault at the end of the century. I won't say more here because I can't imagine anyone cares.
I'm a hobbyist at coding, and I learn from you and others here.. I appreciate your insights. But what you're talking about in this instance happens to be my life's work.
Visitor: Respectfully, we continue to disagree.
We agree that Barthes and Foucault were (post)-structuralists, I brought them up because we now have three citations from different and incompatible schools that somehow all agree with my main observation that the author's intent is the wrong way to judge a work. None of this is super relevant but I guess let's go into it.
> Poetry was W&B's subject. How to read and how not to read poetry was what they cared about.
That may be, but the irony of my position is that it does not matter what their subject is or what they cared about. What matters is the meaning we construct when we read a text.
> But what you're talking about in this instance happens to be my life's work.
That may also be, but upon first glance it reads like an argument from an anonymous authority. Unless you are prepared to disclose your life's work, I don't really see how it is relevant to the topic at hand.
To make matters worse, based on some of the previous admissions regarding how some people prefer to approach software architecture and code, you can never be certain, any longer, that you are not debating a chatbot through a middle man. In this case about a highly subjective topic such as literary theory and intentional fallacy.
@Oskar So I had a quick look at the Backdrop application as suggested:
- I couldn't test the trial version or had a look at the .app bundle because the .dmg is corrupted (according to macOS).
- From the screenshots or description, I understand this is an application that uses a specific window level and plays a movie. And the main UI is basically made of lists. Considering there does not seem not be a huge technical value in the product itself, I believe you when you said this was done mostly via vibe coding. But it could be also coded in less than a day without AI.
- I'm not sure claiming this was done 80% through vibe coding is a good marketing strategy. After all, if anyone can code it, why would they pay $2.99 per month for something that could just be free software or available at $0.99 (one-time purchase) on the Mac App Store.
- The Support Center is nice but the part about the LockScreen feature not being reliable would not encourage me to pay $2.99 per month. Is this unreliable because of Vibe Coding?
@Oskar
> I don't go through the code in detail, nor do I review any of it before shipping. For that, I just do a few review passes with fresh agents.
How can you then guarantee the code does not include backdoors before shipping? You just rely on the notarization check?
> I couldn't test the trial version or had a look at the .app bundle because the .dmg is corrupted (according to macOS)
Sounds like you are on a very old version of macOS. Backdrop requires macOS 14.0 or later.
> How can you then guarantee the code does not include backdoors before shipping? You just rely on the notarization check?
Backdoors by whom? The agents don't randomly add backdoors. You're much more at risk to get that by using 3rd party packages.
> From the screenshots or description, I understand this is an application that uses a specific window level and plays a movie
Backdrop offers an entire community platform (powered by CloudKit), a custom backdrop editor, thousands of user-generated backdrops, and lock screen support. It's not "vibe coded", but I have built it using AI agents, together with my own expertise of macOS internals.
@Oskar
> Backdoors by whom? The agents don't randomly add backdoors. You're much more at risk to get that by using 3rd party packages.
You have networking code in your app that you described as not being verified. Why would you trust this automatically generated code when, like you mentioned, you can already get code poisoning from 3rd party package (or even an IDE)?
It somehow feels like the Build button is labeled "I feel lucky".
"The agents don't randomly add backdoors"
They do. LLMs are trained on a lot of insecure sample code. LLM-generated code will create unsecured API endpoints, leak API keys, and do all kinds of things that a human engineer of basic competency would never do.
I find it pretty interesting that you publicly admit you don't even meet the most minimal professional standards. Companies have had significant legal repercussions for releasing vulnerable software. If anything happens, your comments that you don't have to take even the most basic security steps because you have a disclaimer on your website won't help you in court.
"But the fact he thinks this is a potential solution is nutballs and makes me think he does not understand the problem really."
I'm starting to think that Jonathan Blow is an asshole.