Friday, July 18, 2025

Study on AI Coding Tools

METR (Hacker News):

We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower. We view this result as a snapshot of early-2025 AI capabilities in one relevant setting; as these systems continue to rapidly evolve, we plan on continuing to use this methodology to help estimate AI acceleration from AI R&D automation.

See the full paper for more detail.

Via Thomas Claburn:

Not only did the use of AI tools hinder developers, but it led them to hallucinate, much like the AIs have a tendency to do themselves. The developers predicted a 24 percent speedup, but even after the study concluded, they believed AI had helped them complete tasks 20 percent faster when it had actually delayed their work by about that percentage.

[…]

The study involved 16 experienced developers who work on large, open source projects. The developers provided a list of real issues (e.g. bug fixes, new features, etc.) they needed to address – 246 in total – and then forecast how long they expected those tasks would take. The issues were randomly assigned to allow or disallow AI tool usage.

I’m skeptical about the experimental design, and I suspect there’s huge variance in how much developers in the real world get out of AI.

Ruben Bloom:

I was one of the developers in the @METR_Evals study. Thoughts:

1. This is much less true of my participation in the study where I was more conscientious, but I feel like historically a lot of my AI speed-up gains were eaten by the fact that while a prompt was running, I’d look at something else (FB, X, etc) and continue to do so for much longer than it took the prompt to run.

I discovered two days ago that Cursor has (or now has) a feature you can enable to ring a bell when the prompt is done. I expect to reclaim a lot of the AI gains this way.

[…]

4. As a developer in the study, it’s striking to me how much more capable the models have gotten since February (when I was participating in the study)

[…]

5. There was a selection effect in which tasks I submitted to the study. (a) I didn’t want to risk getting randomized to “no AI” on tasks that felt sufficiently important or daunting to do without AI assistance. (b) Neatly packaged and well-scoped tasks felt suitable for the study, large open-ended greenfield stuff felt harder to legibilize, so I didn’t submit those tasks to study even though AI speed up might have been larger.

Previously:

Update (2025-07-21): Dare Obasanjo:

Remember the study that showed developers think vibe coding saves them time but measurements show it doesn’t after factoring in time prompting and reviewing the AI’s work?

A startup founder is on X documenting his vibe coding struggles with Replit which includes deleting the production database and ignoring requests not to make changes without asking for permission.

6 Comments RSS · Twitter · Mastodon


Suman Chakrabarti

When I read I saw a throwaway clause that invalidates the premise—as a trained reviewer I’ve learned to spot such.

“By analyzing screen recording data from a subset of the studied developers”

They’d said they studied 246 samples, but turns out charts only show 44+30. THEY LIED. Exaggerated. Cherry picked results.


Love this future where the only metric that matters is how fast we can slop some slop together before we slop some more slop on top. Quality? Pff, who cares about that!


Aleks Dorohovich

Fist of all this

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
https://arxiv.org/abs/2507.09089
> On the other hand, benchmarks may overestimate model capabilities by only measuring performance on well-scoped, algorithmically scorable tasks. And we now have strong evidence that anecdotal reports/estimates of speed-up can be very inaccurate.

And second, what worries me the most about AI

https://www.media.mit.edu/publications/your-brain-on-chatgpt/
> Researchers at MIT’s Media Lab asked subjects to write several SAT essays and separated subjects into three groups — using OpenAI’s ChatGPT, using Google’s search engine and using nothing, which they called the “brain‑only” group. Each subject’s brain was monitored through electroencephalography (EEG), which measured the writer’s brain activity through multiple regions in the brain.

> They discovered that subjects who used ChatGPT over a few months had the lowest brain engagement and “consistently underperformed at neural, linguistic, and behavioral levels,” according to the study.


The claim that tools are so much better now than at some point in the past is usually disingenuous because if you look back in time the people making the claim are equally as boosterish then as now.


"They’d said they studied 246 samples, but turns out charts only show 44+30."

The effect is also visible in the unfiltered data and appears to be similarly strong.

"it’s striking to me how much more capable the models have gotten since February"

They haven't. It isn't easy to gauge short-term progress; things may seem relatively stable in the short term, but then you look back over two years and can see the drastic change. However, there has not been a drastic change since February.

It's obvious that, depending on how you use LLMs for coding, you can be a lot slower. Particularly when using agentic coding, having the LLM write the code and then reviewing the code written by the LLM can quickly take twice as long as just writing it. OTOH, asking the LLM something like "how do I resize an image in memory" is faster than googling it and looking at Stack Overflow or the API documentation.


Tudorminator

_‘OTOH, asking the LLM something like "how do I resize an image in memory" is faster than googling it and looking at Stack Overflow or the API documentation.’_

Assuming you have at least some idea about how it’s done. If you don’t, and if the LLM hallucinates some code that doesn’t even compile, or uses inexistent API calls or whatever, you are SO MUCH better off just poking around Stack Overflow or your favourite search engine... At least that's been my experience so far.

Leave a Comment