Friday, July 18, 2025

Study on AI Coding Tools

We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower. We view this result as a snapshot of early-2025 AI capabilities in one relevant setting; as these systems continue to rapidly evolve, we plan on continuing to use this methodology to help estimate AI acceleration from AI R&D automation.
See the full paper for more detail.

Via Thomas Claburn:

Not only did the use of AI tools hinder developers, but it led them to hallucinate, much like the AIs have a tendency to do themselves. The developers predicted a 24 percent speedup, but even after the study concluded, they believed AI had helped them complete tasks 20 percent faster when it had actually delayed their work by about that percentage.
[…]
The study involved 16 experienced developers who work on large, open source projects. The developers provided a list of real issues (e.g. bug fixes, new features, etc.) they needed to address – 246 in total – and then forecast how long they expected those tasks would take. The issues were randomly assigned to allow or disallow AI tool usage.

I’m skeptical about the experimental design, and I suspect there’s huge variance in how much developers in the real world get out of AI.

Ruben Bloom:

I was one of the developers in the @METR_Evals study. Thoughts:
1. This is much less true of my participation in the study where I was more conscientious, but I feel like historically a lot of my AI speed-up gains were eaten by the fact that while a prompt was running, I’d look at something else (FB, X, etc) and continue to do so for much longer than it took the prompt to run.
I discovered two days ago that Cursor has (or now has) a feature you can enable to ring a bell when the prompt is done. I expect to reclaim a lot of the AI gains this way.
[…]
4. As a developer in the study, it’s striking to me how much more capable the models have gotten since February (when I was participating in the study)
[…]
5. There was a selection effect in which tasks I submitted to the study. (a) I didn’t want to risk getting randomized to “no AI” on tasks that felt sufficiently important or daunting to do without AI assistance. (b) Neatly packaged and well-scoped tasks felt suitable for the study, large open-ended greenfield stuff felt harder to legibilize, so I didn’t submit those tasks to study even though AI speed up might have been larger.

Previously:

Artificial Intelligence Claude Developer Tool Programming

3 Comments RSS · Twitter · Mastodon

Suman Chakrabarti

July 18, 2025 2:22 PM

When I read I saw a throwaway clause that invalidates the premise—as a trained reviewer I’ve learned to spot such.

“By analyzing screen recording data from a subset of the studied developers”

They’d said they studied 246 samples, but turns out charts only show 44+30. THEY LIED. Exaggerated. Cherry picked results.

Billyok

July 18, 2025 5:32 PM

Love this future where the only metric that matters is how fast we can slop some slop together before we slop some more slop on top. Quality? Pff, who cares about that!

Aleks Dorohovich

July 18, 2025 9:43 PM

Fist of all this

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
https://arxiv.org/abs/2507.09089
> On the other hand, benchmarks may overestimate model capabilities by only measuring performance on well-scoped, algorithmically scorable tasks. And we now have strong evidence that anecdotal reports/estimates of speed-up can be very inaccurate.

And second, what worries me the most about AI

https://www.media.mit.edu/publications/your-brain-on-chatgpt/
> Researchers at MIT’s Media Lab asked subjects to write several SAT essays and separated subjects into three groups — using OpenAI’s ChatGPT, using Google’s search engine and using nothing, which they called the “brain‑only” group. Each subject’s brain was monitored through electroencephalography (EEG), which measured the writer’s brain activity through multiple regions in the brain.

> They discovered that subjects who used ChatGPT over a few months had the lowest brain engagement and “consistently underperformed at neural, linguistic, and behavioral levels,” according to the study.

Study on AI Coding Tools

3 Comments RSS · Twitter · Mastodon

Leave a Comment