AI Models from Google, OpenAI, Anthropic Solve 0% of ‘Hard’ Coding Problems

cm0002@lemmy.world · 7 months ago

AI Models from Google, OpenAI, Anthropic Solve 0% of ‘Hard’ Coding Problems

🇨🇦 tunetardis@piefed.ca · 7 months ago

For instance, if an AI model could complete a one-hour task with 50% success, it only had a 25% chance of successfully completing a two-hour task. This indicates that for 99% reliability, task duration must be reduced by a factor of 70.

This is interesting. I have noticed this myself. Generally, when an LLM boosts productivity, it shoots back a solution very quickly, and after a quick sanity check, I can accept it and move on. When it has trouble, that’s something of a red flag. You might get there eventually by probing it more and more, but there is good reason for pessimism if it’s taking too long.

In the worst case scenario where you ask it a coding problem for which there is no solution—it’s just not possible to do what you’re asking—it may nevertheless engage you indefinitely until you eventually realize it’s running you around in circles. I’ve wasted a whole afternoon with that nonsense.

Anyway, I worry that companies are no longer hiring junior devs. Today’s juniors are tomorrow’s elites and there is going to be a talent gap in a decade that LLMs—in their current state at least—seem unlikely to fill.

Beej Jorgensen@lemmy.sdf.org · 7 months ago

Sucks for today’s juniors, but that gap will bring them back into the fold with higher salaries eventually.

Zexks@lemmy.world · 7 months ago

I’ve noticed this too and it’s even weirder when you compare it to a physics question. It very consistently tells me when my recent brain fart of an idea is just plain stupid. But it will try eternally to help me find a coding solution even it it just keeps going in circles.

otacon239@lemmy.world · 7 months ago

I think part of this comes down to the format. Physics can often be analogized and can be very conversational when it comes to demonstrating ideas.

Most code also looks pretty similar if you don’t know how to read it and unlike language, the syntax is absolute with no room for interpretation or translation.

I’ve found it’s consistently good if you treat it like a project specification list, including all of your requirements in a list format in the very first message and have it psuedocode the draft along with list what libraries it wants to use and make sure they work how you expect.

There’s some screening that goes into utilizing it well and that only comes with already knowing roughly how to code what you’re trying to make.

Schal330@lemmy.world · 7 months ago

In the worst case scenario where you ask it a coding problem for which there is no solution—it’s just not possible to do what you’re asking—it may nevertheless engage you indefinitely until you eventually realize it’s running you around in circles.

Exactly this, and it’s frustrating as a Jr dev to be fed this bs when you’re learning. I’ve had multiple scenarios where it blatantly told me wrong things. Like using string interpolation in a terraform file to try and set a dynamic source - what it was giving me looked totally viable. It wasn’t until I dug around some more that I found out that terraform init can’t use variables in the source field.

On the positive side it helps give me some direction when I don’t know where to start. I use it with a highly pessimistic and cautious approach. I understand that today is the worst it’s going to be, and that I will be required to use it as a tool in my job going forward, so I’m making an effort to get to grips when working with it.

Modern_medicine_isnt@lemmy.world · 7 months ago

Sadly, the lack of junior devs means my job is probably safe until I am ready to retire. I have mixed feelings about that. On the one hand, yeah for me. On the other sad for the new grads. And sad for software as a whole. But software truely sucks, and has only been enshitifying worse and worse. Could a shake up like this somehow help that? I don’t see how, but who knows.

Endmaker@ani.social · edit-2 7 months ago

In the ‘Medium’ difficulty category, OpenAI’s o4-mini-high model scored the highest at 53.5%.

This fits my observation of such models. o4-mini-high is able to help me with 80-90% of the problems at work. For the remaining problems, it would come up with a nonsensical solution and no matter how much I prompt it, it would tunnel-vision on that specific approach. It could never second guess itself and realise that its initial solution is completely off the mark, and try an entirely differently approach. That’s where I usually step in and do the work myself.

It still saves me time with the trivial stuff though.

I can’t say the same for the rest of the LLMs. They are simply no good at coding and just waste my time.

yogsototh@programming.dev · 7 months ago

I didn’t see Claude 4 Sonnet in the tests and this is the one I use. And it looks like about the same category as o4 mini from my experience.

It is a nice tool to have in my belt. But these LLM based agents are still very far from being able to do advanced and hard tasks. But to me it is probably more important to communicate and learn about the limitations about these tools to not lose tile instead of gaining it.

In fact, I am not even sure they are good enough to be used to really generate production-ready code. But they are nice for pre-reviewing, building simple scripts that don’t need to be highly reliable, analyse a project, ask specific questions etc… The game changer for me was to use Clojure-MCP. Having a REPL at disposal really enhance the quality of most answers.

Ugurcan@lemmy.world · 7 months ago

For me, it’s the Claude Code where everything finally clicked. For advanced stuff, sure they’re shit when they left alone. But as long as I approach it as a Junior Developer (breaking down the tasks to easy bites, having a clear plan all the time, steering away from pitfalls), I find myself enjoying other stuff while it’s doing the monkey work. Just be sure you provide it with tools, mcp, rag and some patience.

technocrit@lemmy.dbzer0.com · 7 months ago

Search engines are able to help me with 100% of work.

nieceandtows@programming.dev · 7 months ago

Not anymore. They’ve all made deals with each other, and search engines SUCK these days

Rikudou_Sage@lemmings.world · 7 months ago

I remember those times, too (well, some 99.9%, there are still the few issues I never found solution to).

But these times are long past, search engines suck nowadays.

katy ✨@piefed.blahaj.zone · 7 months ago

ai is basically just the worst answer on stackexchange

merc@sh.itjust.works · 7 months ago

It’s literally the most common answer on stackexchange.

gens@programming.dev · 7 months ago

It’s a rubber ducky that talks back. If you don’t take it seriously, it can reach the level of usefulness just above a wheezing piece of yellow rubber.

Saledovil@sh.itjust.works · 7 months ago

They aren’t as cute as actual rubber ducks, though.

nieceandtows@programming.dev · 7 months ago

Actual rubber ducks don’t randomly spew bullshit either

daniskarma@lemmy.dbzer0.com · 7 months ago

The bullshit is good it triggers the Cunningham’s Law in my brain.

Sometimes it’s easier to come up with a solution correcting something blatantly wrong than doing it from scratch.

AI Models from Google, OpenAI, Anthropic Solve 0% of ‘Hard’ Coding Problems

AI Models from Google, OpenAI, Anthropic Solve 0% of ‘Hard’ Coding Problems

AI Models from Google, OpenAI, Anthropic Solve 0% of ‘Hard’ Coding Problems | AIM