r/Anthropic 12h ago

Performance Anthropic seems to have caught up with chatgpt 5.5 opus 4.8

Post image
0 Upvotes

12 comments sorted by

2

u/randombsname1 8h ago

Oh look, this random benchmark being astro turfed on every AI subreddit again.

1

u/DepartmentOk9720 2h ago

What? Explain yourself, what do you mean astro turfed? You think we have a agenda or something, because you don't like what you see?

1

u/randombsname1 2h ago

I mean ive seen this post, over and over and over on every subreddit for multiple days in a row.

Continuously.

1

u/DepartmentOk9720 2h ago edited 2h ago

It just got recently updated, most benchmarks are saturated, they show most models are similar including open source ones, but no developers are using those and shipping products that have a meaningfully large customer base.

This benchmark is finally something that actually reflects real developers experience with AI models.

Plus I am really glad that there are actual ways to compare models.

I had no idea how to use Gpt mini , is it like haiku or Sonnet, the API cost is like haiku but it performance is much much better with thinking. With these you can actually realistically compare models. I had no idea how thinking dramatically changes these dynamics.

1

u/randombsname1 2h ago

5.4 mini isnt close to Opus 4.6 which is what I saw on the first time around.

There is also some very odd methodology considering that swe REBENCH goes through problem rotation and decontamination efforts as well -- and their own leaderboard is quite different than this one.

And is/has previously been the standard for how, "real developers experience AI models".

Something is screwy with this benchmark.

1

u/DepartmentOk9720 2h ago

Rebench is also very similar and show similar findings

1

u/randombsname1 2h ago

What? No it doesnt.

https://swe-rebench.com/

Claude is much closer, and this was also before Opus 4.8.

1

u/DepartmentOk9720 2h ago

They done have penalty for cheating, you can look up what constitutes cheating, it looks up git history for fixing issues

1

u/randombsname1 2h ago

Which is weird it was penalized for it since there was nothing in the prompt to prevent it from doing the thing that these models/harnesses SHOULD be doing.

This seems indicative of a poor benchmark.

Thats pretty much the consensus agreement that the large threads about this seem to agree on.

Thats not getting into the limited task set of deepswe vs swe rebench.

Swe rebench is like 20,000+ I think deep swe was like just over 100.

1

u/DepartmentOk9720 1h ago

It's more about the model , is it doing what's asked or doing something else, no doubt if it works for you that good, but GPT is not having those problems, it will do what's asked.

Also those tend to burn through tokens , the post also say Anthropic tends to show more awareness of environment.

It's just a quality assessment, not a propaganda or something like what you said

1

u/iperson4213 5h ago

idk if 58 vs 70% is caught up…