It just got recently updated, most benchmarks are saturated, they show most models are similar including open source ones, but no developers are using those and shipping products that have a meaningfully large customer base.
This benchmark is finally something that actually reflects real developers experience with AI models.
Plus I am really glad that there are actual ways to compare models.
I had no idea how to use Gpt mini , is it like haiku or Sonnet, the API cost is like haiku but it performance is much much better with thinking. With these you can actually realistically compare models.
I had no idea how thinking dramatically changes these dynamics.
5.4 mini isnt close to Opus 4.6 which is what I saw on the first time around.
There is also some very odd methodology considering that swe REBENCH goes through problem rotation and decontamination efforts as well -- and their own leaderboard is quite different than this one.
And is/has previously been the standard for how, "real developers experience AI models".
Which is weird it was penalized for it since there was nothing in the prompt to prevent it from doing the thing that these models/harnesses SHOULD be doing.
This seems indicative of a poor benchmark.
Thats pretty much the consensus agreement that the large threads about this seem to agree on.
Thats not getting into the limited task set of deepswe vs swe rebench.
Swe rebench is like 20,000+ I think deep swe was like just over 100.
It's more about the model , is it doing what's asked or doing something else, no doubt if it works for you that good, but GPT is not having those problems, it will do what's asked.
Also those tend to burn through tokens , the post also say Anthropic tends to show more awareness of environment.
It's just a quality assessment, not a propaganda or something like what you said
2
u/randombsname1 8h ago
Oh look, this random benchmark being astro turfed on every AI subreddit again.