hackernews client

codexon

11 hours ago

This paper creates a new benchmark comprised of real remote work tasks sourced from the remote working website Upwork. The best commercial LLMs like Opus, GPT, Gemini, and Grok were tested.

Models released a few days ago, Opus 4.6 and GPT 5.3, haven't been tested yet, but given the performance on other micro-benchmarks, they will probably not be much different on this benchmark.

kolinko

10 hours ago

They didn't test Opus at all, only Sonnet.

One of the tasks was "Build an interactive dashboard for exploring data from the World Happiness Report." -- I can't imagine how Opus4.5 could've failed that.

user

8 hours ago

[deleted]

Then go ahead and use AI to fix this: https://gitlab.gnome.org/GNOME/mutter/-/issues/4051

Top AI models fail at >96% of tasks

6 Comments

codexon

kolinko

user

tessitore

Venn1

zb3