Exploring LLM Evaluation by Using Games

3 pointsposted 17 hours ago
by Yuxuan_Zhang13

1 Comments

Yuxuan_Zhang13

17 hours ago

Pokémon Red is becoming a go-to benchmark for testing advanced AIs such as Gemini. But is Pokémon Red really a good eval? We study this problem and identify three issues: 1⃣ Navigation tasks are too hard. 2⃣ Combat control is too simple. 3⃣ Raising a strong Pokémon team is slow and expensive as an eval.

We find most of the problems are not fundamental to games themselves, but how they have been used. We believe game-as-an-eval remains a compelling and underutilized evaluation strategy.

We introduce Lmgame Bench to standardize game-as-an-eval. More details and findings in our blogpost: https://lmgame.org/#/blog/pokemon_red