AgentAtlas: Beyond Outcome Leaderboards for LLM Agents
AgentAtlas proposes a unified evaluation framework for LLM agents across diverse tasks (code, browsers, OS), using taxonomies for control decisions and failure modes. Testing eight models reveals that removing explicit supervision drops accuracy 14-40 points and collapses performance to a 0.54-0.62floor, suggesting most gains come from prompt engineering rather than genuine capability.
arXiv AI · 3 min (abstract)
Research