WorkBench Revisited: Workplace Agents Two Years On
Claude Opus 4.8 now completes 89% of WorkBench tasks vs. GPT-4's 43% in March 2024, while unintended harmful actions plummeted from 26% to 2.5%. The study reveals capability and safety improvements go hand-in-hand rather than trade off, though frontier models still botch basic actions like sending emails to wrong recipients.
arXiv AI · 3 min (abstract)
Research