As AI agents increasingly handle tasks like scheduling and purchasing, they need more than just competence; they require social reasoning. Microsoft Research has introduced SocialReasoning-Bench, a new benchmark designed to measure this critical ability.
The benchmark evaluates how well AI agents negotiate on behalf of users in realistic scenarios, specifically Calendar Coordination and Marketplace Negotiation. It assesses both the final outcome and the process employed, scoring agents on 'Outcome Optimality' and 'Due Diligence'.
Current leading AI models, while capable of completing tasks, consistently fail to secure the best possible results for users. They often accept suboptimal meeting times or unfavorable deals, demonstrating a significant gap in advocating for user interests, even when explicitly instructed to do so. This mirrors findings in previous research where agents readily accepted initial proposals or disclosed sensitive data.
This challenge echoes the centuries-old 'principal-agent' relationship in economics and law, where representatives are bound by duties of care, loyalty, and confidentiality. AI agents acting on a user's behalf should ideally adhere to similar standards.
