As large language models (LLMs) increasingly operate as autonomous agents in complex environments, their evaluations must move beyond simple task completion. The nuances of real-world interaction demand an assessment of cultural appropriateness and the reliability of automated evaluators. Addressing this gap, researchers have introduced LiveCultureBench, a novel, multi-cultural, and dynamic benchmark that embeds LLMs within a simulated town to assess their performance on both task execution and adherence to socio-cultural norms. This work, detailed on arXiv, offers a more comprehensive framework for understanding LLM agent capabilities.
Simulating Societal Interaction
The core of LiveCultureBench is a simulated environment representing a small city as a location graph. Within this space, synthetic residents are created with diverse demographic and cultural profiles. Each simulation episode assigns a specific daily goal to one resident, while other residents provide social context, influencing interactions. An LLM-based verifier then generates structured judgments, evaluating both norm violations and task progress. These judgments are aggregated into metrics that capture the trade-offs between task effectiveness and norm sensitivity, as well as the uncertainty inherent in the LLM verifier's assessments.