1 articles with this tag
Microsoft's AsgardBench benchmark tests AI agents' ability to adapt plans using real-time visual feedback, revealing current limitations in perception and state tracking.