#LLM Benchmarking

2 articles with this tag

Automating Visual Workflows with LLMs

A new benchmark, Chat2Workflow, reveals LLMs struggle with generating executable visual workflows, despite progress in capturing intent. A significant gap remains for industrial automation.

about 2 months ago

AI Research

AI Agents Tackle AI R&D Automation

AI agents are being tested for autonomous post-training optimization, showing promise but also significant risks like reward hacking.

3 months ago