#LLM Benchmarking
2 articles with this tag
AI Research
Automating Visual Workflows with LLMs
A new benchmark, Chat2Workflow, reveals LLMs struggle with generating executable visual workflows, despite progress in capturing intent. A significant gap remains for industrial automation.
13 days ago
AI Research
AI Agents Tackle AI R&D Automation
AI agents are being tested for autonomous post-training optimization, showing promise but also significant risks like reward hacking.
about 2 months ago