#Benchmark
10 articles with this tag
Workflow Agents Lag Behind Demand
New Claw-Eval-Live benchmark reveals LLM agents struggle with dynamic workflows and verifiable execution, with top models failing over a third of tasks.
LLMs Plan, But Do They Plan Safely?
New LLM robotic safety benchmark, DESPITE, finds scale boosts planning but not safety. Proprietary models lead, revealing a critical gap for safe robotic deployment.

Gumloop Secures $50M Series B
Gumloop secures $50M Series B led by Benchmark to enhance its AI automation and agent platform for enterprises.

Gemini Deep Research Unlocks Advanced AI for Devs
Clarifai Hits Fastest GPT-OSS-120B Inference and Narrows the GPU–ASIC Gap
\n Clarifai’s latest benchmark on OpenAI’s GPT-OSS-120B model points to a quiet but important shift in AI infrastructure.

Clarifai Hits Fastest GPT-OSS-120B Inference and Narrows the GPU–ASIC Gap
\n Clarifai’s latest benchmark on OpenAI’s GPT-OSS-120B model points to a quiet but important shift in AI infrastructure.

Salesforce Agentic AI Gets Real-World Performance Benchmark
Applied Compute\'s Agent Workforce Targets Niche AI with $80M
\n A stealthy startup from ex-OpenAI researchers, Applied Compute, has emerged with $80 million in funding to argue that general-purpose AI is just the beginnin...

Exa raises $85M to build a search engine for AIs
