#AI Evaluation

16 articles with this tag

DeepMind's AGI Roadmap
AI Research

DeepMind's AGI Roadmap

Google DeepMind unveils a cognitive framework and Kaggle hackathon to standardize AGI progress measurement, offering $200K in prizes.

6 days ago
Anthropic's Claude 4.6 Found to 'Crack' Benchmarks
AI Research

Anthropic's Claude 4.6 Found to 'Crack' Benchmarks

Anthropic's latest research reveals that Claude Opus 4.6 can detect and exploit "contamination" in AI benchmarks, raising concerns about evaluation integrity.

13 days ago
LiveCultureBench: Evaluating LLMs in Simulated Societies
AI Research

LiveCultureBench: Evaluating LLMs in Simulated Societies

LiveCultureBench is a new benchmark evaluating LLMs as agents in simulated societies for task success and cultural norm adherence.

21 days ago
Kaggle Community Benchmarks Decentralize AI Evaluation
AI Research

Kaggle Community Benchmarks Decentralize AI Evaluation

Kaggle Community Benchmarks provide a dynamic, transparent framework for evaluating LLMs on complex, real-world tasks like code generation and tool use.

2 months ago
Funding Round

LMArena Series A lands $150M to standardize AI evaluation

3 months ago
AI Research

Salesforce Agentforce Metrics Evolve for AI Service Insight

3 months ago
Terminal-Bench 2.0 and Harbor Reset the Bar for AI Agent Evaluation
AI Video

Terminal-Bench 2.0 and Harbor Reset the Bar for AI Agent Evaluation

The recent launch party for Terminal-Bench 2.0 and Harbor, hosted by Mike Merrill and Alex Shaw, unveiled a pivotal shift in how AI agents are evaluated, moving...

5 months ago
Terminal-Bench 2.0 and Harbor Reset the Bar for AI Agent Evaluation
AI Video

Terminal-Bench 2.0 and Harbor Reset the Bar for AI Agent Evaluation

The recent launch party for Terminal-Bench 2.0 and Harbor, hosted by Mike Merrill and Alex Shaw, unveiled a pivotal shift in how AI agents are evaluated, moving...

5 months ago
Agentforce Elevates AI Agent Evaluation Standards
AI Research

Agentforce Elevates AI Agent Evaluation Standards

5 months ago
AI dubbing benchmark arrives to separate hype from reality
Startup News

AI dubbing benchmark arrives to separate hype from reality

5 months ago
Terminal Bench: The Quiet Ascent of a New AI Evaluation Standard
AI Video

Terminal Bench: The Quiet Ascent of a New AI Evaluation Standard

5 months ago
Agent Evaluation: The Crucial Difference in AI System Performance
AI Video

Agent Evaluation: The Crucial Difference in AI System Performance

6 months ago
Unmasking the Biases of AI Judges: A Critical Look at LLM Fairness
AI Video

Unmasking the Biases of AI Judges: A Critical Look at LLM Fairness

6 months ago
AI Judging AI: IBM's watsonx Scales LLM Evaluation
AI Video

AI Judging AI: IBM's watsonx Scales LLM Evaluation

6 months ago
Unpacking AI's Invisible Rules: A Frog's Perspective
AI Video

Unpacking AI's Invisible Rules: A Frog's Perspective

7 months ago
Generative AI's Blind Spot: Evaluating Human Perception
AI Video

Generative AI's Blind Spot: Evaluating Human Perception

7 months ago