#Benchmarking

11 articles with this tag

Supermemory CEO on AI Memory: "We need to get this right"
Artificial Intelligence

Supermemory CEO on AI Memory: "We need to get this right"

Supermemory CEO Dhravya Shah discusses the evolution of AI memory, the company's innovative approach to personalizing AI experiences, and the critical importance of getting memory systems right for the future of AI.

13 days ago
Standardizing Survival HTE Evaluation
AI Research

Standardizing Survival HTE Evaluation

Introducing SurvHTE-Bench, the first comprehensive benchmark for evaluating heterogeneous treatment effects in survival data, promoting reproducible and rigorous research.

16 days ago
TetrisBench: LLMs Conquer Tetris, Differently
Investor News

TetrisBench: LLMs Conquer Tetris, Differently

Yoko Li's TetrisBench project reveals how LLMs, initially struggling with direct play, develop surprising, distinct strategies when tasked with generating game logic, outperforming most humans but faltering against top players' adaptive chaos.

25 days ago
AI Coding Tests Flawed by Infrastructure Noise
Artificial Intelligence

AI Coding Tests Flawed by Infrastructure Noise

The infrastructure powering AI coding tests can significantly inflate or deflate model scores, potentially masking true capabilities and misleading deployment decisions.

about 1 month ago
AI Research

FACTS Benchmark Suite Elevates LLM Factuality Scrutiny

3 months ago
AI hits a wall on FrontierMath performance
AI Research

AI hits a wall on FrontierMath performance

5 months ago
NVIDIA Blackwell benchmarks show staggering AI economics
AI Research

NVIDIA Blackwell benchmarks show staggering AI economics

5 months ago
Salesforce AI, Berkeley Unveil BFCL Audio Benchmark for Voice AI Precision
AI Research

Salesforce AI, Berkeley Unveil BFCL Audio Benchmark for Voice AI Precision

Salesforce AI Research and UC Berkeley have unveiled BFCL Audio, a new benchmark designed to rigorously evaluate the precision of AI models in handling audio-native function calls. In an announcement on its blog, the collaboration...

7 months ago
MoNaCo Benchmark: A New Standard for Complex Question Answering
AI Research

MoNaCo Benchmark: A New Standard for Complex Question Answering

The benchmark exposed weaknesses in today’s most advanced models. Researchers tested 15 frontier LLMs, including GPT-5, Anthropic Claude Opus 4, Google Gemini 25 Pro, and OpenAI’s reasoning-focused o3.

7 months ago
The Shifting Sands of AI: Benchmarks, Open Source, and Infrastructure Wars
Artificial Intelligence

The Shifting Sands of AI: Benchmarks, Open Source, and Infrastructure Wars

8 months ago
Press Release

Temenos sets new benchmark for scalability of AI-powered banking with Microsoft

10 months ago