#LLM Evaluation

8 articles with this tag

Meta's Nishant Gupta on Evaluating Agentic AI Systems

Nishant Gupta from Meta's Superintelligence Labs discusses the shift from accuracy-based evaluation to reliability-focused methods for agentic AI systems.

11 days ago

AI Research

ClinEnv: Bridging LLM Gaps in Clinical Decision-Making

The ClinEnv benchmark reveals LLMs struggle with sequential medical decision-making, showing a gap between diagnostic and management capabilities.

about 1 month ago

tech

LinkedIn Tries Real-World AI Benchmarking

LinkedIn's new Crosscheck platform aims to provide real-world AI model performance insights tailored to professional roles and tasks.

about 2 months ago

AI Research

DeepWeb-Bench: Beyond Frontier LLM Claims

DeepWeb-Bench benchmark exposes derivation and calibration as major LLM failure points, revealing domain specialization and the inadequacy of current evaluations.

about 2 months ago

AI Research

LLM Drift: A Structural Blind Spot

LLMs suffer from structural temporal drift, rendering them confidently outdated. A new geometric probe detects this, outperforming standard methods.

about 2 months ago

AI Research

LLMs Fail Esoteric Code Tasks

Frontier LLMs show a dramatic capability gap on a new benchmark using esoteric programming languages, revealing a reliance on memorization over reasoning.

4 months ago

Artificial Intelligence

Balyasny's AI Engine

Balyasny Asset Management built a powerful AI research engine using OpenAI models, slashing analysis times and boosting investment team confidence.

4 months ago

Technology

Context-Aware Guardrails Tested

Mozilla.ai tested context-aware guardrails for LLMs in a humanitarian context, revealing crucial multilingual performance disparities and the need for robust, domain-specific safety policies.

5 months ago