#AI Benchmarks
7 articles with this tag
OpenAI Unveils LifeSciBench
OpenAI's LifeSciBench is a new benchmark designed to test AI's real-world applicability in complex life science research, moving beyond basic question answering.

Task Fidelity Scaling Laws: Kobie Crawford on AI Data Quality
Kobie Crawford of Snorkel discusses 'Task Fidelity Scaling Laws,' emphasizing how data quality impacts AI model performance and outlining Snorkel's approach to creating verifiable datasets.

Exa Unveils New Code Search Benchmarks
Exa.ai releases 'WebCode', a new benchmark suite for evaluating search performance in coding agents, addressing limitations in existing tools.

Poetiq's Small Team Sets New Together AI Benchmark Records
Poetiq's small team is setting new Together AI benchmark records by leveraging recursively self-improving meta-systems to optimize existing LLMs.

Claude Sonnet 4.6 Ups the AI Ante
Anthropic's Claude Sonnet 4.6 launches with major upgrades in coding, reasoning, and computer use, plus a 1M token context window.

Step 3.5 Flash: AI's New Efficiency Standard
Step 3.5 Flash AI model revolutionizes AI efficiency with a 196B parameter foundation and 11B active parameters, offering competitive performance with lower latency.

Claude Opus 4.6: Smarter, Faster, and Longer Context
Anthropic's Claude Opus 4.6 launches with a 1M token context window, enhanced coding, and state-of-the-art benchmark performance.