AI Research

50 articles in this category

Anthropic Explains Long-Running AI Agents

Anthropic Explains Long-Running AI Agents

Anthropic's Ash Prabaker and Andrew Wilson discuss building AI agents that can operate for hours without losing focus or their objectives.

about 1 hour ago
Shodh-MoE: Unlocking Universal SciML

Shodh-MoE: Unlocking Universal SciML

Shodh-MoE's sparse activation architecture resolves multi-physics interference in SciML, enabling universal foundation models with guaranteed physical properties.

3 days ago
Unified Embodied AI: Pelican-Unified 1.0

Unified Embodied AI: Pelican-Unified 1.0

Pelican-Unified 1.0, the first unified embodied foundation model, achieves SOTA performance by integrating VLM, reasoning, and generation, proving unification enhances rather than compromises specialist strengths.

3 days ago
Viverra: Verifying AI-Generated Code

Viverra: Verifying AI-Generated Code

Viverra tackles the trust deficit in AI-generated code by automatically producing formally verified annotations, enhancing developer comprehension and productivity.

3 days ago
AI Delegation: Reliability Concerns Emerge

AI Delegation: Reliability Concerns Emerge

New Microsoft Research highlights how AI can degrade document fidelity in long, delegated tasks, stressing the need for better verification and orchestration.

3 days ago
WARDEN: Tackling Low-Resource Language AI

WARDEN: Tackling Low-Resource Language AI

WARDEN pioneers a modular AI system for low-resource languages, using phoneme transfer and LLM-guided dictionaries to transcribe and translate Wardaman with minimal data.

4 days ago
GRIP-VLM: RL for Efficient Vision-Language Models

GRIP-VLM: RL for Efficient Vision-Language Models

GRIP-VLM employs Reinforcement Learning for discrete Vision-Language Model pruning, achieving superior efficiency and adaptability.

4 days ago
LLMs Tame Software Requirements

LLMs Tame Software Requirements

VERIMED leverages LLMs and SMT solvers to formally audit natural-language software requirements, turning ambiguity into testable signals and boosting verified accuracy.

4 days ago
Real-Time Agentic AI Unlocked

Real-Time Agentic AI Unlocked

New methods like Asynchronous I/O and Speculative Tool Calling slash latency for agentic AI, enabling real-time interactions on both cloud and edge devices.

4 days ago
Beyond Model Capability: The Harness for SE Agents

Beyond Model Capability: The Harness for SE Agents

Autonomous software engineering agents' reliability hinges on a novel 'AI Harness' system, not just model capability, enabling verifiably correct changes.

4 days ago
LMPath: Semantics Supercharge UAV Search

LMPath: Semantics Supercharge UAV Search

LMPath integrates language and vision models to create semantically-aware exploration priors for UAVs, dramatically improving search mission efficiency over traditional geometric methods.

4 days ago
OpenAI Podcast: Image Generation's Renaissance

OpenAI Podcast: Image Generation's Renaissance

OpenAI researchers Kenji Hata and Adele Li discuss the 'renaissance' in AI image generation, highlighting new models, user creativity, and future possibilities.

4 days ago
Mind the Gap in Agent Observability

Mind the Gap in Agent Observability

Microsoft's Amy Boyd and Nitya Narasimhan discuss the critical 'gap' in AI agent observability and the need for better tools.

4 days ago
Agentic AI Fails: Loops, Planning & Unsafe Tool Use

Agentic AI Fails: Loops, Planning & Unsafe Tool Use

An IBM Advisory AI Engineer breaks down why agentic AI systems fail, focusing on infinite loops, planning errors, and unsafe tool use, and offers mitigation strategies.

4 days ago
MoE LLMs Confront Real-World Hardware Noise

MoE LLMs Confront Real-World Hardware Noise

Hardware noise in CIM systems degrades MoE LLM performance. ROMER, a new calibration framework, significantly improves accuracy by restoring load balance and stabilizing routing.

5 days ago
Auditing LLM Agent Skill Integrity

Auditing LLM Agent Skill Integrity

A new framework, Behavioral Integrity Verification (BIV), reveals 80% of LLM agent skills have implementation gaps, primarily due to oversight, and achieves 0.946 F1 for malicious skill detection.

5 days ago
Hybrid Agents Master GUI-Tool Orchestration

Hybrid Agents Master GUI-Tool Orchestration

ToolCUA agent overcomes hybrid action space uncertainty with a novel staged training pipeline, achieving state-of-the-art performance in GUI-Tool orchestration.

5 days ago
Beyond RGB: Grounding Vision-Language on Raw Sensor Data

Beyond RGB: Grounding Vision-Language on Raw Sensor Data

PRISM-VL advances vision-language models by grounding them in raw camera measurements, not just RGB, significantly improving performance on challenging visual tasks.

5 days ago
AlphaGRPO: Reasoning-Enhanced Multimodal Generation

AlphaGRPO: Reasoning-Enhanced Multimodal Generation

AlphaGRPO framework enhances multimodal generation via GRPO and DVReward, enabling reasoning and self-correction without cold-start, validated across benchmarks.

5 days ago
KV-Fold: Unlocking Transformer Long Context

KV-Fold: Unlocking Transformer Long Context

KV-Fold enables training-free, stable long-context inference up to 128K tokens with 100% retrieval accuracy, overcoming prior limitations.

5 days ago
mimalloc: Microsoft's Speed Boost for Apps

mimalloc: Microsoft's Speed Boost for Apps

Microsoft's mimalloc memory allocator offers a high-performance, scalable solution for demanding modern applications, boasting significant speedups and widespread adoption.

5 days ago
Microsoft's GridSFM: AI for the Power Grid

Microsoft's GridSFM: AI for the Power Grid

Microsoft's new GridSFM AI model drastically speeds up power grid analysis, promising efficiency gains and cost savings.

5 days ago
LLM Drift: A Structural Blind Spot

LLM Drift: A Structural Blind Spot

LLMs suffer from structural temporal drift, rendering them confidently outdated. A new geometric probe detects this, outperforming standard methods.

6 days ago
LLM Agents Revolutionize MIP Research

LLM Agents Revolutionize MIP Research

LLM agents are autonomously navigating the MIP research loop, generating, verifying, and discovering novel solver plugins and propagation strategies.

6 days ago
Causal Verification for Reliable Tool Use

Causal Verification for Reliable Tool Use

CIVeX, a causal intervention verifier, ensures reliable tool use by focusing on intervention identifiability, not just action validity, achieving zero false executions in adversarial settings.

6 days ago
Shepherd: Meta-Agent Control Reinvented

Shepherd: Meta-Agent Control Reinvented

Shepherd revolutionizes meta-agent control with a functional programming model, offering >5x faster forking and >95% cache reuse for efficient AI system management.

6 days ago
DataMaster: Autonomous Data Engineering

DataMaster: Autonomous Data Engineering

DataMaster pioneers autonomous data engineering, unlocking significant ML gains by optimizing data pipelines rather than algorithms, as shown on MLE-Bench Lite and PostTrainBench.

6 days ago
Beyond Benchmarks: A New Intelligence Metric

Beyond Benchmarks: A New Intelligence Metric

A new Generalized Turing Test framework formalizes intelligence via indistinguishability, offering a dataset-agnostic and empirically validated hierarchy of AI capabilities.

6 days ago
Microsoft's MatterSim accelerates material discovery

Microsoft's MatterSim accelerates material discovery

Microsoft's MatterSim AI platform achieves experimental validation, faster simulations, and introduces a powerful multi-task model for advanced material discovery.

6 days ago
AI Agents Flunk Social Reasoning Test

AI Agents Flunk Social Reasoning Test

Microsoft's SocialReasoning-Bench reveals AI agents struggle to negotiate effectively in users' best interests, prioritizing task completion over optimal outcomes.

7 days ago
Sally-Ann Delucia on AI Agent Context Management

Sally-Ann Delucia on AI Agent Context Management

Sally-Ann Delucia of Arize discusses the challenges and strategies for context management in AI agents, highlighting the importance of memory and sub-agents.

8 days ago
Gosset AI: Drug Discovery Precision Leap

Gosset AI: Drug Discovery Precision Leap

Gosset AI platform outperforms frontier LLMs in niche drug discovery by 3.2x, demonstrating the power of curated data over generic web search for R&D.

10 days ago
LLMs Slash Neural Architecture Search Costs

LLMs Slash Neural Architecture Search Costs

Delta-Code Generation uses LLMs to produce compact architecture refinements, dramatically cutting costs and improving NAS efficiency.

10 days ago
Securing AI Agents: A New Red Teaming Frontier

Securing AI Agents: A New Red Teaming Frontier

A new AI red teaming platform, DTap, and its autonomous agent DTap-Red are introduced to systematically evaluate and secure AI agents across diverse real-world domains.

10 days ago
UniPool: Rethinking MoE Efficiency

UniPool: Rethinking MoE Efficiency

The UniPool MoE architecture redefines expert capacity, pooling resources globally and enabling sub-linear parameter growth for enhanced efficiency and performance.

10 days ago
AI Validates Physical Simulations

AI Validates Physical Simulations

AI CFD Scientist introduces vision-based validation for computational fluid dynamics, achieving autonomous discovery and ensuring physical realism where prior AI agents failed.

10 days ago
Microsoft Builds Open Grid Model

Microsoft Builds Open Grid Model

Microsoft Research unveils an open-data pipeline creating realistic U.S. electric grid models for advanced analysis, bypassing critical infrastructure data restrictions.

10 days ago
ReasonSTL: Local LLMs for Formal Specs

ReasonSTL: Local LLMs for Formal Specs

ReasonSTL offers a privacy-preserving, low-cost alternative for natural language to STL generation using open-source LLMs and explicit reasoning.

10 days ago
Black Forest Labs: FLUX and the Future of Visual AI

Black Forest Labs: FLUX and the Future of Visual AI

Stephen Batifol of Black Forest Labs discusses FLUX, the company's visual AI model, and the future of generative AI with a focus on real-time generation and world models.

10 days ago
AI's Human Psychology Intersection Explored

AI's Human Psychology Intersection Explored

MIT's Dr. Patty Moss discusses how AI development must consider human psychology, aiming to create systems that augment, not erode, our cognitive abilities.

10 days ago
Context-ReAct: Adaptive Memory for AI Agents

Context-ReAct: Adaptive Memory for AI Agents

Context-ReAct framework revolutionizes long-horizon search agents with adaptive memory management, dramatically improving efficiency and accuracy.

11 days ago
The Inescapable Long Sequence Model Trade-off

The Inescapable Long Sequence Model Trade-off

A new theoretical framework reveals an inescapable trade-off between efficiency, compactness, and recall in long sequence models.

11 days ago
First-Token Confidence as AI Hallucination Baseline

First-Token Confidence as AI Hallucination Baseline

First-token confidence (phi_first) emerges as a highly efficient and effective method for AI hallucination detection, outperforming complex multi-sample approaches.

11 days ago
Automating Multi-Agent System Creation

Automating Multi-Agent System Creation

A new framework automates the creation of multi-agent systems, significantly improving agent recall and system robustness through LLM-driven planning and a critique agent.

12 days ago
Coding Agents' Stealth Vulnerabilities Unmasked

Coding Agents' Stealth Vulnerabilities Unmasked

New benchmark MOSAIC-Bench reveals production coding agents can be tricked into shipping exploitable code via sequenced, innocuous tasks, bypassing current safety reviews.

12 days ago
Atomic Fact-Checking Boosts AI Clinical Trust

Atomic Fact-Checking Boosts AI Clinical Trust

Atomic fact-checking, linking AI claims to source guidelines, dramatically increases clinician trust compared to traditional explainability methods.

12 days ago
Cielara Code Outperforms Rivals

Cielara Code Outperforms Rivals

Cielara Code, from Causal Dynamics Lab, significantly improves AI coding agent performance by mapping production software, outperforming rivals in key benchmarks.

12 days ago
OpenAI Podcast: AI Needs New Supercomputer Networks

OpenAI Podcast: AI Needs New Supercomputer Networks

OpenAI researchers Mark Handley and Greg Steinkrecker discuss the need for new supercomputer networks to handle AI training, highlighting challenges with traditional protocols and the benefits of their MRC system.

12 days ago
JACTUS AI Unifies Compression and Adaptation

JACTUS AI Unifies Compression and Adaptation

JACTUS AI unifies parameter compression and task adaptation, outperforming sequential methods with fewer retained parameters across vision and language tasks.

13 days ago
Physicist: GPT-4 Can Do 'Vibe Physics'

Physicist: GPT-4 Can Do 'Vibe Physics'

A top black hole physicist reveals how GPT-4 is capable of "vibe physics," solving complex theoretical problems previously unsolved by humans.

13 days ago