AI Research
50 articles in this category

Anthropic Explains Long-Running AI Agents
Anthropic's Ash Prabaker and Andrew Wilson discuss building AI agents that can operate for hours without losing focus or their objectives.
Shodh-MoE: Unlocking Universal SciML
Shodh-MoE's sparse activation architecture resolves multi-physics interference in SciML, enabling universal foundation models with guaranteed physical properties.
Unified Embodied AI: Pelican-Unified 1.0
Pelican-Unified 1.0, the first unified embodied foundation model, achieves SOTA performance by integrating VLM, reasoning, and generation, proving unification enhances rather than compromises specialist strengths.
Viverra: Verifying AI-Generated Code
Viverra tackles the trust deficit in AI-generated code by automatically producing formally verified annotations, enhancing developer comprehension and productivity.

AI Delegation: Reliability Concerns Emerge
New Microsoft Research highlights how AI can degrade document fidelity in long, delegated tasks, stressing the need for better verification and orchestration.
WARDEN: Tackling Low-Resource Language AI
WARDEN pioneers a modular AI system for low-resource languages, using phoneme transfer and LLM-guided dictionaries to transcribe and translate Wardaman with minimal data.
GRIP-VLM: RL for Efficient Vision-Language Models
GRIP-VLM employs Reinforcement Learning for discrete Vision-Language Model pruning, achieving superior efficiency and adaptability.
LLMs Tame Software Requirements
VERIMED leverages LLMs and SMT solvers to formally audit natural-language software requirements, turning ambiguity into testable signals and boosting verified accuracy.
Real-Time Agentic AI Unlocked
New methods like Asynchronous I/O and Speculative Tool Calling slash latency for agentic AI, enabling real-time interactions on both cloud and edge devices.
Beyond Model Capability: The Harness for SE Agents
Autonomous software engineering agents' reliability hinges on a novel 'AI Harness' system, not just model capability, enabling verifiably correct changes.
LMPath: Semantics Supercharge UAV Search
LMPath integrates language and vision models to create semantically-aware exploration priors for UAVs, dramatically improving search mission efficiency over traditional geometric methods.

OpenAI Podcast: Image Generation's Renaissance
OpenAI researchers Kenji Hata and Adele Li discuss the 'renaissance' in AI image generation, highlighting new models, user creativity, and future possibilities.

Mind the Gap in Agent Observability
Microsoft's Amy Boyd and Nitya Narasimhan discuss the critical 'gap' in AI agent observability and the need for better tools.

Agentic AI Fails: Loops, Planning & Unsafe Tool Use
An IBM Advisory AI Engineer breaks down why agentic AI systems fail, focusing on infinite loops, planning errors, and unsafe tool use, and offers mitigation strategies.
MoE LLMs Confront Real-World Hardware Noise
Hardware noise in CIM systems degrades MoE LLM performance. ROMER, a new calibration framework, significantly improves accuracy by restoring load balance and stabilizing routing.
Auditing LLM Agent Skill Integrity
A new framework, Behavioral Integrity Verification (BIV), reveals 80% of LLM agent skills have implementation gaps, primarily due to oversight, and achieves 0.946 F1 for malicious skill detection.
Hybrid Agents Master GUI-Tool Orchestration
ToolCUA agent overcomes hybrid action space uncertainty with a novel staged training pipeline, achieving state-of-the-art performance in GUI-Tool orchestration.
Beyond RGB: Grounding Vision-Language on Raw Sensor Data
PRISM-VL advances vision-language models by grounding them in raw camera measurements, not just RGB, significantly improving performance on challenging visual tasks.
AlphaGRPO: Reasoning-Enhanced Multimodal Generation
AlphaGRPO framework enhances multimodal generation via GRPO and DVReward, enabling reasoning and self-correction without cold-start, validated across benchmarks.
KV-Fold: Unlocking Transformer Long Context
KV-Fold enables training-free, stable long-context inference up to 128K tokens with 100% retrieval accuracy, overcoming prior limitations.

mimalloc: Microsoft's Speed Boost for Apps
Microsoft's mimalloc memory allocator offers a high-performance, scalable solution for demanding modern applications, boasting significant speedups and widespread adoption.

Microsoft's GridSFM: AI for the Power Grid
Microsoft's new GridSFM AI model drastically speeds up power grid analysis, promising efficiency gains and cost savings.
LLM Drift: A Structural Blind Spot
LLMs suffer from structural temporal drift, rendering them confidently outdated. A new geometric probe detects this, outperforming standard methods.
LLM Agents Revolutionize MIP Research
LLM agents are autonomously navigating the MIP research loop, generating, verifying, and discovering novel solver plugins and propagation strategies.
Causal Verification for Reliable Tool Use
CIVeX, a causal intervention verifier, ensures reliable tool use by focusing on intervention identifiability, not just action validity, achieving zero false executions in adversarial settings.
Shepherd: Meta-Agent Control Reinvented
Shepherd revolutionizes meta-agent control with a functional programming model, offering >5x faster forking and >95% cache reuse for efficient AI system management.
DataMaster: Autonomous Data Engineering
DataMaster pioneers autonomous data engineering, unlocking significant ML gains by optimizing data pipelines rather than algorithms, as shown on MLE-Bench Lite and PostTrainBench.
Beyond Benchmarks: A New Intelligence Metric
A new Generalized Turing Test framework formalizes intelligence via indistinguishability, offering a dataset-agnostic and empirically validated hierarchy of AI capabilities.

Microsoft's MatterSim accelerates material discovery
Microsoft's MatterSim AI platform achieves experimental validation, faster simulations, and introduces a powerful multi-task model for advanced material discovery.

AI Agents Flunk Social Reasoning Test
Microsoft's SocialReasoning-Bench reveals AI agents struggle to negotiate effectively in users' best interests, prioritizing task completion over optimal outcomes.

Sally-Ann Delucia on AI Agent Context Management
Sally-Ann Delucia of Arize discusses the challenges and strategies for context management in AI agents, highlighting the importance of memory and sub-agents.
Gosset AI: Drug Discovery Precision Leap
Gosset AI platform outperforms frontier LLMs in niche drug discovery by 3.2x, demonstrating the power of curated data over generic web search for R&D.
LLMs Slash Neural Architecture Search Costs
Delta-Code Generation uses LLMs to produce compact architecture refinements, dramatically cutting costs and improving NAS efficiency.
Securing AI Agents: A New Red Teaming Frontier
A new AI red teaming platform, DTap, and its autonomous agent DTap-Red are introduced to systematically evaluate and secure AI agents across diverse real-world domains.
UniPool: Rethinking MoE Efficiency
The UniPool MoE architecture redefines expert capacity, pooling resources globally and enabling sub-linear parameter growth for enhanced efficiency and performance.
AI Validates Physical Simulations
AI CFD Scientist introduces vision-based validation for computational fluid dynamics, achieving autonomous discovery and ensuring physical realism where prior AI agents failed.

Microsoft Builds Open Grid Model
Microsoft Research unveils an open-data pipeline creating realistic U.S. electric grid models for advanced analysis, bypassing critical infrastructure data restrictions.
ReasonSTL: Local LLMs for Formal Specs
ReasonSTL offers a privacy-preserving, low-cost alternative for natural language to STL generation using open-source LLMs and explicit reasoning.

Black Forest Labs: FLUX and the Future of Visual AI
Stephen Batifol of Black Forest Labs discusses FLUX, the company's visual AI model, and the future of generative AI with a focus on real-time generation and world models.

AI's Human Psychology Intersection Explored
MIT's Dr. Patty Moss discusses how AI development must consider human psychology, aiming to create systems that augment, not erode, our cognitive abilities.
Context-ReAct: Adaptive Memory for AI Agents
Context-ReAct framework revolutionizes long-horizon search agents with adaptive memory management, dramatically improving efficiency and accuracy.
The Inescapable Long Sequence Model Trade-off
A new theoretical framework reveals an inescapable trade-off between efficiency, compactness, and recall in long sequence models.
First-Token Confidence as AI Hallucination Baseline
First-token confidence (phi_first) emerges as a highly efficient and effective method for AI hallucination detection, outperforming complex multi-sample approaches.
Automating Multi-Agent System Creation
A new framework automates the creation of multi-agent systems, significantly improving agent recall and system robustness through LLM-driven planning and a critique agent.
Coding Agents' Stealth Vulnerabilities Unmasked
New benchmark MOSAIC-Bench reveals production coding agents can be tricked into shipping exploitable code via sequenced, innocuous tasks, bypassing current safety reviews.
Atomic Fact-Checking Boosts AI Clinical Trust
Atomic fact-checking, linking AI claims to source guidelines, dramatically increases clinician trust compared to traditional explainability methods.

Cielara Code Outperforms Rivals
Cielara Code, from Causal Dynamics Lab, significantly improves AI coding agent performance by mapping production software, outperforming rivals in key benchmarks.

OpenAI Podcast: AI Needs New Supercomputer Networks
OpenAI researchers Mark Handley and Greg Steinkrecker discuss the need for new supercomputer networks to handle AI training, highlighting challenges with traditional protocols and the benefits of their MRC system.
JACTUS AI Unifies Compression and Adaptation
JACTUS AI unifies parameter compression and task adaptation, outperforming sequential methods with fewer retained parameters across vision and language tasks.

Physicist: GPT-4 Can Do 'Vibe Physics'
A top black hole physicist reveals how GPT-4 is capable of "vibe physics," solving complex theoretical problems previously unsolved by humans.