#Multimodal AI

50 articles with this tag

Violin: AI Translates Video Content
Technology

Violin: AI Translates Video Content

Together AI launches Violin, an open-source AI tool for video translation and interactive content analysis.

3 days ago
Beyond RGB: Grounding Vision-Language on Raw Sensor Data
AI Research

Beyond RGB: Grounding Vision-Language on Raw Sensor Data

PRISM-VL advances vision-language models by grounding them in raw camera measurements, not just RGB, significantly improving performance on challenging visual tasks.

4 days ago
AlphaGRPO: Reasoning-Enhanced Multimodal Generation
AI Research

AlphaGRPO: Reasoning-Enhanced Multimodal Generation

AlphaGRPO framework enhances multimodal generation via GRPO and DVReward, enabling reasoning and self-correction without cold-start, validated across benchmarks.

4 days ago
AI's Next Leap: Interaction Models
Technology

AI's Next Leap: Interaction Models

Thinking Machines Lab introduces 'interaction models' for AI, enabling real-time, multimodal collaboration that mirrors human conversation.

4 days ago
Architectural Interactivity, Linguistic Interpretability, and Molecular Synthesis: The Frontier of Native AI
Artificial Intelligence

Architectural Interactivity, Linguistic Interpretability, and Molecular Synthesis: The Frontier of Native AI

Three organisations now define the frontier of native AI: Thinking Machines is rebuilding human-AI collaboration as a low-latency interaction model, the Effable movement wants interpretable safety frameworks like SafetyAnalyst, and Isomorphic Labs is converting AlphaFold into an end-to-end drug design engine. The common thread is moving from AI as a layer of abstraction toward AI as a fundamental component of human and biological systems.

5 days ago
Thinking Machines Lab Wants to Replace OpenAI Realtime With a Model That Listens While It Speaks
Artificial Intelligence

Thinking Machines Lab Wants to Replace OpenAI Realtime With a Model That Listens While It Speaks

Mira Murati's lab published its first technical paper, arguing that real-time interactivity should be a native model capability rather than scaffolding bolted around turn-based language models — and it ships benchmarks where GPT Realtime-2 scores near zero.

5 days ago
AI Archives: Water Data Gets Searchable
Technology

AI Archives: Water Data Gets Searchable

Databricks uses multimodal AI to turn Sudan's scanned water archives into a searchable database for critical groundwater discovery.

5 days ago
Together AI Adds NVIDIA Nemotron 3
Technology

Together AI Adds NVIDIA Nemotron 3

Together AI launches NVIDIA's Nemotron 3 Nano Omni, a unified multimodal AI model, to developers, simplifying agentic application creation.

19 days ago
Verifiable Reasoning in MLLMs
AI Research

Verifiable Reasoning in MLLMs

The V-tableR1 framework enables verifiable, multi-step reasoning in MLLMs by grounding logic in visual data, achieving SOTA on tabular benchmarks.

24 days ago
Google DeepMind's Gemma 4 Models Shine at AI Engineer Europe
Artificial Intelligence

Google DeepMind's Gemma 4 Models Shine at AI Engineer Europe

Google DeepMind's Omar Sanseviero shared insights into the Gemma 4 family of open AI models at AI Engineer Europe, highlighting their performance, on-device capabilities, and community adoption.

27 days ago
Beyond Black-Box: Structuring Humor AI Reasoning
AI Research

Beyond Black-Box: Structuring Humor AI Reasoning

New IRS framework moves beyond black-box AI, structuring humor understanding via explicit incongruity-resolution reasoning for expert-level performance.

30 days ago
Anthropic's Claude Opus 4.7 Arrives, Sharper Than Ever
Artificial Intelligence

Anthropic's Claude Opus 4.7 Arrives, Sharper Than Ever

Anthropic unveils Claude Opus 4.7, boosting AI's coding prowess, multimodal input, and safety features for enterprise use.

about 1 month ago
Rubric-Driven DPO for Visual Tasks
AI Research

Rubric-Driven DPO for Visual Tasks

A new rDPO framework uses instance-specific rubrics to create high-quality preference data, dramatically improving multimodal AI evaluation and performance.

about 1 month ago
Bridging Vision Tools and LLMs with P2
AI Research

Bridging Vision Tools and LLMs with P2

Perception Programs (P2) transforms raw vision tool outputs into structured summaries, dramatically enhancing MLLM reasoning without retraining.

about 1 month ago
Agentic Models Bypass Tool Reliance
AI Research

Agentic Models Bypass Tool Reliance

HDPO framework enables agentic multimodal models to drastically reduce tool use by decoupling accuracy and efficiency optimization, fostering self-reliance without performance loss.

about 1 month ago
Meta's Muse Spark: AI's Next Act?
Artificial Intelligence

Meta's Muse Spark: AI's Next Act?

Meta unveils Muse Spark, a new multimodal AI model targeting 'personal superintelligence' with advanced reasoning and agent capabilities.

about 1 month ago
IBM Master Inventor Explains Multimodal AI
Artificial Intelligence

IBM Master Inventor Explains Multimodal AI

IBM Master Inventor Martin Keen explains the evolution of multimodal AI, contrasting feature-level fusion with native multimodality and the importance of temporal reasoning for video.

about 1 month ago
Mistral AI's Vox-Trainer and Fine-Tuning
Artificial Intelligence

Mistral AI's Vox-Trainer and Fine-Tuning

Mistral AI announces Vox-Trainer, a new multimodal AI model for voice cloning and speech generation, alongside new benchmarks for speech understanding.

about 2 months ago
PRIMO R1: Active Critics for Robotic Manipulation
AI Research

PRIMO R1: Active Critics for Robotic Manipulation

PRIMO R1 transforms video MLLMs into active critics for robotic manipulation via outcome-based RL, achieving SOTA on RoboFail and outperforming larger models.

2 months ago
Mistral Small 4 Unifies AI Capabilities
Artificial Intelligence

Mistral Small 4 Unifies AI Capabilities

Mistral AI unveils Mistral Small 4, a unified model combining text, image, reasoning, and coding capabilities under an open-source license.

2 months ago
Code-Driven Reasoning for Precise Image Generation
AI Research

Code-Driven Reasoning for Precise Image Generation

CoCo (Code-as-CoT) introduces executable code as a reasoning framework for text-to-image generation, achieving superior precision and control.

2 months ago
AI Learns Beyond Text
Artificial Intelligence

AI Learns Beyond Text

AI is moving beyond text, with multimodal pretraining enabling models to learn from images, audio, and video for richer comprehension.

2 months ago
Crab+ Unifies AV-LLMs, Reverses Negative Transfer
AI Research

Crab+ Unifies AV-LLMs, Reverses Negative Transfer

Crab+ introduces a novel approach to Audio-Visual Large Language Models, overcoming negative transfer via explicit cooperation in data and model design.

2 months ago
Microsoft's Phi-4-reasoning-vision-15B compact AI model
AI Research

Microsoft's Phi-4-reasoning-vision-15B compact AI model

Microsoft Research's Phi-4-reasoning-vision-15B offers efficient multimodal AI, excelling in reasoning and vision tasks with less data and compute.

2 months ago
Google's Interactions API Evolves Gemini
Artificial Intelligence

Google's Interactions API Evolves Gemini

Google's new Interactions API for Gemini models offers a unified interface for complex AI tasks, supporting multimodal inputs, agents, and tool integration.

2 months ago
AI Research

Multimodal LLMs: What's Lost in Translation?

New research reveals multimodal LLMs struggle to utilize non-textual data due to a 'mismatched decoder problem,' impacting their true understanding.

3 months ago
AI Research

Less Data, More Alignment: SOTAlign

Researchers introduce SOTAlign, a framework that achieves robust cross-modal alignment using significantly less paired data by leveraging unpaired samples.

3 months ago
Agentic Vision Gemini 3 Flash: Code Execution Solves Visual Hallucination
AI Research

Agentic Vision Gemini 3 Flash: Code Execution Solves Visual Hallucination

Agentic Vision Gemini 3 Flash shifts multimodal AI from static image processing to an active, code-driven investigation, dramatically improving accuracy and verifiability.

4 months ago
Sparkli AI raises $5M to kill the EdTech chatbot for kids
Funding Round

Sparkli AI raises $5M to kill the EdTech chatbot for kids

Sparkli AI, founded by Google alums, raised a $5 million pre-seed round to develop a multimodal, simulation-based learning engine for children aged 5 to 12.

4 months ago
Argos Framework Delivers Grounded AI Reasoning
AI Research

Argos Framework Delivers Grounded AI Reasoning

Argos is an agentic verification framework that fundamentally changes reinforcement learning by rewarding models only for Grounded AI reasoning based on verifiable evidence.

4 months ago
Gemini API Data Ingestion Gets Production Ready
AI Research

Gemini API Data Ingestion Gets Production Ready

Google has upgraded Gemini API data ingestion to support persistent storage via GCS registration and external signed URLs, boosting the inline limit to 100MB.

4 months ago
The AI Pet Startup That Claims to Translate Your Dog's Thoughts
Funding Round

The AI Pet Startup That Claims to Translate Your Dog's Thoughts

5 months ago
Google Gemini 3 Redefines AI Reasoning and Efficiency
AI Research

Google Gemini 3 Redefines AI Reasoning and Efficiency

5 months ago
Google AI Tips: A Year of Ubiquitous Intelligence
AI Research

Google AI Tips: A Year of Ubiquitous Intelligence

5 months ago
T5Gemma 2 Multimodal Ushers In Efficient AI Future
AI Research

T5Gemma 2 Multimodal Ushers In Efficient AI Future

5 months ago
Tinker launches OpenAI API compatibility, challenging vendor lock-in.
AI Research

Tinker launches OpenAI API compatibility, challenging vendor lock-in.

5 months ago
Gemini Google Translate Elevates Nuance
AI Research

Gemini Google Translate Elevates Nuance

5 months ago
Gemma 3n Powers Real-World Impact at the Edge
AI Research

Gemma 3n Powers Real-World Impact at the Edge

5 months ago
AI Research

FACTS Benchmark Suite Elevates LLM Factuality Scrutiny

5 months ago
AI Precision Oncology Gets Scalable Boost from Microsoft AI
AI Research

AI Precision Oncology Gets Scalable Boost from Microsoft AI

5 months ago
Google's Gemini 3 Ushers In The Latest AI Era
AI Research

Google's Gemini 3 Ushers In The Latest AI Era

5 months ago
VoiceVision RAG: Beyond Text, Towards True Multimodal Document Intelligence
AI Video

VoiceVision RAG: Beyond Text, Towards True Multimodal Document Intelligence

5 months ago
Google TAU AI Partnership Expands Foundational AI Research
AI Research

Google TAU AI Partnership Expands Foundational AI Research

6 months ago
Google Cloud's Nano Banana Transforms Text-to-Vision Capabilities
AI Video

Google Cloud's Nano Banana Transforms Text-to-Vision Capabilities

6 months ago
Gemini 3 Unleashes a New Era of AI-Powered Creation
AI Video

Gemini 3 Unleashes a New Era of AI-Powered Creation

6 months ago
Meta’s Segment Anything Model 3 masters text and video
AI Research

Meta’s Segment Anything Model 3 masters text and video

6 months ago
Gemini 3: Google's Ambitious Leap Towards Universal AI Integration
AI Video

Gemini 3: Google's Ambitious Leap Towards Universal AI Integration

6 months ago
Google Gemini 3 Elevates AI with Agentic Interfaces
AI Research

Google Gemini 3 Elevates AI with Agentic Interfaces

6 months ago
NotebookLM Deep Research Redefines AI Analysis
AI Research

NotebookLM Deep Research Redefines AI Analysis

6 months ago
Marble World Model Goes Public, Redefining 3D Generation
Artificial Intelligence

Marble World Model Goes Public, Redefining 3D Generation

6 months ago