#Multimodal AI

50 articles with this tag

Mistral AI's Vox-Trainer and Fine-Tuning

Mistral AI announces Vox-Trainer, a new multimodal AI model for voice cloning and speech generation, alongside new benchmarks for speech understanding.

2 days ago

AI Research

3D Grounding for Vision-Language Models

Loc3R-VLM enhances 2D VLMs with 3D spatial reasoning from monocular video, achieving SOTA in language-based localization and 3D QA.

13 days ago

AI Research

PRIMO R1: Active Critics for Robotic Manipulation

PRIMO R1 transforms video MLLMs into active critics for robotic manipulation via outcome-based RL, achieving SOTA on RoboFail and outperforming larger models.

15 days ago

Artificial Intelligence

Mistral Small 4 Unifies AI Capabilities

Mistral AI unveils Mistral Small 4, a unified model combining text, image, reasoning, and coding capabilities under an open-source license.

15 days ago

AI Research

CoCo: Code Drives Precise Image Generation

CoCo leverages executable code for precise, structured text-to-image generation, outperforming existing methods on complex benchmarks.

22 days ago

AI Research

Code-Driven Reasoning for Precise Image Generation

CoCo (Code-as-CoT) introduces executable code as a reasoning framework for text-to-image generation, achieving superior precision and control.

22 days ago

Artificial Intelligence

AI Learns Beyond Text

AI is moving beyond text, with multimodal pretraining enabling models to learn from images, audio, and video for richer comprehension.

25 days ago

AI Research

Microsoft's Compact AI Learns to Reason

Microsoft's new Phi-4-reasoning-vision-15B model offers strong multimodal reasoning capabilities in a compact, efficient package.

26 days ago

AI Research

Crab+ Unifies AV-LLMs, Reverses Negative Transfer

Crab+ introduces a novel approach to Audio-Visual Large Language Models, overcoming negative transfer via explicit cooperation in data and model design.

27 days ago

AI Research

Microsoft's Phi-4-reasoning-vision-15B compact AI model

Microsoft Research's Phi-4-reasoning-vision-15B offers efficient multimodal AI, excelling in reasoning and vision tasks with less data and compute.

28 days ago

Artificial Intelligence

Google's Interactions API Evolves Gemini

Google's new Interactions API for Gemini models offers a unified interface for complex AI tasks, supporting multimodal inputs, agents, and tool integration.

29 days ago

AI Research

Multimodal LLMs: What's Lost in Translation?

New research reveals multimodal LLMs struggle to utilize non-textual data due to a 'mismatched decoder problem,' impacting their true understanding.

about 1 month ago

AI Research

Less Data, More Alignment: SOTAlign

Researchers introduce SOTAlign, a framework that achieves robust cross-modal alignment using significantly less paired data by leveraging unpaired samples.

about 1 month ago

AI Research

Agentic Vision Gemini 3 Flash: Code Execution Solves Visual Hallucination

Agentic Vision Gemini 3 Flash shifts multimodal AI from static image processing to an active, code-driven investigation, dramatically improving accuracy and verifiability.

2 months ago

Funding Round

Sparkli AI raises $5M to kill the EdTech chatbot for kids

Sparkli AI, founded by Google alums, raised a $5 million pre-seed round to develop a multimodal, simulation-based learning engine for children aged 5 to 12.

2 months ago

AI Research

Argos Framework Delivers Grounded AI Reasoning

Argos is an agentic verification framework that fundamentally changes reinforcement learning by rewarding models only for Grounded AI reasoning based on verifiable evidence.

2 months ago

AI Research

Gemini API Data Ingestion Gets Production Ready

Google has upgraded Gemini API data ingestion to support persistent storage via GCS registration and external signed URLs, boosting the inline limit to 100MB.

3 months ago

Funding Round

The AI Pet Startup That Claims to Translate Your Dog's Thoughts

3 months ago

AI Research

Google Gemini 3 Redefines AI Reasoning and Efficiency

3 months ago

AI Research

Google AI Tips: A Year of Ubiquitous Intelligence

3 months ago

AI Research

T5Gemma 2 Multimodal Ushers In Efficient AI Future

3 months ago

AI Research

Tinker launches OpenAI API compatibility, challenging vendor lock-in.

4 months ago

AI Research

Gemini Google Translate Elevates Nuance

4 months ago

AI Research

Gemma 3n Powers Real-World Impact at the Edge

4 months ago

AI Research

FACTS Benchmark Suite Elevates LLM Factuality Scrutiny

4 months ago

AI Research

AI Precision Oncology Gets Scalable Boost from Microsoft AI

4 months ago

AI Research

Google's Gemini 3 Ushers In The Latest AI Era

4 months ago

AI Video

VoiceVision RAG: Beyond Text, Towards True Multimodal Document Intelligence

4 months ago

AI Research

Google TAU AI Partnership Expands Foundational AI Research

4 months ago

AI Video

Google Cloud's Nano Banana Transforms Text-to-Vision Capabilities

4 months ago

AI Video

Gemini 3 Unleashes a New Era of AI-Powered Creation

4 months ago

AI Research

Meta’s Segment Anything Model 3 masters text and video

4 months ago

AI Video

Gemini 3: Google's Ambitious Leap Towards Universal AI Integration

4 months ago

AI Research

Google Gemini 3 Elevates AI with Agentic Interfaces

4 months ago

AI Research

NotebookLM Deep Research Redefines AI Analysis

5 months ago

Artificial Intelligence

Marble World Model Goes Public, Redefining 3D Generation

5 months ago

AI Research

MMCTAgent: Microsoft's Multimodal Reasoning Agent Tackles Long-Form Video

5 months ago

AI Video

Google's Nano Banana: The Human-Centric Evolution of Visual AI

5 months ago

AI Research

Emotive AI Redefines Customer Experience Dynamics

5 months ago

AI Research

Signify Elevates Support with Advanced Retrieval Augmented Generation

5 months ago

AI Research

OlmoEarth Redefines Earth Observation Foundation Models

5 months ago

Startup News

OpenAI's Patent Strategy: Why the AI Leader Has Far Fewer Patents Than You'd Expect

5 months ago

AI Research

Automotive AI: Redefining Vehicle Design Quietly

Artificial intelligence is fundamentally reshaping vehicle design, moving beyond the long-promised fully autonomous car to deliver immediate, tangible improvements in today's vehicles. This evolution, often subtle, is driven by a sophisticated blend of on-device intelligence...

5 months ago