#Multimodal AI

50 articles with this tag

Open Source AI Beats Proprietary on Cost, Quality
Technology

Open Source AI Beats Proprietary on Cost, Quality

Open-source AI models like Kimi K2.7 Code are proving to be cost-effective and quality-competitive alternatives to proprietary AI, especially with multimodal inputs.

14 days ago
HYDRA-X: Unifying Image & Video Tokenization
AI Research

HYDRA-X: Unifying Image & Video Tokenization

HYDRA-X, a novel Vision Transformer-based UMM, unifies image and video tokenization, enhancing editing consistency and performance through causal attention and latent-level manipulation.

18 days ago
Images as the New Reasoning Medium
AI Research

Images as the New Reasoning Medium

This paper introduces optical reasoning, enabling images to serve as the primary medium for LLM and MLLM reasoning, achieving higher token efficiency and competitive performance.

22 days ago
Gemini's Audio Stack: From Transcription to Music Generation
AI Research

Gemini's Audio Stack: From Transcription to Music Generation

Google DeepMind's Thor Schaeff explores Gemini's audio stack, from advanced transcription to music generation with Lyria 3.

23 days ago
Google's Gemma 4 12B: AI on Your Laptop
AI Research

Google's Gemma 4 12B: AI on Your Laptop

Google's Gemma 4 12B model brings efficient, multimodal AI directly to laptops with a novel unified architecture.

23 days ago
Together AI Masters MiniMax M3 Inference
Technology

Together AI Masters MiniMax M3 Inference

Together AI details engineering feats enabling efficient MiniMax M3 inference, unlocking 1M-token context and multimodality.

29 days ago
Symbolic Meta-Verification Boosts Multimodal AI
AI Research

Symbolic Meta-Verification Boosts Multimodal AI

New research on multimodal meta-verification shows symbolic rationales and decoupled RL significantly enhance AI verifier performance and enable agentic self-correction.

about 1 month ago
Omar Sanseviero on Google's AI Strategy
AI Research

Omar Sanseviero on Google's AI Strategy

Omar Sanseviero from Google DeepMind discusses Google's AI strategy, focusing on efficient models, multimodality, and open innovation in AI.

about 1 month ago
Violin: AI Translates Video Content
Technology

Violin: AI Translates Video Content

Together AI launches Violin, an open-source AI tool for video translation and interactive content analysis.

about 2 months ago
Beyond RGB: Grounding Vision-Language on Raw Sensor Data
AI Research

Beyond RGB: Grounding Vision-Language on Raw Sensor Data

PRISM-VL advances vision-language models by grounding them in raw camera measurements, not just RGB, significantly improving performance on challenging visual tasks.

about 2 months ago
AlphaGRPO: Reasoning-Enhanced Multimodal Generation
AI Research

AlphaGRPO: Reasoning-Enhanced Multimodal Generation

AlphaGRPO framework enhances multimodal generation via GRPO and DVReward, enabling reasoning and self-correction without cold-start, validated across benchmarks.

about 2 months ago
AI's Next Leap: Interaction Models
Technology

AI's Next Leap: Interaction Models

Thinking Machines Lab introduces 'interaction models' for AI, enabling real-time, multimodal collaboration that mirrors human conversation.

about 2 months ago
Architectural Interactivity, Linguistic Interpretability, and Molecular Synthesis: The Frontier of Native AI
Artificial Intelligence

Architectural Interactivity, Linguistic Interpretability, and Molecular Synthesis: The Frontier of Native AI

Three organisations now define the frontier of native AI: Thinking Machines is rebuilding human-AI collaboration as a low-latency interaction model, the Effable movement wants interpretable safety frameworks like SafetyAnalyst, and Isomorphic Labs is converting AlphaFold into an end-to-end drug design engine. The common thread is moving from AI as a layer of abstraction toward AI as a fundamental component of human and biological systems.

about 2 months ago
Thinking Machines Lab Wants to Replace OpenAI Realtime With a Model That Listens While It Speaks
Artificial Intelligence

Thinking Machines Lab Wants to Replace OpenAI Realtime With a Model That Listens While It Speaks

Mira Murati's lab published its first technical paper, arguing that real-time interactivity should be a native model capability rather than scaffolding bolted around turn-based language models, and it ships benchmarks where GPT Realtime-2 scores near zero.

about 2 months ago
AI Archives: Water Data Gets Searchable
Technology

AI Archives: Water Data Gets Searchable

Databricks uses multimodal AI to turn Sudan's scanned water archives into a searchable database for critical groundwater discovery.

about 2 months ago
Together AI Adds NVIDIA Nemotron 3
Technology

Together AI Adds NVIDIA Nemotron 3

Together AI launches NVIDIA's Nemotron 3 Nano Omni, a unified multimodal AI model, to developers, simplifying agentic application creation.

2 months ago
Verifiable Reasoning in MLLMs
AI Research

Verifiable Reasoning in MLLMs

The V-tableR1 framework enables verifiable, multi-step reasoning in MLLMs by grounding logic in visual data, achieving SOTA on tabular benchmarks.

2 months ago
Google DeepMind's Gemma 4 Models Shine at AI Engineer Europe
Artificial Intelligence

Google DeepMind's Gemma 4 Models Shine at AI Engineer Europe

Google DeepMind's Omar Sanseviero shared insights into the Gemma 4 family of open AI models at AI Engineer Europe, highlighting their performance, on-device capabilities, and community adoption.

2 months ago
Beyond Black-Box: Structuring Humor AI Reasoning
AI Research

Beyond Black-Box: Structuring Humor AI Reasoning

New IRS framework moves beyond black-box AI, structuring humor understanding via explicit incongruity-resolution reasoning for expert-level performance.

3 months ago
Anthropic's Claude Opus 4.7 Arrives, Sharper Than Ever
Artificial Intelligence

Anthropic's Claude Opus 4.7 Arrives, Sharper Than Ever

Anthropic unveils Claude Opus 4.7, boosting AI's coding prowess, multimodal input, and safety features for enterprise use.

3 months ago
Rubric-Driven DPO for Visual Tasks
AI Research

Rubric-Driven DPO for Visual Tasks

A new rDPO framework uses instance-specific rubrics to create high-quality preference data, dramatically improving multimodal AI evaluation and performance.

3 months ago
Bridging Vision Tools and LLMs with P2
AI Research

Bridging Vision Tools and LLMs with P2

Perception Programs (P2) transforms raw vision tool outputs into structured summaries, dramatically enhancing MLLM reasoning without retraining.

3 months ago
Agentic Models Bypass Tool Reliance
AI Research

Agentic Models Bypass Tool Reliance

HDPO framework enables agentic multimodal models to drastically reduce tool use by decoupling accuracy and efficiency optimization, fostering self-reliance without performance loss.

3 months ago
Meta's Muse Spark: AI's Next Act?
Artificial Intelligence

Meta's Muse Spark: AI's Next Act?

Meta unveils Muse Spark, a new multimodal AI model targeting 'personal superintelligence' with advanced reasoning and agent capabilities.

3 months ago
IBM Master Inventor Explains Multimodal AI
Artificial Intelligence

IBM Master Inventor Explains Multimodal AI

IBM Master Inventor Martin Keen explains the evolution of multimodal AI, contrasting feature-level fusion with native multimodality and the importance of temporal reasoning for video.

3 months ago
Mistral AI's Vox-Trainer and Fine-Tuning
Artificial Intelligence

Mistral AI's Vox-Trainer and Fine-Tuning

Mistral AI announces Vox-Trainer, a new multimodal AI model for voice cloning and speech generation, alongside new benchmarks for speech understanding.

3 months ago
PRIMO R1: Active Critics for Robotic Manipulation
AI Research

PRIMO R1: Active Critics for Robotic Manipulation

PRIMO R1 transforms video MLLMs into active critics for robotic manipulation via outcome-based RL, achieving SOTA on RoboFail and outperforming larger models.

4 months ago
Mistral Small 4 Unifies AI Capabilities
Artificial Intelligence

Mistral Small 4 Unifies AI Capabilities

Mistral AI unveils Mistral Small 4, a unified model combining text, image, reasoning, and coding capabilities under an open-source license.

4 months ago
Code-Driven Reasoning for Precise Image Generation
AI Research

Code-Driven Reasoning for Precise Image Generation

CoCo (Code-as-CoT) introduces executable code as a reasoning framework for text-to-image generation, achieving superior precision and control.

4 months ago
AI Learns Beyond Text
Artificial Intelligence

AI Learns Beyond Text

AI is moving beyond text, with multimodal pretraining enabling models to learn from images, audio, and video for richer comprehension.

4 months ago
Crab+ Unifies AV-LLMs, Reverses Negative Transfer
AI Research

Crab+ Unifies AV-LLMs, Reverses Negative Transfer

Crab+ introduces a novel approach to Audio-Visual Large Language Models, overcoming negative transfer via explicit cooperation in data and model design.

4 months ago
Microsoft's Phi-4-reasoning-vision-15B compact AI model
AI Research

Microsoft's Phi-4-reasoning-vision-15B compact AI model

Microsoft Research's Phi-4-reasoning-vision-15B offers efficient multimodal AI, excelling in reasoning and vision tasks with less data and compute.

4 months ago
Google's Interactions API Evolves Gemini
Artificial Intelligence

Google's Interactions API Evolves Gemini

Google's new Interactions API for Gemini models offers a unified interface for complex AI tasks, supporting multimodal inputs, agents, and tool integration.

4 months ago
AI Research

Multimodal LLMs: What's Lost in Translation?

New research reveals multimodal LLMs struggle to utilize non-textual data due to a 'mismatched decoder problem,' impacting their true understanding.

4 months ago
AI Research

Less Data, More Alignment: SOTAlign

Researchers introduce SOTAlign, a framework that achieves robust cross-modal alignment using significantly less paired data by leveraging unpaired samples.

4 months ago
Agentic Vision Gemini 3 Flash: Code Execution Solves Visual Hallucination
AI Research

Agentic Vision Gemini 3 Flash: Code Execution Solves Visual Hallucination

Agentic Vision Gemini 3 Flash shifts multimodal AI from static image processing to an active, code-driven investigation, dramatically improving accuracy and verifiability.

5 months ago
Sparkli AI raises $5M to kill the EdTech chatbot for kids
Funding Round

Sparkli AI raises $5M to kill the EdTech chatbot for kids

Sparkli AI, founded by Google alums, raised a $5 million pre-seed round to develop a multimodal, simulation-based learning engine for children aged 5 to 12.

5 months ago
Argos Framework Delivers Grounded AI Reasoning
AI Research

Argos Framework Delivers Grounded AI Reasoning

Argos is an agentic verification framework that fundamentally changes reinforcement learning by rewarding models only for Grounded AI reasoning based on verifiable evidence.

5 months ago
Gemini API Data Ingestion Gets Production Ready
AI Research

Gemini API Data Ingestion Gets Production Ready

Google has upgraded Gemini API data ingestion to support persistent storage via GCS registration and external signed URLs, boosting the inline limit to 100MB.

6 months ago
The AI Pet Startup That Claims to Translate Your Dog's Thoughts
Funding Round

The AI Pet Startup That Claims to Translate Your Dog's Thoughts

6 months ago
Google Gemini 3 Redefines AI Reasoning and Efficiency
AI Research

Google Gemini 3 Redefines AI Reasoning and Efficiency

6 months ago
Google AI Tips: A Year of Ubiquitous Intelligence
AI Research

Google AI Tips: A Year of Ubiquitous Intelligence

6 months ago
T5Gemma 2 Multimodal Ushers In Efficient AI Future
AI Research

T5Gemma 2 Multimodal Ushers In Efficient AI Future

7 months ago
Tinker launches OpenAI API compatibility, challenging vendor lock-in.
AI Research

Tinker launches OpenAI API compatibility, challenging vendor lock-in.

7 months ago
Gemini Google Translate Elevates Nuance
AI Research

Gemini Google Translate Elevates Nuance

7 months ago
Gemma 3n Powers Real-World Impact at the Edge
AI Research

Gemma 3n Powers Real-World Impact at the Edge

7 months ago
AI Research

FACTS Benchmark Suite Elevates LLM Factuality Scrutiny

7 months ago
AI Precision Oncology Gets Scalable Boost from Microsoft AI
AI Research

AI Precision Oncology Gets Scalable Boost from Microsoft AI

7 months ago
Google's Gemini 3 Ushers In The Latest AI Era
AI Research

Google's Gemini 3 Ushers In The Latest AI Era

7 months ago
VoiceVision RAG: Beyond Text, Towards True Multimodal Document Intelligence
AI Video

VoiceVision RAG: Beyond Text, Towards True Multimodal Document Intelligence

7 months ago