#Computer Vision

50 articles with this tag

LocateAnything: Parallel Decoding for Vision
AI Research

LocateAnything: Parallel Decoding for Vision

LocateAnything revolutionizes vision-language models with Parallel Box Decoding, boosting speed and accuracy in visual grounding and detection.

about 22 hours ago
Uber Fights Bounding Box Errors
tech

Uber Fights Bounding Box Errors

Uber Engineering uses machine learning to automatically detect and correct bounding box annotation errors in video data, boosting ML model training quality.

1 day ago
AI Image Generation Reimagined: Channel-Wise Quantization
AI Research

AI Image Generation Reimagined: Channel-Wise Quantization

Channel-wise Vector Quantization (CVQ) redefines image tokenization, enabling autoregressive models like CAR to generate richer, more detailed images with state-of-the-art performance.

3 days ago
SpatioRoute VLM: Dynamic Prompting for Video QA
AI Research

SpatioRoute VLM: Dynamic Prompting for Video QA

SpatioRoute VLM revolutionizes zero-shot spatial video question answering with dynamic prompt routing, achieving SOTA without fine-tuning or 3D sensors.

10 days ago
TaskGround: Bridging Scene Context and Action
AI Research

TaskGround: Bridging Scene Context and Action

TaskGround revolutionizes household AI by enabling compact models to interpret complex scenes, infer task structures, and act effectively, drastically improving performance and reducing costs.

10 days ago
AI Learns to See, Hear, and Understand
Technology

AI Learns to See, Hear, and Understand

Multimodal AI analytics is enabling businesses to decode video, audio, and images, unlocking deeper insights from previously unstructured data.

10 days ago
LMPath: Semantics Supercharge UAV Search
AI Research

LMPath: Semantics Supercharge UAV Search

LMPath integrates language and vision models to create semantically-aware exploration priors for UAVs, dramatically improving search mission efficiency over traditional geometric methods.

15 days ago
Beyond RGB: Grounding Vision-Language on Raw Sensor Data
AI Research

Beyond RGB: Grounding Vision-Language on Raw Sensor Data

PRISM-VL advances vision-language models by grounding them in raw camera measurements, not just RGB, significantly improving performance on challenging visual tasks.

16 days ago
Claude's Corner: GrazeMate — Three Clicks to Move a Thousand Cows
Claude's Corner

Claude's Corner: GrazeMate — Three Clicks to Move a Thousand Cows

GrazeMate builds fully autonomous drone software that herds cattle across million-acre stations with three phone taps, using proprietary reinforcement learning trained on expert stockmanship to read and respond to real-time animal behavior. Founded by a 19-year-old Australian farmer, the company has $1.2M raised, 1.7 million acres under contract, and is expanding into California and Texas.

16 days ago
Codex AI Automates Complex Computer Tasks
Artificial Intelligence

Codex AI Automates Complex Computer Tasks

Codex AI demonstrates advanced capabilities, automating complex tasks across applications by interacting with computer interfaces.

17 days ago
Claude's Corner: Librar Labs — The AI Librarian That's Really a Data-Catalog Trojan Horse
Claude's Corner

Claude's Corner: Librar Labs — The AI Librarian That's Really a Data-Catalog Trojan Horse

Librar Labs looks like another YC W2026 SaaS — AI-powered school library management — until you look at the team and the technical claim under the hood. OpenAI / Scale / Palantir alums plus quantum physicists plus a 'self-healing database for unstructured data' don't build a school librarian assistant unless the school librarian is the wedge.

17 days ago
MLX Genmedia: Prince Canuma on On-Device AI
Artificial Intelligence

MLX Genmedia: Prince Canuma on On-Device AI

Prince Canuma of MLX Genmedia discusses the power of on-device AI, showcasing how MLX enables efficient deployment of AI models on Apple Silicon devices for vision and audio tasks.

18 days ago
Transformers Conquer Computer Vision
Artificial Intelligence

Transformers Conquer Computer Vision

Isaac Robinson from Roboflow explains how Transformers, once confined to NLP, have revolutionized computer vision, surpassing CNNs through massive pre-training and architectural innovation.

21 days ago
Black Forest Labs: FLUX and the Future of Visual AI
AI Research

Black Forest Labs: FLUX and the Future of Visual AI

Stephen Batifol of Black Forest Labs discusses FLUX, the company's visual AI model, and the future of generative AI with a focus on real-time generation and world models.

21 days ago
PhyCo: Bridging Physics and Video Generation
AI Research

PhyCo: Bridging Physics and Video Generation

PhyCo introduces a framework for physically consistent and controllable video generation, overcoming limitations of current diffusion models through physics-supervised fine-tuning and VLM-guided rewards.

28 days ago
X-WAM: Bridging Action and 4D Synthesis
AI Research

X-WAM: Bridging Action and 4D Synthesis

The X-WAM unified 4D world model revolutionizes robotics by integrating real-time action with high-fidelity 4D synthesis, achieving state-of-the-art benchmarks.

29 days ago
Mosaic SoC raises $3.8M for spatial intelligence chips
Funding Round

Mosaic SoC raises $3.8M for spatial intelligence chips

Mosaic SoC raises $3.8M for chips that bring real-time spatial intelligence to devices, enabling advanced perception with minimal power consumption.

29 days ago
UniDoc-RL: Finer-Grained Visual RAG
AI Research

UniDoc-RL: Finer-Grained Visual RAG

UniDoc-RL enhances LVLMs with fine-grained visual RAG via hierarchical RL, active perception, and multi-reward training, achieving state-of-the-art results.

about 1 month ago
RadAgent: Interpretable AI for Medical Imaging
AI Research

RadAgent: Interpretable AI for Medical Imaging

RadAgent offers interpretable, agent-based CT report generation, significantly improving accuracy, robustness, and introducing crucial faithfulness.

about 1 month ago
Beyond Black-Box: Structuring Humor AI Reasoning
AI Research

Beyond Black-Box: Structuring Humor AI Reasoning

New IRS framework moves beyond black-box AI, structuring humor understanding via explicit incongruity-resolution reasoning for expert-level performance.

about 1 month ago
Anthropic Unveils Updated AI Model Opus 4.7
Artificial Intelligence

Anthropic Unveils Updated AI Model Opus 4.7

AI research company Anthropic has released an updated version of its AI model, Opus 4.7, boasting enhanced computer vision capabilities and a continued focus on safety.

about 1 month ago
HiVLA: Decoupling Reasoning for Robotic Control
AI Research

HiVLA: Decoupling Reasoning for Robotic Control

HiVLA decouples VLM reasoning from motor control using a hierarchical framework, enhancing robotic manipulation performance and preserving zero-shot capabilities.

about 1 month ago
Adaptive Zooming for Precise GUI Grounding
AI Research

Adaptive Zooming for Precise GUI Grounding

UI-Zoomer revolutionizes GUI grounding with a training-free adaptive zoom-in approach, enhancing localization accuracy by intelligently quantifying and responding to prediction uncertainty.

about 1 month ago
Anthropic Unveils Opus 4.7: A Leap in AI Coding and Vision
Artificial Intelligence

Anthropic Unveils Opus 4.7: A Leap in AI Coding and Vision

Anthropic unveils its updated Opus 4.7 AI model, boasting enhanced coding and computer vision capabilities, with a key focus on cybersecurity.

about 1 month ago
Bridging Vision Tools and LLMs with P2
AI Research

Bridging Vision Tools and LLMs with P2

Perception Programs (P2) transforms raw vision tool outputs into structured summaries, dramatically enhancing MLLM reasoning without retraining.

about 1 month ago
Automating High-Quality Image Editing Data
AI Research

Automating High-Quality Image Editing Data

A new pipeline, EditCaption, drastically improves VLM instruction synthesis for image editing, boosting Qwen3-VL performance and reducing critical errors.

about 2 months ago
Instance-Aware VLP: Beyond Global Understanding
AI Research

Instance-Aware VLP: Beyond Global Understanding

InstAP introduces instance-aware pre-training for VLP, enhancing instance-level reasoning and global understanding with the InstVL dataset.

about 2 months ago
MoRight: Causal Control in Video Generation
AI Research

MoRight: Causal Control in Video Generation

MoRight revolutionizes video generation by enabling disentangled motion control and modeling motion causality for realistic, interactive scene dynamics.

about 2 months ago
IBM Master Inventor Explains Multimodal AI
Artificial Intelligence

IBM Master Inventor Explains Multimodal AI

IBM Master Inventor Martin Keen explains the evolution of multimodal AI, contrasting feature-level fusion with native multimodality and the importance of temporal reasoning for video.

about 2 months ago
EdgeDiT: Transformers on the Edge
AI Research

EdgeDiT: Transformers on the Edge

EdgeDiT brings high-fidelity generative AI to mobile devices by optimizing Diffusion Transformers for NPUs, achieving significant efficiency gains.

about 2 months ago
Personalized Driving with Vega
AI Research

Personalized Driving with Vega

The Vega vision-language-action model enhances autonomous driving by enabling personalized, instruction-based navigation through a novel dataset and hybrid AI architecture.

2 months ago
Microsoft's AsgardBench Tests AI's Planning Skills
AI Research

Microsoft's AsgardBench Tests AI's Planning Skills

Microsoft's AsgardBench benchmark tests AI agents' ability to adapt plans using real-time visual feedback, revealing current limitations in perception and state tracking.

2 months ago
Robots Get Better at Long-Term Planning
AI Research

Robots Get Better at Long-Term Planning

Microsoft's GroundedPlanBench and V2GP framework improve robot planning by jointly considering actions and locations, overcoming limitations of decoupled approaches.

2 months ago
Medical VLMs Fail Critical Input Sanity Checks
AI Research

Medical VLMs Fail Critical Input Sanity Checks

Medical VLMs fail critical input validation tests, as revealed by the new MedObvious benchmark, highlighting a significant safety risk.

2 months ago
UniMotion: Unifying Motion, Vision, and Language
AI Research

UniMotion: Unifying Motion, Vision, and Language

UniMotion establishes a unified framework for continuous motion, vision, and text, overcoming discrete tokenization limits and achieving SOTA cross-modal performance.

2 months ago
Bridging Dense Dynamics and Semantic Reasoning
AI Research

Bridging Dense Dynamics and Semantic Reasoning

A new VLM-guided JEPA latent world modeling framework fuses dense motion dynamics with semantic reasoning for robust long-horizon forecasting.

2 months ago
Perceptio: Spatial Grounding for LVLMs
AI Research

Perceptio: Spatial Grounding for LVLMs

Perceptio LVLM integrates explicit spatial tokens (segmentation, depth) to overcome LVLM limitations in fine-grained visual grounding, achieving SOTA across benchmarks.

2 months ago
3D Spatial Reasoning for VLM
AI Research

3D Spatial Reasoning for VLM

Loc3R-VLM injects 3D spatial reasoning into 2D VLMs using monocular video, achieving SOTA in localization and 3D QA.

2 months ago
VideoAtlas: Unlocking Long-Context Video AI
AI Research

VideoAtlas: Unlocking Long-Context Video AI

VideoAtlas AI offers a lossless, hierarchical grid representation and Video-RLM for scalable, robust long-context video understanding with logarithmic compute growth.

2 months ago
V2M-Zero: Temporal Music Sync Without Paired Data
AI Research

V2M-Zero: Temporal Music Sync Without Paired Data

V2M-Zero revolutionizes video-to-music generation by using event curves to achieve temporal synchronization without paired data, achieving significant performance gains.

3 months ago
BEACON Navigates Occlusion Challenges
AI Research

BEACON Navigates Occlusion Challenges

BEACON revolutionizes robot navigation by using Bird's-Eye View (BEV) affordance heatmaps to overcome occlusion challenges, achieving significant accuracy gains over image-space methods.

3 months ago
RealWonder: Physics Bridges Video Generation
AI Research

RealWonder: Physics Bridges Video Generation

RealWonder leverages physics simulation to bridge the gap in action-conditioned video generation, enabling real-time simulation of physical interactions.

3 months ago
ZipMap: Linear-Time 3D Vision
AI Research

ZipMap: Linear-Time 3D Vision

ZipMap revolutionizes 3D vision with linear-time reconstruction, achieving 20x speedup and enabling real-time state querying.

3 months ago
Crab+ Unifies AV-LLMs, Reverses Negative Transfer
AI Research

Crab+ Unifies AV-LLMs, Reverses Negative Transfer

Crab+ introduces a novel approach to Audio-Visual Large Language Models, overcoming negative transfer via explicit cooperation in data and model design.

3 months ago
Microsoft's Phi-4-reasoning-vision-15B compact AI model
AI Research

Microsoft's Phi-4-reasoning-vision-15B compact AI model

Microsoft Research's Phi-4-reasoning-vision-15B offers efficient multimodal AI, excelling in reasoning and vision tasks with less data and compute.

3 months ago
AI Research

Certified Circuits for Stable AI Explanations

New 'Certified Circuits' framework provides provable stability for AI model explanations, yielding more accurate and compact circuits.

3 months ago
AI Research

Multimodal LLMs: What's Lost in Translation?

New research reveals multimodal LLMs struggle to utilize non-textual data due to a 'mismatched decoder problem,' impacting their true understanding.

3 months ago
AI Research

Less Data, More Alignment: SOTAlign

Researchers introduce SOTAlign, a framework that achieves robust cross-modal alignment using significantly less paired data by leveraging unpaired samples.

3 months ago
AI Research

SeeThrough3D: Mastering Occlusion in 3D Scenes

SeeThrough3D introduces an occlusion-aware 3D scene representation, enabling precise control over inter-object occlusions in AI-generated scenes.

3 months ago
Anthropic Buys Vercept for AI Digital Dexterity
Artificial Intelligence

Anthropic Buys Vercept for AI Digital Dexterity

Anthropic has acquired Vercept to advance Claude's computer use capabilities, enabling the AI to interact with live software applications for complex tasks.

3 months ago