The prevailing paradigm in vision-language models (VLMs) for visual grounding and detection treats bounding box coordinates as a sequence of 1D tokens. This approach, while functional, introduces a practical inference bottleneck by decoding these tokens largely independently and sequentially, neglecting the inherent geometric coherence within a bounding box. Researchers have introduced LocateAnything, a unified framework designed to overcome this limitation.
Related startups
Parallel Box Decoding: Unlocking Geometric Coherence
LocateAnything fundamentally rethinks the decoding process by introducing Parallel Box Decoding (PBD). Instead of serializing box coordinates, PBD treats geometric elements like bounding boxes and points as atomic units decoded in a single step. This parallel approach inherently preserves the coupled geometric structure of boxes, leading to substantial improvements in both decoding throughput and localization accuracy. This marks a significant departure from prior methods that created an inference bottleneck through strictly sequential generation.
Scalable Data Engine for High-Precision Localization
Complementing the architectural innovation, the LocateAnything framework is supported by a scalable data engine that has curated LocateAnything-Data. This new dataset comprises over 138 million training samples, dramatically increasing data diversity specifically for high-precision localization tasks. The combination of Parallel Box Decoding and this extensive dataset allows LocateAnything to advance the speed-accuracy frontier, demonstrating superior decoding throughput and enhanced high-IoU localization quality across diverse benchmarks.