The challenge of precisely localizing interface elements within screenshots, especially small icons or dense layouts, remains a significant hurdle in GUI grounding. Existing test-time zoom-in methods, while beneficial, apply uniform cropping, failing to adapt to model uncertainty on a per-instance basis.
Uncertainty-Driven Adaptive Zooming
The researchers behind UI-Zoomer propose a training-free framework that fundamentally rethinks the zoom-in process. Instead of uniform cropping, UI-Zoomer treats both the decision to zoom and the extent of the zoom as a problem of prediction uncertainty quantification. A confidence-aware gate intelligently fuses spatial consensus from various candidates with token-level generation confidence. This mechanism ensures that zoom-in is triggered selectively, only when the model's initial localization confidence is low. This targeted approach avoids unnecessary computation and focuses resources where they are most needed, enhancing the efficiency of UI-Zoomer GUI grounding.
Per-Instance Crop Sizing via Variance Decomposition
When zoom-in is triggered, UI-Zoomer employs an uncertainty-driven crop sizing module. This module uniquely decomposes prediction variance into two components: the positional spread across different samples and the bounding box extent within a single sample. By analyzing these variances, the framework derives a per-instance crop radius using the law of total variance. This sophisticated method allows for dynamically adjusting the zoom level based on the specific uncertainty observed for each element, a critical advancement for robust UI-Zoomer GUI grounding. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 showcase substantial improvements, with gains reaching up to +13.4%, +10.3%, and +4.2% respectively, validating the effectiveness of this adaptive strategy across diverse datasets and model architectures without any additional training overhead.