To mitigate the perceptual bottleneck in VLMs, recent approaches often rely on external tools or explicit intermediate visual cues (e.g., generated masks, bounding boxes, or latent tokens) during ...