Our core idea is to treat different retrieval models for multimodal data, such as text and images, as a process analogous to multi-sensor fusion. When independent, heterogeneous modality-specific ...