LARE: Low-Attention Region Encoding for Text–Image Retrieval

Abdulmalik Alquwayfili,  Faisal Almeshal,  Jumanah Almajnouni,  Leena Alotaibi,  Huda Alamri,  Muhammad Kamran J. Khan
LARE pipeline diagram
Figure 1. LARE pipeline: A single forward pass produces both a global image embedding and a spatial attention map. Inverting the attention map highlights under-attended regions, which are clustered into candidate crops and then re-encoded independently. A confidence gate determines whether regional evidence should be used to adjust the final retrieval score.

Abstract

Image retrieval in crowded scenes is particularly challenging due to the salience bias of conventional visual encoders, which tend to focus on dominant objects while neglecting low-attention regions that are often crucial for fine-grained retrieval. We propose LARE (Low-Attention Region Encoding), a framework that explicitly models these overlooked regions. LARE adopts a dual-encoding strategy that encodes low-attention regions of an image and the full image in parallel, leading to more diverse and informative image embeddings.

To evaluate image retrieval performance in challenging crowded scenes, we introduce Dense-Set, a challenging subset derived from MS-COCO and Flickr30k. In this subset, images are re-captioned to provide richer descriptions of low-attention or previously overlooked regions. Experimental results demonstrate that the proposed framework improves retrieval performance by preserving subtle, non-dominant visual cues within the shared latent space.

Dense-Set Dataset

We introduce Dense-Set, a curated evaluation benchmark derived from MS-COCO and Flickr30k, designed to stress-test retrieval models in visually crowded scenes. Images are selected for high object density and the presence of rare single-instance object categories, then re-captioned using BLIP-2 to highlight these overlooked regions.

Dense Candidate Pool

Images ranked by total detected object count (YOLO); the top 10% are retained as the high-density subset.

Rare-Class Filtering

Images containing at least one single-instance object category are kept, focusing on visually subordinate objects.

Re-captioning

BLIP-2 generates new captions guided by class-aware prompts, shifting focus from global scene context to fine-grained details.

Evaluation

Dense-Set exposes the salience bias of global embeddings that standard benchmarks fail to reveal.

3,089
MS-COCO Dense-Set images
2,477
Flickr30k Dense-Set images

Results

Zero-shot retrieval performance (Recall@1/5/10). LARE is applied on top of CLIP, SigLIP, and SigLIP 2 — no additional training or fine-tuning required.

Model MS-COCO Flickr30k MS-COCO-Dense Flickr30k-Dense
R@1R@5R@10 R@1R@5R@10 R@1R@5R@10 R@1R@5R@10
CLIP (L/14) 36.1061.1071.44 65.0088.0092.62 17.7935.8545.11 3.4811.9716.33
SigLIP (So/14) 54.2476.7884.21 82.9496.0898.00 26.6146.3155.22 5.0515.5020.96
SigLIP 2 (So/16) 56.5578.7585.95 83.7296.3498.32 27.5647.5656.73 5.1216.4721.80
LARE (CLIP) 36.1061.1071.44 65.0088.0092.62 22.9742.1052.03 9.7316.6320.40
LARE (SigLIP) 54.2676.8084.24 82.9496.1298.00 29.9450.1759.26 12.3319.8724.10
LARE (SigLIP 2) 56.5678.7885.97 83.7696.3898.34 31.0051.4560.67 13.2821.1125.10

Highlighted rows (LARE) consistently improve Dense-Set performance while preserving standard benchmark scores.

Qualitative comparison between SigLIP and LARE
Figure 2. Qualitative comparison between the baseline encoder (SigLIP) and LARE on MS-COCO-Dense (Cols. 1–2) and Flickr30k-Dense (Cols. 3–4). Top-5 retrieval results are shown; ground-truth is highlighted. LARE improves ranking by leveraging fine-grained, localized cues missed by the baseline.

BibTeX

@inproceedings{alquwayfili2026lare,
  title={LARE: Low-Attention Region Encoding for Text--Image Retrieval},
  author={Abdulmalik Alquwayfili and Faisal Almeshal and Jumanah Almajnouni and Leena Alotaibi and Huda Alamri and Muhammad Kamran J Khan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
  year={2026}
}