Relational Visual Similarity

arXiv BibTeX HuggingFace Dataset Project Page Dataset Live view Image Retrieval Results GitHub stars

Relational Visual Similarity (arXiv 2025)
Thao Nguyen1, Sicheng Mo3, Krishna Kumar Singh2, Yilin Wang2, Jing Shi2, Nicholas Kolkin2, Eli Shechtman2, Yong Jae Lee1,2, ★, Yuheng Li1, ★
(★ Equal advising)
1- University of Wisconsin–Madison; 2- Adobe Research; 3- UCLA

TL;DR: We introduce a new visual similarity notion: relational visual similarity, which complements traditional attribute-based perceptual similarity (e.g., LPIPS, CLIP, DINO).

Click here to read Abstract 📝 Humans do not just see attribute similarity---we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach’s skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image–caption dataset in which the captions are anonymized---describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision–Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it---revealing a critical gap in visual computing.

🛠️ Quick Usage

This code is tested on Python 3.10: (i) NVIDIA A100 80GB (torch2.5.1+cu124) and (ii) NVIDIA RTX A6000 48GB (torch2.9.1+cu128).
Other hardware setup hasn't been tested, but it should still work. Please install pytorch and torchvision according to your machine configuration.

conda create -n relsim python=3.10
pip install relsim

# or you can clone the repo
git clone https://github.com/thaoshibe/relsim.git
cd relsim
pip install -r requirements.txt

Given two images, you can compute their relational visual similarity (relsim) like this:

from relsim.relsim_score import relsim
from PIL import Image

# Load model
model, preprocess = relsim(pretrained=True, checkpoint_dir="thaoshibe/relsim-qwenvl25-lora")

img1 = preprocess(Image.open("image_path_1"))
img2 = preprocess(Image.open("image_path_2"))
similarity = model(img1, img2)  # Returns similarity score (higher = more similar)
print(f"relational similarity score: {similarity:.3f}")

For more details, training code, data, etc. please visit: thaoshibe/relsim

Citation: arxiv.org/abs/2512.07833

@misc{nguyen2025relationalvisualsimilarity,
      title={Relational Visual Similarity}, 
      author={Thao Nguyen and Sicheng Mo and Krishna Kumar Singh and Yilin Wang and Jing Shi and Nicholas Kolkin and Eli Shechtman and Yong Jae Lee and Yuheng Li},
      year={2025},
      eprint={2512.07833},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.07833}, 
}
Downloads last month
211
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thaoshibe/relsim-qwenvl25-lora

Adapter
(157)
this model

Datasets used to train thaoshibe/relsim-qwenvl25-lora