🏹 Cupid

Evaluating Personalized and Contextualized Alignment of LLMs from Interactions

Paper Code CUPID CUPID-Unverified

Accepted to COLM 2025


Cupid 🏹 is a benchmark for evaluating the capability of Large Language Models (LLMs) to infer and apply personalized, contextual preferences from multi-turn user interactions. Unlike existing approaches that assume static global preferences, Cupid tests models’ ability to understand dynamic, context-dependent user preferences revealed through users’ conversational feedback.

Pipeline diagram showing CUPID's synthesis process to generate diverse interaction sessions


Dataset

Cupid contains 756 human-curated interaction session histories between simulated users and LLM-based AI assistants, available on HuggingFace.

We also release a larger, unverified version of our dataset Cupid-Unverified on HuggingFace.

Key Features

Our Github Repository contains code for the synthesis pipeline, which can be used to generate additional training/evaluation data.

Example interaction session from CUPID


Evaluation Tasks

Cupid can be used to assess models on two tasks:

1. Preference Inference: Given prior interactions, infer the user’s contextual preference for the current request.

2. Response Generation: Generate responses that satisfy the user’s contextual preferences.

Evaluation scripts are available in our Github Repository.


Bibtex

@article{kim2025cupid,
  title     = {CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions},
  author    = {Kim, Tae Soo and Lee, Yoonjoo and Park, Yoonah and Kim, Jiho and Kim, Young-Ho and Kim, Juho},
  journal   = {arXiv preprint arXiv:2508.01674},
  year      = {2025}
}

Logo of KIXLAB Logo of KAIST Logo of SNU Logo of Calvin University Logo of NAVER

This research was supported by the KAIST-NAVER Hypercreative AI Center.