🏹 Cupid

Evaluating Personalized and Contextualized Alignment of LLMs from Interactions

Tae Soo Kim

KAIST

KAIST

SNU

Calvin University

NAVER AI Lab

KAIST

Paper Code CUPID

Unverified

Accepted to COLM 2025

Cupid 🏹 is a benchmark for evaluating the capability of Large Language Models (LLMs) to infer and apply personalized, contextual preferences from multi-turn user interactions. Unlike existing approaches that assume static global preferences, Cupid tests models’ ability to understand dynamic, context-dependent user preferences revealed through users’ conversational feedback.

Pipeline diagram showing CUPID's synthesis process to generate diverse interaction sessions

Dataset

Cupid contains 756 human-curated interaction session histories between simulated users and LLM-based AI assistants, available on HuggingFace.

We also release a larger, unverified version of our dataset Cupid-Unverified on HuggingFace.

Key Features

Contextual Preferences: Tests models’ ability to infer preferences that change based on context
Multi-turn Interactions: Preferences are revealed through implicit and multi-turn feedback rather than explicit statements
Three Instance Types: Demonstrate different ways in which a user’s preference may change over time and across contexts
Diverse Context Factors: Encompasess multi-faceted contextual factors that influence preferences, including personal relationships, work experiences, locations, organizations, etc.

Our Github Repository contains code for the synthesis pipeline, which can be used to generate additional training/evaluation data.

Example interaction session from CUPID

Evaluation Tasks

Cupid can be used to assess models on two tasks:

1. Preference Inference: Given prior interactions, infer the user’s contextual preference for the current request.

Metric: LLM-based preference matching between inferred and ground-truth preferences to then calculate F1, Precision, and Recall.
Model: We provide kixlab/prefmatcher-7b, a fine-tuned metric for cost-efficient and reliable preference matching.

2. Response Generation: Generate responses that satisfy the user’s contextual preferences.

Metric: LLM-based judge on response-preference alignment (1-10 scale)
Recommendation: Benchmark on preference inference as it strongly correlates with response generation performance.

Evaluation scripts are available in our Github Repository.

Bibtex

@article{kim2025cupid,
  title     = {CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions},
  author    = {Kim, Tae Soo and Lee, Yoonjoo and Park, Yoonah and Kim, Jiho and Kim, Young-Ho and Kim, Juho},
  journal   = {arXiv preprint arXiv:2508.01674},
  year      = {2025}
}

This research was supported by the KAIST-NAVER Hypercreative AI Center and by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No. RS-2024-00443251, Accurate and Safe Multimodal, Multilingual Personalized AI Tutors).