Referring to Any Person

2503.08507

Qing Jiang et el.

🤗 2025-03-12

TL;DR
#

Current computer vision models struggle with real-world usability in referring to specific people in images because existing benchmarks focus on one-to-one referring and neglect multiple individuals. Existing models don’t capture attributes, spatial relations, interactions, and reasoning effectively.

To address these issues, the paper introduces HumanRef, a new dataset designed for human referring tasks, and RexSeek, a novel model designed for this task. The new dataset includes 103,028 referring statements for multiple instances, and model integrates a multimodal large language model (MLLM) with an object detection framework.

Key Takeaways
#

Why does it matter?
#

This paper is important because it addresses the limitations of existing human referring models by introducing a new benchmark and a novel MLLM. It opens new avenues for research in human-centric CV, encouraging the development of more robust and generalizable models.

Visual Insights
#

🔼 Figure 1 showcases the ‘Referring to Any Person’ task, a new computer vision challenge. The task involves identifying every person in an image that matches a given natural language description. The figure displays a wide variety of example images and their corresponding descriptions, highlighting the complexity and diversity of the task. These descriptions range from simple attributes (e.g., ‘person in blue’) to complex spatial relationships (e.g., ‘5th person from the right’) and even include celebrity recognition (e.g., ‘Elon Musk’). The figure also introduces RexSeek, a novel model specifically designed to tackle this challenge, demonstrating its ability to effectively capture various attributes, spatial relationships, interactions, and reasoning involved in accurately identifying the described individuals.
read the caption
Figure 1: We introduce referring to any person, a task that requires detecting all individuals in an image which match a given natural language description, and a new model RexSeek designed for this task with strong perception and understanding capabilities that effectively captures attributes, spatial relations, interactions, reasoning, celebrity recognition, etc.

domain

sub-domains

examples

attribute

gender, age, race, profession, posture,

appearance, clothing and accessories, action

male, female, white man, the police officer,

person with a shocked expression, person wearing

a mask, person standing

position

inner position (human to human),

outer position (human to environment)

the second person from left to right, person at

the right, person closest to the microphone,

person sitting in the chair

interaction

inner interaction (human with human),

outer interaction (human with environment)

two people holding hands, people locked in

each other’s gaze, the person holding a gun,

person holding the certificate in hand

reasoning

inner position reasoning,

outer position reasoning,

attribute reasoning

all the people to the right of the person closest

to the glass, person wearing a lab coat but not putting

their hand on the board

celebrity

recognition

actor, character, athlete, entrepreneur,

scientist, politician, singer

Brad Pitt, Bruce Wayne, Cristiano Ronaldo,

Rihanna, Elon Musk, Albert Einstein, Donald Trump

rejection

attribute, position, interaction, reasoning

a man in red hat, three women in a circle

🔼 Table 1 provides a detailed breakdown of the annotation domains and sub-domains used in the HumanRef dataset. It shows how different aspects of human appearance, spatial relations, actions, and identities are categorized and annotated for more comprehensive understanding. These annotations are fundamental to the task of referring to any person, ensuring that the dataset accurately reflects the complexity of real-world scenarios.
read the caption
Table 1: The primary annotation domains and their corresponding sub-domains within HumanRef.

In-depth insights
#

Refer Any Person
#

Referring to Any Person is a pivotal task in computer vision, demanding the ability to identify and detect individuals based on natural language descriptions. This capability holds substantial practical value across diverse applications, including human-robot interaction and healthcare. This area of research aims to overcome the limitations of existing models that often struggle with real-world usability due to unclear task definitions and a lack of high-quality data. The task involves capturing a number of aspects in which humans can be referred to. This includes attributes, position, interaction, reasoning, and celebrity recognition. Developing robust models for referring to any person requires addressing challenges such as multi-instance referring, multi-instance discrimination, and rejection of non-existence. The focus on real-world scenarios necessitates models that can accurately identify multiple individuals and avoid hallucinating results when the referred person is not present in the image.

HumanRef Dataset
#

The HumanRef dataset is introduced to address the limitations of existing datasets in real-world human referring scenarios, particularly the multi-instance referring where a single expression relates to multiple individuals. A key design choice is including diverse human contexts such as natural settings, industrial scenes, healthcare, and sports. Five key aspects define how humans are referred: attributes, position, interaction, reasoning, and celebrity recognition. Data acquisition involves filtering high-resolution images with at least four individuals to ensure multi-instance discrimination. The dataset’s annotation process includes manual labeling for attributes, position, interaction, and reasoning with automated pipelines for celebrity recognition and rejection. HumanRef aims to better reflect complex, real-world interactions and requires models to possess both robust perception and strong language comprehension.

RexSeek Design
#

The RexSeek design emphasizes a robust perception ability and strong language comprehension. It integrates a person detector (DINO-X) for reliable individual detection and a multimodal LLM (Qwen2.5) for accurate interpretation of complex language. RexSeek formulates referring as a retrieval-based process, using vision encoders (CLIP, ConvNext) to extract image features. Rol features and positional embeddings capture object context, combined with text tokens, and fed into the LLM to identify the corresponding bounding box indices. This design enables RexSeek to excel at both human and general object referring, overcoming limitations of existing models.

Multi-Instance
#

The concept of multi-instance is crucial for advancing computer vision, especially in tasks like referring expression comprehension. Existing datasets often assume a one-to-one correspondence, limiting their applicability in real-world scenarios where expressions can refer to multiple objects or people. Addressing this requires models to accurately identify and locate all relevant instances, not just a single one. This necessitates a shift in training data and evaluation metrics to accommodate multi-instance scenarios, pushing for more robust and practical vision systems. Failure to account for multi-instance referring leads to models with low recall and limited usability in complex, real-world environments.

Rejection Tests
#

Rejection tests are crucial for assessing a model’s ability to abstain from making predictions when an object is absent. A high-performing model should not hallucinate objects. Current models struggle with this, often predicting bounding boxes even when the referred object doesn’t exist. Addressing hallucination necessitates careful dataset design and training strategies. Incorporating negative examples during training can significantly enhance rejection capabilities. Evaluation metrics must accurately quantify the model’s performance in rejection scenarios. In this context, the model must be able to identify if the person is not present and avoid hallucinating an output.

More visual insights
#

More on figures

🔼 Figure 2 presents a comparison of three state-of-the-art models (Qwen2.5-VL, InternVL-2.5, and DeepSeek-VL2) performance on a human referring task. The models are shown to successfully identify single individuals in images, as evidenced by their good performance on the RefCOCO+/g benchmark. However, when the task requires identifying multiple individuals within a single image based on a natural language description, these same models often fail. This failure is attributed to an insufficient number of bounding boxes produced by the models, indicating a difficulty in detecting all relevant individuals within the image.
read the caption
Figure 2: Visualization results of Qwen2.5-VL [3], InternVL-2.5 [14], and DeepSeek-VL2 [70] on the human referring task. Despite achieving strong results on referring benchmarks RefCOCO/+/g [50, 75], state-of-the-art models struggle when tasked with identifying multiple individuals as they output an insufficient number of bounding boxes.

🔼 This figure illustrates the manual annotation process used to create the HumanRef dataset. The process involves three main steps: (1) generating a structured property dictionary using a large language model (LLM) to list potential properties for each individual in an image; (2) assigning these properties to the corresponding individuals; and (3) using the LLM to translate these structured property assignments into natural language referring expressions. The figure visualizes these three steps, showing how properties are extracted, assigned, and converted into the final annotations.
read the caption
Figure 3: Overview of the mannual annotation pipeline of the HumanRef dataset.

🔼 This figure visualizes the six subsets of the HumanRef benchmark dataset. Each subset focuses on a specific aspect of referring to people in images: attributes (describing characteristics like gender, age, clothing), position (spatial relationships between people and the environment), interaction (actions between people or with objects), reasoning (multi-step inferences to identify individuals), celebrity recognition (identifying famous people), and rejection (handling cases where a referred person isn’t present). Each subfigure shows example images and annotations illustrating the type of data and complexities within each subset.
read the caption
Figure 4: Visualization of the six subsets in the HumanRef Benchmark.

🔼 This figure presents two histograms visualizing the distribution of the number of individuals present in each image of the HumanRef dataset and the number of individuals referenced by each referring expression within the dataset. The first histogram shows how many images contain a certain number of people (e.g., the number of images with 1 person, 2 people, 3 people, and so on). The second histogram depicts the distribution of the number of individuals referenced within each referring expression in the dataset. This provides insights into the dataset’s complexity and challenges regarding the multi-instance nature of the referring task.
read the caption
Figure 5: Distribution of the number of individuals per image and the number of individuals referenced by each referring expression.

🔼 RexSeek is a retrieval-based model that leverages a person detection model and a large language model (LLM). Instead of directly predicting bounding box coordinates for referring expressions, RexSeek transforms the referring task into a retrieval problem. The model first uses a person detector to identify individuals in an image, generating corresponding object tokens. These tokens, along with vision tokens from the image and text tokens from the referring expression, are input to the LLM. The LLM then outputs a sequence of object indices that directly correspond to the individuals matching the referring expression.
read the caption
Figure 6: Overview of the RexSeek model. RexSeek is a retrieval-based model built upon ChatRex [25]. By integrating a person detection model, RexSeek transforms the referring task from predicting box coordinates to retrieving the index of input boxes.

🔼 This figure presents a comparison of the performance of various multi-modal models on the HumanRef benchmark, specifically focusing on how their recall and precision change as the number of instances associated with each referring expression increases. The x-axis represents the number of instances (individuals) referenced by a single expression, categorized into ranges (1, 2-5, 6-10, >10). The y-axis displays the recall and precision scores (in percentage). The different lines represent different models, illustrating their performance across this range of complexities.
read the caption
Figure 7: Visualizing the trend of recall and precision variations across different models as the number of instances corresponding to each referring expression increases.

More on tables