OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment

2502.18965

Jiaxin Deng et el.

🤗 2025-03-04

TL;DR
#

Traditional recommendation systems use a multi-stage approach: retrieving potential items then ranking them. Generative retrieval-based recommendation systems (GRs) directly generate items, but haven’t matched the accuracy of multi-stage systems. This paper tackles the challenge of creating a single model that can handle item recommendations effectively.

The paper introduces OneRec, an end-to-end generative framework. OneRec replaces the traditional cascaded learning with a unified generative model, which includes an encoder-decoder structure and an iterative preference alignment module. The method surpasses current complex recommendation systems, showing a substantial improvement in a real-world short video platform.

Key Takeaways
#

Why does it matter?
#

This paper introduces a novel generative recommendation framework, paving the way for more effective and personalized content delivery. It offers new insights for building end-to-end recommendation systems and preference learning, inspiring future work to explore generative models for various real-world applications and user experience enhancements.

Visual Insights
#

🔼 Figure 1 illustrates two different recommendation system architectures. (a) shows the proposed OneRec model, which is a unified end-to-end architecture for generating recommendations. This model directly generates a list of recommended items, unlike traditional systems. (b) depicts a typical cascade ranking system, which uses a three-stage pipeline: Retrieval (identifying a large set of candidates), Pre-ranking (filtering the candidates to a smaller subset), and Ranking (ordering the remaining candidates). This figure visually contrasts the simplicity and directness of OneRec with the complexity of the traditional approach.
read the caption
Figure 1. (a) Our proposed unified architecture for end-to-end generation. (b) A typical cascade ranking system, which includes three stages from the bottom to the top: Retrieval, Pre-ranking, and Ranking.

Model	Watching-Time Metrics				Interaction Metrics
	swt $\uparrow$		vtr $\uparrow$		wtr $\uparrow$		ltr $\uparrow$
	mean	max	mean	max	mean	max	mean	max
Pointwise Discriminative Method
\cdashline1-9 SASRec	0.0375 $\pm$ 0.002	0.0803 $\pm$ 0.005	0.4313 $\pm$ 0.013	0.5801 $\pm$ 0.013	0.00294 $\pm$ 0.001	0.00978 $\pm$ 0.001	0.0314 $\pm$ 0.002	0.0604 $\pm$ 0.004
BERT4Rec	0.0336 $\pm$ 0.002	0.0706 $\pm$ 0.004	0.4192 $\pm$ 0.014	0.5633 $\pm$ 0.013	0.00281 $\pm$ 0.001	0.00932 $\pm$ 0.001	0.0316 $\pm$ 0.002	0.0606 $\pm$ 0.004
FDSA	0.0325 $\pm$ 0.002	0.0683 $\pm$ 0.005	0.4145 $\pm$ 0.015	0.5588 $\pm$ 0.014	0.00271 $\pm$ 0.001	0.00921 $\pm$ 0.001	0.0313 $\pm$ 0.002	0.0604 $\pm$ 0.003
\cdashline1-9 Pointwise Generative Method
\cdashline1-9 TIGER-0.1B	0.0879 $\pm$ 0.007	0.1286 $\pm$ 0.010	0.5826 $\pm$ 0.016	0.6625 $\pm$ 0.017	0.00277 $\pm$ 0.001	0.00671 $\pm$ 0.001	0.0316 $\pm$ 0.004	0.0541 $\pm$ 0.007
TIGER-1B	0.0873 $\pm$ 0.006	0.1368 $\pm$ 0.010	0.5827 $\pm$ 0.015	0.6776 $\pm$ 0.015	0.00292 $\pm$ 0.001	0.00758 $\pm$ 0.001	0.0323 $\pm$ 0.004	0.0579 $\pm$ 0.008
\cdashline1-9 Listwise Generative Method
\cdashline1-9 OneRec-0.1B	0.0973 $\pm$ 0.010	0.1501 $\pm$ 0.015	0.6001 $\pm$ 0.021	0.6981 $\pm$ 0.021	0.00326 $\pm$ 0.001	0.00870 $\pm$ 0.001	0.0349 $\pm$ 0.009	0.0642 $\pm$ 0.015
OneRec-1B	0.0991 $\pm$ 0.008	0.1529 $\pm$ 0.012	0.6039 $\pm$ 0.020	0.7013 $\pm$ 0.020	0.00349 $\pm$ 0.001	0.00919 $\pm$ 0.002	0.0360 $\pm$ 0.005	0.0660 $\pm$ 0.008
\cdashline1-9 Listwise Generative Method with Preference Alignment
\cdashline1-9 OneRec-1B $+$ DPO	0.1014 $\pm$ 0.010	0.1595 $\pm$ 0.015	0.6127 $\pm$ 0.017	0.7116 $\pm$ 0.016	0.00339 $\pm$ 0.001	0.00896 $\pm$ 0.001	0.0351 $\pm$ 0.004	0.0644 $\pm$ 0.008
OneRec-1B $+$ IPO	0.0979 $\pm$ 0.003	0.1528 $\pm$ 0.005	0.6000 $\pm$ 0.007	0.7012 $\pm$ 0.007	0.00335 $\pm$ 0.001	0.00905 $\pm$ 0.001	0.0350 $\pm$ 0.003	0.0654 $\pm$ 0.004
OneRec-1B $+$ cDPO	0.0993 $\pm$ 0.006	0.1547 $\pm$ 0.008	0.6030 $\pm$ 0.011	0.7030 $\pm$ 0.009	0.00339 $\pm$ 0.001	0.00901 $\pm$ 0.001	0.0355 $\pm$ 0.006	0.0652 $\pm$ 0.009
OneRec-1B $+$ rDPO	0.1005 $\pm$ 0.006	0.1555 $\pm$ 0.008	0.6071 $\pm$ 0.014	0.7059 $\pm$ 0.011	0.00339 $\pm$ 0.001	0.00899 $\pm$ 0.001	0.0357 $\pm$ 0.004	0.0657 $\pm$ 0.006
OneRec-1B $+$ CPO	0.0993 $\pm$ 0.008	0.1538 $\pm$ 0.012	0.6045 $\pm$ 0.021	0.7029 $\pm$ 0.018	0.00343 $\pm$ 0.001	0.00911 $\pm$ 0.002	0.0357 $\pm$ 0.008	0.0659 $\pm$ 0.014
OneRec-1B $+$ simPO	0.0995 $\pm$ 0.008	0.1536 $\pm$ 0.013	0.6047 $\pm$ 0.016	0.7022 $\pm$ 0.015	0.00349 $\pm$ 0.001	0.00918 $\pm$ 0.001	0.0360 $\pm$ 0.005	0.0659 $\pm$ 0.008
OneRec-1B $+$ S-DPO	0.1021 $\pm$ 0.008	0.1575 $\pm$ 0.013	0.6096 $\pm$ 0.016	0.7070 $\pm$ 0.015	0.00345 $\pm$ 0.001	0.00909 $\pm$ 0.001	0.0361 $\pm$ 0.004	0.0659 $\pm$ 0.008
\cdashline1-9 OneRec-1B $+$ IPA	0.1025 $\pm$ 0.009	0.1933 $\pm$ 0.017	0.6141 $\pm$ 0.020	0.7646 $\pm$ 0.021	0.00354 $\pm$ 0.001	0.00992 $\pm$ 0.001	0.0397 $\pm$ 0.004	0.1203 $\pm$ 0.010

🔼 This table presents a comparison of the offline performance of the proposed OneRec model against various baseline methods. These baselines are categorized into pointwise methods, listwise methods, and preference alignment methods, each representing different approaches to recommendation. The OneRec model is highlighted in green, pointwise methods in brown, listwise methods in blue, and preference alignment methods in yellow. The best performance for each metric is shown in bold, while near-optimal results are underlined. Metrics are evaluated according to whether a higher or lower value is preferable, indicated by arrows (↑ for higher is better, ↓ for lower is better). The metrics used assess both watching time (e.g., total watch time) and user interaction (e.g., likes, follows).
read the caption
Table 1. Offline performance of our proposed OneRec (green) with pointwise methods (brown), listwise methods (blue) and preference alignment methods (yellow). Best results are in bold, sub-optimal results are underlined. Metrics with ↑↑\uparrow↑ indicate higher is better, while ↓↓\downarrow↓ indicates lower is better.

In-depth insights
#

Generative Rec
#

Generative recommendation (GR) systems represent a paradigm shift, moving away from traditional methods that rely on two-tower models and nearest neighbor searches. GR directly generates item identifiers in an autoregressive manner, leveraging semantic IDs to encode item information. This approach harnesses the power of sequence generation, enabling the model to produce more diverse and contextually relevant recommendations. However, GR models have limitations in accuracy compared to cascade ranking. Addressing these limitations is crucial for realizing the full potential of GR in real-world recommendation scenarios, requiring innovations in model architecture, training strategies, and integration with existing ranking pipelines. Future research can address the challenges by exploring novel techniques to improve the accuracy, diversity, and scalability of generative recommendation systems. These advancements can potentially lead to a new generation of recommenders.

OneRec Model
#

The OneRec model introduces a unified approach to recommendation, replacing traditional cascaded systems with a single generative framework. This shift aims to overcome limitations where each stage’s effectiveness caps the subsequent one. OneRec leverages an encoder-decoder structure, capturing user history to predict item interest. A key innovation is the use of sparse Mixture-of-Experts (MoE), scaling capacity without proportional FLOPs increase. The model adopts a session-wise generation approach, predicting item lists for contextual coherence, contrasting point-by-point methods. An Iterative Preference Alignment (IPA) module, combined with Direct Preference Optimization (DPO), enhances generated content quality. IPA tackles sparse user-item data by using a reward model to simulate user generation, customizing sampling based on online learning attributes, thus aligning recommendations with user preferences efficiently.

Iterative Align
#

Iterative alignment, in the context of recommendation systems, represents a crucial strategy for refining model behavior to better reflect user preferences. This process involves repeatedly adjusting the model’s parameters based on feedback, aiming to minimize the discrepancy between predicted and desired outcomes. Such alignment often leverages techniques like reinforcement learning or preference optimization, where the model learns from user interactions or explicit feedback signals to iteratively improve its recommendations. The iterative nature allows the model to adapt to evolving user tastes and preferences, ensuring long-term relevance and satisfaction. By continuously refining its understanding of user needs, the system becomes more adept at delivering personalized and engaging content.

Offline vs Online
#

The distinction between offline and online methodologies is crucial in evaluating recommender systems. Offline evaluation allows for controlled experimentation and rapid iteration using historical data, enabling the assessment of various models and hyperparameter tuning. However, it often suffers from a disconnect from real-world user behavior, as it cannot capture the dynamic nature of user preferences and the impact of the recommendation system itself on user interactions. Online A/B testing, on the other hand, provides a more realistic assessment by deploying the system to a subset of real users and measuring its impact on key metrics such as click-through rate, conversion rate, and user engagement. While online testing offers higher fidelity, it is often more expensive and time-consuming, and may be subject to confounding factors such as seasonality and external events. Therefore, a balanced approach that combines offline and online evaluation is often the most effective strategy for developing and deploying successful recommender systems.

Scaling OneRec
#

Based on the text, it seems scaling the OneRec model leads to significant and consistent accuracy gains. The experiments reveal that expanding OneRec from 0.05B to 1B parameters demonstrably improves performance, showcasing the benefits of larger model capacity. OneRec-0.1B shows a maximum accuracy gain of 14.45%, when compared to the OneRec-0.05B model. Further scaling to 0.2B, 0.5B, and 1B continues to produce accuracy gains of 5.09%, 5.70%, and 5.69% respectively. This suggests that the OneRec architecture effectively leverages increased model size, indicating a well-designed framework capable of capturing complex user preferences and item relationships. It’s likely that with more parameters, the model becomes more adept at discerning subtle patterns and contextual nuances, leading to more relevant and accurate recommendations.

More visual insights
#

More on figures

🔼 This figure illustrates the two-stage training process of the OneRec model. The first stage focuses on training OneRec using session-wise data, which means that the model learns to generate relevant sequences of videos for each user session. The second stage employs an Iterative Preference Alignment (IPA) module which leverages iterative direct preference optimization using self-hard negatives to improve the quality of generated recommendations. Self-hard negatives are generated from the beam search results, ensuring high-quality preference pairs are used to refine the model’s preferences. This iterative refinement process aims to align the model’s generated recommendations more closely with actual user preferences.
read the caption
Figure 2. The overall framework of OneRec, consists of two stages: (i) the session training stage which train OneRec with session-wise data; (ii) the IPA stage which utilizes iterative direct preference optimization with self-hard negatives.

🔼 This figure illustrates the online deployment architecture of the OneRec model. It shows how the trained model parameters are synchronized to both an online inference model and a DPO sample server. The online inference model serves user requests in real-time, while the DPO sample server provides preference data for model updates. The system also includes modules for log collection, preprocessing, and distributed training. The architecture is optimized for efficiency and stability, using techniques like key-value caching, float16 quantization, and beam search.
read the caption
Figure 3. Framework of Online Deployment of OneRec.

🔼 This ablation study investigates the impact of varying the DPO (Direct Preference Optimization) sample ratio on model performance. The x-axis represents the DPO sample ratio, ranging from 1% to 5%. The y-axis displays the resulting performance metrics for different aspects of the recommendation system (e.g., session watch time, view probability, follow probability, like probability). The results demonstrate that increasing the sample ratio from 1% yields only marginal performance improvements, indicating a diminishing return. A 1% sample ratio is identified as the optimal balance between performance gain and computational efficiency. Beyond this point, the additional computational cost outweighs any minor performance increases.
read the caption
Figure 4. The ablation study on DPO sample ratio rDPOsubscript𝑟DPOr_{\rm DPO}italic_r start_POSTSUBSCRIPT roman_DPO end_POSTSUBSCRIPT. The results indicate that a 1% ratio of DPO training leads to significant gains but further increase the sample ratio results in limited improvements.

🔼 Figure 5 presents a detailed visualization of the probability distributions generated by the softmax layer for each level of semantic IDs within the OneRec model. The probability distributions show how the model assigns probabilities to different semantic IDs at various stages of processing. Each plot displays the distribution for a particular layer, illustrating how the model’s confidence in certain semantic IDs evolves as it processes the data across different layers. The red star highlights the specific semantic ID that receives the highest reward value from the reward model, indicating the model’s top choice at that layer. This visualization effectively demonstrates the hierarchical refinement process within the model as it progresses towards a final prediction, providing insight into how the uncertainty and confidence of the model change as more context is considered.
read the caption
Figure 5. The visualization of the probability distribution of the softmax output for each layer of the semantic ID. The red star represents the sematic ID of item which has the highest reward value.

🔼 Figure 6 demonstrates the impact of model size on OneRec’s performance. Multiple lines graph the performance against increasing model parameters (x-axis) for various metrics, including accuracy on different layers (Layer 1, Layer 2, Layer 3) and training loss. The results show a consistent positive correlation between model size and performance across all metrics, indicating OneRec effectively leverages increased model capacity to improve accuracy and reduce loss.
read the caption
Figure 6. Scalability of OneRec on model scaling. The results show that OneRec constantly benefits from performance improvement when the parameters are scaled up.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

Generative Rec#

OneRec Model#

Iterative Align#

Offline vs Online#

Scaling OneRec#

More visual insights#

Full paper#