🏢 Fudan University

When Preferences Diverge: Aligning Diffusion Models with Minority-Aware Adaptive DPO

21 March 2025·1831 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Fudan University

Adaptive Diffusion Models with Minority-Aware Adaptive DPO

MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

20 March 2025·4169 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Fudan University

MagicMotion: A controllable video generation framework enabling precise object motion control through dense-to-sparse trajectory guidance.

From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration

17 March 2025·5931 words·28 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Fudan University

ADR balances visual-language models by adaptively calibrating long-tail data, boosting LLaVA 1.5 by 4.36% without increasing training data volume.

World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning

13 March 2025·3847 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Embodied AI 🏢 Fudan University

D2PO: World modeling enhances embodied task planning by jointly optimizing state prediction and action selection, leading to more efficient execution.

DreamRelation: Relation-Centric Video Customization

10 March 2025·2731 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Fudan University

DreamRelation: Personalize videos by customizing relationships between subjects, generalizing to new domains.

Unified Reward Model for Multimodal Understanding and Generation

7 March 2025·368 words·2 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Understanding 🏢 Fudan University

UNIFIEDREWARD: A unified reward model that enhances multimodal understanding and generation!

DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

2 March 2025·2236 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Fudan University

DuoDecoding: Accelerating LLM inference by strategically deploying draft & target models on CPU & GPU for parallel decoding and dynamic drafting.

VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

11 February 2025·3389 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Fudan University

VidCRAFT3 enables high-quality image-to-video generation with precise control over camera movement, object motion, and lighting, pushing the boundaries of visual content creation.

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

7 February 2025·3961 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Fudan University

VideoRoPE enhances video processing in Transformer models by introducing a novel 3D rotary position embedding that preserves spatio-temporal relationships, resulting in superior performance across var…

Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

20 January 2025·4105 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Fudan University

Agent-R: A novel self-training framework enables language model agents to learn from errors by dynamically constructing training data that corrects erroneous actions, resulting in significantly improv…

Dolphin: Closed-loop Open-ended Auto-research through Thinking, Practice, and Feedback

7 January 2025·3489 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Fudan University

DOLPHIN: AI automates scientific research from idea generation to experimental validation.

LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment

6 December 2024·276 words·2 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Fudan University

LiFT leverages human feedback, including reasoning, to effectively align text-to-video models with human preferences, significantly improving video quality.

BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments

31 October 2024·6027 words·29 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Fudan University

BitStack: Dynamic LLM sizing for variable memory!