🏢 Peking University

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

27 March 2025·2979 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University

This survey presents a methodology-centered taxonomy of LLM agent systems, linking design principles to emergent behaviors and identifying future research directions.

Training-free Diffusion Acceleration with Bottleneck Sampling

24 March 2025·3305 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

Bottleneck Sampling: Accelerate diffusion models without retraining by cleverly using low-resolution priors for efficient inference!

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

20 March 2025·2805 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University

ActVLP: Enhancing VLMs through visual-linguistic guidance for superior action-based decision-making in interactive environments.

UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation

19 March 2025·3142 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University

UPME: Peer review for MLLMs, minus human bias!

MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation

18 March 2025·3052 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Peking University

MagicComp: Dual-Phase Refinement Enables Training-Free Compositional Video Generation

WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes

17 March 2025·1935 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Peking University

WideRange4D: A new benchmark & reconstruction method for high-quality 4D scenes with wide-range movements, pushing the boundaries of 4D reconstruction.

BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing

17 March 2025·2181 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

BlobCtrl: Precisely edit images at the element level with a unified, flexible framework, bridging the gap between generation and editing.

Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills

16 March 2025·4598 words·22 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Embodied AI 🏢 Peking University

Being-0: A humanoid robot agent achieves complex tasks by integrating a vision-language model with modular skills, enhancing efficiency and real-time performance.

Uni$ extbf{F}^2$ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models

11 March 2025·2980 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Peking University

UniFace: a novel UMM tailored for fine-grained face understanding and generation.

Open-World Skill Discovery from Unsegmented Demonstrations

11 March 2025·3148 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Peking University

SBD: Self-supervised skill discovery from unsegmented videos!

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

10 March 2025·3702 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

WISE: Evaluates world knowledge in text-to-image generation.

TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation

6 March 2025·570 words·3 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University

TinyR1-32B-Preview: A novel branch-merge distillation approach that significantly enhances model accuracy and reduces computational costs for LLMs.

LONGCODEU: Benchmarking Long-Context Language Models on Long Code Understanding

6 March 2025·2588 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers AI Applications Software Engineering 🏢 Peking University

LONGCODEU: A new benchmark to challenge & enhance long code understanding in language models for software engineering!

SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning

27 February 2025·1993 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers AI Applications Software Development 🏢 Peking University

SoRFT enhances LLMs for issue resolving via subtask-oriented reinforced fine-tuning, outperforming other open-source models.

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think

27 February 2025·2523 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

DREAM ENGINE: Text-image interleaved control made easy, unifying text and visual cues for creative image generation.

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

17 February 2025·2102 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University

HermesFlow seamlessly bridges the understanding-generation gap in MLLMs using a novel Pair-DPO framework and self-play optimization on homologous data, achieving significant performance improvements.

Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening

17 February 2025·2525 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

Diffusion-Sharpening enhances diffusion model fine-tuning by optimizing sampling trajectories, achieving faster convergence and high inference efficiency without extra NFEs, leading to improved alignm…

HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

14 February 2025·4310 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University

HealthGPT: A novel medical vision-language model unifying comprehension and generation via heterogeneous knowledge adaptation, achieving superior performance on various medical tasks.

Next Block Prediction: Video Generation via Semi-Autoregressive Modeling

11 February 2025·3939 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Peking University

Next-Block Prediction (NBP) revolutionizes video generation by using a semi-autoregressive model that predicts blocks of video content simultaneously, resulting in significantly faster inference.

Magic 1-For-1: Generating One Minute Video Clips within One Minute

11 February 2025·1947 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

Magic141 generates one-minute video clips in under a minute by cleverly factorizing the generation task and employing optimization techniques.