Skip to main content

🏢 Peking University

Large Language Model Agent: A Survey on Methodology, Applications and Challenges
·2979 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University
This survey presents a methodology-centered taxonomy of LLM agent systems, linking design principles to emergent behaviors and identifying future research directions.
Training-free Diffusion Acceleration with Bottleneck Sampling
·3305 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
Bottleneck Sampling: Accelerate diffusion models without retraining by cleverly using low-resolution priors for efficient inference!
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse
·2805 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University
ActVLP: Enhancing VLMs through visual-linguistic guidance for superior action-based decision-making in interactive environments.
MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation
·3052 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Peking University
MagicComp: Dual-Phase Refinement Enables Training-Free Compositional Video Generation
WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes
·1935 words·10 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Peking University
WideRange4D: A new benchmark & reconstruction method for high-quality 4D scenes with wide-range movements, pushing the boundaries of 4D reconstruction.
BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing
·2181 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
BlobCtrl: Precisely edit images at the element level with a unified, flexible framework, bridging the gap between generation and editing.
Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills
·4598 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Embodied AI 🏢 Peking University
Being-0: A humanoid robot agent achieves complex tasks by integrating a vision-language model with modular skills, enhancing efficiency and real-time performance.
Uni$ extbf{F}^2$ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models
·2980 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Peking University
UniFace: a novel UMM tailored for fine-grained face understanding and generation.
Open-World Skill Discovery from Unsegmented Demonstrations
·3148 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Peking University
SBD: Self-supervised skill discovery from unsegmented videos!
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
·3702 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
WISE: Evaluates world knowledge in text-to-image generation.
TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation
·570 words·3 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University
TinyR1-32B-Preview: A novel branch-merge distillation approach that significantly enhances model accuracy and reduces computational costs for LLMs.
LONGCODEU: Benchmarking Long-Context Language Models on Long Code Understanding
·2588 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers AI Applications Software Engineering 🏢 Peking University
LONGCODEU: A new benchmark to challenge & enhance long code understanding in language models for software engineering!
SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning
·1993 words·10 mins· loading · loading
AI Generated 🤗 Daily Papers AI Applications Software Development 🏢 Peking University
SoRFT enhances LLMs for issue resolving via subtask-oriented reinforced fine-tuning, outperforming other open-source models.
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
·2523 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
DREAM ENGINE: Text-image interleaved control made easy, unifying text and visual cues for creative image generation.
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
·2102 words·10 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University
HermesFlow seamlessly bridges the understanding-generation gap in MLLMs using a novel Pair-DPO framework and self-play optimization on homologous data, achieving significant performance improvements.
Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening
·2525 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
Diffusion-Sharpening enhances diffusion model fine-tuning by optimizing sampling trajectories, achieving faster convergence and high inference efficiency without extra NFEs, leading to improved alignm…
HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation
·4310 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University
HealthGPT: A novel medical vision-language model unifying comprehension and generation via heterogeneous knowledge adaptation, achieving superior performance on various medical tasks.
Next Block Prediction: Video Generation via Semi-Autoregressive Modeling
·3939 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Peking University
Next-Block Prediction (NBP) revolutionizes video generation by using a semi-autoregressive model that predicts blocks of video content simultaneously, resulting in significantly faster inference.
Magic 1-For-1: Generating One Minute Video Clips within One Minute
·1947 words·10 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
Magic141 generates one-minute video clips in under a minute by cleverly factorizing the generation task and employing optimization techniques.
Almost Surely Safe Alignment of Large Language Models at Inference-Time
·2605 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University
InferenceGuard ensures almost-sure safe LLM responses at inference time by framing safe generation as a constrained Markov Decision Process in the LLM’s latent space, achieving high safety rates witho…