🏢 Peking University
PaSa: An LLM Agent for Comprehensive Academic Paper Search
·4507 words·22 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Peking University
PaSa: An LLM agent autonomously performs comprehensive academic paper searches, outperforming existing methods by efficiently combining search tools, paper reading, and citation analysis, optimized vi…
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
·4541 words·22 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Peking University
Sa2VA marries SAM2 and LLaVA for dense grounded image and video understanding, achieving state-of-the-art results on multiple benchmarks.
Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding
·2127 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Dialogue Systems
🏢 Peking University
Friends-MMC: A new dataset facilitates multi-modal multi-party conversation understanding by providing 24,000+ utterances with video, audio, and speaker annotations, enabling advancements in character…
RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response
·2508 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Peking University
ROBUSTFT tackles noisy data in LLM fine-tuning by using multi-expert noise detection and context-enhanced relabeling, significantly boosting model performance in noisy scenarios.
Parallelized Autoregressive Visual Generation
·4274 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Peking University
Boosting autoregressive visual generation speed by 3.6-9.5x, this research introduces parallel processing while preserving model simplicity and generation quality.
Outcome-Refining Process Supervision for Code Generation
·2838 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Peking University
Boosting code generation accuracy, Outcome-Refining Process Supervision (ORPS) uses execution feedback and structured reasoning to refine code, achieving significant improvements across models and dat…
MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes
·3969 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Peking University
MOVIS enhances 3D scene generation by improving cross-view consistency in multi-object novel view synthesis.
BrushEdit: All-In-One Image Inpainting and Editing
·3281 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Peking University
BrushEdit revolutionizes interactive image editing with instructions & inpainting.
DisPose: Disentangling Pose Guidance for Controllable Human Image Animation
·3252 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Peking University
DisPose disentangles pose guidance for controllable human image animation, generating diverse animations while preserving appearance consistency using only sparse skeleton pose input, eliminating the …
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
·3918 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Peking University
DiffSensei: A new framework generates customized manga with dynamic multi-character control using multi-modal LLMs and diffusion models, outperforming existing methods.
HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing
·7357 words·35 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Peking University
HumanEdit: A new human-rewarded dataset revolutionizes instruction-based image editing by providing high-quality, diverse image pairs with detailed instructions, enabling precise model evaluation and …
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
·6193 words·30 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Applications
Robotics
🏢 Peking University
Code-as-Monitor (CaM) uses vision-language models and constraint-aware visual programming to achieve both reactive and proactive robotic failure detection in real-time, improving success rates and red…
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
·4800 words·23 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Peking University
Imperfect OCR hinders Retrieval-Augmented Generation (RAG). OHRBench, a new benchmark, reveals this cascading impact, showing current OCR solutions insufficient for high-quality RAG knowledge bases. …
Open-Sora Plan: Open-Source Large Video Generation Model
·4618 words·22 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Peking University
Open-Sora Plan introduces an open-source large video generation model capable of producing high-resolution videos with long durations, based on various user inputs.
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
·2978 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Peking University
VideoLLM’s interaction format is revolutionized by the novel Video-Text Duet, enabling real-time, time-sensitive video comprehension with significantly improved performance.
WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model
·4024 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Peking University
WF-VAE boosts video VAE performance with wavelet-driven energy flow and causal caching, enabling 2x higher throughput and 4x lower memory usage in latent video diffusion models.
Identity-Preserving Text-to-Video Generation by Frequency Decomposition
·2775 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Peking University
ConsisID achieves high-quality, identity-preserving text-to-video generation using a tuning-free diffusion transformer model that leverages frequency decomposition for effective identity control.
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
·2726 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Peking University
LLaVA-01: A novel visual language model achieves superior reasoning performance through structured, multi-stage processing and efficient inference-time scaling, surpassing even larger, closed-source m…
Large Language Models Can Self-Improve in Long-context Reasoning
·3316 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Peking University
LLMs can now self-improve long-context reasoning via SEALONG, a novel method leveraging multiple model outputs and minimum Bayes risk scoring to enable effective supervised fine-tuning or preference o…
GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation
·2630 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Peking University
GaussianAnything: Interactive point cloud latent diffusion enables high-quality, editable 3D models from images or text, overcoming existing 3D generation limitations.