Skip to main content

🏢 Peking University

Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
·6193 words·30 mins· loading · loading
AI Generated 🤗 Daily Papers AI Applications Robotics 🏢 Peking University
Code-as-Monitor (CaM) uses vision-language models and constraint-aware visual programming to achieve both reactive and proactive robotic failure detection in real-time, improving success rates and red…
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
·4800 words·23 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University
Imperfect OCR hinders Retrieval-Augmented Generation (RAG). OHRBench, a new benchmark, reveals this cascading impact, showing current OCR solutions insufficient for high-quality RAG knowledge bases. …
Open-Sora Plan: Open-Source Large Video Generation Model
·4618 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
Open-Sora Plan introduces an open-source large video generation model capable of producing high-resolution videos with long durations, based on various user inputs.
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
·2978 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University
VideoLLM’s interaction format is revolutionized by the novel Video-Text Duet, enabling real-time, time-sensitive video comprehension with significantly improved performance.
WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model
·4024 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Peking University
WF-VAE boosts video VAE performance with wavelet-driven energy flow and causal caching, enabling 2x higher throughput and 4x lower memory usage in latent video diffusion models.
Identity-Preserving Text-to-Video Generation by Frequency Decomposition
·2775 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
ConsisID achieves high-quality, identity-preserving text-to-video generation using a tuning-free diffusion transformer model that leverages frequency decomposition for effective identity control.
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
·2726 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University
LLaVA-01: A novel visual language model achieves superior reasoning performance through structured, multi-stage processing and efficient inference-time scaling, surpassing even larger, closed-source m…
Large Language Models Can Self-Improve in Long-context Reasoning
·3316 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University
LLMs can now self-improve long-context reasoning via SEALONG, a novel method leveraging multiple model outputs and minimum Bayes risk scoring to enable effective supervised fine-tuning or preference o…
GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation
·2630 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Peking University
GaussianAnything: Interactive point cloud latent diffusion enables high-quality, editable 3D models from images or text, overcoming existing 3D generation limitations.
KMM: Key Frame Mask Mamba for Extended Motion Generation
·2527 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Peking University
KMM: Key Frame Mask Mamba generates extended, diverse human motion from text prompts by innovatively masking key frames in the Mamba architecture and using contrastive learning for improved text-motio…
Training-free Regional Prompting for Diffusion Transformers
·1817 words·9 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
Training-free Regional Prompting for FLUX boosts compositional text-to-image generation by cleverly manipulating attention mechanisms, achieving fine-grained control without retraining.
DreamPolish: Domain Score Distillation With Progressive Geometry Generation
·2197 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Peking University
DreamPolish: A new text-to-3D model generates highly detailed 3D objects with polished surfaces and realistic textures using progressive geometry refinement and a novel domain score distillation tech…
HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models
·2152 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
HelloMeme enhances text-to-image models by integrating spatial knitting attentions, enabling high-fidelity meme video generation while preserving model generalization.