Skip to main content

🏢 Peking University

Almost Surely Safe Alignment of Large Language Models at Inference-Time
·2605 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University
InferenceGuard ensures almost-sure safe LLM responses at inference time by framing safe generation as a constrained Markov Decision Process in the LLM’s latent space, achieving high safety rates witho…
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation
·3227 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
DIFFSPLAT repurposes 2D image diffusion models to natively generate high-quality 3D Gaussian splats, overcoming limitations in existing 3D generation methods.
ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer
·1758 words·9 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University
ARWKV: A novel RNN-attention-based language model, distilled from a larger model, achieves strong performance using significantly fewer resources, opening a new path in efficient language model develo…
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
·2900 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
Researchers significantly enhanced autoregressive image generation by integrating chain-of-thought reasoning strategies, achieving a remarkable +24% improvement on the GenEval benchmark.
PaSa: An LLM Agent for Comprehensive Academic Paper Search
·4507 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University
PaSa: An LLM agent autonomously performs comprehensive academic paper searches, outperforming existing methods by efficiently combining search tools, paper reading, and citation analysis, optimized vi…
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
·4541 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University
Sa2VA marries SAM2 and LLaVA for dense grounded image and video understanding, achieving state-of-the-art results on multiple benchmarks.
Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding
·2127 words·10 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Dialogue Systems 🏢 Peking University
Friends-MMC: A new dataset facilitates multi-modal multi-party conversation understanding by providing 24,000+ utterances with video, audio, and speaker annotations, enabling advancements in character…
RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response
·2508 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University
ROBUSTFT tackles noisy data in LLM fine-tuning by using multi-expert noise detection and context-enhanced relabeling, significantly boosting model performance in noisy scenarios.
Parallelized Autoregressive Visual Generation
·4274 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
Boosting autoregressive visual generation speed by 3.6-9.5x, this research introduces parallel processing while preserving model simplicity and generation quality.
Outcome-Refining Process Supervision for Code Generation
·2838 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University
Boosting code generation accuracy, Outcome-Refining Process Supervision (ORPS) uses execution feedback and structured reasoning to refine code, achieving significant improvements across models and dat…
MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes
·3969 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Peking University
MOVIS enhances 3D scene generation by improving cross-view consistency in multi-object novel view synthesis.
BrushEdit: All-In-One Image Inpainting and Editing
·3281 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
BrushEdit revolutionizes interactive image editing with instructions & inpainting.
DisPose: Disentangling Pose Guidance for Controllable Human Image Animation
·3252 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
DisPose disentangles pose guidance for controllable human image animation, generating diverse animations while preserving appearance consistency using only sparse skeleton pose input, eliminating the …
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
·3918 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
DiffSensei: A new framework generates customized manga with dynamic multi-character control using multi-modal LLMs and diffusion models, outperforming existing methods.
HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing
·7357 words·35 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
HumanEdit: A new human-rewarded dataset revolutionizes instruction-based image editing by providing high-quality, diverse image pairs with detailed instructions, enabling precise model evaluation and …
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
·6193 words·30 mins· loading · loading
AI Generated 🤗 Daily Papers AI Applications Robotics 🏢 Peking University
Code-as-Monitor (CaM) uses vision-language models and constraint-aware visual programming to achieve both reactive and proactive robotic failure detection in real-time, improving success rates and red…
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
·4800 words·23 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University
Imperfect OCR hinders Retrieval-Augmented Generation (RAG). OHRBench, a new benchmark, reveals this cascading impact, showing current OCR solutions insufficient for high-quality RAG knowledge bases. …
Open-Sora Plan: Open-Source Large Video Generation Model
·4618 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
Open-Sora Plan introduces an open-source large video generation model capable of producing high-resolution videos with long durations, based on various user inputs.
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
·2978 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University
VideoLLM’s interaction format is revolutionized by the novel Video-Text Duet, enabling real-time, time-sensitive video comprehension with significantly improved performance.
WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model
·4024 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Peking University
WF-VAE boosts video VAE performance with wavelet-driven energy flow and causal caching, enabling 2x higher throughput and 4x lower memory usage in latent video diffusion models.