🏢 Peking University

Almost Surely Safe Alignment of Large Language Models at Inference-Time

3 February 2025·2605 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University

InferenceGuard ensures almost-sure safe LLM responses at inference time by framing safe generation as a constrained Markov Decision Process in the LLM’s latent space, achieving high safety rates witho…

DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation

28 January 2025·3227 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

DIFFSPLAT repurposes 2D image diffusion models to natively generate high-quality 3D Gaussian splats, overcoming limitations in existing 3D generation methods.

ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer

26 January 2025·1758 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University

ARWKV: A novel RNN-attention-based language model, distilled from a larger model, achieves strong performance using significantly fewer resources, opening a new path in efficient language model develo…

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

23 January 2025·2900 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

Researchers significantly enhanced autoregressive image generation by integrating chain-of-thought reasoning strategies, achieving a remarkable +24% improvement on the GenEval benchmark.

PaSa: An LLM Agent for Comprehensive Academic Paper Search

17 January 2025·4507 words·22 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University

PaSa: An LLM agent autonomously performs comprehensive academic paper searches, outperforming existing methods by efficiently combining search tools, paper reading, and citation analysis, optimized vi…

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

7 January 2025·4541 words·22 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University

Sa2VA marries SAM2 and LLaVA for dense grounded image and video understanding, achieving state-of-the-art results on multiple benchmarks.

Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding

23 December 2024·2127 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Dialogue Systems 🏢 Peking University

Friends-MMC: A new dataset facilitates multi-modal multi-party conversation understanding by providing 24,000+ utterances with video, audio, and speaker annotations, enabling advancements in character…

RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response

19 December 2024·2508 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University

ROBUSTFT tackles noisy data in LLM fine-tuning by using multi-expert noise detection and context-enhanced relabeling, significantly boosting model performance in noisy scenarios.

Parallelized Autoregressive Visual Generation

19 December 2024·4274 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

Boosting autoregressive visual generation speed by 3.6-9.5x, this research introduces parallel processing while preserving model simplicity and generation quality.

Outcome-Refining Process Supervision for Code Generation

19 December 2024·2838 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University

Boosting code generation accuracy, Outcome-Refining Process Supervision (ORPS) uses execution feedback and structured reasoning to refine code, achieving significant improvements across models and dat…

MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes

16 December 2024·3969 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Peking University

MOVIS enhances 3D scene generation by improving cross-view consistency in multi-object novel view synthesis.

BrushEdit: All-In-One Image Inpainting and Editing

13 December 2024·3281 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

BrushEdit revolutionizes interactive image editing with instructions & inpainting.

DisPose: Disentangling Pose Guidance for Controllable Human Image Animation

12 December 2024·3252 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

DisPose disentangles pose guidance for controllable human image animation, generating diverse animations while preserving appearance consistency using only sparse skeleton pose input, eliminating the …

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

10 December 2024·3918 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

DiffSensei: A new framework generates customized manga with dynamic multi-character control using multi-modal LLMs and diffusion models, outperforming existing methods.

HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing

5 December 2024·7357 words·35 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

HumanEdit: A new human-rewarded dataset revolutionizes instruction-based image editing by providing high-quality, diverse image pairs with detailed instructions, enabling precise model evaluation and …

Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

5 December 2024·6193 words·30 mins· loading · loading

AI Generated 🤗 Daily Papers AI Applications Robotics 🏢 Peking University

Code-as-Monitor (CaM) uses vision-language models and constraint-aware visual programming to achieve both reactive and proactive robotic failure detection in real-time, improving success rates and red…

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

3 December 2024·4800 words·23 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University

Imperfect OCR hinders Retrieval-Augmented Generation (RAG). OHRBench, a new benchmark, reveals this cascading impact, showing current OCR solutions insufficient for high-quality RAG knowledge bases. …

Open-Sora Plan: Open-Source Large Video Generation Model

28 November 2024·4618 words·22 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

Open-Sora Plan introduces an open-source large video generation model capable of producing high-resolution videos with long durations, based on various user inputs.

VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

27 November 2024·2978 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University

VideoLLM’s interaction format is revolutionized by the novel Video-Text Duet, enabling real-time, time-sensitive video comprehension with significantly improved performance.

WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

26 November 2024·4024 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Peking University

WF-VAE boosts video VAE performance with wavelet-driven energy flow and causal caching, enabling 2x higher throughput and 4x lower memory usage in latent video diffusion models.