🏢 Peking University

Zero-Shot Event-Intensity Asymmetric Stereo via Visual Prompting from Image Domain

26 September 2024·4096 words·20 mins· loading · loading

AI Generated Computer Vision 3D Vision 🏢 Peking University

Zero-shot Event-Intensity Asymmetric Stereo (ZEST) uses visual prompting and monocular cues to achieve robust 3D perception without event-specific training, outperforming existing methods.

Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents

26 September 2024·2783 words·14 mins· loading · loading

AI Generated Natural Language Processing Large Language Models 🏢 Peking University

LLM-based agents are vulnerable to diverse backdoor attacks that manipulate their reasoning and outputs, highlighting the urgent need for targeted defenses.

VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions

26 September 2024·2504 words·12 mins· loading · loading

AI Applications Robotics 🏢 Peking University

VLMimic: Vision-Language Models enable robots to master intricate actions using only a few human video demonstrations, surpassing existing methods by a significant margin.

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

26 September 2024·1946 words·10 mins· loading · loading

Image Generation 🏢 Peking University

Visual Autoregressive Modeling (VAR) revolutionizes image generation by using a coarse-to-fine ’next-scale prediction’, outperforming diffusion models and exhibiting scaling laws similar to LLMs.

VideoTetris: Towards Compositional Text-to-Video Generation

26 September 2024·2282 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Peking University

VideoTetris: a novel framework enabling compositional text-to-video generation by precisely following complex textual semantics through spatio-temporal compositional diffusion, achieving impressive qu…

Video Diffusion Models are Training-free Motion Interpreter and Controller

26 September 2024·2252 words·11 mins· loading · loading

Computer Vision Video Understanding 🏢 Peking University

Training-free video motion control achieved via novel Motion Feature (MOFT) extraction from existing video diffusion models, offering architecture-agnostic insights and high performance.

Unveiling the Tapestry of Consistency in Large Vision-Language Models

26 September 2024·2665 words·13 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Peking University

ConBench: Unveiling Inconsistency in Large Vision-Language Models

Unveiling Encoder-Free Vision-Language Models

26 September 2024·2435 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Peking University

EVE, a groundbreaking encoder-free vision-language model, rivals encoder-based counterparts using a fraction of the data and resources, demonstrating efficient, transparent training for pure decoder-o…

Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling

26 September 2024·1911 words·9 mins· loading · loading

AI Generated AI Theory Generalization 🏢 Peking University

This work systematically investigates the approximation properties of Transformer networks for sequence modeling, revealing the distinct roles of key components (self-attention, positional encoding, f…

U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers

26 September 2024·2151 words·11 mins· loading · loading

Computer Vision Image Generation 🏢 Peking University

U-DiT: Revolutionizing diffusion transformers with a U-Net design and token downsampling for superior image generation and drastically reduced computation cost.

Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts

26 September 2024·1784 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Peking University

VL-SAM: Training-free open-ended object detection & segmentation using attention maps as prompts, surpassing previous methods on LVIS and CODA datasets.

Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning

26 September 2024·1671 words·8 mins· loading · loading

Multimodal Learning Sentiment Analysis 🏢 Peking University

Hierarchical Representation Learning Framework (HRLF) significantly improves Multimodal Sentiment Analysis (MSA) accuracy by effectively addressing incomplete data through fine-grained representation …

To Learn or Not to Learn, That is the Question — A Feature-Task Dual Learning Model of Perceptual Learning

26 September 2024·1867 words·9 mins· loading · loading

Machine Learning Transfer Learning 🏢 Peking University

A new dual-learning model resolves the paradox of perceptual learning, showing how task-based and feature-based learning interact to produce both specific and transferable improvements in sensory perc…

The Implicit Bias of Heterogeneity towards Invariance: A Study of Multi-Environment Matrix Sensing

26 September 2024·1551 words·8 mins· loading · loading

AI Theory Optimization 🏢 Peking University

Leveraging data heterogeneity, this study reveals that standard SGD implicitly learns invariant features across multiple environments, achieving robust generalization without explicit regularization.

Temporal Sentence Grounding with Relevance Feedback in Videos

26 September 2024·2432 words·12 mins· loading · loading

Natural Language Processing Vision-Language Models 🏢 Peking University

RaTSG network tackles Temporal Sentence Grounding with Relevance Feedback (TSG-RF) by discerning query relevance at multiple granularities before selectively grounding segments.

Template-free Articulated Gaussian Splatting for Real-time Reposable Dynamic View Synthesis

26 September 2024·2005 words·10 mins· loading · loading

Computer Vision 3D Vision 🏢 Peking University

This research introduces a template-free articulated Gaussian splatting method for real-time dynamic view synthesis, automatically discovering object skeletons from videos to enable reposing.

Take A Shortcut Back: Mitigating the Gradient Vanishing for Training Spiking Neural Networks

26 September 2024·1272 words·6 mins· loading · loading

Machine Learning Deep Learning 🏢 Peking University

Shortcut back-propagation and an evolutionary training framework conquer gradient vanishing in spiking neural networks, drastically improving training and achieving state-of-the-art accuracy.

StreamFlow: Streamlined Multi-Frame Optical Flow Estimation for Video Sequences

26 September 2024·2803 words·14 mins· loading · loading

AI Generated Computer Vision Video Understanding 🏢 Peking University

StreamFlow accelerates video optical flow estimation by 44% via a streamlined in-batch multi-frame pipeline and innovative spatiotemporal modeling, achieving state-of-the-art results.

Statistical Efficiency of Distributional Temporal Difference Learning

26 September 2024·295 words·2 mins· loading · loading

Reinforcement Learning 🏢 Peking University

Researchers achieve minimax optimal sample complexity bounds for distributional temporal difference learning, enhancing reinforcement learning algorithm efficiency.

Spiking Transformer with Experts Mixture

26 September 2024·2017 words·10 mins· loading · loading

Computer Vision Image Classification 🏢 Peking University

Spiking Experts Mixture Mechanism (SEMM) boosts Spiking Transformers by integrating Mixture-of-Experts for efficient, sparse conditional computation, achieving significant performance improvements on …