๐ข Peking University
Zero-Shot Event-Intensity Asymmetric Stereo via Visual Prompting from Image Domain
ยท4096 wordsยท20 minsยท
loading
ยท
loading
AI Generated
Computer Vision
3D Vision
๐ข Peking University
Zero-shot Event-Intensity Asymmetric Stereo (ZEST) uses visual prompting and monocular cues to achieve robust 3D perception without event-specific training, outperforming existing methods.
Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
ยท2783 wordsยท14 minsยท
loading
ยท
loading
AI Generated
Natural Language Processing
Large Language Models
๐ข Peking University
LLM-based agents are vulnerable to diverse backdoor attacks that manipulate their reasoning and outputs, highlighting the urgent need for targeted defenses.
VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions
ยท2504 wordsยท12 minsยท
loading
ยท
loading
AI Applications
Robotics
๐ข Peking University
VLMimic: Vision-Language Models enable robots to master intricate actions using only a few human video demonstrations, surpassing existing methods by a significant margin.
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
ยท1946 wordsยท10 minsยท
loading
ยท
loading
Image Generation
๐ข Peking University
Visual Autoregressive Modeling (VAR) revolutionizes image generation by using a coarse-to-fine โnext-scale predictionโ, outperforming diffusion models and exhibiting scaling laws similar to LLMs.
VideoTetris: Towards Compositional Text-to-Video Generation
ยท2282 wordsยท11 minsยท
loading
ยท
loading
Multimodal Learning
Vision-Language Models
๐ข Peking University
VideoTetris: a novel framework enabling compositional text-to-video generation by precisely following complex textual semantics through spatio-temporal compositional diffusion, achieving impressive quโฆ
Video Diffusion Models are Training-free Motion Interpreter and Controller
ยท2252 wordsยท11 minsยท
loading
ยท
loading
Computer Vision
Video Understanding
๐ข Peking University
Training-free video motion control achieved via novel Motion Feature (MOFT) extraction from existing video diffusion models, offering architecture-agnostic insights and high performance.
Unveiling the Tapestry of Consistency in Large Vision-Language Models
ยท2665 wordsยท13 minsยท
loading
ยท
loading
Multimodal Learning
Vision-Language Models
๐ข Peking University
ConBench: Unveiling Inconsistency in Large Vision-Language Models
Unveiling Encoder-Free Vision-Language Models
ยท2435 wordsยท12 minsยท
loading
ยท
loading
Multimodal Learning
Vision-Language Models
๐ข Peking University
EVE, a groundbreaking encoder-free vision-language model, rivals encoder-based counterparts using a fraction of the data and resources, demonstrating efficient, transparent training for pure decoder-oโฆ
Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling
ยท1911 wordsยท9 minsยท
loading
ยท
loading
AI Generated
AI Theory
Generalization
๐ข Peking University
This work systematically investigates the approximation properties of Transformer networks for sequence modeling, revealing the distinct roles of key components (self-attention, positional encoding, fโฆ
U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers
ยท2151 wordsยท11 minsยท
loading
ยท
loading
Computer Vision
Image Generation
๐ข Peking University
U-DiT: Revolutionizing diffusion transformers with a U-Net design and token downsampling for superior image generation and drastically reduced computation cost.
Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts
ยท1784 wordsยท9 minsยท
loading
ยท
loading
Multimodal Learning
Vision-Language Models
๐ข Peking University
VL-SAM: Training-free open-ended object detection & segmentation using attention maps as prompts, surpassing previous methods on LVIS and CODA datasets.
Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning
ยท1671 wordsยท8 minsยท
loading
ยท
loading
Multimodal Learning
Sentiment Analysis
๐ข Peking University
Hierarchical Representation Learning Framework (HRLF) significantly improves Multimodal Sentiment Analysis (MSA) accuracy by effectively addressing incomplete data through fine-grained representation โฆ
To Learn or Not to Learn, That is the Question โ A Feature-Task Dual Learning Model of Perceptual Learning
ยท1867 wordsยท9 minsยท
loading
ยท
loading
Machine Learning
Transfer Learning
๐ข Peking University
A new dual-learning model resolves the paradox of perceptual learning, showing how task-based and feature-based learning interact to produce both specific and transferable improvements in sensory percโฆ
The Implicit Bias of Heterogeneity towards Invariance: A Study of Multi-Environment Matrix Sensing
ยท1551 wordsยท8 minsยท
loading
ยท
loading
AI Theory
Optimization
๐ข Peking University
Leveraging data heterogeneity, this study reveals that standard SGD implicitly learns invariant features across multiple environments, achieving robust generalization without explicit regularization.
Temporal Sentence Grounding with Relevance Feedback in Videos
ยท2432 wordsยท12 minsยท
loading
ยท
loading
Natural Language Processing
Vision-Language Models
๐ข Peking University
RaTSG network tackles Temporal Sentence Grounding with Relevance Feedback (TSG-RF) by discerning query relevance at multiple granularities before selectively grounding segments.
Template-free Articulated Gaussian Splatting for Real-time Reposable Dynamic View Synthesis
ยท2005 wordsยท10 minsยท
loading
ยท
loading
Computer Vision
3D Vision
๐ข Peking University
This research introduces a template-free articulated Gaussian splatting method for real-time dynamic view synthesis, automatically discovering object skeletons from videos to enable reposing.
Take A Shortcut Back: Mitigating the Gradient Vanishing for Training Spiking Neural Networks
ยท1272 wordsยท6 minsยท
loading
ยท
loading
Machine Learning
Deep Learning
๐ข Peking University
Shortcut back-propagation and an evolutionary training framework conquer gradient vanishing in spiking neural networks, drastically improving training and achieving state-of-the-art accuracy.
StreamFlow: Streamlined Multi-Frame Optical Flow Estimation for Video Sequences
ยท2803 wordsยท14 minsยท
loading
ยท
loading
AI Generated
Computer Vision
Video Understanding
๐ข Peking University
StreamFlow accelerates video optical flow estimation by 44% via a streamlined in-batch multi-frame pipeline and innovative spatiotemporal modeling, achieving state-of-the-art results.
Statistical Efficiency of Distributional Temporal Difference Learning
ยท295 wordsยท2 minsยท
loading
ยท
loading
Reinforcement Learning
๐ข Peking University
Researchers achieve minimax optimal sample complexity bounds for distributional temporal difference learning, enhancing reinforcement learning algorithm efficiency.
Spiking Transformer with Experts Mixture
ยท2017 wordsยท10 minsยท
loading
ยท
loading
Computer Vision
Image Classification
๐ข Peking University
Spiking Experts Mixture Mechanism (SEMM) boosts Spiking Transformers by integrating Mixture-of-Experts for efficient, sparse conditional computation, achieving significant performance improvements on โฆ