🏢 Peking University
Zero-Shot Event-Intensity Asymmetric Stereo via Visual Prompting from Image Domain
·4096 words·20 mins·
loading
·
loading
AI Generated
Computer Vision
3D Vision
🏢 Peking University
Zero-shot Event-Intensity Asymmetric Stereo (ZEST) uses visual prompting and monocular cues to achieve robust 3D perception without event-specific training, outperforming existing methods.
Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
·2783 words·14 mins·
loading
·
loading
AI Generated
Natural Language Processing
Large Language Models
🏢 Peking University
LLM-based agents are vulnerable to diverse backdoor attacks that manipulate their reasoning and outputs, highlighting the urgent need for targeted defenses.
VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions
·2504 words·12 mins·
loading
·
loading
AI Applications
Robotics
🏢 Peking University
VLMimic: Vision-Language Models enable robots to master intricate actions using only a few human video demonstrations, surpassing existing methods by a significant margin.
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
·1946 words·10 mins·
loading
·
loading
Image Generation
🏢 Peking University
Visual Autoregressive Modeling (VAR) revolutionizes image generation by using a coarse-to-fine ’next-scale prediction’, outperforming diffusion models and exhibiting scaling laws similar to LLMs.
VideoTetris: Towards Compositional Text-to-Video Generation
·2282 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Peking University
VideoTetris: a novel framework enabling compositional text-to-video generation by precisely following complex textual semantics through spatio-temporal compositional diffusion, achieving impressive qu…
Video Diffusion Models are Training-free Motion Interpreter and Controller
·2252 words·11 mins·
loading
·
loading
Computer Vision
Video Understanding
🏢 Peking University
Training-free video motion control achieved via novel Motion Feature (MOFT) extraction from existing video diffusion models, offering architecture-agnostic insights and high performance.
Unveiling the Tapestry of Consistency in Large Vision-Language Models
·2665 words·13 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Peking University
ConBench: Unveiling Inconsistency in Large Vision-Language Models
Unveiling Encoder-Free Vision-Language Models
·2435 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Peking University
EVE, a groundbreaking encoder-free vision-language model, rivals encoder-based counterparts using a fraction of the data and resources, demonstrating efficient, transparent training for pure decoder-o…
Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling
·1911 words·9 mins·
loading
·
loading
AI Generated
AI Theory
Generalization
🏢 Peking University
This work systematically investigates the approximation properties of Transformer networks for sequence modeling, revealing the distinct roles of key components (self-attention, positional encoding, f…
U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers
·2151 words·11 mins·
loading
·
loading
Computer Vision
Image Generation
🏢 Peking University
U-DiT: Revolutionizing diffusion transformers with a U-Net design and token downsampling for superior image generation and drastically reduced computation cost.
Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts
·1784 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Peking University
VL-SAM: Training-free open-ended object detection & segmentation using attention maps as prompts, surpassing previous methods on LVIS and CODA datasets.
Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning
·1671 words·8 mins·
loading
·
loading
Multimodal Learning
Sentiment Analysis
🏢 Peking University
Hierarchical Representation Learning Framework (HRLF) significantly improves Multimodal Sentiment Analysis (MSA) accuracy by effectively addressing incomplete data through fine-grained representation …
To Learn or Not to Learn, That is the Question — A Feature-Task Dual Learning Model of Perceptual Learning
·1867 words·9 mins·
loading
·
loading
Machine Learning
Transfer Learning
🏢 Peking University
A new dual-learning model resolves the paradox of perceptual learning, showing how task-based and feature-based learning interact to produce both specific and transferable improvements in sensory perc…
The Implicit Bias of Heterogeneity towards Invariance: A Study of Multi-Environment Matrix Sensing
·1551 words·8 mins·
loading
·
loading
AI Theory
Optimization
🏢 Peking University
Leveraging data heterogeneity, this study reveals that standard SGD implicitly learns invariant features across multiple environments, achieving robust generalization without explicit regularization.
Temporal Sentence Grounding with Relevance Feedback in Videos
·2432 words·12 mins·
loading
·
loading
Natural Language Processing
Vision-Language Models
🏢 Peking University
RaTSG network tackles Temporal Sentence Grounding with Relevance Feedback (TSG-RF) by discerning query relevance at multiple granularities before selectively grounding segments.
Template-free Articulated Gaussian Splatting for Real-time Reposable Dynamic View Synthesis
·2005 words·10 mins·
loading
·
loading
Computer Vision
3D Vision
🏢 Peking University
This research introduces a template-free articulated Gaussian splatting method for real-time dynamic view synthesis, automatically discovering object skeletons from videos to enable reposing.
Take A Shortcut Back: Mitigating the Gradient Vanishing for Training Spiking Neural Networks
·1272 words·6 mins·
loading
·
loading
Machine Learning
Deep Learning
🏢 Peking University
Shortcut back-propagation and an evolutionary training framework conquer gradient vanishing in spiking neural networks, drastically improving training and achieving state-of-the-art accuracy.
StreamFlow: Streamlined Multi-Frame Optical Flow Estimation for Video Sequences
·2803 words·14 mins·
loading
·
loading
AI Generated
Computer Vision
Video Understanding
🏢 Peking University
StreamFlow accelerates video optical flow estimation by 44% via a streamlined in-batch multi-frame pipeline and innovative spatiotemporal modeling, achieving state-of-the-art results.
Statistical Efficiency of Distributional Temporal Difference Learning
·295 words·2 mins·
loading
·
loading
Reinforcement Learning
🏢 Peking University
Researchers achieve minimax optimal sample complexity bounds for distributional temporal difference learning, enhancing reinforcement learning algorithm efficiency.
Spiking Transformer with Experts Mixture
·2017 words·10 mins·
loading
·
loading
Computer Vision
Image Classification
🏢 Peking University
Spiking Experts Mixture Mechanism (SEMM) boosts Spiking Transformers by integrating Mixture-of-Experts for efficient, sparse conditional computation, achieving significant performance improvements on …