Skip to main content

🏢 University of Chinese Academy of Sciences

VMamba: Visual State Space Model
·2891 words·14 mins· loading · loading
Image Classification 🏢 University of Chinese Academy of Sciences
VMamba: a vision backbone achieving linear time complexity using Visual State Space (VSS) blocks and 2D Selective Scan (SS2D) for efficient visual representation.
Trajectory Diffusion for ObjectGoal Navigation
·2125 words·10 mins· loading · loading
Multimodal Learning Embodied AI 🏢 University of Chinese Academy of Sciences
Trajectory Diffusion (T-Diff) significantly improves object goal navigation by learning sequential planning through trajectory diffusion, resulting in more accurate and efficient navigation.
Rethinking 3D Convolution in $ll_p$-norm Space
·1754 words·9 mins· loading · loading
3D Vision 🏢 University of Chinese Academy of Sciences
L1-norm based 3D convolution achieves competitive performance with lower energy consumption and latency compared to traditional methods, as proven through universal approximation theorem and experimen…
Generative Retrieval Meets Multi-Graded Relevance
·2003 words·10 mins· loading · loading
Information Retrieval 🏢 University of Chinese Academy of Sciences
GR2, a novel framework, extends generative retrieval to handle multi-graded relevance, addressing limitations of existing binary-relevance approaches by enhancing docid distinctness and implementing m…
Evaluation of Text-to-Video Generation Models: A Dynamics Perspective
·3278 words·16 mins· loading · loading
Natural Language Processing Vision-Language Models 🏢 University of Chinese Academy of Sciences
DEVIL: a novel text-to-video evaluation protocol focusing on video dynamics, resulting in more realistic video generation.
Dual-frame Fluid Motion Estimation with Test-time Optimization and Zero-divergence Loss
·2477 words·12 mins· loading · loading
Computer Vision 3D Vision 🏢 University of Chinese Academy of Sciences
Self-supervised dual-frame fluid motion estimation achieves superior accuracy with 99% less training data, using a novel zero-divergence loss and dynamic velocimetry enhancement.
Artemis: Towards Referential Understanding in Complex Videos
·3373 words·16 mins· loading · loading
AI Generated Natural Language Processing Large Language Models 🏢 University of Chinese Academy of Sciences
Artemis: A new MLLM excels at video-based referential understanding, accurately describing targets within complex videos using natural language questions and bounding boxes, surpassing existing models…