🏢 University of Chinese Academy of Sciences

VMamba: Visual State Space Model

26 September 2024·2891 words·14 mins· loading · loading

Image Classification 🏢 University of Chinese Academy of Sciences

VMamba: a vision backbone achieving linear time complexity using Visual State Space (VSS) blocks and 2D Selective Scan (SS2D) for efficient visual representation.

Trajectory Diffusion for ObjectGoal Navigation

26 September 2024·2125 words·10 mins· loading · loading

Multimodal Learning Embodied AI 🏢 University of Chinese Academy of Sciences

Trajectory Diffusion (T-Diff) significantly improves object goal navigation by learning sequential planning through trajectory diffusion, resulting in more accurate and efficient navigation.

Rethinking 3D Convolution in $ll_p$-norm Space

26 September 2024·1754 words·9 mins· loading · loading

3D Vision 🏢 University of Chinese Academy of Sciences

L1-norm based 3D convolution achieves competitive performance with lower energy consumption and latency compared to traditional methods, as proven through universal approximation theorem and experimen…

Generative Retrieval Meets Multi-Graded Relevance

26 September 2024·2003 words·10 mins· loading · loading

Information Retrieval 🏢 University of Chinese Academy of Sciences

GR2, a novel framework, extends generative retrieval to handle multi-graded relevance, addressing limitations of existing binary-relevance approaches by enhancing docid distinctness and implementing m…

Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

26 September 2024·3278 words·16 mins· loading · loading

Natural Language Processing Vision-Language Models 🏢 University of Chinese Academy of Sciences

DEVIL: a novel text-to-video evaluation protocol focusing on video dynamics, resulting in more realistic video generation.

Dual-frame Fluid Motion Estimation with Test-time Optimization and Zero-divergence Loss

26 September 2024·2477 words·12 mins· loading · loading

Computer Vision 3D Vision 🏢 University of Chinese Academy of Sciences

Self-supervised dual-frame fluid motion estimation achieves superior accuracy with 99% less training data, using a novel zero-divergence loss and dynamic velocimetry enhancement.

Artemis: Towards Referential Understanding in Complex Videos

26 September 2024·3373 words·16 mins· loading · loading

AI Generated Natural Language Processing Large Language Models 🏢 University of Chinese Academy of Sciences

Artemis: A new MLLM excels at video-based referential understanding, accurately describing targets within complex videos using natural language questions and bounding boxes, surpassing existing models…