🏢 University of Science and Technology of China
Equivariant Image Modeling
·3413 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 University of Science and Technology of China
Aligning image generation subtasks: Equivariant modeling boosts efficiency and generalization by leveraging natural visual signal invariance.
Tokenize Image as a Set
·3037 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 University of Science and Technology of China
TokenSet: Tokenizing images as unordered sets for dynamic capacity allocation and robust generation, breaking from fixed-position latent codes.
Towards Unified Latent Space for 3D Molecular Latent Diffusion Modeling
·2283 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Machine Learning
Deep Learning
🏢 University of Science and Technology of China
UAE-3D: A unified latent space approach for efficient & high-quality 3D molecular generation, outperforming existing methods in accuracy and speed.
RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers
·2754 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 University of Science and Technology of China
RelaCtrl: Relevance-guided control boosts diffusion transformer efficiency, cutting parameters by intelligently allocating resources.
DepthMaster: Taming Diffusion Models for Monocular Depth Estimation
·2694 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 University of Science and Technology of China
DepthMaster tames diffusion models for faster, more accurate monocular depth estimation by aligning generative features with high-quality semantic features and adaptively balancing low and high-freque…
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation
·2542 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 University of Science and Technology of China
Molar: A novel multimodal LLM framework boosts sequential recommendation accuracy by cleverly aligning collaborative filtering with rich item representations from text and non-text data.
One Shot, One Talk: Whole-body Talking Avatar from a Single Image
·2297 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 University of Science and Technology of China
One-shot image to realistic, animatable talking avatar! Novel pipeline uses diffusion models and a hybrid 3DGS-mesh representation, achieving seamless generalization and precise control.