🏢 Hong Kong University of Science and Technology
GaussianAvatar-Editor: Photorealistic Animatable Gaussian Head Avatar Editor
·2208 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Hong Kong University of Science and Technology
GaussianAvatar-Editor enables photorealistic, text-driven editing of animatable 3D heads, solving motion occlusion and ensuring temporal consistency.
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
·3018 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Hong Kong University of Science and Technology
Diffusion as Shader (DaS) achieves versatile video control by using 3D tracking videos as control signals in a unified video diffusion model, enabling precise manipulation across diverse tasks.
TransPixar: Advancing Text-to-Video Generation with Transparency
·2458 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Hong Kong University of Science and Technology
TransPixar generates high-quality videos with transparency by jointly training RGB and alpha channels, outperforming sequential generation methods.
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control
·3152 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Hong Kong University of Science and Technology
VideoAnydoor: High-fidelity video object insertion with precise motion control, achieved via an end-to-end framework leveraging an ID extractor and a pixel warper for robust detail preservation and fi…
A3: Android Agent Arena for Mobile GUI Agents
·2276 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Applications
Human-AI Interaction
🏢 Hong Kong University of Science and Technology
Android Agent Arena (A3): A novel evaluation platform for mobile GUI agents offering diverse tasks, flexible action space, and automated LLM-based evaluation, advancing real-world AI agent research.
Edicho: Consistent Image Editing in the Wild
·2565 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Hong Kong University of Science and Technology
Edicho: a novel training-free method for consistent image editing across diverse images, achieving precise consistency by leveraging explicit correspondence.
Diving into Self-Evolving Training for Multimodal Reasoning
·3292 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Hong Kong University of Science and Technology
M-STAR: a novel self-evolving training framework significantly boosts multimodal reasoning in large models without human annotation, achieving state-of-the-art results.
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
·2172 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Hong Kong University of Science and Technology
B-STAR dynamically balances exploration and exploitation in self-taught reasoners, achieving superior performance in mathematical, coding, and commonsense reasoning tasks.
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
·2604 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Hong Kong University of Science and Technology
MegaPairs synthesizes 26M+ high-quality multimodal retrieval training examples, enabling state-of-the-art zero-shot performance and surpassing existing methods trained on 70x more data.
LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis
·2715 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Hong Kong University of Science and Technology
LeviTor: Revolutionizing image-to-video synthesis with intuitive 3D trajectory control, generating realistic videos from static images by abstracting object masks into depth-aware control points.
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
·2901 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Hong Kong University of Science and Technology
Enhance image captions significantly with DCE, a novel engine leveraging visual specialists to generate comprehensive, detailed descriptions surpassing LMM and human-annotated captions.
AniDoc: Animation Creation Made Easier
·2223 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Hong Kong University of Science and Technology
AniDoc automates cartoon animation line art video colorization, making animation creation easier!
GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs
·3380 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Hong Kong University of Science and Technology
Training-free method adds physical properties to 3D models using vision-language models.
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
·3111 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Hong Kong University of Science and Technology
Lyra: An efficient, speech-centric framework for omni-cognition, achieving state-of-the-art results across various modalities while being highly efficient.
VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation
·2511 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Hong Kong University of Science and Technology
VideoGen-of-Thought (VGoT) creates high-quality, multi-shot videos by collaboratively generating scripts, keyframes, and video clips, ensuring narrative consistency and visual coherence.
OmniCreator: Self-Supervised Unified Generation with Universal Editing
·5399 words·26 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Hong Kong University of Science and Technology
OmniCreator: Self-supervised unified image+video generation & universal editing.
MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control
·4302 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Hong Kong University of Science and Technology
MagicDriveDiT generates high-resolution, long street-view videos with precise control, exceeding limitations of previous methods in autonomous driving.
Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models
·3715 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Hong Kong University of Science and Technology
Golden Touchstone, a new bilingual benchmark, comprehensively evaluates financial LLMs across eight tasks, revealing model strengths and weaknesses and advancing FinLLM research.