🏢 Hong Kong University of Science and Technology

Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control

7 January 2025·3018 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology

Diffusion as Shader (DaS) achieves versatile video control by using 3D tracking videos as control signals in a unified video diffusion model, enabling precise manipulation across diverse tasks.

TransPixar: Advancing Text-to-Video Generation with Transparency

6 January 2025·2458 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology

TransPixar generates high-quality videos with transparency by jointly training RGB and alpha channels, outperforming sequential generation methods.

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

2 January 2025·3152 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology

VideoAnydoor: High-fidelity video object insertion with precise motion control, achieved via an end-to-end framework leveraging an ID extractor and a pixel warper for robust detail preservation and fi…

A3: Android Agent Arena for Mobile GUI Agents

2 January 2025·2276 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers AI Applications Human-AI Interaction 🏢 Hong Kong University of Science and Technology

Android Agent Arena (A3): A novel evaluation platform for mobile GUI agents offering diverse tasks, flexible action space, and automated LLM-based evaluation, advancing real-world AI agent research.

Edicho: Consistent Image Editing in the Wild

30 December 2024·2565 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Hong Kong University of Science and Technology

Edicho: a novel training-free method for consistent image editing across diverse images, achieving precise consistency by leveraging explicit correspondence.

Diving into Self-Evolving Training for Multimodal Reasoning

23 December 2024·3292 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Hong Kong University of Science and Technology

M-STAR: a novel self-evolving training framework significantly boosts multimodal reasoning in large models without human annotation, achieving state-of-the-art results.

B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

23 December 2024·2172 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Hong Kong University of Science and Technology

B-STAR dynamically balances exploration and exploitation in self-taught reasoners, achieving superior performance in mathematical, coding, and commonsense reasoning tasks.

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

19 December 2024·2604 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

MegaPairs synthesizes 26M+ high-quality multimodal retrieval training examples, enabling state-of-the-art zero-shot performance and surpassing existing methods trained on 70x more data.

LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

19 December 2024·2715 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology

LeviTor: Revolutionizing image-to-video synthesis with intuitive 3D trajectory control, generating realistic videos from static images by abstracting object masks into depth-aware control points.

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

18 December 2024·2901 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

Enhance image captions significantly with DCE, a novel engine leveraging visual specialists to generate comprehensive, detailed descriptions surpassing LMM and human-annotated captions.

AniDoc: Animation Creation Made Easier

18 December 2024·2223 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology

AniDoc automates cartoon animation line art video colorization, making animation creation easier!

GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs

15 December 2024·3380 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Hong Kong University of Science and Technology

Training-free method adds physical properties to 3D models using vision-language models.

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

12 December 2024·3111 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

Lyra: An efficient, speech-centric framework for omni-cognition, achieving state-of-the-art results across various modalities while being highly efficient.

VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation

3 December 2024·2511 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology

VideoGen-of-Thought (VGoT) creates high-quality, multi-shot videos by collaboratively generating scripts, keyframes, and video clips, ensuring narrative consistency and visual coherence.

OmniCreator: Self-Supervised Unified Generation with Universal Editing

3 December 2024·5399 words·26 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology

OmniCreator: Self-supervised unified image+video generation & universal editing.

MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

21 November 2024·4302 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology

MagicDriveDiT generates high-resolution, long street-view videos with precise control, exceeding limitations of previous methods in autonomous driving.

Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models

9 November 2024·3715 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Hong Kong University of Science and Technology

Golden Touchstone, a new bilingual benchmark, comprehensively evaluates financial LLMs across eight tasks, revealing model strengths and weaknesses and advancing FinLLM research.