🏢 Hong Kong University of Science and Technology
Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research
·3084 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Hong Kong University of Science and Technology
Perovskite-LLM: a new knowledge-enhanced system boosts perovskite solar cell research by integrating a domain-specific knowledge graph, high-quality datasets, and specialized LLMs for superior knowled…
Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation
·2594 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Hong Kong University of Science and Technology
mmMamba: a novel framework creates linear-complexity multimodal models via distillation, drastically improving efficiency without sacrificing performance.
Atom of Thoughts for Markov LLM Test-Time Scaling
·2660 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Hong Kong University of Science and Technology
Atom of Thoughts (AOT) revolutionizes LLM test-time scaling by decomposing complex reasoning into independent sub-questions, drastically reducing computation while maintaining high accuracy.
FinMTEB: Finance Massive Text Embedding Benchmark
·3630 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Hong Kong University of Science and Technology
FinMTEB: A new benchmark reveals that general-purpose embedding models struggle in the finance domain; domain-specific models excel, and surprisingly, simple BoW outperforms sophisticated models on ce…
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models
·3464 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Hong Kong University of Science and Technology
ThinkDiff empowers text-to-image diffusion models with multimodal reasoning by aligning vision-language models to an LLM decoder, achieving state-of-the-art results on in-context reasoning benchmarks.
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction
·5174 words·25 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Hong Kong University of Science and Technology
CODEI/O: Condensing reasoning patterns from code into LLM training data for enhanced reasoning.
CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers
·2569 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Hong Kong University of Science and Technology
CustomVideoX: Zero-shot personalized video generation, exceeding existing methods in quality & consistency via 3D reference attention and dynamic adaptation.
Generating Symbolic World Models via Test-time Scaling of Large Language Models
·2722 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Hong Kong University of Science and Technology
LLMs excel at complex reasoning but struggle with planning; this paper introduces a test-time scaling approach that enhances LLMs’ PDDL reasoning, enabling high-quality PDDL domain generation, outperf…
FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
·4450 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Hong Kong University of Science and Technology
FlashVideo: Generate stunning high-resolution videos efficiently using a two-stage framework prioritizing fidelity and detail, achieving state-of-the-art results.
Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis
·3315 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Text Generation
🏢 Hong Kong University of Science and Technology
Llasa, a novel single-Transformer TTS model, achieves state-of-the-art performance by scaling both training and inference compute, improving naturalness, prosody and emotional expressiveness.
Weak-to-Strong Diffusion with Reflection
·4655 words·22 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Machine Learning
Deep Learning
🏢 Hong Kong University of Science and Technology
W2SD: A novel framework boosts diffusion model quality by using the difference between weak and strong models to refine sampling trajectories, achieving state-of-the-art performance.
GaussianAvatar-Editor: Photorealistic Animatable Gaussian Head Avatar Editor
·2208 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Hong Kong University of Science and Technology
GaussianAvatar-Editor enables photorealistic, text-driven editing of animatable 3D heads, solving motion occlusion and ensuring temporal consistency.
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
·3018 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Hong Kong University of Science and Technology
Diffusion as Shader (DaS) achieves versatile video control by using 3D tracking videos as control signals in a unified video diffusion model, enabling precise manipulation across diverse tasks.
TransPixar: Advancing Text-to-Video Generation with Transparency
·2458 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Hong Kong University of Science and Technology
TransPixar generates high-quality videos with transparency by jointly training RGB and alpha channels, outperforming sequential generation methods.
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control
·3152 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Hong Kong University of Science and Technology
VideoAnydoor: High-fidelity video object insertion with precise motion control, achieved via an end-to-end framework leveraging an ID extractor and a pixel warper for robust detail preservation and fi…
A3: Android Agent Arena for Mobile GUI Agents
·2276 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Applications
Human-AI Interaction
🏢 Hong Kong University of Science and Technology
Android Agent Arena (A3): A novel evaluation platform for mobile GUI agents offering diverse tasks, flexible action space, and automated LLM-based evaluation, advancing real-world AI agent research.
Edicho: Consistent Image Editing in the Wild
·2565 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Hong Kong University of Science and Technology
Edicho: a novel training-free method for consistent image editing across diverse images, achieving precise consistency by leveraging explicit correspondence.
Diving into Self-Evolving Training for Multimodal Reasoning
·3292 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Hong Kong University of Science and Technology
M-STAR: a novel self-evolving training framework significantly boosts multimodal reasoning in large models without human annotation, achieving state-of-the-art results.
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
·2172 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Hong Kong University of Science and Technology
B-STAR dynamically balances exploration and exploitation in self-taught reasoners, achieving superior performance in mathematical, coding, and commonsense reasoning tasks.
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
·2604 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Hong Kong University of Science and Technology
MegaPairs synthesizes 26M+ high-quality multimodal retrieval training examples, enabling state-of-the-art zero-shot performance and surpassing existing methods trained on 70x more data.