🏢 Microsoft Research
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
·2010 words·10 mins·
loading
·
loading
Multimodal Learning
Human-AI Interaction
🏢 Microsoft Research
VASA-1: Real-time, lifelike talking faces generated from a single image and audio!
Understanding Information Storage and Transfer in Multi-Modal Large Language Models
·2906 words·14 mins·
loading
·
loading
Natural Language Processing
Large Language Models
🏢 Microsoft Research
Researchers unveil how multi-modal LLMs process information, revealing that early layers are key for storage, and introduce MULTEDIT, a model-editing algorithm for correcting errors and inserting new …
Understanding and Improving Training-free Loss-based Diffusion Guidance
·2849 words·14 mins·
loading
·
loading
AI Generated
Computer Vision
Image Generation
🏢 Microsoft Research
Training-free guidance revolutionizes diffusion models by enabling zero-shot conditional generation, but suffers from misaligned gradients and slow convergence. This paper provides theoretical analysi…
Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs
·2686 words·13 mins·
loading
·
loading
AI Theory
Optimization
🏢 Microsoft Research
Trace: Automating AI workflow design with LLMs.
Towards Flexible Visual Relationship Segmentation
·3217 words·16 mins·
loading
·
loading
AI Generated
Computer Vision
Image Segmentation
🏢 Microsoft Research
FleVRS: One unified model masters standard, promptable, and open-vocabulary visual relationship segmentation, outperforming existing methods.
Towards Editing Time Series
·4219 words·20 mins·
loading
·
loading
AI Generated
AI Applications
Smart Cities
🏢 Microsoft Research
TEdit: a novel diffusion model edits existing time series to meet specified attribute targets, preserving other properties, solving limitations of prior synthesis methods.
Slot-VLM: Object-Event Slots for Video-Language Modeling
·4378 words·21 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Microsoft Research
Slot-VLM generates semantically decomposed video tokens using an Object-Event Slots module, improving video-language model performance.
Scaling the Codebook Size of VQ-GAN to 100,000 with a Utilization Rate of 99%
·2947 words·14 mins·
loading
·
loading
Computer Vision
Image Generation
🏢 Microsoft Research
VQGAN-LC massively scales VQGAN’s codebook to 100,000 entries while maintaining a 99% utilization rate, significantly boosting image generation and downstream task performance.
Protecting Your LLMs with Information Bottleneck
·2699 words·13 mins·
loading
·
loading
Natural Language Processing
Large Language Models
🏢 Microsoft Research
IBProtector shields LLMs from harmful outputs via prompt compression, selectively preserving essential information using a trainable extractor.
Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning
·3127 words·15 mins·
loading
·
loading
AI Generated
Natural Language Processing
Machine Translation
🏢 Microsoft Research
PCformer boosts Transformer performance by using a predictor-corrector learning framework and exponential moving average coefficient learning for high-order prediction, achieving state-of-the-art resu…
Policy Improvement using Language Feedback Models
·3358 words·16 mins·
loading
·
loading
AI Generated
Natural Language Processing
Large Language Models
🏢 Microsoft Research
Boosting AI instruction following, Language Feedback Models (LFMs) leverage Large Language Models (LLMs) to identify desirable behaviors from visual trajectories, significantly improving task completi…
Physical Consistency Bridges Heterogeneous Data in Molecular Multi-Task Learning
·3395 words·16 mins·
loading
·
loading
AI Generated
Machine Learning
Deep Learning
🏢 Microsoft Research
Physically consistent multi-task learning bridges heterogeneous molecular data by directly leveraging physical laws to improve predictions, enhancing accuracy beyond the limitations of individual data…
Online Estimation via Offline Estimation: An Information-Theoretic Framework
·1315 words·7 mins·
loading
·
loading
AI Theory
Optimization
🏢 Microsoft Research
This paper introduces a novel information-theoretic framework, showing how to convert offline into online estimation algorithms efficiently, impacting interactive decision-making.
Multimodal Large Language Models Make Text-to-Image Generative Models Align Better
·4263 words·21 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Microsoft Research
AI-generated preference data improves text-to-image alignment.
Multi-Head Mixture-of-Experts
·2844 words·14 mins·
loading
·
loading
Natural Language Processing
Large Language Models
🏢 Microsoft Research
Multi-Head Mixture-of-Experts (MH-MoE) drastically boosts large language model efficiency by activating almost all expert networks, achieving superior performance compared to existing Sparse Mixture-o…
Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models
·4302 words·21 mins·
loading
·
loading
Natural Language Processing
Large Language Models
🏢 Microsoft Research
LLMs’ spatial reasoning abilities are boosted by visualizing their thought processes via ‘Visualization-of-Thought’ prompting, significantly improving performance on navigation and tiling tasks.
Is Behavior Cloning All You Need? Understanding Horizon in Imitation Learning
·1726 words·9 mins·
loading
·
loading
Reinforcement Learning
🏢 Microsoft Research
Offline imitation learning achieves surprisingly strong performance, matching online methods’ efficiency under certain conditions, contradicting prior assumptions.
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
·3090 words·15 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Microsoft Research
SpatialEval benchmark reveals that current vision-language models struggle with spatial reasoning, highlighting the need for improved multimodal models that effectively integrate visual and textual in…
Infusing Self-Consistency into Density Functional Theory Hamiltonian Prediction via Deep Equilibrium Models
·1907 words·9 mins·
loading
·
loading
Machine Learning
Deep Learning
🏢 Microsoft Research
Deep Equilibrium Models (DEQs) infused into DFT Hamiltonian prediction achieves self-consistency, accelerating large-scale materials simulations.
Improving Context-Aware Preference Modeling for Language Models
·1939 words·10 mins·
loading
·
loading
Natural Language Processing
Large Language Models
🏢 Microsoft Research
Context-aware preference modeling improves language model alignment by resolving ambiguity through a two-step process: context selection followed by context-specific preference evaluation. The approa…