🏢 Microsoft Research

VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

26 September 2024·2010 words·10 mins· loading · loading

Multimodal Learning Human-AI Interaction 🏢 Microsoft Research

VASA-1: Real-time, lifelike talking faces generated from a single image and audio!

Understanding Information Storage and Transfer in Multi-Modal Large Language Models

26 September 2024·2906 words·14 mins· loading · loading

Natural Language Processing Large Language Models 🏢 Microsoft Research

Researchers unveil how multi-modal LLMs process information, revealing that early layers are key for storage, and introduce MULTEDIT, a model-editing algorithm for correcting errors and inserting new …

Understanding and Improving Training-free Loss-based Diffusion Guidance

26 September 2024·2849 words·14 mins· loading · loading

AI Generated Computer Vision Image Generation 🏢 Microsoft Research

Training-free guidance revolutionizes diffusion models by enabling zero-shot conditional generation, but suffers from misaligned gradients and slow convergence. This paper provides theoretical analysi…

Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs

26 September 2024·2686 words·13 mins· loading · loading

AI Theory Optimization 🏢 Microsoft Research

Trace: Automating AI workflow design with LLMs.

Towards Flexible Visual Relationship Segmentation

26 September 2024·3217 words·16 mins· loading · loading

AI Generated Computer Vision Image Segmentation 🏢 Microsoft Research

FleVRS: One unified model masters standard, promptable, and open-vocabulary visual relationship segmentation, outperforming existing methods.

Towards Editing Time Series

26 September 2024·4219 words·20 mins· loading · loading

AI Generated AI Applications Smart Cities 🏢 Microsoft Research

TEdit: a novel diffusion model edits existing time series to meet specified attribute targets, preserving other properties, solving limitations of prior synthesis methods.

Slot-VLM: Object-Event Slots for Video-Language Modeling

26 September 2024·4378 words·21 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Microsoft Research

Slot-VLM generates semantically decomposed video tokens using an Object-Event Slots module, improving video-language model performance.

Scaling the Codebook Size of VQ-GAN to 100,000 with a Utilization Rate of 99%

26 September 2024·2947 words·14 mins· loading · loading

Computer Vision Image Generation 🏢 Microsoft Research

VQGAN-LC massively scales VQGAN’s codebook to 100,000 entries while maintaining a 99% utilization rate, significantly boosting image generation and downstream task performance.

Protecting Your LLMs with Information Bottleneck

26 September 2024·2699 words·13 mins· loading · loading

Natural Language Processing Large Language Models 🏢 Microsoft Research

IBProtector shields LLMs from harmful outputs via prompt compression, selectively preserving essential information using a trainable extractor.

Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning

26 September 2024·3127 words·15 mins· loading · loading

AI Generated Natural Language Processing Machine Translation 🏢 Microsoft Research

PCformer boosts Transformer performance by using a predictor-corrector learning framework and exponential moving average coefficient learning for high-order prediction, achieving state-of-the-art resu…

Policy Improvement using Language Feedback Models

26 September 2024·3358 words·16 mins· loading · loading

AI Generated Natural Language Processing Large Language Models 🏢 Microsoft Research

Boosting AI instruction following, Language Feedback Models (LFMs) leverage Large Language Models (LLMs) to identify desirable behaviors from visual trajectories, significantly improving task completi…

Physical Consistency Bridges Heterogeneous Data in Molecular Multi-Task Learning

26 September 2024·3395 words·16 mins· loading · loading

AI Generated Machine Learning Deep Learning 🏢 Microsoft Research

Physically consistent multi-task learning bridges heterogeneous molecular data by directly leveraging physical laws to improve predictions, enhancing accuracy beyond the limitations of individual data…

Online Estimation via Offline Estimation: An Information-Theoretic Framework

26 September 2024·1315 words·7 mins· loading · loading

AI Theory Optimization 🏢 Microsoft Research

This paper introduces a novel information-theoretic framework, showing how to convert offline into online estimation algorithms efficiently, impacting interactive decision-making.

Multimodal Large Language Models Make Text-to-Image Generative Models Align Better

26 September 2024·4263 words·21 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Microsoft Research

AI-generated preference data improves text-to-image alignment.

Multi-Head Mixture-of-Experts

26 September 2024·2844 words·14 mins· loading · loading

Natural Language Processing Large Language Models 🏢 Microsoft Research

Multi-Head Mixture-of-Experts (MH-MoE) drastically boosts large language model efficiency by activating almost all expert networks, achieving superior performance compared to existing Sparse Mixture-o…

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

26 September 2024·4302 words·21 mins· loading · loading

Natural Language Processing Large Language Models 🏢 Microsoft Research

LLMs’ spatial reasoning abilities are boosted by visualizing their thought processes via ‘Visualization-of-Thought’ prompting, significantly improving performance on navigation and tiling tasks.

Is Behavior Cloning All You Need? Understanding Horizon in Imitation Learning

26 September 2024·1726 words·9 mins· loading · loading

Reinforcement Learning 🏢 Microsoft Research

Offline imitation learning achieves surprisingly strong performance, matching online methods’ efficiency under certain conditions, contradicting prior assumptions.

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

26 September 2024·3090 words·15 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Microsoft Research

SpatialEval benchmark reveals that current vision-language models struggle with spatial reasoning, highlighting the need for improved multimodal models that effectively integrate visual and textual in…

Infusing Self-Consistency into Density Functional Theory Hamiltonian Prediction via Deep Equilibrium Models

26 September 2024·1907 words·9 mins· loading · loading

Machine Learning Deep Learning 🏢 Microsoft Research

Deep Equilibrium Models (DEQs) infused into DFT Hamiltonian prediction achieves self-consistency, accelerating large-scale materials simulations.

Improving Context-Aware Preference Modeling for Language Models

26 September 2024·1939 words·10 mins· loading · loading

Natural Language Processing Large Language Models 🏢 Microsoft Research

Context-aware preference modeling improves language model alignment by resolving ambiguity through a two-step process: context selection followed by context-specific preference evaluation. The approa…