🏢 Microsoft Research

Magma: A Foundation Model for Multimodal AI Agents

18 February 2025·5533 words·26 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Embodied AI 🏢 Microsoft Research

Magma: a new foundation model for multimodal AI agents excels at bridging verbal and spatial intelligence, achieving state-of-the-art performance across various tasks, including UI navigation and robo…

Beyond Prompt Content: Enhancing LLM Performance via Content-Format Integrated Prompt Optimization

6 February 2025·2451 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Microsoft Research

Researchers jointly optimize prompt content and format to significantly boost Large Language Model (LLM) performance.

Optimizing Large Language Model Training Using FP4 Quantization

28 January 2025·1562 words·8 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Microsoft Research

First-ever FP4 training framework for LLMs achieves accuracy comparable to BF16 and FP8, enabling efficient ultra-low precision training.

Chain-of-Retrieval Augmented Generation

24 January 2025·4155 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Question Answering 🏢 Microsoft Research

CoRAG, a novel Chain-of-Retrieval Augmented Generation model, dynamically refines queries for improved accuracy in multi-hop question answering, achieving state-of-the-art performance.

Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models

23 January 2025·8384 words·40 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Microsoft Research

SIGMA, a novel large language model, achieves up to 33.36% faster inference speeds by using DiffQKV attention, which differentially optimizes query, key, and value components in the attention mech…

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

8 January 2025·3910 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Microsoft Research

Small language models can master complex math reasoning using self-evolved deep thinking via Monte Carlo Tree Search, surpassing larger models in performance.

GeAR: Generation Augmented Retrieval

6 January 2025·1952 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Question Answering 🏢 Microsoft Research

GeAR, a new retrieval model, boosts accuracy by combining document retrieval with fine-grained information generation, leading to better understanding and improved localization.

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

19 December 2024·2482 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Microsoft Research

MixLLM achieves state-of-the-art LLM compression by using mixed-precision quantization between output features, improving accuracy and system efficiency.

VidTok: A Versatile and Open-Source Video Tokenizer

17 December 2024·2918 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Microsoft Research

VidTok: an open-source, top performing video tokenizer.

Phi-4 Technical Report

12 December 2024·2630 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Microsoft Research

Phi-4: a 14B parameter LLM surpassing its teacher model (GPT-4) in STEM-focused QA through innovative synthetic data generation and post-training techniques.

OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

12 December 2024·5249 words·25 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Microsoft Research

OLA-VLM boosts multimodal LLMs’ visual understanding by distilling knowledge from specialized visual encoders into the LLM’s internal representations during pretraining, achieving significant performa…

Multimodal Latent Language Modeling with Next-Token Diffusion

11 December 2024·4442 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Microsoft Research

LatentLM: a novel multimodal model unifying discrete & continuous data via next-token diffusion, surpassing existing methods in performance & scalability across various tasks.

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

5 December 2024·3412 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Microsoft Research

Florence-VL enhances vision-language models by incorporating a generative vision encoder and a novel depth-breadth fusion architecture, achieving state-of-the-art results on various benchmarks.

MH-MoE:Multi-Head Mixture-of-Experts

25 November 2024·1899 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Microsoft Research

MH-MoE: A novel implementation of Multi-Head Mixture-of-Experts achieves superior performance in large language models by enhancing efficiency without sacrificing model size or computational cost.

LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

7 November 2024·2445 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Microsoft Research

LLM2CLIP boosts CLIP’s performance by cleverly integrating LLMs, enabling it to understand longer, more complex image captions and achieving state-of-the-art results across various benchmarks.

BitNet a4.8: 4-bit Activations for 1-bit LLMs

7 November 2024·2844 words·14 mins· loading · loading

AI Generated Natural Language Processing Large Language Models 🏢 Microsoft Research

BitNet a4.8 achieves comparable performance to existing 1-bit LLMs, but with significantly faster inference, by using a hybrid quantization and sparsification strategy for 4-bit activations.