↓Skip to main content

🏢 Apple

MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

17 March 2025·5602 words·27 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Apple

MM-Spatial enhances multimodal LLMs with 3D spatial reasoning via a novel dataset and benchmark, improving performance on spatial understanding tasks.

STIV: Scalable Text and Image Conditioned Video Generation

10 December 2024·5285 words·25 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Apple

STIV: A novel, scalable method for text and image-conditioned video generation, systematically improving model architectures, training, and data curation for superior performance.

Cut Your Losses in Large-Vocabulary Language Models

13 November 2024·2958 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Apple

Cut Cross-Entropy (CCE) dramatically reduces the memory footprint of training large language models by cleverly computing the cross-entropy loss without materializing the full logit matrix.

Controlling Language and Diffusion Models by Transporting Activations

30 October 2024·11502 words·54 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Apple

Steering large language and diffusion models is made easy and efficient via Activation Transport (ACT)! This novel framework uses optimal transport theory to precisely control model activations, leadi…