2025-01-16s

XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework

15 January 2025·3087 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Speech and Audio Music Generation 🏢 Tencent AI Lab

XMusic: A new framework generates high-quality, emotionally controllable symbolic music from various prompts (images, videos, text, tags, humming).

Trusted Machine Learning Models Unlock Private Inference for Problems Currently Infeasible with Cryptography

15 January 2025·1464 words·7 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Privacy 🏢 Google DeepMind

Machine learning models can enable secure computations previously impossible with cryptography, achieving privacy and efficiency in Trusted Capable Model Environments (TCMEs).

RepVideo: Rethinking Cross-Layer Representation for Video Generation

15 January 2025·2785 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanyang Technological University

RepVideo enhances text-to-video generation by enriching feature representations, resulting in significantly improved temporal coherence and spatial detail.

Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

15 January 2025·2366 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Rochester

Ouroboros-Diffusion: A novel tuning-free long video generation framework achieving unprecedented content consistency by cleverly integrating information across frames via latent sampling, cross-frame…

Multimodal LLMs Can Reason about Aesthetics in Zero-Shot

15 January 2025·3561 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong Polytechnic University

Multimodal LLMs can now evaluate art aesthetics with human-level accuracy using a novel dataset (MM-StyleBench) and prompt method (ArtCoT), significantly improving AI alignment in artistic evaluation.

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

15 January 2025·1663 words·8 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Cross-Modal Retrieval 🏢 Noah's Ark Lab, Huawei

MMDocIR, a new benchmark dataset, enables better evaluation of multi-modal document retrieval systems by providing page-level and layout-level annotations for diverse long documents and questions.

CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities

15 January 2025·3972 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tencent AI Lab

CityDreamer4D generates realistic, unbounded 4D city models by cleverly separating dynamic objects (like vehicles) from static elements (buildings, roads), using multiple neural fields for enhanced re…

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

14 January 2025·4505 words·22 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

Parameter-Inverted Image Pyramid Networks (PIIP) drastically cut visual model computing costs without sacrificing accuracy by using smaller models for higher-resolution images and larger models for lo…