Skip to main content

2025-01-16s

2025

XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework
·3087 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Speech and Audio Music Generation 🏢 Tencent AI Lab
XMusic: A new framework generates high-quality, emotionally controllable symbolic music from various prompts (images, videos, text, tags, humming).
Trusted Machine Learning Models Unlock Private Inference for Problems Currently Infeasible with Cryptography
·1464 words·7 mins· loading · loading
AI Generated 🤗 Daily Papers AI Theory Privacy 🏢 Google DeepMind
Machine learning models can enable secure computations previously impossible with cryptography, achieving privacy and efficiency in Trusted Capable Model Environments (TCMEs).
RepVideo: Rethinking Cross-Layer Representation for Video Generation
·2785 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanyang Technological University
RepVideo enhances text-to-video generation by enriching feature representations, resulting in significantly improved temporal coherence and spatial detail.
Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion
·2366 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Rochester
Ouroboros-Diffusion: A novel tuning-free long video generation framework achieving unprecedented content consistency by cleverly integrating information across frames via latent sampling, cross-frame…
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
·3561 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong Polytechnic University
Multimodal LLMs can now evaluate art aesthetics with human-level accuracy using a novel dataset (MM-StyleBench) and prompt method (ArtCoT), significantly improving AI alignment in artistic evaluation.
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
·1663 words·8 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Cross-Modal Retrieval 🏢 Noah's Ark Lab, Huawei
MMDocIR, a new benchmark dataset, enables better evaluation of multi-modal document retrieval systems by providing page-level and layout-level annotations for diverse long documents and questions.
CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities
·3972 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tencent AI Lab
CityDreamer4D generates realistic, unbounded 4D city models by cleverly separating dynamic objects (like vehicles) from static elements (buildings, roads), using multiple neural fields for enhanced re…
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
·4505 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
Parameter-Inverted Image Pyramid Networks (PIIP) drastically cut visual model computing costs without sacrificing accuracy by using smaller models for higher-resolution images and larger models for lo…