Oral Others

VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

26 September 2024·2010 words·10 mins· loading · loading

Multimodal Learning Human-AI Interaction 🏢 Microsoft Research

VASA-1: Real-time, lifelike talking faces generated from a single image and audio!

The Road Less Scheduled

26 September 2024·2275 words·11 mins· loading · loading

Optimization 🏢 Princeton University

Revolutionizing machine learning, Schedule-Free optimization achieves state-of-the-art results without needing learning rate schedules, simplifying training and improving efficiency.

Stochastic Taylor Derivative Estimator: Efficient amortization for arbitrary differential operators

26 September 2024·2876 words·14 mins· loading · loading

🏢 National University of Singapore

Stochastic Taylor Derivative Estimator (STDE) drastically accelerates the optimization of neural networks involving high-dimensional, high-order differential operators by efficiently amortizing comput…

Scale Equivariant Graph Metanetworks

26 September 2024·1680 words·8 mins· loading · loading

🏢 National and Kapodistrian University of Athens

ScaleGMNs, a new framework, enhances neural network processing by incorporating scaling symmetries, boosting performance across various tasks and datasets.

RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression Segmentation

26 September 2024·2214 words·11 mins· loading · loading

Question Answering 🏢 Tencent AI Lab

RG-SAN achieves state-of-the-art 3D referring expression segmentation by leveraging spatial awareness and rule-guided weak supervision, significantly improving accuracy and handling of ambiguous descr…

NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction

26 September 2024·2374 words·12 mins· loading · loading

Video Understanding 🏢 Tongji University

NeuroClips: groundbreaking fMRI-to-video reconstruction, achieving high-fidelity smooth video up to 6s at 8FPS by decoding both high-level semantics and low-level perception flows.

MeshFormer : High-Quality Mesh Generation with 3D-Guided Reconstruction Model

26 September 2024·1982 words·10 mins· loading · loading

3D Vision 🏢 University of California, San Diego

MeshFormer: High-quality 3D mesh generation from sparse views in seconds, using transformers and 3D convolutions.

MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making

26 September 2024·2756 words·13 mins· loading · loading

Question Answering 🏢 Massachusetts Institute of Technology

MDAgents: An adaptive multi-agent LLM framework boosts medical decision-making accuracy by dynamically adjusting collaboration structures based on task complexity.

GIC: Gaussian-Informed Continuum for Physical Property Identification and Simulation

26 September 2024·2226 words·11 mins· loading · loading

3D Vision 🏢 Hong Kong University of Science and Technology

GIC: Novel hybrid framework leverages 3D Gaussian representation for accurate physical property estimation from visual observations, achieving state-of-the-art performance.

Flipped Classroom: Aligning Teacher Attention with Student in Generalized Category Discovery

26 September 2024·3153 words·15 mins· loading · loading

Image Classification 🏢 Xi'an Jiaotong University

FlipClass dynamically updates the teacher model in a teacher-student framework to align with the student’s attention, resolving learning inconsistencies and significantly improving generalized categor…

Exploitation of a Latent Mechanism in Graph Contrastive Learning: Representation Scattering

26 September 2024·1847 words·9 mins· loading · loading

Self-Supervised Learning 🏢 Tianjin University

SGRL, a novel graph contrastive learning framework, significantly boosts performance by leveraging the inherent ‘representation scattering’ mechanism and integrating graph topology, outperforming exis…

E2E-MFD: Towards End-to-End Synchronous Multimodal Fusion Detection

26 September 2024·2822 words·14 mins· loading · loading

Object Detection 🏢 Xidian University

E2E-MFD: A novel end-to-end multimodal fusion detection algorithm achieves state-of-the-art performance by synchronously optimizing image fusion and object detection.

DenoiseRep: Denoising Model for Representation Learning

26 September 2024·1739 words·9 mins· loading · loading

Image Classification 🏢 Beijing Jiaotong University

DenoiseRep: A novel denoising model enhances feature discrimination in computer vision tasks by integrating feature extraction and denoising within a single backbone, achieving impressive improvements…

Decompose, Analyze and Rethink: Solving Intricate Problems with Human-like Reasoning Cycle

26 September 2024·2295 words·11 mins· loading · loading

Question Answering 🏢 University of Science and Technology of China

DeAR: A novel framework lets LLMs solve complex problems with human-like iterative reasoning.

DapperFL: Domain Adaptive Federated Learning with Model Fusion Pruning for Edge Devices

26 September 2024·2063 words·10 mins· loading · loading

Federated Learning 🏢 State Key Laboratory for Novel Software Technology

DapperFL enhances federated learning by introducing a model fusion pruning module and domain adaptive regularization to improve performance and reduce model size for heterogeneous edge devices.

Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions

26 September 2024·2284 words·11 mins· loading · loading

Vision-Language Models 🏢 Hong Kong Polytechnic University

Can AI understand humor? A new benchmark, YESBUT, reveals that even state-of-the-art models struggle with the nuanced humor of juxtaposed comics, highlighting the need for improved AI in understandin…

Convolutional Differentiable Logic Gate Networks

26 September 2024·2283 words·11 mins· loading · loading

🏢 Stanford University

Convolutional Differentiable Logic Gate Networks achieve state-of-the-art accuracy on CIFAR-10 with 29x fewer gates than existing models, demonstrating highly efficient deep learning inference.

CAT3D: Create Anything in 3D with Multi-View Diffusion Models

26 September 2024·1770 words·9 mins· loading · loading

3D Vision 🏢 Google DeepMind

CAT3D: Generate high-quality 3D scenes from as little as one image using a novel multi-view diffusion model, outperforming existing methods in speed and quality.

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

26 September 2024·4503 words·22 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 New York University

Cambrian-1: Open, vision-centric multimodal LLMs achieve state-of-the-art performance using a novel spatial vision aggregator and high-quality data.

Bayesian-guided Label Mapping for Visual Reprogramming

26 September 2024·3607 words·17 mins· loading · loading

Transfer Learning 🏢 University of Melbourne

Bayesian-guided Label Mapping (BLM) enhances visual reprogramming!