Oral Others
2024
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
·2010 words·10 mins·
loading
·
loading
Multimodal Learning
Human-AI Interaction
🏢 Microsoft Research
VASA-1: Real-time, lifelike talking faces generated from a single image and audio!
The Road Less Scheduled
·2275 words·11 mins·
loading
·
loading
Optimization
🏢 Princeton University
Revolutionizing machine learning, Schedule-Free optimization achieves state-of-the-art results without needing learning rate schedules, simplifying training and improving efficiency.
Stochastic Taylor Derivative Estimator: Efficient amortization for arbitrary differential operators
·2876 words·14 mins·
loading
·
loading
🏢 National University of Singapore
Stochastic Taylor Derivative Estimator (STDE) drastically accelerates the optimization of neural networks involving high-dimensional, high-order differential operators by efficiently amortizing comput…
Scale Equivariant Graph Metanetworks
·1680 words·8 mins·
loading
·
loading
🏢 National and Kapodistrian University of Athens
ScaleGMNs, a new framework, enhances neural network processing by incorporating scaling symmetries, boosting performance across various tasks and datasets.
RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression Segmentation
·2214 words·11 mins·
loading
·
loading
Question Answering
🏢 Tencent AI Lab
RG-SAN achieves state-of-the-art 3D referring expression segmentation by leveraging spatial awareness and rule-guided weak supervision, significantly improving accuracy and handling of ambiguous descr…
NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction
·2374 words·12 mins·
loading
·
loading
Video Understanding
🏢 Tongji University
NeuroClips: groundbreaking fMRI-to-video reconstruction, achieving high-fidelity smooth video up to 6s at 8FPS by decoding both high-level semantics and low-level perception flows.
MeshFormer : High-Quality Mesh Generation with 3D-Guided Reconstruction Model
·1982 words·10 mins·
loading
·
loading
3D Vision
🏢 University of California, San Diego
MeshFormer: High-quality 3D mesh generation from sparse views in seconds, using transformers and 3D convolutions.
MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making
·2756 words·13 mins·
loading
·
loading
Question Answering
🏢 Massachusetts Institute of Technology
MDAgents: An adaptive multi-agent LLM framework boosts medical decision-making accuracy by dynamically adjusting collaboration structures based on task complexity.
GIC: Gaussian-Informed Continuum for Physical Property Identification and Simulation
·2226 words·11 mins·
loading
·
loading
3D Vision
🏢 Hong Kong University of Science and Technology
GIC: Novel hybrid framework leverages 3D Gaussian representation for accurate physical property estimation from visual observations, achieving state-of-the-art performance.
Flipped Classroom: Aligning Teacher Attention with Student in Generalized Category Discovery
·3153 words·15 mins·
loading
·
loading
Image Classification
🏢 Xi'an Jiaotong University
FlipClass dynamically updates the teacher model in a teacher-student framework to align with the student’s attention, resolving learning inconsistencies and significantly improving generalized categor…
Exploitation of a Latent Mechanism in Graph Contrastive Learning: Representation Scattering
·1847 words·9 mins·
loading
·
loading
Self-Supervised Learning
🏢 Tianjin University
SGRL, a novel graph contrastive learning framework, significantly boosts performance by leveraging the inherent ‘representation scattering’ mechanism and integrating graph topology, outperforming exis…
E2E-MFD: Towards End-to-End Synchronous Multimodal Fusion Detection
·2822 words·14 mins·
loading
·
loading
Object Detection
🏢 Xidian University
E2E-MFD: A novel end-to-end multimodal fusion detection algorithm achieves state-of-the-art performance by synchronously optimizing image fusion and object detection.
DenoiseRep: Denoising Model for Representation Learning
·1739 words·9 mins·
loading
·
loading
Image Classification
🏢 Beijing Jiaotong University
DenoiseRep: A novel denoising model enhances feature discrimination in computer vision tasks by integrating feature extraction and denoising within a single backbone, achieving impressive improvements…
Decompose, Analyze and Rethink: Solving Intricate Problems with Human-like Reasoning Cycle
·2295 words·11 mins·
loading
·
loading
Question Answering
🏢 University of Science and Technology of China
DeAR: A novel framework lets LLMs solve complex problems with human-like iterative reasoning.
DapperFL: Domain Adaptive Federated Learning with Model Fusion Pruning for Edge Devices
·2063 words·10 mins·
loading
·
loading
Federated Learning
🏢 State Key Laboratory for Novel Software Technology
DapperFL enhances federated learning by introducing a model fusion pruning module and domain adaptive regularization to improve performance and reduce model size for heterogeneous edge devices.
Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions
·2284 words·11 mins·
loading
·
loading
Vision-Language Models
🏢 Hong Kong Polytechnic University
Can AI understand humor? A new benchmark, YESBUT, reveals that even state-of-the-art models struggle with the nuanced humor of juxtaposed comics, highlighting the need for improved AI in understandin…
Convolutional Differentiable Logic Gate Networks
·2283 words·11 mins·
loading
·
loading
🏢 Stanford University
Convolutional Differentiable Logic Gate Networks achieve state-of-the-art accuracy on CIFAR-10 with 29x fewer gates than existing models, demonstrating highly efficient deep learning inference.
CAT3D: Create Anything in 3D with Multi-View Diffusion Models
·1770 words·9 mins·
loading
·
loading
3D Vision
🏢 Google DeepMind
CAT3D: Generate high-quality 3D scenes from as little as one image using a novel multi-view diffusion model, outperforming existing methods in speed and quality.
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
·4503 words·22 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 New York University
Cambrian-1: Open, vision-centric multimodal LLMs achieve state-of-the-art performance using a novel spatial vision aggregator and high-quality data.
Bayesian-guided Label Mapping for Visual Reprogramming
·3607 words·17 mins·
loading
·
loading
Transfer Learning
🏢 University of Melbourne
Bayesian-guided Label Mapping (BLM) enhances visual reprogramming!