π’ Tsinghua University
Zero-Shot Scene Reconstruction from Single Images with Deep Prior Assembly
·2753 words·13 mins·
loading
·
loading
Computer Vision
3D Vision
π’ Tsinghua University
Zero-shot 3D scene reconstruction from single images is achieved by assembling diverse deep priors from large models, eliminating the need for 3D/2D training data and achieving superior performance.
You Only Cache Once: Decoder-Decoder Architectures for Language Models
·2411 words·12 mins·
loading
·
loading
Large Language Models
π’ Tsinghua University
YOCO: A decoder-decoder architecture for LLMs dramatically reduces memory usage and improves inference speed by caching key-value pairs only once.
YOLOv10: Real-Time End-to-End Object Detection
·1949 words·10 mins·
loading
·
loading
Computer Vision
Object Detection
π’ Tsinghua University
YOLOv10: Real-time object detection achieves state-of-the-art speed and accuracy by eliminating NMS post-processing and holistically optimizing model architecture for efficiency and accuracy.
XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation
·2133 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ Tsinghua University
XMask3D uses cross-modal mask reasoning to achieve state-of-the-art open vocabulary 3D semantic segmentation by aligning 2D and 3D features at the mask level, resulting in precise segmentation boundar…
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
·6701 words·32 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ Tsinghua University
VisionLLM v2 unifies visual perception, understanding, and generation, excelling in various vision tasks and achieving performance comparable to task-specific models.
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
·2294 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ Tsinghua University
Latent Compression Learning (LCL) revolutionizes vision model pre-training by effectively leveraging readily available interleaved image-text data, achieving performance comparable to models trained o…
Value-Based Deep Multi-Agent Reinforcement Learning with Dynamic Sparse Training
·4753 words·23 mins·
loading
·
loading
AI Generated
Machine Learning
Reinforcement Learning
π’ Tsinghua University
MAST: Train ultra-sparse deep MARL agents with minimal performance loss!
Unleashing the Denoising Capability of Diffusion Prior for Solving Inverse Problems
·3236 words·16 mins·
loading
·
loading
Computer Vision
Image Generation
π’ Tsinghua University
ProjDiff: A novel algorithm unleashes diffusion models’ denoising power for superior inverse problem solutions.
Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image
·2641 words·13 mins·
loading
·
loading
AI Generated
Computer Vision
3D Vision
π’ Tsinghua University
Unique3D: Single image to high-fidelity 3D mesh in 30 seconds!
UniGAD: Unifying Multi-level Graph Anomaly Detection
·2482 words·12 mins·
loading
·
loading
Machine Learning
Graph Anomaly Detection
π’ Tsinghua University
UniGAD unifies multi-level graph anomaly detection, improving accuracy and zero-shot transferability by jointly modeling node, edge, and graph anomalies via a novel subgraph sampler and GraphStitch Ne…
UniAudio 1.5: Large Language Model-Driven Audio Codec is A Few-Shot Audio Task Learner
·2866 words·14 mins·
loading
·
loading
AI Generated
Multimodal Learning
Audio-Visual Learning
π’ Tsinghua University
UniAudio 1.5 uses a novel LLM-driven audio codec to enable frozen LLMs to perform various audio tasks with just a few examples, opening new avenues for efficient few-shot cross-modal learning.
Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE
·2974 words·14 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
π’ Tsinghua University
Uni-Med, a novel unified medical foundation model, tackles multi-task learning challenges by using Connector-MoE to efficiently bridge modalities, achieving competitive performance across six medical …
Understanding Emergent Abilities of Language Models from the Loss Perspective
·1924 words·10 mins·
loading
·
loading
Natural Language Processing
Large Language Models
π’ Tsinghua University
Language model emergent abilities aren’t exclusive to large models; they emerge when pre-training loss falls below a threshold, irrespective of model or data size.
Unchosen Experts Can Contribute Too: Unleashing MoE Modelsβ Power by Self-Contrast
·2047 words·10 mins·
loading
·
loading
Natural Language Processing
Large Language Models
π’ Tsinghua University
Self-Contrast Mixture-of-Experts (SCMoE) boosts MoE model reasoning by cleverly using ‘unchosen’ experts during inference. This training-free method contrasts outputs from strong and weak expert acti…
Training Compute-Optimal Protein Language Models
·3023 words·15 mins·
loading
·
loading
Large Language Models
π’ Tsinghua University
Compute-optimal protein language models are trained efficiently using scaling laws derived from a massive dataset, improving performance while optimizing compute budgets.
Training an Open-Vocabulary Monocular 3D Detection Model without 3D Data
·3285 words·16 mins·
loading
·
loading
AI Generated
Computer Vision
3D Vision
π’ Tsinghua University
Train open-vocabulary 3D object detectors using only RGB images and large language models, achieving state-of-the-art performance without expensive LiDAR data.
TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables
·2924 words·14 mins·
loading
·
loading
Machine Learning
Deep Learning
π’ Tsinghua University
TimeXer empowers transformers for superior time series forecasting by cleverly integrating exogenous variables, achieving state-of-the-art results on diverse benchmarks.
SuperVLAD: Compact and Robust Image Descriptors for Visual Place Recognition
·3456 words·17 mins·
loading
·
loading
AI Generated
Computer Vision
Visual Place Recognition
π’ Tsinghua University
SuperVLAD: A new visual place recognition method boasts superior robustness and compactness, outperforming state-of-the-art techniques by significantly reducing parameters and dimensions.
Stabilizing Zero-Shot Prediction: A Novel Antidote to Forgetting in Continual Vision-Language Tasks
·2243 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ Tsinghua University
ZAF: a novel replay-free continual learning method for vision-language models, significantly reduces forgetting by stabilizing zero-shot predictions.
SMART: Scalable Multi-agent Real-time Motion Generation via Next-token Prediction
·1973 words·10 mins·
loading
·
loading
AI Applications
Autonomous Vehicles
π’ Tsinghua University
SMART: a scalable, real-time multi-agent driving simulator using next-token prediction, achieves state-of-the-art results and zero-shot generalization.