Skip to main content

🏢 Tsinghua University

Dynamic Scaling of Unit Tests for Code Reward Modeling
·3208 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tsinghua University
Boosting code generation accuracy with more unit tests! This research shows that increasing the number of unit tests used to evaluate code generated by LLMs significantly improves accuracy, especially…
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation
·8988 words·43 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tsinghua University
VisionReward, a novel reward model, surpasses existing methods by precisely capturing multi-dimensional human preferences for image and video generation, enabling more accurate and stable model optimi…
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation
·3981 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tsinghua University
New benchmarks, HumanEval Pro and MBPP Pro, reveal LLMs struggle with self-invoking code generation, highlighting a critical gap in current code reasoning capabilities.
Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization
·2203 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tsinghua University
FoPE enhances attention’s periodic extension for better length generalization in language models by addressing spectral damage in RoPE using Fourier Series and zeroing out destructive frequencies.
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching
·3841 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tsinghua University
Distilled Decoding (DD) drastically speeds up image generation from autoregressive models by using flow matching to enable one-step sampling, achieving significant speedups while maintaining acceptabl…
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
·5664 words·27 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tsinghua University
ReMoE: Revolutionizing Mixture-of-Experts with fully differentiable ReLU routing, achieving superior scalability and performance.
How to Synthesize Text Data without Model Collapse?
·5702 words·27 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tsinghua University
Token-level editing prevents language model collapse from synthetic data by theoretically bounding test error and empirically improving model performance.
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
·3553 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
LLaVA-UHD v2 enhances MLLMs by integrating high-resolution visual details using a hierarchical window transformer.
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models
·3747 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tsinghua University
Self-play method SPAR enhances LLMs instruction following abilities, beating GPT-4 on IFEval
ColorFlow: Retrieval-Augmented Image Sequence Colorization
·2655 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tsinghua University
ColorFlow, a new AI model, accurately colorizes black-and-white image sequences while preserving character identity.
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
·3840 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
SynerGen-VL: A simpler, more powerful unified MLLM for image understanding and generation.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
·9241 words·44 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
InternVL 2.5, a new open-source multimodal LLM, surpasses 70% on the MMMU benchmark, rivaling top commercial models through model, data, and test-time scaling strategies.
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
·4750 words·23 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
GENMAC: Multi-agent collaboration revolutionizes compositional text-to-video generation, achieving state-of-the-art results by iteratively refining videos via specialized agents.
Densing Law of LLMs
·1976 words·10 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tsinghua University
LLMs’ training quality is exponentially improving, enabling models with half the parameters to match state-of-the-art performance every 3 months, thus reducing inference costs.
MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation
·2260 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tsinghua University
MIDI: a novel multi-instance diffusion model generates compositional 3D scenes from single images by simultaneously creating multiple 3D instances with accurate spatial relationships and high generali…
2DGS-Room: Seed-Guided 2D Gaussian Splatting with Geometric Constrains for High-Fidelity Indoor Scene Reconstruction
·2645 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tsinghua University
2DGS-Room: Seed-guided 2D Gaussian splatting with geometric constraints achieves state-of-the-art high-fidelity indoor scene reconstruction.
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
·3550 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
X-Prompt: a novel autoregressive vision-language model achieves universal in-context image generation by efficiently compressing contextual information and using a unified training framework for super…
Structured 3D Latents for Scalable and Versatile 3D Generation
·4249 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tsinghua University
Unified 3D latent representation (SLAT) enables versatile high-quality 3D asset generation, significantly outperforming existing methods.
Free Process Rewards without Process Labels
·3126 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tsinghua University
Train high-performing Process Reward Models (PRMs) cheaply using only outcome-level labels, eliminating the need for costly step-by-step annotations!
AlphaTablets: A Generic Plane Representation for 3D Planar Reconstruction from Monocular Videos
·2678 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tsinghua University
AlphaTablets: A novel 3D plane representation enabling accurate, consistent, and flexible 3D planar reconstruction from monocular videos, achieving state-of-the-art results.