🏢 Tsinghua University

Dynamic Scaling of Unit Tests for Code Reward Modeling

2 January 2025·3208 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tsinghua University

Boosting code generation accuracy with more unit tests! This research shows that increasing the number of unit tests used to evaluate code generated by LLMs significantly improves accuracy, especially…

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

30 December 2024·8988 words·43 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tsinghua University

VisionReward, a novel reward model, surpasses existing methods by precisely capturing multi-dimensional human preferences for image and video generation, enabling more accurate and stable model optimi…

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

30 December 2024·3981 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tsinghua University

New benchmarks, HumanEval Pro and MBPP Pro, reveal LLMs struggle with self-invoking code generation, highlighting a critical gap in current code reasoning capabilities.

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

23 December 2024·2203 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tsinghua University

FoPE enhances attention’s periodic extension for better length generalization in language models by addressing spectral damage in RoPE using Fourier Series and zeroing out destructive frequencies.

Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching

22 December 2024·3841 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tsinghua University

Distilled Decoding (DD) drastically speeds up image generation from autoregressive models by using flow matching to enable one-step sampling, achieving significant speedups while maintaining acceptabl…

ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing

19 December 2024·5664 words·27 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tsinghua University

ReMoE: Revolutionizing Mixture-of-Experts with fully differentiable ReLU routing, achieving superior scalability and performance.

How to Synthesize Text Data without Model Collapse?

19 December 2024·5702 words·27 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tsinghua University

Token-level editing prevents language model collapse from synthetic data by theoretically bounding test error and empirically improving model performance.

LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer

18 December 2024·3553 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

LLaVA-UHD v2 enhances MLLMs by integrating high-resolution visual details using a hierarchical window transformer.

SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models

16 December 2024·3747 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tsinghua University

Self-play method SPAR enhances LLMs instruction following abilities, beating GPT-4 on IFEval

ColorFlow: Retrieval-Augmented Image Sequence Colorization

16 December 2024·2655 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tsinghua University

ColorFlow, a new AI model, accurately colorizes black-and-white image sequences while preserving character identity.

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

12 December 2024·3840 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

SynerGen-VL: A simpler, more powerful unified MLLM for image understanding and generation.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

6 December 2024·9241 words·44 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

InternVL 2.5, a new open-source multimodal LLM, surpasses 70% on the MMMU benchmark, rivaling top commercial models through model, data, and test-time scaling strategies.

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

5 December 2024·4750 words·23 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

GENMAC: Multi-agent collaboration revolutionizes compositional text-to-video generation, achieving state-of-the-art results by iteratively refining videos via specialized agents.

Densing Law of LLMs

5 December 2024·1976 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tsinghua University

LLMs’ training quality is exponentially improving, enabling models with half the parameters to match state-of-the-art performance every 3 months, thus reducing inference costs.

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

4 December 2024·2260 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tsinghua University

MIDI: a novel multi-instance diffusion model generates compositional 3D scenes from single images by simultaneously creating multiple 3D instances with accurate spatial relationships and high generali…

2DGS-Room: Seed-Guided 2D Gaussian Splatting with Geometric Constrains for High-Fidelity Indoor Scene Reconstruction

4 December 2024·2645 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tsinghua University

2DGS-Room: Seed-guided 2D Gaussian splatting with geometric constraints achieves state-of-the-art high-fidelity indoor scene reconstruction.

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

2 December 2024·3550 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

X-Prompt: a novel autoregressive vision-language model achieves universal in-context image generation by efficiently compressing contextual information and using a unified training framework for super…

Structured 3D Latents for Scalable and Versatile 3D Generation

2 December 2024·4249 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tsinghua University

Unified 3D latent representation (SLAT) enables versatile high-quality 3D asset generation, significantly outperforming existing methods.

Free Process Rewards without Process Labels

2 December 2024·3126 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tsinghua University

Train high-performing Process Reward Models (PRMs) cheaply using only outcome-level labels, eliminating the need for costly step-by-step annotations!

AlphaTablets: A Generic Plane Representation for 3D Planar Reconstruction from Monocular Videos

29 November 2024·2678 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tsinghua University

AlphaTablets: A novel 3D plane representation enabling accurate, consistent, and flexible 3D planar reconstruction from monocular videos, achieving state-of-the-art results.