Skip to main content

Paper Reviews by AI

2024

ShowUI: One Vision-Language-Action Model for GUI Visual Agent
·5469 words·26 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Show Lab, National University of Singapore
ShowUI, a novel vision-language-action model, efficiently manages high-resolution GUI screenshots and diverse task needs via UI-guided token selection and interleaved streaming, achieving state-of-the…
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
·3716 words·18 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Northwestern Polytechnical University
FiCoCo: A unified paradigm accelerates Multimodal Large Language Model (MLLM) inference by up to 82.4% with minimal performance loss, surpassing state-of-the-art training-free methods.
Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis
·3637 words·18 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Image Generation 🏒 Nanyang Technological University
Omegance: One parameter precisely controls image detail in diffusion models, enabling flexible granularity adjustments without model changes or retraining.
MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation
·4827 words·23 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision 3D Vision 🏒 DFKI
MARVEL-40M+ & MARVEL-FX3D: 40M+ high-quality 3D annotations & a fast two-stage text-to-3D pipeline enabling high-fidelity 3D model generation within 15 seconds.
Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens
·3397 words·16 mins· loading · loading
AI Generated πŸ€— Daily Papers Natural Language Processing Large Language Models 🏒 Tencent AI Lab
Low-bit quantization excels for undertrained LLMs but struggles with fully-trained ones; new scaling laws reveal this, directing future research.
LongKey: Keyphrase Extraction for Long Documents
·3409 words·17 mins· loading · loading
AI Generated πŸ€— Daily Papers Natural Language Processing Information Extraction 🏒 University of Luxembourg
LongKey: A novel framework excels at extracting keyphrases from lengthy documents using an encoder-based language model and max-pooling, outperforming existing methods across diverse datasets.
Identity-Preserving Text-to-Video Generation by Frequency Decomposition
·2775 words·14 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Image Generation 🏒 Peking University
ConsisID achieves high-quality, identity-preserving text-to-video generation using a tuning-free diffusion transformer model that leverages frequency decomposition for effective identity control.
Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models
·2315 words·11 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Kim Jaechul Graduate School of AI, KAIST
FreeΒ²Guide: Gradient-free path integral control enhances text-to-video generation using powerful large vision-language models, improving alignment without gradient-based fine-tuning.
DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting
·2489 words·12 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Image Generation 🏒 Dalian University of Technology
DreamMix enhances image inpainting by disentangling object attributes for precise editing, enabling both identity preservation and flexible text-driven modifications.
DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching
·3048 words·15 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Image Generation 🏒 Samsung R&D Institute UK
DreamCache enables efficient, high-quality personalized image generation without finetuning by caching reference image features and using lightweight conditioning adapters.
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient
·3600 words·17 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Image Generation 🏒 National University of Singapore
Collaborative Decoding (CoDe) dramatically boosts visual auto-regressive model efficiency.
ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting
·2950 words·14 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Image Generation 🏒 Xi'an Jiaotong University
ChatGen-Evo automates text-to-image generation from freestyle chatting, simplifying the process and significantly improving performance over existing methods.
AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation
·2812 words·14 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Image Generation 🏒 Tencent AI Lab
AnchorCrafter animates cyber-anchors selling products via human-object interacting video generation, achieving high visual fidelity and controllable interactions.
VisualLens: Personalization through Visual History
·2160 words·11 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Visual Question Answering 🏒 Meta
VisualLens leverages user visual history for personalized recommendations, improving state-of-the-art by 5-10% and exceeding GPT-4’s performance.
UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing
·4040 words·19 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 University of Chinese Academy of Sciences
UniPose: A unified multimodal framework for human pose comprehension, generation, and editing, enabling seamless transitions across various modalities and showcasing zero-shot generalization.
SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis
·3638 words·18 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision 3D Vision 🏒 Twelve Labs
SplatFlow: A novel multi-view rectified flow model enabling direct 3D Gaussian splatting generation & training-free editing for diverse 3D tasks.
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
·2778 words·14 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision 3D Vision 🏒 Nanyang Technological University
SAR3D: Blazing-fast autoregressive 3D object generation and understanding using a multi-scale VQVAE, achieving sub-second generation and detailed multimodal comprehension.
SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis
·3021 words·15 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Integrated Vision and Language Lab, KAIST
SALOVA, a novel video-LLM framework, enhances long-form video comprehension through targeted retrieval. It introduces SceneWalk, a high-quality dataset of densely-captioned long videos, and integrates…
Predicting Emergent Capabilities by Finetuning
·6002 words·29 mins· loading · loading
AI Generated πŸ€— Daily Papers Natural Language Processing Large Language Models 🏒 UC Berkeley
Predicting emergent LLM capabilities is now possible by finetuning smaller models; this approach shifts the emergence point, enabling accurate predictions of future model performance, even with up to …
Pathways on the Image Manifold: Image Editing via Video Generation
·3449 words·17 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Image Generation 🏒 Technion - Israel Institute of Technology
Image editing is revolutionized by Frame2Frame, which uses video generation to produce seamless and accurate edits, preserving image fidelity.