Paper Reviews by AI
2024
Edicho: Consistent Image Editing in the Wild
·2565 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Hong Kong University of Science and Technology
Edicho: a novel training-free method for consistent image editing across diverse images, achieving precise consistency by leveraging explicit correspondence.
Bringing Objects to Life: 4D generation from 3D objects
·2761 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Bar-Ilan University
3to4D: Animate any 3D object with text prompts, preserving visual quality and achieving realistic motion!
OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System
·379 words·2 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Information Extraction
🏢 Zhejiang University
OneKE: a dockerized, schema-guided LLM agent system efficiently extracts knowledge from diverse sources, offering adaptability and robust error handling.
On the Compositional Generalization of Multimodal LLMs for Medical Imaging
·5637 words·27 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Chinese University of Hong Kong, Shenzhen
Multimodal LLMs for medical imaging now generalize better via compositional generalization, leveraging relationships between image features (modality, anatomy, task) to understand unseen images and im…
Xmodel-2 Technical Report
·2582 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Xiaoduo AI Lab
Xmodel-2: A 1.2B parameter LLM achieving state-of-the-art reasoning performance through efficient architecture and training, now publicly available!
VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models
·4442 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Tencent AI Lab
VideoMaker achieves high-fidelity zero-shot customized video generation by cleverly harnessing the inherent power of video diffusion models, eliminating the need for extra feature extraction and injec…
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging
·269 words·2 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Intel Labs
Boost fine-tuned LLMs’ performance without sacrificing safety by merging pre- and post-tuning model weights!
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
·3641 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Oxford
OS-Genesis: Reverse task synthesis revolutionizes GUI agent training by generating high-quality trajectory data without human supervision, drastically boosting performance on challenging benchmarks.
From Elements to Design: A Layered Approach for Automatic Graphic Design Composition
·3329 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Xi'an Jiaotong University
LaDeCo: a layered approach to automatic graphic design composition, generating high-quality designs by sequentially composing elements into semantic layers.
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
·3509 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Shanghai AI Laboratory
Task Preference Optimization (TPO) significantly boosts multimodal large language models’ visual understanding by aligning them with fine-grained visual tasks via learnable task tokens, achieving 14.6…
Token-Budget-Aware LLM Reasoning
·3147 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Nanjing University
TALE: A novel framework dynamically adjusts token budgets in LLM reasoning prompts, slashing costs by ~70% with minimal accuracy loss.
PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models
·3061 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Meta AI
PartGen generates compositional 3D objects with meaningful parts from text, images, or unstructured 3D data using multi-view diffusion models, enabling flexible 3D part editing.
Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models
·3014 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Zhejiang University
Orient Anything: Learning robust object orientation estimation directly from rendered 3D models, achieving state-of-the-art accuracy on real images.
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation
·2542 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 University of Science and Technology of China
Molar: A novel multimodal LLM framework boosts sequential recommendation accuracy by cleverly aligning collaborative filtering with rich item representations from text and non-text data.
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
·2929 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Toronto
MMFactory: A universal framework for vision-language tasks, offering diverse programmatic solutions based on user needs and constraints, outperforming existing methods.
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
·3843 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 MMLab, the Chinese University of Hong Kong
DiTCtrl achieves state-of-the-art multi-prompt video generation without retraining by cleverly controlling attention in a diffusion transformer, enabling smooth transitions between video segments.
DepthLab: From Partial to Complete
·2516 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 HKU
DepthLab: a novel image-conditioned depth inpainting model enhances downstream 3D tasks by effectively completing partial depth information, showing superior performance and generalization.
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding
·3344 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 AIRI
3DGraphLLM boosts 3D scene understanding by cleverly merging semantic graphs and LLMs, enabling more accurate scene descriptions and outperforming existing methods.
YuLan-Mini: An Open Data-efficient Language Model
·4206 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Renmin University of China
YuLan-Mini: An open, data-efficient 2.42B parameter LLM achieving top-tier performance with innovative training techniques.
SBS Figures: Pre-training Figure QA from Stage-by-Stage Synthesized Images
·2647 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Visual Question Answering
🏢 Kyoto University
SBS Figures creates a massive, high-quality figure QA dataset via a novel stage-by-stage synthesis pipeline, enabling efficient pre-training of visual language models.