Recent
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models
·2885 words·14 mins
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Tsinghua University
LLaMA-Mesh: Unifying 3D mesh generation with LLMs by directly representing meshes as text, enabling efficient text-to-3D conversion within a single model.
MagicQuill: An Intelligent Interactive Image Editing System
·4923 words·24 mins
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 HKUST
MagicQuill: an intelligent interactive image editing system enabling intuitive, precise image edits via brushstrokes and real-time intent prediction by a multimodal LLM.
CamemBERT 2.0: A Smarter French Language Model Aged to Perfection
·1996 words·10 mins
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Inria, Paris, France
CamemBERT 2.0: Two new French language models (CamemBERTav2 & CamemBERTv2) outperform predecessors by addressing temporal concept drift via larger, updated datasets and enhanced tokenization, demonstr…
Can sparse autoencoders be used to decompose and interpret steering vectors?
·2017 words·10 mins
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 University of Oxford
Sparse autoencoders fail to accurately decompose and interpret steering vectors due to distribution mismatch and the inability to handle negative feature projections; this paper identifies these issue…
Cut Your Losses in Large-Vocabulary Language Models
·2958 words·14 mins
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Apple
Cut Cross-Entropy (CCE) dramatically reduces the memory footprint of training large language models by cleverly computing the cross-entropy loss without materializing the full logit matrix.
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation
·1627 words·8 mins
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Alibaba
EgoVid-5M: First high-quality dataset for egocentric video generation, enabling realistic human-centric world simulations.
Sharingan: Extract User Action Sequence from Desktop Recordings
·9852 words·47 mins
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Tsinghua University
Sharingan extracts user action sequences from desktop recordings using novel VLM-based methods, achieving 70-80% accuracy and enabling RPA.
Direct Preference Optimization Using Sparse Feature-Level Constraints
·2078 words·10 mins
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Westlake University
Feature-level constrained Preference Optimization (FPO) boosts LLM alignment efficiency and stability by using sparse autoencoders and feature-level constraints, achieving significant improvements ove…
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
·4045 words·19 mins
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
JanusFlow harmonizes autoregression and rectified flow for unified multimodal understanding and generation, achieving state-of-the-art results on standard benchmarks.