↓Skip to main content

Scene Understanding

Towards a Unified Copernicus Foundation Model for Earth Vision

14 March 2025·4400 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Scene Understanding 🏢 Technical University of Munich

Unified Copernicus Foundation Model for Earth Vision: A multimodal approach to improve scalability, versatility, and adaptability of EO models.

Unposed Sparse Views Room Layout Reconstruction in the Age of Pretrain Model

24 February 2025·3468 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Scene Understanding 🏢 Hong Kong Center for Construction Robotics, the Hong Kong University of Science and Technology

Plane-DUSt3R: Leveraging pre-trained models for unposed sparse views room layout reconstruction, enhancing robustness and generalization.

KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

20 February 2025·3707 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Scene Understanding 🏢 MBZUAI

KITAB-Bench: A new multi-domain Arabic OCR benchmark to bridge the performance gap with English OCR technologies.

CrossOver: 3D Scene Cross-Modal Alignment

20 February 2025·5760 words·28 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Scene Understanding 🏢 Stanford University

CrossOver: Flexible scene-level cross-modal alignment via modality-agnostic embeddings, unlocking robust 3D scene understanding.

Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework

19 February 2025·2585 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Scene Understanding 🏢 MBZUAI

New geolocation dataset & reasoning framework enhance accuracy and interpretability by leveraging human gameplay data.