3rd Workshop on Language for 3D Scenes
ICCV 2023 Workshop
This is the third workshop on natural language and 3D-oriented object understanding of real-world scenes. Our primary goal is to spark research interest in this emerging area, and we set two objectives to achieve this. Our first objective is to bring together researchers interested in natural language and object representations of the physical world. This way, we hope to foster a multidisciplinary and broad discussion on how humans use language to communicate about different aspects of objects present in their surrounding 3D environments. The second objective is to benchmark progress in connecting language to 3D to identify and localize 3D objects with natural language. Tapping on the recently introduced large-scale datasets of ScanRefer and ReferIt3D, we host two benchmark challenges on language-assisted 3D localization and identification tasks. The workshop consists of presentations by experts in the field and short talks regarding methods addressing the benchmark challenges designed to highlight the emerging open problems in this area.
We establish three challenges:
- 3D Object Localization: to predict a bounding box in a 3D scene corresponding to an object described in natural language
- Fine-grained 3D Object Identification: to identify a referred object among multiple objects in a 3D scene given natural or spatial-based language
- 3D Dense Captioning: to predict the bounding boxes and the associated descriptions in natural language for objects in a 3D scene
3D Object Localization
Fine-grained 3D Object Identification
3D Dense Captioning
For each task the challenge participants are provided with prepared training, and test datasets, and automated evaluation scripts. The winner of each task will give a short talk describing their method during this workshop.
The challenge leaderboard is online. If you want to join the challenge, see more details here:
Call For Papers
Call for papers: We invite non-archival papers of up to 8 pages (in ICCV format) for work on tasks related to the intersection of natural language and 3D object understanding in real-world scenes. Paper topics may include but are not limited to:
- 3D Visual Grounding
- 3D Dense Captioning
- 3D Question Answering
- Leveraging language for 3D scene understanding
- Embodied Question Answering
Submission: We encourage submissions of up to 8 pages, excluding references and acknowledgements. The submission should be in the ICCV format. Reviewing will be single-blind. Accepted papers will be made publicly available as non-archival reports, allowing future submissions to archival conferences or journals. We welcome already published papers that are within the scope of the workshop (without re-formatting), including papers from the main ICCV conference. Please submit your paper to the following address by the deadline: firstname.lastname@example.org Please mention in your email if your submission has already been accepted for publication (and the name of the conference).
Schedule (Paris / Pacific Time Zone)
Rana Hanocka She is an Assistant Professor of Computer Science at the University of Chicago. She directs 3DL, a group of enthusiastic researchers passionate about 3D, machine learning, and visual computing. She obtained her Ph.D. in 2021 from Tel Aviv University under the supervision of Daniel Cohen-Or and Raja Giryes. Her research is focused on building artificial intelligence for 3D data, spanning the fields of computer graphics, machine learning, and computer vision. She had recent work on using text for stylizing meshes (Text2Mesh) and localizing regions on 3D shapes (3DHighlighter).
Or Litany He is a senior research scientist at NVIDIA and an incoming assistant professor at the Technion. Before that, he was a postdoc at Stanford University working under Prof. Leonidas Guibas, and FAIR working under Prof. Jitendra Malik. He received his PhD from Tel-Aviv University, where he was advised by Prof. Alex Bronstein. His research interests include deep learning for 3D vision and geometry and learning with reduced supervision. His recent work includes leveraging pretraining language-vision models for 3D scene understanding (ScanNet200).
Chris Paxton He is a robotics research scientist at FAIR Labs. He received his PhD in Computer Science in 2018 from the Johns Hopkins University in Baltimore, Maryland, focusing on using learning to create powerful task and motion planning capabilities for robots operating in human environments. From 2018-2022, he was with NVIDIA at their Seattle robotics lab. His recent work has been focusing on methods to tie together language, perception, and action in order to make robots into robust, versatile assistants for a variety of applications. His work include language grounding with 3D objects and modeling 3D scene representations with CLIP (CLIP-Fields).