[논문 리뷰] RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture

2023. 6. 15. 16:04

RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture

The techniques for 3D indoor scene capturing are widely used, but the meshes produced leave much to be desired. In this paper, we propose "RoomDreamer", which leverages powerful natural language to synthesize a new room with a different style. Unlike exist

arxiv.org

Apple에서 나온 논문으로 코드는 아직 공개되지 않았다.

목표

Geometry(여기선 mesh)를 개선시킴과 동시에 주어진 indoor mesh의 texture을 생성

자연어를 사용하여 낮은 품질의 3D mesh의 geometry과 일치하지만 스타일은 다른 3D scene을 구축

기존의 이미지 합성 방법과 달리, Roomdreamer는 입력 scene structure와 text prompt에 맞춰 geometry과 texture를 동시에 일치시킨다.(?)
Geometry를 개선하는 것을 목표로 하며, 이는 이전 기술들이 주로 새로운 geometry을 생성하는 데 초점을 맞추는 것과 대조적이다.

Main contributions

주어진 mesh를 편집하기 위해 2D diffusion model을 사용한 프레임워크를 제안한다. 이 프레임워크의 특징은 geometry와 texture 두가지 모두를 text prompt를 사용하여 편집하고 style을 수정할 수 있다는 것이다.
Diffusion 모델을 제어하기 위한 2D diffusion scheme을 고안하였다. 이러한 scheme은 입력 mesh에 대한 scene consistent하고 구조적으로 일정한 texture을 생성할 수 있다.
스마트폰으로 스캔한 indoor mesh를 사용하여 다양한 실험을 진행하였다. 이를 통해 프레임워크의 효율성과 신뢰성을 증명하였다.

Related works

Text2Room - 2D diffusion model과 depth model을 사용하여 text prompt에 대응하는 textured room mesh를 생성한다.
- 반면, RoomDreamer은 scanned mesh을 guidance로 사용한다. 따라서 새롭게 생성된 3D scene들은 입력 scene과 정확하게 정렬됨과 동시에 다른 스타일로 변환되는 것이 가능하다.
InstructN2N은 2D 이미지를 편집한다. 따라서 새로운 scene을 생성할 때 이미지 기반 편집에 국한되어 있을 수도 있다.
- 반면, RoomDreamer은 geometry-controlled 2D diffusion 생성을 사용한다. 이러한 방법은 오리지널 scene의 texture에 방해받지 않는다는 장점이 있다.

Method

Overview

Roomdreamer은 2가지 요소로 이루어진 프레임워크이다. 첫번째는 "Geometry guided diffusion for 3d Scene"이며 두번째는 "Mesh optimization"이다. 첫번째 요소는 2D pior을 전체 scene에 동시에 적용함으로써 scene의 consistency를 얻는 데에 집중한다. 두번째 요소는 스캔한 scene의 artifacts를 제거하고 geometry와 texture을 동시에 개선하는 역할을 한다.

1. Geometry Guided Diffusion for 3D Scene

순서)

Cubemap 생성
Differentiable rendere을 사용하여 mesh texture을 업데이트
Scene에서 랜덤하게 카메라를 샘플링
Cubemap에 담기지 않은 영역은 아웃페인팅으로 업데이트

1) Cubemap 생성

View-by-view outpainting

Scene texture을 2D 이미지를 diffusion model을 사용하여 생성하고 임의적으로 위치한 카메라를 시작으로 관찰되지 않은 영역을 outpainting하는 아주 직관적인 방법이다.
하지만 이러한 방법은 artifacts를 만들기 때문에 2D diffusion model의 outpainting 능력을 제한시킨다.

Eq.2) diffusion model as a mapping function

Eq.3) the whole diffusion process from random noise and conditioned on the text and depth

Artifacts를 피하기 위해 RoomDreamer은 Cubemap을 사용한 방법을 차용한다. 즉, Room의 가장 중앙의 view의 scene을 담은 360º 사진을 생성한다.

Cubemap = cube camera로 찍은 360º 사진
Diffusion process를 cubemap path들로 연장하면서 2D diffusion model을 사용하여 panorama 이미지(cubemap)를 생성하는 것이 가능해진다.
- "MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation" 이 논문을 보면 정확한 기전을 이해할 수 있을 것이나 우선은 diffusion model로 파노라마 이미지를 생성할 수 있다는 것만 기억하자.

Text prompt에 condition된 일반적인 diffusion model과는 다르게 Roomdreamer은 text prompt C_text와 depth map D에 conditioned된 diffusion model을 사용한다. (Eq. 1)

하지만 depth map을 바로 cubemap 생성을 제어하기 위해 사용하는것은 inconsistency로 이어질 수 있다.

다른 카메라 포즈들은 cube map faces의 depth value의 inconsistency를 야기할 수 있다.
더 자세히 설명하자면, 각 카메라에 연관된 depth map은 카메라와 평면 간의 거리로 표현된다. 따라서 동일한 평면에 대해 다른 뷰에서 depth value가 크게 다를 수 있으며, 이로 인해 artifacts가 발생할 수 있다.

이를 극복하기 위해 distance map D^을 고안한다.
- Distance map D^는 points와 camera origin의 geometric distance를 나타낸다.
- Distance map D^과 depth map D의 diffusion process를 control하는 것에서 차이점이 있다.
  - D에서는 구조물이 RGB 이미지와 잘 정렬이 되어 있는 반면 D^에서는 그렇지 못한다.
  - 그 대신 D^에서는 cubemap에서 (cube faces간의) border가 더 smoother하다.

Realistic geometric alignment와 border consistency를 둘다 달성하기 위해서 blending scheme을 제안한다.

2) Cubemap을 생성한 후 Differentiable rendere을 사용하여 mesh texture을 업데이트

3) Scene에서 랜덤하게 카메라들 샘플링

4) Cubemap에 담기지 않은 영역은 아웃페인팅으로 업데이트

2. Mesh Optimization

RoomDreamer의 input과 output은 모두 3D mesh이다.

여기서 R은 rendering function을 뜻한다. 논문에서는 differentiable renderer을 사용한다.

3.1의 방법으로 2D image k개를 diffusion model로 만들었다고 하자.

그렇다면 texture-based loss는 위와 같이 된다. 입력 RGB이미지와 스타일이 입혀진 이미지 X의 절대차이다.

Equation (5)는 V_c만 업데이트하고 geometry에 해당하는 V, F는 업데이트하지 못한다. (differentiable render가 geometry의 gradient를 계산할 수 없기 때문이다.)

V,F를 V_c와 함께 optimize하는 방법은 다음과 같다:

구멍이 많은 입력 mesh는 스타일이 입혀진 완전한 이미지 X의 depth map을 통해 조정된다.
합성된 scene이 smooth region을 포함하고 있을 때 geometry 또한 평면적이어야 한다는 관찰에 기반하여 식을 짠다.

이 모든 과정을 다음 표를 보면 이해하기가 쉬워진다.

Experiment

Dataset

ARKitScenes dataset

Qualitative results

Quantitative results

'Novel View Synthesis' 카테고리의 다른 글

[논문 리뷰] Painting 3D Nature in 2D:View Synthesis of Natural Scenes from a Single Semantic Mask (0)	2023.06.10
[논문 리뷰] Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models (0)	2023.06.07
[논문 리뷰] SceneScape: Text-Driven Consistent Scene Generation (0)	2023.06.05
[배경 지식] depth-conditioned image generation (0)	2023.06.02
[논문 리뷰] SynSin: End-to-end View Synthesis from a Single Image (0)	2023.06.01

Computer Vision 대학원생 기록지