🍭

(230326) Diary: Diffusion Models Tutorial Feat. DDPM

ChatGPT에 이은.. 쌓인 논문/공부 부채 상환하기 2편: Diffusion Models 본인의 전공 도메인이 Vision이 아닌 데다, 수식도 너무 많아 어떤 논문부터 공부해야 할지 고민하던 중, DDPM(Ho et al., 2020)부터 살펴보기로 결정! 완벽하게 수식을 이해하기보다 직관적으로 납득하는 방식으로 리뷰함

Diffusion Models

What are Diffusion Models?

[Updated on 2021-09-19: Highly recommend this blog post on score-based generative modeling by Yang Song (author of several key papers in the references)]. [Updated on 2022-08-27: Added classifier-free guidance, GLIDE, unCLIP and Imagen. [Updated on 2022-08-31: Added latent diffusion model. So far, I’ve written about three types of generative models, GAN, VAE, and Flow-based models. They have shown great success in generating high-quality samples, but each has some limitations of its own.

https://lilianweng.github.io/posts/2021-07-11-diffusion-models/

(DDPM 논문과 함께 위 블로그 포스팅을 참고.. OpenAI Researcher 분이 운영하시는 좋은 블로그)

Diffusion Model이란?

•

원본 이미지에 Gaussian Noise를 점진적으로 추가하여 Noisy한 이미지를 생성하고 (Forward Process),

•

생성한 Noisy한 이미지로부터 원본 이미지를 복원하는 방식(Reverse Process)으로 학습되는 생성 모델

•

(Gaussian 분포를 따르는 Latent를 샘플링하여 원본 이미지를 복원+Variational Lower Bound를 활용하는 VAE의 Multi-Step 버전으로 생각할 수 있지 않을까?)

Forward Process, q(xt∣xt−1)q(\mathbf{x}_t \vert \mathbf{x}_{t-1})q(xt​∣xt−1​)

q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t\mathbf{I}) \quad q(\mathbf{x}_{1:T} \vert \mathbf{x}_0) = \prod^T_{t=1} q(\mathbf{x}_t \vert \mathbf{x}_{t-1})

•

TTT Step에 걸쳐 원본 이미지 x0\mathbf{x}_0x0​에 작은 양의 Gaussian Noise를 추가

•

추가하는 Noise의 양은 Variance 값인 βt\beta_tβt​에 의해 결정

\begin{aligned} q(\mathbf{x}_t \vert \mathbf{x}_0) &= \mathcal{N}(\mathbf{x}_t;\color{red} \sqrt{\bar{\alpha}_t}\color{black} \mathbf{x}_0, \color{red}(1 - \bar{\alpha}_t)\color{black}\mathbf{I})&\text{ ;where }\alpha_t = 1 - \beta_t\text{, }\bar{\alpha}_t = \prod_{i=1}^t \alpha_i \end{aligned}

•

Step ttt에서의 조건부 확률은 위와 같이 표현할 수 있는데,

•

다음과 같은 Gaussian Merge 과정을 통해 유도할 수 있음 (주목: Reverse Process 파트에서 다시 언급됨)

\begin{aligned} \mathbf{x}_t &= \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1 - \alpha_t}\boldsymbol{\epsilon}_{t-1} & \text{ ;where } \boldsymbol{\epsilon}_{t-1}, \boldsymbol{\epsilon}_{t-2}, \dots \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ &= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\boldsymbol{\epsilon}}_{t-2} & \text{ ;where } \bar{\boldsymbol{\epsilon}}_{t-2} \text{ merges two Gaussians} \\ &= \dots \\ &= \color{red}\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon} \\ \end{aligned}

Reverse Process, pθ(xt−1∣xt)p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t)pθ​(xt−1​∣xt​)

•

만약, βt\beta_tβt​가 충분히 작다면 q(xt−1∣xt)q(\mathbf{x}_{t-1} \vert \mathbf{x}_t)q(xt−1​∣xt​) 역시 Gaussian으로 생각할 수 있으며,

p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \color{red}\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)\color{black}, \color{red}\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)\color{black})

•

q(xt−1∣xt)q(\mathbf{x}_{t-1} \vert \mathbf{x}_t)q(xt−1​∣xt​)를 직접 도출하기는 쉽지 않으므로 Diffusion Model, θ\thetaθ을 활용하여 근사값 pθp_\thetapθ​를 도출함

•

VAE와 유사하게 pθp_\thetapθ​의 Negative Log Likelihood에 Variational Lower Bound를 적용하면, 아래와 같이 풀이할 수 있고,

\begin{aligned}- \log p_\theta(\mathbf{x}_0) &\leq - \log p_\theta(\mathbf{x}_0) + D_\text{KL}(q(\mathbf{x}_{1:T}\vert\mathbf{x}_0) \| p_\theta(\mathbf{x}_{1:T}\vert\mathbf{x}_0) ) \\&= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_{\mathbf{x}_{1:T}\sim q(\mathbf{x}_{1:T} \vert \mathbf{x}_0)} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T}) / p_\theta(\mathbf{x}_0)} \Big] \\&= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_q \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} + \log p_\theta(\mathbf{x}_0) \Big] \\&= \color{red}\mathbb{E}_q \Big[ \log \frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \\\end{aligned}

•

결국, Diffusion Model의 학습에서 Optimize하는 Loss는 다음과 같음

\begin{aligned}L_\text{VLB} &= \color{red}\mathbb{E}_{q(\mathbf{x}_{0:T})} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \\&= \mathbb{E}_q \Big[ \log\frac{\prod_{t=1}^T q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{ p_\theta(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t) } \Big] \\&= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} \Big] \\&= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\&= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \Big( \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)}\cdot \frac{q(\mathbf{x}_t \vert \mathbf{x}_0)}{q(\mathbf{x}_{t-1}\vert\mathbf{x}_0)} \Big) + \log \frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\&= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t \vert \mathbf{x}_0)}{q(\mathbf{x}_{t-1} \vert \mathbf{x}_0)} + \log\frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\&= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_T \vert \mathbf{x}_0)}{q(\mathbf{x}_1 \vert \mathbf{x}_0)} + \log \frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big]\\&= \mathbb{E}_q \Big[ \log\frac{q(\mathbf{x}_T \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_T)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} - \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1) \Big] \\&= \mathbb{E}_q [\underbrace{D_\text{KL}(q(\mathbf{x}_T \vert \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_T))}_{L_T} + \sum_{t=2}^T \underbrace{D_\text{KL}(q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t))}_{L_{t-1}} \underbrace{- \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)}_{L_0} ]\end{aligned}

\begin{aligned}L_\text{VLB} &= L_T + L_{T-1} + \dots + L_0 \\\text{where } L_T &= D_\text{KL}(q(\mathbf{x}_T \vert \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_T)) \\L_t &= D_\text{KL}(\color{red}q(\mathbf{x}_t \vert \mathbf{x}_{t+1}, \mathbf{x}_0)\color{black} \parallel \color{red}p_\theta(\mathbf{x}_t \vert\mathbf{x}_{t+1})\color{black}) \text{ for }1 \leq t \leq T-1 \\L_0 &= - \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)\end{aligned}

•

Diffusion Model의 학습에 사용되는 LtL_tLt​를 살펴보기 전, q(xt−1∣xt,x0)q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)q(xt−1​∣xt​,x0​)를 풀이하면 다음과 같고 (Gaussian Density Function 참고),

q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \color{blue}{\tilde{\boldsymbol{\mu}}}(\mathbf{x}_t, \mathbf{x}_0), \color{red}{\tilde{\beta}_t} \mathbf{I})

\begin{aligned}q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) &= q(\mathbf{x}_t \vert \mathbf{x}_{t-1}, \mathbf{x}_0) \frac{ q(\mathbf{x}_{t-1} \vert \mathbf{x}_0) }{ q(\mathbf{x}_t \vert \mathbf{x}_0) } \\&\propto \exp \Big(-\frac{1}{2} \big(\frac{(\mathbf{x}_t - \sqrt{\alpha_t} \mathbf{x}_{t-1})^2}{\beta_t} + \frac{(\mathbf{x}_{t-1} - \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0)^2}{1-\bar{\alpha}_{t-1}} - \frac{(\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \mathbf{x}_0)^2}{1-\bar{\alpha}_t} \big) \Big) \\&= \exp \Big(-\frac{1}{2} \big(\frac{\mathbf{x}_t^2 - 2\sqrt{\alpha_t} \mathbf{x}_t \color{blue}{\mathbf{x}_{t-1}} \color{black}{+ \alpha_t} \color{red}{\mathbf{x}_{t-1}^2} }{\beta_t} + \frac{ \color{red}{\mathbf{x}_{t-1}^2} \color{black}{- 2 \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0} \color{blue}{\mathbf{x}_{t-1}} \color{black}{+ \bar{\alpha}_{t-1} \mathbf{x}_0^2}  }{1-\bar{\alpha}_{t-1}} - \frac{(\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \mathbf{x}_0)^2}{1-\bar{\alpha}_t} \big) \Big) \\&= \exp\Big( -\frac{1}{2} \big( \color{red}{(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}})} \mathbf{x}_{t-1}^2 - \color{blue}{(\frac{2\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{2\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0)} \mathbf{x}_{t-1} \color{black}{ + C(\mathbf{x}_t, \mathbf{x}_0) \big) \Big)}\end{aligned}

•

Gaussian Density Function에 의하여, μ~t(xt,x0)\tilde{\boldsymbol{\mu}}_t (\mathbf{x}_t, \mathbf{x}_0)μ~​t​(xt​,x0​)와 β~t{\tilde{\beta}_t}β~​t​를 아래와 같이 도출할 수 있음

\begin{aligned}\tilde{\beta}_t &= 1/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}) = 1/(\frac{\alpha_t - \bar{\alpha}_t + \beta_t}{\beta_t(1 - \bar{\alpha}_{t-1})})= \color{green}{\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t} \\\end{aligned}

\begin{aligned}\tilde{\boldsymbol{\mu}}_t (\mathbf{x}_t, \mathbf{x}_0)&= (\frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1} }}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0)/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}) \\&= (\frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1} }}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0) \color{green}{\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t} \\&= \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0\\\end{aligned}

(Forward Process의 마지막 수식,

\mathbf{x}_t=\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}

에 의하여)

\begin{aligned}\tilde{\boldsymbol{\mu}}_t (\mathbf{x}_t, \mathbf{x}_0)&= \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t) \\&= \color{red}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_t \Big)}\end{aligned}

•

다시 LtL_tLt​로 돌아와서, 결국 Diffusion Model의 학습 Objective는 각 Step, t별로  pθ(xt−1∣xt)p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t)pθ​(xt−1​∣xt​)를 q(xt−1∣xt,x0)q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)q(xt−1​∣xt​,x0​)에 근사시키는 것으로 생각할 수 있음

DDPM: Denoising Diffusion Probabilistic Models

Denoising Diffusion Probabilistic Models

We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best...

https://arxiv.org/abs/2006.11239

Google Colaboratory

https://colab.research.google.com/drive/1sjy9odlSSy0RBVgMTgP7s99NXsqglsUL?usp=sharing#scrollTo=qWw50ui9IZ5q

(참고할 만한 좋은 Codes)

•

DDPM은 pθ(xt−1∣xt)p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t)pθ​(xt−1​∣xt​)의 Σθ(xt,t)\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)Σθ​(xt​,t)를 학습되지 않는 상수로 설정

◦

실험적으로 Σθ(xt,t)\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)Σθ​(xt​,t)가 β~t{\tilde{\beta}_t}β~​t​일 때와 βt\beta_tβt​일 때, 비슷한 결과를 보인다고 함

•

또한, pθ(xt−1∣xt)p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t)pθ​(xt−1​∣xt​)의 μθ(xt,t)\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)μθ​(xt​,t)를 아래와 같이 Parameterization함

\begin{aligned}\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) &= {\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \color{red}\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \color{black}\Big)}\end{aligned}

\begin{aligned}L_t &= \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \Big[\frac{1}{2 \| \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t) \|^2_2} \| \color{blue}{\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)} \color{black}- \color{red}{\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)} \color{black}\|^2 \Big] \\&= \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \Big[\frac{1}{2 \|\boldsymbol{\Sigma}_\theta \|^2_2} \| \color{blue}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_t \Big)} \color{black}- \color{red}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t, t) \Big)} \color{black}\|^2 \Big] \\&= \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \Big[\frac{ (1 - \alpha_t)^2 }{2 \alpha_t (1 - \bar{\alpha}_t) \| \boldsymbol{\Sigma}_\theta \|^2_2} \|\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2 \Big] \\&= \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \Big[\frac{ (1 - \alpha_t)^2 }{2 \alpha_t (1 - \bar{\alpha}_t) \| \boldsymbol{\Sigma}_\theta \|^2_2} \|\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t, t)\|^2 \Big] \end{aligned}

•

Parameterization 결과로 얻은 LtL_tLt​의 Weighting Term을 제거하여, 다음과 같은 최종 Training & Sampling 알고리즘을 제안