Introduction
Generative models can be classified into 2 categories:
Implicit models Probability distribution is implicit represented by a network and new samples are generated by passing a Gaussian noise as the input. One example is generative adversarial networks (GANs).
Likelihood-based models Directly learn probability distribution density by maximizing likelihood. Examples include variational auto-encoders (VAEs) and diffusion model.
We have derive the diffusion model from another page Here. We have introduced score function which is a objective of ELBO optimization. In this blog, we try to dive into the details of score based model and to what extent it is correlated with diffusion model.
Score function, score-based models
Assume we use a data distribution $p(x)$ to generate a bunch of data $x_n$ (such as an image), our target is to learn this distribution. Here we define a probability density function (pdf) or energy function $f_\theta(x)$
$$ p_\theta(x) = \frac{e^{-f_\theta(x)}}{Z_{\theta}} $$
Where Z is a constant to makes sure $\int p_\theta(x)dx=1$. We could learn this distribution by likelihood maximization. However the Z calculation is intractable!
Therefore, when we apply log and derivatives on both sides, $$\nabla_x log p_\theta(x) = \nabla_x log(\frac{e^{-f_\theta(x)}}{Z_{\theta}})$$ $$= \nabla_x log(\frac{1}{Z_{\theta}})+\nabla_x log(e^{-f_\theta(x)})$$ $$= -\nabla_x f_\theta(x)$$ $$=s_\theta(x)$$ In this case we get rid of the normalization issue and obtain the score function $\nabla_x log p_\theta(x)$! We can then optimize the estimate this score function by a neural network and Fisher Divergence. $$E_{p(x)}[||s_\theta(x)-\nabla log p_\theta(x)||^2_2]$$ Score function is indeed the gradient which describe the move of data to improve likelihood. Given any random point x, it will converge to one of the “modes” determined by score function. Such methods is called Markov chain Monte Carl(MCMC) and we mainly use Langevin dynamics here. $$x_{i+1}=x_i+c\nabla log p(x_i)+\sqrt{2c}\epsilon$$ $x_0$ is sampled from a prior distribution which can be random. Gaussian noise is added each step to avoid all samples collapse to the exact same mode but add extra diversity. In order to calculate the Fisher Divergence above we need to have access of the ground truth score function which is impossible for most cases. Luckily we can optimize this with other techniques such as score matching with stochastic gradient descent. The model generated with MCMC method such as Langevin dynamics is called score-based generative models.