Home Is Conditional Generative Modelling All You Need for Decision-Making?
Post
Cancel

Is Conditional Generative Modelling All You Need for Decision-Making?

One of the must-see papers of 2022. Finally, but as expected, it's here: decision-making formulated not as reinforcement learning (RL), but as conditional generative modeling. Personally, I am doubly interested since it also runs experiments on robotic simulation. I wonder if it can solve complexities of traditional RL methods.

\[\renewcommand{\V}[1]{\mathbf{#1}}\]

Introduction

Amazing paper from Improbable AI Lab MIT: (Ajay et al., 2022) .

img

By viewing decision-making not as reinforcement learning (RL) problem, but as a conditional generative modeling problem:

  1. Conditional generative modeling is an effective tool in offline decision making
  2. Using classifier-free guidance with low-temperature sampling, instead of dynamic programming
  3. Leveraging the framework of conditional gererative modeling to combine constraints and compose skills during inference flexibly.

Decision Diffuser

img

img

Results Overview

img

Constraint Satisfaction

Combining Stacking Constraints

imgimgimg

img

Combining Rearrangement Constraints

imgimgimg

img

‘Not’ constraints in Stacking and Rearrangement

imgimg

img

Infeasible constraints lead to incoherent behavior

img

img

Background

Diffusion Probabilistic Models

Diffusion models (Sohl-Dickstein et al., 2015) , (Ho et al., 2020) are a specific type of generative model that learn the data distribution \(q(\V x)\) from a dataset \(\mathcal{D} := \{ \V x^i \}_{0 \leq i < M}\) . They have been used most notably for synthesizing high-quality images from text descriptions. Here the data-generating procedure is modelled with a predefined forward noising process \(q(\V x_{k-1} \mid \V x_k) := \mathcal{N}(\V x_{k+1}; \sqrt{\alpha_k} \V x_k, (1-\alpha_k) \V I)\) and a trainable reverse process \(p_\theta(\V x_{k-1} \mid \V x_k) := \mathcal{N} (\V x_{k-1} \mid \mu_\theta (\V x_k, k), \Sigma_k)\), where \(\mathcal{N}(\mu, \Sigma)\) denotes a Gaussian distribution with mean \(\mu\) and variance \(\Sigma\), \(\alpha_k \in \mathbb{R}\) determines the variance schedule, \(\V x_0 := \V x\) is a sample, \(\V x_1, \V x_2, \ldots, \V x_{K-1}\) are the latents, and \(\V x_K \sim \mathcal{N}(\V 0, \V I)\) for carefully chosen \(\alpha_k\) and long enough \(K\). Starting with Gaussian noise, sample are then iteratively generated through a series of “denoising” steps.

Although a tractable variational lower-bound on \(\log p_\theta\) can be optimized to train diffusion models, (Ho et al., 2020) propose a simplified surrogate loss:

\[\mathcal{L}_{\mathrm {denoise}}(\theta) := \mathbb{E}_{k \sim [1, K], \V x_0\sim q, \epsilon \sim \mathcal{N}(\V 0, \V I)}[\| \epsilon - \epsilon_0 (\V x_k, k)\|^2]\]

The predicted noise \(\epsilon_\theta(\V x_k, k)\), parameterized with a deep neural network, estimates the noise \(\epsilon \sim \mathcal{N}(0, I)\) added to the dataset sample \(\V x_0\) to produce a noisy \(\V x_k\). This is equivalent to predicting the mean of \(p_\theta(\V x_{k-1} \mid \V x_k)\) since \(\mu_\theta(\V x_k, k)\) can be calculated as a function of \(\epsilon_\theta(\V x_k, k)\) (Ho et al., 2020) .

Connection with stochastic gradient Langevin dynamics

Langevin dynamics is a concept from physics, developed for statistically modeling molecular systems. Combined with stochastic gradient descent, stochastic gradient Langevin dynamics (Welling & Teh, 2011) can produce samples from a probability density \(p(\V x)\) using only the gradients \(\nabla_{\V x} \log p(\V x)\) in a Markov chain of updates:

\[\V x_t = \V x_{t-1} + \frac{\delta}{2} \nabla_{\V x} \log q(\V x_{t-1}) + \sqrt{\delta} \V \epsilon_t, \quad\text{where } \epsilon \sim \mathcal N (\V 0, \V I)\]

where \(\delta\) is the step size. When \(T \rightarrow \infty\), \(\epsilon \to 0\), \(\V x_T\) equals to the true probability density \(p(\V x)\).

Compared to standard SGD, stochastic gradient Langevin dynamics injects Gaussian noise into the parameter updates to avoid collapses into local minima.1

References

  1. Ajay, A., Du, Y., Gupta, A., Tenenbaum, J., Jaakkola, T., & Agrawal, P. (2022). Is Conditional Generative Modeling all you need for Decision-Making? arXiv. doi: 10.48550/ARXIV.2211.15657 https://arxiv.org/abs/2211.15657
  2. Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., & Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. CoRR, abs/1503.03585. http://arxiv.org/abs/1503.03585
  3. Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Advances in Neural Information Processing Systems (Vol. 33, pp. 6840–6851). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf
  4. Welling, M., & Teh, Y. W. (2011). Bayesian Learning via Stochastic Gradient Langevin Dynamics. International Conference on Machine Learning.

Footnotes

  1. https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ 

This post is licensed under CC BY 4.0 by the author.

Generalist Neural Algorithmic Learner

Feynman-Gates Conversation on Future of Computation (AI Generated)