\[\renewcommand{\V}[1]{\mathbf{#1}}\]One of the must-see papers of 2022. Finally, but as expected, it's here: decision-making formulated not as reinforcement learning (RL), but as conditional generative modeling. Personally, I am doubly interested since it also runs experiments on robotic simulation. I wonder if it can solve complexities of traditional RL methods.
Amazing paper from Improbable AI Lab MIT: (Ajay et al., 2022) .
By viewing decision-making not as reinforcement learning (RL) problem, but as a conditional generative modeling problem:
- Conditional generative modeling is an effective tool in offline decision making
- Using classifier-free guidance with low-temperature sampling, instead of dynamic programming
- Leveraging the framework of conditional gererative modeling to combine constraints and compose skills during inference flexibly.
Decision Diffuser
Results Overview
Constraint Satisfaction
Combining Stacking Constraints
Combining Rearrangement Constraints
‘Not’ constraints in Stacking and Rearrangement
Infeasible constraints lead to incoherent behavior
Diffusion Probabilistic Models
Diffusion models (Sohl-Dickstein et al., 2015) , (Ho et al., 2020) are a specific type of generative model that learn the data distribution \(q(\V x)\) from a dataset \(\mathcal{D} := \{ \V x^i \}_{0 \leq i < M}\) . They have been used most notably for synthesizing high-quality images from text descriptions. Here the data-generating procedure is modelled with a predefined forward noising process \(q(\V x_{k-1} \mid \V x_k) := \mathcal{N}(\V x_{k+1}; \sqrt{\alpha_k} \V x_k, (1-\alpha_k) \V I)\) and a trainable reverse process \(p_\theta(\V x_{k-1} \mid \V x_k) := \mathcal{N} (\V x_{k-1} \mid \mu_\theta (\V x_k, k), \Sigma_k)\), where \(\mathcal{N}(\mu, \Sigma)\) denotes a Gaussian distribution with mean \(\mu\) and variance \(\Sigma\), \(\alpha_k \in \mathbb{R}\) determines the variance schedule, \(\V x_0 := \V x\) is a sample, \(\V x_1, \V x_2, \ldots, \V x_{K-1}\) are the latents, and \(\V x_K \sim \mathcal{N}(\V 0, \V I)\) for carefully chosen \(\alpha_k\) and long enough \(K\). Starting with Gaussian noise, sample are then iteratively generated through a series of “denoising” steps.
Although a tractable variational lower-bound on \(\log p_\theta\) can be optimized to train diffusion models, (Ho et al., 2020) propose a simplified surrogate loss:
\[\mathcal{L}_{\mathrm {denoise}}(\theta) := \mathbb{E}_{k \sim [1, K], \V x_0\sim q, \epsilon \sim \mathcal{N}(\V 0, \V I)}[\| \epsilon - \epsilon_0 (\V x_k, k)\|^2]\]The predicted noise \(\epsilon_\theta(\V x_k, k)\), parameterized with a deep neural network, estimates the noise \(\epsilon \sim \mathcal{N}(0, I)\) added to the dataset sample \(\V x_0\) to produce a noisy \(\V x_k\). This is equivalent to predicting the mean of \(p_\theta(\V x_{k-1} \mid \V x_k)\) since \(\mu_\theta(\V x_k, k)\) can be calculated as a function of \(\epsilon_\theta(\V x_k, k)\) (Ho et al., 2020) .
Connection with stochastic gradient Langevin dynamics
Langevin dynamics is a concept from physics, developed for statistically modeling molecular systems. Combined with stochastic gradient descent, stochastic gradient Langevin dynamics (Welling & Teh, 2011) can produce samples from a probability density \(p(\V x)\) using only the gradients \(\nabla_{\V x} \log p(\V x)\) in a Markov chain of updates:
\[\V x_t = \V x_{t-1} + \frac{\delta}{2} \nabla_{\V x} \log q(\V x_{t-1}) + \sqrt{\delta} \V \epsilon_t, \quad\text{where } \epsilon \sim \mathcal N (\V 0, \V I)\]where \(\delta\) is the step size. When \(T \rightarrow \infty\), \(\epsilon \to 0\), \(\V x_T\) equals to the true probability density \(p(\V x)\).
Compared to standard SGD, stochastic gradient Langevin dynamics injects Gaussian noise into the parameter updates to avoid collapses into local minima.1
