Home Scalable Q Learning
Seiok 🤸‍♂ Kim
Cancel

Scalable Q Learning

Reaction to Seohong's Post on X

Writing in progress.

The Search for Scalable Q Learning

I have long searched for ways to scale up Reinforcement Learning. One interesting project of such is the starcraft.ai project. I was lucky to have attended the workshop given by Deepmind, where I could meet with people who actually built the environment (Vinyals et al., 2017) . The challenges of the environment is as follows:

  • Partial observability

  • Multi agent interaction

  • Large action space

  • Large state space from raw input feature maps

  • Delayed credit assignment requiring long-term strategies over thousands of steps.

Q-Learning is not yet scalable

My friend, SeoHong addresses this problem too, in his paper:

To what extent can current offline RL algorithms solve complex tasks simply by scaling up data and compute?

SeoHong Park, (Park et al., 2025)

He says that Q-learning is not readily scalable to complex, long-horizon problems. His answer is:

\[\mathbb{E}_{(s,a,r,s')\sim\mathcal{D}} \bigg[ \Big( Q_\theta (s,a) - \underbrace{\big(r + \gamma \max_{a'} Q_{\bar\theta}(s',a') \big)}_{ {\color{royalblue}{\tt Biased }}\ (i.e., \not = Q^\ast(s,a))} \Big)^2\bigg]\]

Q-learning struggles to scale because the prediction targets are biased, and these biases accumulate over the horizon. The presence of bias accumulation is a fundamental limitation that is unique to Q-learning (TD learning).

There are previous works showing that current RL methods scale to more (but necessarily harder) tasks with larger models and datasets (Kumar et al., 2023) , (Springenberg et al., 2024) .

Also, Seohong has tried to identify the bottlenecks of offline reinforcement learning.

The main bottleneck in offline RL is policy learning, not value learning.

TL;DR

(1) Policy extraction is often more important than value learning: Do not use weighted behavior cloning (AWR); always use behavior-constrianed policy gradient (DDPG+BC).
(2) Test-time policy generalization is one of the most significant bottlenecks in offline RL: Current offlien RL is often already great at learning an effective policy on dataset states, and the performance is often simply determined by its performance on out-of-distribution states.

Direction

definitely looking into ways to avoid TD learning entirely.

  • quasimetric RL, based on the LP formulation of RL.
  • MC-based methods like contrastive RL.

 /\_/\
( o.o )
 > ^ <
Thinking…

cat

References

  1. Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J. P., Schrittwieser, J., Quan, J., Gaffney, S., Petersen, S., Simonyan, K., Schaul, T., van Hasselt, H., Silver, D., Lillicrap, T. P., Calderone, K., … Tsing, R. (2017). StarCraft II: A New Challenge for Reinforcement Learning. CoRR, abs/1708.04782. http://arxiv.org/abs/1708.04782
  2. Park, S., Frans, K., Mann, D., Eysenbach, B., Kumar, A., & Levine, S. (2025). Horizon Reduction Makes RL Scalable. https://arxiv.org/abs/2506.04168
  3. Kumar, A., Agarwal, R., Geng, X., Tucker, G., & Levine, S. (2023). Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes. https://arxiv.org/abs/2211.15144
  4. Springenberg, J. T., Abdolmaleki, A., Zhang, J., Groth, O., Bloesch, M., Lampe, T., Brakel, P., Bechtle, S., Kapturowski, S., Hafner, R., Heess, N., & Riedmiller, M. A. (2024). Offline Actor-Critic Reinforcement Learning Scales to Large Models. Forty-First International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. https://openreview.net/forum?id=tl2qmO5kpD
This post is licensed under CC BY 4.0 by the author.
Contents

Feynman-Gates Conversation on Future of Computation (AI Generated)

-