Writing in progress.

The Search for Scalable Q Learning

I have long searched for ways to scale up Reinforcement Learning. One interesting project of such is the starcraft.ai project. I was lucky to have attended the workshop given by Deepmind, where I could meet with people who actually built the environment (Vinyals et al., 2017) . The challenges of the environment is as follows:

Partial observability
Multi agent interaction
Large action space
Large state space from raw input feature maps
Delayed credit assignment requiring long-term strategies over thousands of steps.

Q-Learning is not yet scalable

My friend, SeoHong addresses this problem too, in his paper:

To what extent can current offline RL algorithms solve complex tasks simply by scaling up data and compute?
SeoHong Park, (Park et al., 2025)

He says that Q-learning is not readily scalable to complex, long-horizon problems. His answer is:

\[\mathbb{E}_{(s,a,r,s')\sim\mathcal{D}} \bigg[ \Big( Q_\theta (s,a) - \underbrace{\big(r + \gamma \max_{a'} Q_{\bar\theta}(s',a') \big)}_{ {\color{royalblue}{\tt Biased }}\ (i.e., \not = Q^\ast(s,a))} \Big)^2\bigg]\]

Q-learning struggles to scale because the prediction targets are biased, and these biases accumulate over the horizon. The presence of bias accumulation is a fundamental limitation that is unique to Q-learning (TD learning).

There are previous works showing that current RL methods scale to more (but necessarily harder) tasks with larger models and datasets (Kumar et al., 2023) , (Springenberg et al., 2024) .

Also, Seohong has tried to identify the bottlenecks of offline reinforcement learning.

The main bottleneck in offline RL is policy learning, not value learning.

TL;DR

(1) Policy extraction is often more important than value learning: Do not use weighted behavior cloning (AWR); always use behavior-constrianed policy gradient (DDPG+BC).
(2) Test-time policy generalization is one of the most significant bottlenecks in offline RL: Current offlien RL is often already great at learning an effective policy on dataset states, and the performance is often simply determined by its performance on out-of-distribution states.

Direction

definitely looking into ways to avoid TD learning entirely.

quasimetric RL, based on the LP formulation of RL.
MC-based methods like contrastive RL.

/\_/\
( o.o )
> ^ <
Thinking…
cat

References

Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J. P., Schrittwieser, J., Quan, J., Gaffney, S., Petersen, S., Simonyan, K., Schaul, T., van Hasselt, H., Silver, D., Lillicrap, T. P., Calderone, K., … Tsing, R. (2017). StarCraft II: A New Challenge for Reinforcement Learning. CoRR, abs/1708.04782. http://arxiv.org/abs/1708.04782
Park, S., Frans, K., Mann, D., Eysenbach, B., Kumar, A., & Levine, S. (2025). Horizon Reduction Makes RL Scalable. https://arxiv.org/abs/2506.04168
Kumar, A., Agarwal, R., Geng, X., Tucker, G., & Levine, S. (2023). Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes. https://arxiv.org/abs/2211.15144
Springenberg, J. T., Abdolmaleki, A., Zhang, J., Groth, O., Bloesch, M., Lampe, T., Brakel, P., Bechtle, S., Kapturowski, S., Hafner, R., Heess, N., & Riedmiller, M. A. (2024). Offline Actor-Critic Reinforcement Learning Scales to Large Models. Forty-First International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. https://openreview.net/forum?id=tl2qmO5kpD

Scalable Q Learning

Reaction to Seohong's Post on X

The Search for Scalable Q Learning

Q-Learning is not yet scalable

Direction

References

Further Reading

Feynman-Gates Conversation on Future of Computation (AI Generated)

Is Conditional Generative Modelling All You Need for Decision-Making?

Generalist Neural Algorithmic Learner