Writing in progress.
The Search for Scalable Q Learning
I have long searched for ways to scale up Reinforcement Learning. One interesting project of such is the starcraft.ai project. I was lucky to have attended the workshop given by Deepmind, where I could meet with people who actually built the environment (Vinyals et al., 2017) . The challenges of the environment is as follows:
Partial observability
Multi agent interaction
Large action space
Large state space from raw input feature maps
Delayed credit assignment requiring long-term strategies over thousands of steps.
Q-Learning is not yet scalable
My friend, SeoHong addresses this problem too, in his paper:
To what extent can current offline RL algorithms solve complex tasks simply by scaling up data and compute?
SeoHong Park, (Park et al., 2025)
He says that Q-learning is not readily scalable to complex, long-horizon problems. His answer is:
\[\mathbb{E}_{(s,a,r,s')\sim\mathcal{D}} \bigg[ \Big( Q_\theta (s,a) - \underbrace{\big(r + \gamma \max_{a'} Q_{\bar\theta}(s',a') \big)}_{ {\color{royalblue}{\tt Biased }}\ (i.e., \not = Q^\ast(s,a))} \Big)^2\bigg]\]Q-learning struggles to scale because the prediction targets are biased, and these biases accumulate over the horizon. The presence of bias accumulation is a fundamental limitation that is unique to Q-learning (TD learning).
There are previous works showing that current RL methods scale to more (but necessarily harder) tasks with larger models and datasets (Kumar et al., 2023) , (Springenberg et al., 2024) .
Also, Seohong has tried to identify the bottlenecks of offline reinforcement learning.
The main bottleneck in offline RL is policy learning, not value learning.
TL;DR
(1) Policy extraction is often more important than value learning: Do not use weighted behavior cloning (AWR); always use behavior-constrianed policy gradient (DDPG+BC).
(2) Test-time policy generalization is one of the most significant bottlenecks in offline RL: Current offlien RL is often already great at learning an effective policy on dataset states, and the performance is often simply determined by its performance on out-of-distribution states.
Direction
definitely looking into ways to avoid TD learning entirely.
- quasimetric RL, based on the LP formulation of RL.
- MC-based methods like contrastive RL.
/\_/\
( o.o )
> ^ <
Thinking…cat
References
- Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J. P., Schrittwieser, J., Quan, J., Gaffney, S., Petersen, S., Simonyan, K., Schaul, T., van Hasselt, H., Silver, D., Lillicrap, T. P., Calderone, K., … Tsing, R. (2017). StarCraft II: A New Challenge for Reinforcement Learning. CoRR, abs/1708.04782. http://arxiv.org/abs/1708.04782
- Park, S., Frans, K., Mann, D., Eysenbach, B., Kumar, A., & Levine, S. (2025). Horizon Reduction Makes RL Scalable. https://arxiv.org/abs/2506.04168
- Kumar, A., Agarwal, R., Geng, X., Tucker, G., & Levine, S. (2023). Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes. https://arxiv.org/abs/2211.15144
- Springenberg, J. T., Abdolmaleki, A., Zhang, J., Groth, O., Bloesch, M., Lampe, T., Brakel, P., Bechtle, S., Kapturowski, S., Hafner, R., Heess, N., & Riedmiller, M. A. (2024). Offline Actor-Critic Reinforcement Learning Scales to Large Models. Forty-First International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. https://openreview.net/forum?id=tl2qmO5kpD