
























Abstract:A convenient approach to optimally solving combinatorial optimization tasks is the Branch-and-Bound method. Its branching heuristic can be learned to solve a large set of similar tasks. The promising results here are achieved by the recently appeared on-policy reinforcement learning method based on the tree Markov Decision Process. To overcome its main disadvantages, namely, very large training time and unstable training, we propose TreeDQN (Tree Deep Q-Network), a sample-efficient off-policy RL method trained by optimizing the geometric mean of expected return. To theoretically support the training procedure for our method, we prove the contraction property of the Bellman operator for the tree MDP. As a result, our method requires up to 10 times less training data and performs faster than known on-policy methods on synthetic tasks. Moreover, TreeDQN significantly outperforms the state-of-the-art techniques on a challenging practical task from the ML4CO competition.
| Comments: | Accepted in Knowledge-Based Systems |
| Subjects: | Machine Learning (cs.LG); Optimization and Control (math.OC) |
| Cite as: | arXiv:2306.05905 [cs.LG] |
| (or arXiv:2306.05905v2 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2306.05905 arXiv-issued DOI via DataCite |
From: Dmitry Sorokin [view email]
[v1]
Fri, 9 Jun 2023 14:01:26 UTC (90 KB)
[v2]
Wed, 20 May 2026 18:28:05 UTC (15,208 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。