Beyond the Cumulative Return in Reinforcement Learning
Virtual Informal Systems Seminar (VISS)
Centre for Intelligent Machines (CIM) and Groupe d'Etudes et de Recherche en Analyse des Decisions (GERAD)
Speaker: – Research Scientist, Amazon, United States
Webinar ID: 910 7928 6959
Passcode: VISS
Abstract: Reinforcement Learning (RL) is a form of stochastic adaptive control in which one seeks to estimate parameters of a controller only from data, and has gained popularity in recent years. However, technological successes of RL are hindered by the high variance and irreproducibility their training exhibits in practice. Motivated by this gap, we'll present recent efforts to solidify theoretical understanding of how risk-sensitivity, incorporating prior information, and prioritizing exploration may be subsumed into a "general utility." This entity is defined as any concave function of the long-term state-action occupancy measure of a MDP. We present two different methodologies for RL with general utilities: the first, for the tabular setting, extends the classical linear programming formulation of dynamic programming to general utilities. We develop a solution methodology based upon a stochastic variant of primal-dual method, whose polynomial rate of convergence to a primal-dual optimal pair is derived. Experiments demonstrate the proposed approach yields a rigorous way to incorporate risk-sensitivity into RL. Secondly, we study scalable solutions for general utilities by searching over parameterized families of policies. To do so, we put forth Variational Policy Gradient Theorem, based upon which we develop Variational Policy Gradient (VPG) method. VPG constructs a ``shadow reward" which plays the role of the usual reward in PG methods to conduct search directions in parameter space. We present the convergence rate of this technique to global optimality that exploits a bijection between occupancy measures and parameterized polices. Experimentally, we observe that VPG provides an effective framework for solving constrained MDPs and exploration problems experimentally on some benchmarks in OpenAI Gym.
Bio: Alec Koppel is a Research Scientist at Amazon within Supply Chain Optimization Technologies (SCOT) since September of 2021. From 2017-2021, he was a Research Scientist with the U.S. Army Research Laboratory (ARL) in the Computational and Information Sciences Directorate. He completed his Master's degree in Statistics and Doctorate in Electrical and Systems Engineering, both at the University of Pennsylvania (Penn) in August of 2017. Before coming to Penn, he completed his Master's degree in Systems Science and Mathematics and Bachelor's Degree in Mathematics, both at Washington University in St. Louis (WashU), Missouri. He is a recipient of the 2016 UPenn ESE Dept. Award for Exceptional Service, an awardee of the Science, Mathematics, and Research for Transformation (SMART) Scholarship, a co-author of Best Paper Finalist at the 2017 IEEE Asilomar Conference on Signals, Systems, and Computers, a finalist for the ARL Honorable Scientist Award 2019, an awardee of the 2020 ARL Director's Research Award Translational Research Challenge (DIRA-TRC), a 2020 Honorable Mention from the IEEE Robotics and Automation Letters, and mentor to the 2021 ARL Summer Symposium Best Project Awardee. His research interests are in optimization and machine learning. His academic work focuses on approximate Bayesian inference, reinforcement learning, and decentralized optimization, with an emphasis on applications in robotics and autonomy. On the industry side, he is investigating inferring hidden supply signals from market data, and its intersection with vendor selection.