Reinforcement Learning for Concave Objectives and Convex Constraints

Mridul Agarwal, Purdue University

Abstract

In areas where prior data is not available, data scientists use reinforcement learning (RL) on online data to explore the system model and eventually to learn optimal policies to make decisions after observing the environment state, optimizing certain objective. Many diverse domains, such as recommendation systems, automatic control, operations research, etc., are using reinforcement learning to solve problems, including personalized product recommendations to increase click-through rates, automatic lane merging to reduce merging times, and warehouse planning to reduce costs, respectively. Further, the advances in RL allow for obtaining sample guarantees in many domains where the system is a Markov Decision Process (MDP) with finite states and actions. However, formulating RL with MDPs work only for a single objective, and hence, they are not readily applicable where the policies need to optimize multiple objectives or to satisfy certain constraints while maximizing one or multiple objectives, which can often be conflicting. For example, a wireless service provider may aim to maximize fairness among the user while still providing a minimum service guarantee to certain premium users. Further, many applications such as robotics or autonomous driving do not allow for violating constraints even during the training process. Currently, existing algorithms do not simultaneously combine multiple objectives and zero-constraint violations, sample efficiency, and computational complexity. To this end, we begin with studying sample efficient Multi-Objective Reinforcement Learning (MORL) with a non-linear scalarization function. In particular, we use a posterior sampling algorithm that maximizes a non-linear, concave, Lipschitz continuous function of multiple objectives for an MDP sampled from the posterior distribution learned from the collected samples. The posterior sampling algorithm works with a convex optimization problem to solve for the stationary distribution of the states and actions. Further, using our Bellman error based analysis, we show that the algorithm obtains a near-optimal regret bound for the number of interaction with the environment. We then extend the framework to work with a constrained Markov Decision Process. We now provide an optimism based algorithm to ensure that there exists a solution to the convex optimization problem to solve for the stationary distributions. We then assume that the feasible policies allow for some slack in constraints. This assumption allows us to design an algorithm which does not violate constraints. Finally, we then merge the two setups and algorithms and provide a posterior sampling algorithm for multi-objective RL with concave, Lipschitz continuous scalarization function and convex, Lipschitz continuous constraints. We also show that the algorithm performs significantly better than the existing algorithm for MDPs with finite states and finite actions.

Degree

Ph.D.

Advisors

Reibman, Purdue University.

Subject Area

Robotics|Artificial intelligence|Commerce-Business|Information science|Mathematics|Operations research

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS