Automated AI for Decision-Making

By IBM Research Automated Decision Optimization

Automation for data-driven/knowledge-driven dynamic optimization problems including reinforcement learning (RL).

Technology Overview

Reinforcement learning (RL) has emerged as a promising area within machine learning to address sequential decision-making problems (typically) under uncertainty, such as inventory management with multiple echelons and multiple suppliers with lead times under demand uncertainty, control problems like autonomous manufacturing operations, autonomous driving, resource allocation problems, etc. Reinforcement learning typically works in the setting of a state space (or observation space), action space, along with a reward signal that provides necessary feedback about performance, and hence the opportunity to learn which actions are beneficial in any given state. Most dynamic optimization problems, as well as some deterministic discrete optimization problems, are naturally expressible in the state-action-reward paradigm. For e.g., Markov Decision Processes (MDPs) that formalize sequential decision-making in dynamic systems under uncertain transitions and rewards takes the form of a state-action-reward model. Reinforcement learning techniques may be organized broadly and usefully in terms of online techniques versus offline techniques. In online reinforcement learning, we have access to a live or a simulated system that represents (or encodes via simulation) the dynamics of the system, namely, how does the system evolve from state to state upon taking permissible actions, and further so typically under uncertainty. In addition, it also provides a reward signal in each time step --- this signal is possibly empty or null and more generally it is possibly multivariate, i.e., more than one such key performance measure of interest. Online RL algorithms (or agents) learn by interacting with such a responsive online system. In contrast, offline RL (or Batch RL) is a complementary setting where we don’t have access to such a responsive, interactive system. Instead, offline RL only has access to a historical log of past actions, past state transitions and corresponding rewards, and it learns a policy from such historical data. The current release (version) of this asset addresses automation of reinforcement learning in the online setting.

Online Reinforcement Learning

This asset addresses online RL where interaction with the system is via the interface of an OpenAI Gym Environment. In other words, the user provided input that is expected by this asset is a system simulator implemented as per the interface specification of OpenAI Gym. The OpenAI Gym interface specification is available at: https://www.gymlibrary.ml/. The user provided ‘environment’ needs to be implemented in Python as a class that derives from the OpenAI Gym.Env class. The documentation how to do this subclassing is available at: https://www.gymlibrary.ml/content/environment_creation/. Doing so will set up the user provided system simulator with the required interface, against which our asset will interact for the purposes of automating the search for the best online RL algorithm (agent). Our asset carries out a parallel distributed search for the best agent along with its hyperparameters that span both learning-related configuration as well as internal policy function and value function representations in the form of respective neural architectures. This REST API describes the usage of the engine in terms of all the inputs and configurable choices that the user may exercise. The output of the API is in terms of the top K best RL solutions is also described along with how the user may use the resulting policies for rollout-based evaluation or deployment against a real or virtual system.

Sample Use-cases

The service has been tested against several sample use cases for validating the effectiveness of the automated search for the best performing RL solution(s). We have pre-registered several environments for trying out this service.It includes classic control use cases commonly used in the academic literature on RL such as “mountain-car”, “cart-pole”, “acrobot” among others that are available in OpenAI Gym (https://www.gymlibrary.ml/environments/classic_control/). It also includes several environments inspired by more realistic use cases in the Operations Research literature, such as Knapsack, Portfolio optimization, multi-echelon inventory control and network supply chain management (https://github.com/hubbs5/or-gym).

User-defined environment

As described above, the usage of the asset for Online RL requires the user to provide a system simulator, also called as an “environment”. This is python code implemented as per the OpenAI Gym.Env interface. The current release allows a subscriber to make requests against a set of pre-registered environments. A list of these environments is as shown below:

CartPole-v0 : "CartPole-v0"
CartPole-v1 : "CartPole-v1"
Acrobot-v1 : "Acrobot-v1"
MountainCar-v0 : "MountainCar-v0"
Pendulum-v1: "Pendulum-v1"
MountainCarContinuous-v0 : "MountainCarContinuous-v0"
LunarLander-v2 : "LunarLander-v2"
LunarLanderContinuous-v2 : "LunarLanderContinuous-v2"
PortfolioOpt-v0 : "PortfolioOpt-v0"
InvManagement-v0 : "InvManagement-v0"
InvManagement-v1 : "InvManagement-v1"
Knapsack-v0 : "KnapsackCustom-v0"
Knapsack-v1 : "KnapsackCustom-v1"
Knapsack-v2 : "KnapsackCustom-v2"
Knapsack-v3 : "KnapsackCustom-v3"
BinPacking-v0 : "BinPackingCustom-v0"
BinPacking-v1 : "BinPackingCustom-v1"
BinPacking-v2 : "BinPackingCustom-v2"
BinPacking-v3 : "BinPackingCustom-v3"
BinPacking-v4 : "BinPackingCustom-v4"
BinPacking-v5 : "BinPackingCustom-v5"
Newsvendor-v0 : "Newsvendor-v0"

Some description about the pre-registered environments: InvManagement-v0: multi-echelon supply chain re-order problem with backlogs InvManagement-v1: multi-echelon supply chain re-order problem without backlog BinPacking-v0: small bin packing with bounded waste (which is the difference between the current size and excess space of the bin) BinPacking-v1: large bin packing with bounded waste BinPacking-v2: small perfectly packable bin packing with linear waste BinPacking-v3: large perfectly bin packing probem with linear waste BinPacking-v4: small perfectly packable bin packing problem with bounded waste BinPacking-v5: large perfectly packable bin packing problem with bounded waste Knapsack-v0: unbounded knapsack problem Knapsack-v1: binary knapsack problem Knapsack-v2: bounded knapsack problem Knapsack-v3: online knapsack problem

For user-specified custom environments, please send a note to the developer team with an accessible pointer to the code (implementing the environment in python as per the OpenAI Gym.Env interface). For security reasons, we will first check the code, pre-register it for you, and share the registration string with you. After this step, you will be able to make requests against the provided custom environment.

Link to the Learning Paths for more detail to learn

Open API Specification

About specification

API Version

1.0.28

Legend

Technology Overview

Online Reinforcement Learning

Sample Use-cases

User-defined environment

API Version

Company information