finite_mdp¶ ↑
<img src=“https://secure.travis-ci.org/jdleesmiller/finite_mdp.png”/>
SYNOPSIS¶ ↑
Solve small, finite Markov Decision Process (MDP) models.
This library provides several ways of describing an MDP model (see {FiniteMDP::Model}) and some reasonably efficient implementations of policy iteration and value iteration to solve it (see {FiniteMDP::Solver}).
Usage¶ ↑
Example 1: Recycling Robot¶ ↑
The following shows how to solve the recycling robot model (example 3.7) from <cite>Sutton and Barto (1998). Reinforcement Learning: An Introduction</cite>.
<blockquote> At each time step, the robot decides whether it should (1) actively search for a can, (2) remain stationary and wait for someone to bring it a can, or (3) go back to home base to recharge its battery. The best way to find cans is to actively search for them, but this runs down the robot’s battery, whereas waiting does not. Whenever the robot is searching, the possibility exists that its battery will become depleted. In this case the robot must shut down and wait to be rescued (producing a low reward). The agent makes its decisions solely as a function of the energy level of the battery. It can distinguish two levels, high and low. </blockquote>
The transition model is described in Table 3.1, which can be fed directly into FiniteMDP using the {FiniteMDP::TableModel}, as follows.
require 'finite_mdp' alpha = 0.1 # Pr(stay at high charge if searching | now have high charge) beta = 0.1 # Pr(stay at low charge if searching | now have low charge) r_search = 2 # reward for searching r_wait = 1 # reward for waiting r_rescue = -3 # reward (actually penalty) for running out of charge model = FiniteMDP::TableModel.new [ [:high, :search, :high, alpha, r_search], [:high, :search, :low, 1-alpha, r_search], [:low, :search, :high, 1-beta, r_rescue], [:low, :search, :low, beta, r_search], [:high, :wait, :high, 1, r_wait], [:high, :wait, :low, 0, r_wait], [:low, :wait, :high, 0, r_wait], [:low, :wait, :low, 1, r_wait], [:low, :recharge, :high, 1, 0], [:low, :recharge, :low, 0, 0]] solver = FiniteMDP::Solver.new(model, 0.95) # discount factor 0.95 solver.policy_iteration 1e-4 solver.policy #=> {:high=>:search, :low=>:recharge}
Example 2: Grid Worlds¶ ↑
A more complicated example: the grid world from <cite>Russel and Norvig (2003). Artificial Intelligence: A Modern Approach</cite>, Chapter 17.
Here we describe the model as a class that implements the {FiniteMDP::Model} interface. The model contains terminal states, which we represent with a special absorbing state with zero reward, called :stop.
require 'finite_mdp' class AIMAGridModel include FiniteMDP::Model # # @param [Array<Array<Float, nil>>] grid rewards at each point, or nil if a # grid square is an obstacle # # @param [Array<[i, j]>] terminii coordinates of the terminal states # def initialize grid, terminii @grid, @terminii = grid, terminii end attr_reader :grid, :terminii # every position on the grid is a state, except for obstacles, which are # indicated by a nil in the grid def states is, js = (0...grid.size).to_a, (0...grid.first.size).to_a is.product(js).select {|i, j| grid[i][j]} + [:stop] end # can move north, east, south or west on the grid MOVES = { '^' => [-1, 0], '>' => [ 0, 1], 'v' => [ 1, 0], '<' => [ 0, -1]} # agent can move north, south, east or west (unless it's in the :stop # state); if it tries to move off the grid or into an obstacle, it stays # where it is def actions state if state == :stop || terminii.member?(state) [:stop] else MOVES.keys end end # define the transition model def transition_probability state, action, next_state if state == :stop || terminii.member?(state) (action == :stop && next_state == :stop) ? 1 : 0 else # agent usually succeeds in moving forward, but sometimes it ends up # moving left or right move = case action when '^' then [['^', 0.8], ['<', 0.1], ['>', 0.1]] when '>' then [['>', 0.8], ['^', 0.1], ['v', 0.1]] when 'v' then [['v', 0.8], ['<', 0.1], ['>', 0.1]] when '<' then [['<', 0.8], ['^', 0.1], ['v', 0.1]] end move.map {|m, pr| m_state = [state[0] + MOVES[m][0], state[1] + MOVES[m][1]] m_state = state unless states.member?(m_state) # stay in bounds pr if m_state == next_state }.compact.inject(:+) || 0 end end # reward is given by the grid cells; zero reward for the :stop state def reward state, action, next_state state == :stop ? 0 : grid[state[0]][state[1]] end # helper for functions below def hash_to_grid hash 0.upto(grid.size-1).map{|i| 0.upto(grid[i].size-1).map{|j| hash[[i,j]]}} end # print the values in a grid def pretty_value value hash_to_grid(Hash[value.map {|s, v| [s, "%+.3f" % v]}]).map{|row| row.map{|cell| cell || ' '}.join(' ')} end # print the policy using ASCII arrows def pretty_policy policy hash_to_grid(policy).map{|row| row.map{|cell| (cell.nil? || cell == :stop) ? ' ' : cell}.join(' ')} end end # the grid from Figures 17.1, 17.2(a) and 17.3 model = AIMAGridModel.new( [[-0.04, -0.04, -0.04, +1], [-0.04, nil, -0.04, -1], [-0.04, -0.04, -0.04, -0.04]], [[0, 3], [1, 3]]) # terminals (the +1 and -1 states) # sanity check: successor state probabilities must sum to 1 model.check_transition_probabilities_sum solver = FiniteMDP::Solver.new(model, 1) # discount factor 1 solver.value_iteration(1e-5, 100) #=> true if converged puts model.pretty_policy(solver.policy) # output: (matches Figure 17.2(a)) # > > > # ^ ^ # ^ < < < puts model.pretty_value(solver.value) # output: (matches Figure 17.3) # 0.812 0.868 0.918 1.000 # 0.762 0.660 -1.000 # 0.705 0.655 0.611 0.388 FiniteMDP::TableModel.from_model(model) #=> [[0, 0], "v", [0, 0], 0.1, -0.04] # [[0, 0], "v", [0, 1], 0.1, -0.04] # [[0, 0], "v", [1, 0], 0.8, -0.04] # [[0, 0], "<", [0, 0], 0.9, -0.04] # [[0, 0], "<", [1, 0], 0.1, -0.04] # [[0, 0], ">", [0, 0], 0.1, -0.04] # [[0, 0], ">", [0, 1], 0.8, -0.04] # [[0, 0], ">", [1, 0], 0.1, -0.04] # ... # [:stop, :stop, :stop, 1, 0]
Note that python code for this model is also available from the book’s authors at aima.cs.berkeley.edu/python/mdp.html
REQUIREMENTS¶ ↑
This gem requires ruby 2.2 or higher.
INSTALLATION¶ ↑
gem install finite_mdp
LICENSE¶ ↑
(The MIT License)
Copyright © 2016 John Lees-Miller
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the ‘Software’), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED ‘AS IS’, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.