Home / homework / HW 9 -- POMDPs

HW 9 -- POMDPs

The questions below are due on Monday November 20, 2023; 11:59:00 PM.
 
You are not logged in.

Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.

This week's Colab notebook can be found here: Colab Notebook (Click Me!)

1) Decision Theory Warmup§

You’re an olympic skier. In practice today, you fell down and hurt your ankle. Based on the results of an x-ray, the doctor thinks that it’s broken with probability 0.2. So, the question is, should you ski in the race tomorrow? If you ski, you think you’ll win with probability 0.1 (independent of whether your leg is broken). If your leg is broken and you ski on it, then you’ll damage it further.

Your utilities are as follows:

  • if you win the race and your leg isn’t broken, +100;
  • if you win and your leg is broken, +50;
  • if you lose and your leg isn’t broken 0;
  • if you lose and your leg is broken -50.
  • if you don’t ski, then if your leg is broken your utility is -10, and if it isn’t, it’s 0.
  1. What is your expected utility for skiing?

    Please input a number (2 decimal places).

  2. What is your expected utility for staying home?

    Please input a number (2 decimal places).

You might be able to gather some more information about the state of your leg by having more tests. You might be able to gather more information about whether you’ll win the race by talking to your coach or the TV sports commentators.

  1. What is the expected value of perfect information about the state of your leg?

    Please input a number (2 decimal places).

  2. What is the expected value of perfect information about whether you’ll win the race?

    Please input a number (2 decimal places).

The Robot Story

Our robot has lost its charger! The robot lives in a world with K=3 locations but it needs to have the charger in location 0 (where the power outlet is) in order to plug it in and charge.

The robot doesn't have any uncertainty about its own location (and we won't model that) but it does have uncertainty about the location of the charger. When the robot looks in location loc, by executing a look(loc) action, it will receive an observation in \{0, 1\}. The low battery means that its sensors aren't working very well.

  • If the charger is in the location loc, it will get observation 1 with probability 0.9.
  • If the charger is not in location loc, it will get observation 1 with probability 0.4.

The robot can also try to execute a move(loc1, loc2) (which moves the charger):

  • If the charger is actually in loc1, then with probability 0.8, the charger will be moved to loc2, otherwise, it will stay in loc1.
  • If the charger is not in loc1, then nothing will happen.

Because of the constraints of the robot's arm, we can't do all possible move actions. Specifically, the only valid movements are move(0, 1), move(1, 2), and move(2, 0).

The robot has two more actions (charge and nop)

  • If it executes the charge action when the charger is in location 0, then it gets a reward of 10 and the game terminates.
  • If it executes that action in any other state, it gets reward -100 and the game terminates.

It gets reward -0.1 for every look action and -0.5 for every move action in all other states.

Finally, it has a nop action, with reward 0, which does not provide a useful observation, and does not affect the world state.

For actions that don't yield useful information about the environment (move, charge, and nop), we assume that they get observation 0 with probability 1, independent of the state.

2) Robot HMM§

Before worrying about how to select actions for the robot, we will think about how the robot can estimate where its charger is, given a history of actions and observations. This problem should seem familiar! A POMDP, ignoring the problem of choosing actions, is an Hidden Markov Model (HMM). We can solve the problem of computing P(S_{t+1} \mid O_2, \ldots O_{t+1}, A_1, \ldots, A_t) using the generic sum-product method, or HMM forward inference. You might also remember this computation as the "Bayes filter" from Lecture 12.

Bayes Filter

Let b_t = P(S_{t} \mid O_{2:t}, A_{1:t-1}) be our belief at time t about the hidden state S_t. In the robot-charger problem, since there are K=3 possible locations for the charger, we can represent the state as \mathcal{S} = \{s_0, s_1, s_2\}, indicating that the charger is at location 0, at location 1, and at location 2, respectively. Thus, a belief b_t over the current state can be represented as a 3-tuple (p_0, p_1, p_2) such that 0 \le p_i \le 1 and \sum_i p_i = 1, indicating the probability that we are at state s_0, s_1, and s_2.

The equation for computing our belief at time b_{t+1} given our current belief b_t, an action A_t=a_t, and an observation O_{t+1}=o_{t+1} can be represented as:

b_{t+1} \propto P(O_{t+1}=o_{t+1} \mid S_{t+1}, A_t=a_t)[\sum_{s_t} P(S_{t+1} \mid S_{t}=s_t, A_{t}=a_t)b_t(s_t)]

It is also possible to understand the above equation as updating the belief in three steps:

  1. Transition update: For each s_{t+1} \in \mathcal{S}, b'(s_{t+1}) = \sum_{s_t} P(S_{t+1}=s_{t+1} \mid S_{t}=s_t, A_{t}=a_t)b_t(s_t).
  2. Observation update: For each s_{t+1} \in \mathcal{S}, b''(s_{t+1}) = P(O_{t+1}=o_{t+1} \mid S_{t+1}=s_{t+1}, A_t=a_t)b'(s_{t+1}).
  3. Normalize: Let Z = \sum_{s} b''(s). Then for each s_{t+1} \in \mathcal{S}, b_{t+1}(s_{t+1}) = \frac{1}{Z}b''(s_{t+1}).

2.1) Practice§

Consider starting from a uniform prior over the charger's location: b_0 = (1/3, 1/3, 1/3). The robot takes the following three actions and receives the observations.

  1. At step 0, the robot takes action A_0 = \textit{look}(0), and observes O_1 = 0, what's the updated belief b_1?

    Please input your answer a Python list of three float numbers. You should make sure that the precision is at least 1e-4.

  2. Continue. At step 1, the robot takes action A_1 = \textit{look}(1), and observes O_2 = 1, what's the updated belief b_2?

    Please input your answer a Python list of three float numbers. You should make sure that the precision is at least 1e-4.

  3. Continue. At step 2, the robot takes action A_2 = \textit{move}(2, 0), and observes O_3 = 0, what's the updated belief b_3?

    Please input your answer a Python list of three float numbers. You should make sure that the precision is at least 1e-4.

3) Belief Space MDP§

Recall that we can model the sequential process that governs the evolution of the belief state over time as an MDP in continuous space. (This is a super-cool result!) What makes it kind of counter-intuitive (and cool!) is that the states in this ''belief MDP'' are probability distributions. What makes it manageable is that, from any given belief state b, given an action a, there are only as many new beliefs b' as there are of possible observations o \in O.

To gain some intuition, let's construct the belief MDP that corresponds to our simple robot POMDP.

  • Belief MDP state space: the three-dimensional simplex, that is, the set of vectors of three non-negative numbers that add up to 1, interpreted as a distribution over the 3 possible underlying states of the POMDP.
  • Action space: as before.
  • Reward function: expected value of reward given the belief state.
  • Transition distribution: for each o \in O, we can reach belief state bf(b, a, o) with probability P(o | b, a).

Assume that you have a Bayes filter function \textit{bf}(b, a, o) that takes a belief state b, action a, and observation o, and returns a new belief state b' as you computed in the previous problem.

  1. What is the probability of receiving observation 1 given belief state b and action look(x)?
    Treat b as a function of a state which returns the belief for that state, so b(0) is the belief for state 0. Let x be a variable denoting a state. Please input your answer as a Python expression involving function b and variable x. For example, b(x) * 0.5 + b(0) * 0.5.

3.1) Transition Model§

Let's work on expressing the transition model of the belief-space process.

  1. If the robot executes action look(x), how many possible resulting belief states are there?

    Please input your answer as a single integer.

  2. What are they?

    Please input your answer as a list of Python expressions. You can use the Bayes filter function bf, which takes three arguments: the belief, the action and the observation. You can represent the belief as b and the action as look(x). For example, you can input [bf(b, look(x), 1)]

  3. If the robot executes action move(x1, x2), how many possible resulting belief states are there?

    Please input your answer as a single integer.

  4. What are they?

    Please input your answer as a list of Python expressions. You can use the Bayes filter function bf, which takes three arguments: the belief, the action and the observation. You can represent the belief as b and the action as move(x1, x2). For example, you can input [bf(b, move(x1, x2), 1)]

3.2) Reward Model§

  1. What is the reward value of taking action charge in belief state b?
    Please input your answer as a Python expression of the belief function b. For example, b(0) * 0.5 + b(1) * 0.5.

3.3) Belief-Space MDP§

In this section, you will implement a function create_belief_mdp, that transforms a POMDP into a belief-space MDP. We have provided the basic skeleton for you. In particular, you only need to implement the get_reward and the get_transition_distribution function for the Belief MDP. You can use the function belief_filter(pomdp, b, a, o) which is already defined; the implementaation of that function is available in the colab.

For reference, our solution is 81 line(s) of code.

4) RHC with Expectimax§

Now that we have made a belief MDP, we can use our MDP methods for online search to decide actions to take. Let's explore receding horizon control using expectimax!

4.1) Receding Horizon Control§

Use the RHC implementation provided in the colab and answer the questions in Catsoop. Note that the simulation function sets the random seed so that your results will be deterministic.

4.2) Practice§

Consider a situation in which:

  • the temporal discount (gamma) is set to 1
  • the robot's prior is (b(0), b(1), b(2)) = (0.6, 0.25, 0.15).
  1. Run the RHC with horizon 3. What's the first action that the model will take?

    Select the RHC action for H = 3.

  2. Run the RHC with horizon 4. What's the first action that the model will take?

    Select the RHC action for H = 4.

  3. Why does increasing the horizon produce a policy with a better expected value?

    Mark all reasons that apply.
    Because we should not execute the charge action before we are very sure that the charger is at 0 (having a large belief 'mass' at 0).
    Because we can't move enough belief 'mass' to 0 within 3 steps.
    Because for H=3, the cost of moving or looking exceeds the expected value of charging.

4.3) Policy Tree§

  1. Use the expectimax code and your belief MDP to find the optimal horizon 4 policy tree (with temporal discount factor \gamma = 1.0). You should use the initial belief: b = (0.5, 0.3, 0.2). You may need to make multiple calls with different horizons to do this. Note that we are considering a finite-horizon Belief MDP with horizon=4.

Please enter the policy tree of the Belief MDP as a Python dictionary, defined recursively.

  • The first field of the dictionary is action, which takes values from RobotChargingPOMDP.action_space.
  • The second field of the dictionary is observation, which itself is a dictionary mapping from a possible observation after taking this action to a dictionary, corresponding to the policy tree.
  • When taking the action results in a terminal state (i.e., taking c) or the maximum horizon is reached, the dict does not contain the observation field.

An example solution is the following. (We have beautified the dict format for better visualization. You don't need to format your submission.)

{'action': 'l1',
 'observation': {0: {'action': 'm01',
                     'observation': {0: {'action': 'l2',
                                         'observation': {0: {'action': 'l1'},
                                                         1: {'action': 'nop'}}}}},
                 1: {'action': 'm12',
                     'observation': {0: {'action': 'l2',
                                         'observation': {0: {'action': 'l1'},
                                                         1: {'action': 'nop'}}}}}}}

The semantics of this policy tree is that:

  • At the initial belief, the action to take is l1. Then we have a branch.
  • If we take l1, receive 0, then the next action to take is m01.
  • If we take l1, receive 0, take m01, observe 0 (we can't receive 1 in this case), the next action to take is l2.
  • If we take l1, receive 0, take m01, observe 0, take l2, receive 0, the next action to take is l1 (then we have reached the max horizon, no further branching is required).

5) Most Likely State§

Consider a situation in which:

  • the charger is actually in location 2.
  • the robot's prior is (b(0), b(1), b(2)) = (0.4, 0.3, 0.3).
  • the horizon is indefinite, as in the original problem statement.

We will consider most likely state (MLS) determinization for solving the belief MDP.

  1. What action will the MLS strategy take?

    Select the MLS action.

  2. What is the reward of taking this action?

    Select the reward.

  3. Is MLS a good strategy in this case?

    Pick one.
    Yes
    No

6) Most Likely Observation§

In this section, we will implement a Receding Horizon Control algorithm based on the Most-Likely-Observation approximation.

Recall that Most-Likely-Observation determinization for a POMDP is equivalent to Most-Likely-Outcome determinization in the corresponding belief MDP.

6.1) Receding Horizon Control with Most-Likely-Observation§

Complete the RHC-MLO implementation provided in the colab and answer the questions in Catsoop. The only new requirement is to implement the expectimax_search_mlo that instead of computing the expected reward over all possible observations, only focuses on the most-likely observation.
HINT: You only need very MINIMAL changes to the original expectimax_search code.
HINT: You may find the DiscreteDistribution.max() function useful.

After implementing expectimax_search_mlo, use the following code snippets (in colab) to simulate RHC-MLO, and answer questions in Catsoop. Specifically, we will use gamma = 0.9 and initial belief b = (0.4, 0.3, 0.3) (the same one we used in the Most Likely State problem). Note that we are specifying the actual initial state to be 2.

For reference, our solution is 21 line(s) of code.

In addition to all the utilities defined at the top of the Colab notebook, the following functions are available in this question environment: create_belief_mdp. You may not need to use all of them.

6.2) Practice§

Use the expectimax_search_mlo function you just implemented to answer the following questions. Consider a situation in which:

  • the charger is actually in location 2.
  • the robot's prior is (b(0), b(1), b(2)) = (0.4, 0.3, 0.3).
  • we use a horizon of 4 with RHC.
  1. What's the action sequence that your algorithm takes?

    Please input your answer a Python list of strings defined in RobotChargingPOMDP.action_space. For example, ['l0', 'm20', 'c'].

  2. If your robot finally executes the charge action, what's the belief the robot has about the charger is at position 0? If your robot does not charge, enter 0.

    Please input your answer a Python float number. You should make sure that the precision is at least 1e-4.

  3. We observe that MLO has to execute a lot of steps before charging. Will this tend to result in better overall reward than MLS?

    Pick one.
    Yes
    No

7) QMDP (Optional)§

Now, we will look at the QMDP approximation strategy.

7.1) QMDP§

Complete the QMDP implementation provided in the colab and answer the questions in catsoop. Here you can use the value function computed by the value_iteration algorithm provided and focus on computing the policy.

For reference, our solution is 21 line(s) of code.

7.2) Practice§

Use the QMDP function you just implemented to answer the following questions. Consider the initial belief b = (0.5, 0.3, 0.2). Set the temporal discount factor to 0.9.

  1. What's the action that QMDP will take based on this belief?

    Please input your answer as a string defined in RobotChargingPOMDP.action_space. For example, l0.

  2. After taking this action, assume the observation you get is 0. What's the updated belief?

    Please input your answer as a Python list of three float numbers. You should make sure that the precision is at least 1e-4.

  3. What's the action that QMDP will take next based on the updated belief?

    Please input your answer as a string defined in RobotChargingPOMDP.action_space. For example, l0.

  4. Let's start again. This time, we will use Receding Horizon Control with expectimax search (not determinized) to compute the policy. Set the horizon to 4. What's the action that RHC will take based on the initial belief b = (0.5, 0.3, 0.2)?

    Please input your answer as a string defined in RobotChargingPOMDP.action_space. For example, l0.

  5. Is QMDP a good POMDP solver for this case?

    Pick one
    Yes
    No

8) Online POMDP search§

Let’s compare two algorithms for online search in a POMDP.

  1. Convert the MDP into a belief MDP, and then use the version of MCTS for MDPs to solve it (let’s call this BMCTS).
  2. Use the POMCP algorithm directly.

Mark all that are true:
If exact belief updates are expensive, POMCP might be a better choice
If you have a very high-dimensional state space, BMCTS might be a better choice
In BMCTS, it would be sensible to aggregate the statistics for the same (belief, action) pair if it occurred at multiple nodes in the tree.
In POMCP, it would never be sensible to aggregate node statistics.

9) Belief Simplex§

In this section we will discuss a useful representation for visualizing a set of beliefs, called the "Belief Simplex".

Specifically, the set of possible belief states over K world states is a simplex, which is a K-1 dimensional manifold (surface) embedded in the K dimensional space. In our case where K = 3, we can visualize a belief space in a 3D plot. In the following simplex visualizations, you will see green areas covering part of the belief space, which indicates the set.

The figure above shows an illustration of what a simplex looks like in 3D. You can find more details on this Wikipedia Page. Figure credit: blog.bogatron.net.

  1. For each of the following belief states, explain what part of the simplex it corresponds to.

    • b(s = 0) = 1

    • b(s = 0) = 0

    • b(s = 0) > p

    • b(s = 0) = b(s = 1)

  2. Which of the following plots corresponds to belief states in which the value of charging is greater than 0?

    A B
    C D

    Select the simplex in which the value of charging is greater than 0.

    Pick one.

10) Optimal Solution§

10.1) Exponential^Exponential§

In general, consider a Partially-Observable Markov Decision Process (POMDP) with action space \mathcal{A} and observation space \mathcal{Z} with horizon h. One way to compute the "optimal" strategy for solving the POMDP is to build a "policy tree". Denote A = |\mathcal{A}| and Z = |\mathcal{O}|.

  1. What's the size of the policy tree of horizon h?
    O(A^h)
    O(Z^h)
    O((AZ)^h)
    O(A^{(Z^h)})

  2. How many possible policy trees are there?
    O(A^h)
    O(Z^h)
    O((AZ)^h)
    O(A^{(Z^h)})

  3. Are there any algorithms for solving POMDPs optimally with a better worst-case complexity than a SPA (Stupidest Possible Algorithm)?
    Yes! :-)
    No... :-(

10.2) Belief Simplex Visualization of Optimal Policies§

Here are four plots showing the optimal finite-horizon policies with \gamma = 1.0 for horizons H = 1, 2, 3, 4. The colors correspond to actions:

  • Green: look(0)
  • Magenta: look(1)
  • Cyan: look(2)
  • Red: move(0, 1)
  • Blue: move(1, 2)
  • Orange: move(2, 0)
  • Brown: charge
  • Grey: nop
H=1 H=2
H=3 H=4

Note: Red and cyan have never appeared in the figures above, but they may appear in figures showing the optimal policies for longer horizons.

  1. The horizon 2 policy has 5 regions. In which regions are we guaranteed to charge on this step or the next one? (Hint: look at the simplex, and use your intuition about the workings of the charge POMDP.)

    Mark all colors that apply.
    Brown
    Green
    Magenta
    Orange
    Grey

  2. The horizon 2 policy has 5 regions. In which regions will we charge on the next step if we get observation 1 on this step?

    Mark all colors that apply.
    Brown
    Green
    Magenta
    Orange
    Grey

  3. In the horizon 2 policy, there are fewer belief states when we want to charge than there are in the horizon 1. Why?

    Mark all reasons that apply.
    Because discounting makes charging less attractive now
    Because we have time to increase our belief that the charger is in location 0
    Because we lose information over time

  4. Is it true that, in the horizon 2 policy, for all the gray belief states, there is no possible observation that would put us in the brown region of the horizon 1 policy?

    Enter True or False.
    True
    False

  5. If we were to decrease the cost for looking, would the magenta region in the horizon 2 policy be bigger, smaller, or the same?

    Mark the change to the magenta region.
    Bigger
    Smaller
    Same

11) Simple POMDP approximations§

Recall the Fortune or Ruin problem from lecture.

  • States: the world states have the form (Robot, Money), where Robot is in loc {1...7} indicating the robot’s location, and Money is in loc {1, 3} indicating the money’s location.
  • Actions: the robot can move into any adjacent square from where it is (the whole space is Left, Right, Up, Down). Actions are deterministic and only change the robot’s location. Problem terminates after the robot reaches loc 1 or loc 3.
  • Reward: R((r, m), a, (r', m')) is
    • 10, if r' = m'.
    • -100, if r' = 1 and m' = 3, or r' = 3 and m' = 1.
    • -1 otherwise.
  • Observations: There are two possible observations: \{1, 3\}. O((r, m)) is
    • 1, if r = 7 and m = 1.
    • 3, if r = 7 and m = 3.
    • 1, in all other cases.
  • Discount factor = 1
  • Initial belief b_0 = \{(5, 1) : .51, (5, 3) : .49\}

11.1) Most-likely State§

Let's consider most likely state (MLS) determinization for solving the belief MDP.

  1. What action would this strategy take in b_0?

    Left
    Right
    Up
    Down

  2. What action would it take in belief \{(2, 1) : .51, (2, 3) : .49\}?

    Left
    Right
    Up
    Down

  3. What is the expected value of this policy starting in belief state \{(2, 1) : .51, (2, 3) : .49\}?

    Please input your answer as a Python float number or an equation that can be evaluated. You should make sure that the precision is at least 1e-4.

11.2) QMDP§

  1. The first step in applying Q_MDP is finding the optimal Q value function for the underlying MDP. What are the Q values for the following (s, a) pairs?

    1. ((2, 1), left)

      Please input your answer as a Python float number or an equation that can be evaluated. You should make sure that the precision is at least 1e-4.

    2. ((2, 1), right)

      Please input your answer as a Python float number or an equation that can be evaluated. You should make sure that the precision is at least 1e-4.

    3. ((2, 1), down)

      Please input your answer as a Python float number or an equation that can be evaluated. You should make sure that the precision is at least 1e-4.

    4. ((2, 3), left)

      Please input your answer as a Python float number or an equation that can be evaluated. You should make sure that the precision is at least 1e-4.

    5. ((2, 3), right)

      Please input your answer as a Python float number or an equation that can be evaluated. You should make sure that the precision is at least 1e-4.

    6. ((2, 3), down)

      Please input your answer as a Python float number or an equation that can be evaluated. You should make sure that the precision is at least 1e-4.

  2. What action would Q_MDP take in belief state \{(2, 1) : .51, (2, 3) : .49\}?

    Left
    Right
    Up
    Down

  3. What action would Q_MDP take in belief state \{(4, 1) : .51, (4, 3) : .49\}?

    Left
    Right
    Up
    Down

  4. What can we say about the behavior of Q_MDP relative to the behavior of MLS in this problem?

    Mark all that are True
    MLS is more likely to reach a high-reward state.
    MLS is more likely to reach a low-reward state.
    Without discounting Q_MDP will have an infinite negative value.
    With discounting (e.g. .9) we would prefer the Q_MDP strategy in this case.

11.3) Most-likely Observation§

Recall that this approximation turns our POMDP into a deterministic reward-maximization problem.

  1. If the robot starts in belief \{(6, 1) : .51, (6, 3) : .49\} and executes action ‘right’, what is the next belief state according to this approximation?

    Enter the result as a Python dictionary.

  2. What is the Q value of taking action ‘left’ in belief \{(2, 1) : .51, (2, 3) : .49\}?

    Please input your answer as a Python float number or an equation that can be evaluated. You should make sure that the precision is at least 1e-4.

  3. Starting with initial belief \{(5, 1) : .51, (5, 3) : .49\}, how many reachable belief states are there? (Assume that once the robot reaches location 1 or 3, the game is over and so there are no reachable beliefs with the robot at either of those locations.)

    Please input your answer as an integer, e.g. 1

  4. What is the optimal path from belief \{(5, 1) : .51, (5, 3) : .49\} ?

    Enter the result as a list of strings of given actions.. E.g. ['Left','Up','Up'].

  5. What is the total reward of this optimal path?

    Please input your answer as a Python float number or an equation that can be evaluated. You should make sure that the precision is at least 1e-4.

  6. What can we say about the behavior of MLO relative to the other approximate strategies?

    Mark all that are True
    MLO will not seek out information because it is assuming a particular observation.
    The Q_MDP value of a belief would be an admissible heuristic for doing A* in MLO.

  7. Imagine that when moving right from location 6, there is a .1 chance of falling into a maelstrom (r = -10000) and getting a special observation ‘m’. Would this affect the choice of action in MLO?

    Yes
    No

  8. What actually would be the optimal policy in the undiscounted maelstrom version of this problem, with initial belief \{(5, 1) : .51, (5, 3) : .49\}?

    Move directly to location 1.
    Try to move to location 7 to get information and then go to where the money is.
    Just move back and forth in the comfort of the safe, safe, hallway forever.

  9. What actually would be the optimal policy in the maelstrom version of this problem, with discount factor 0.9 and initial belief \{(5, 1) : .51, (5, 3) : .49\}?

    Move directly to location 1.
    Try to move to location 7 to get information and then go to where the money is.
    Just move back and forth in the comfort of the safe, safe, hallway forever.

12) Feedback§

  1. How many hours did you spend on this homework?

  2. Do you have any comments or suggestions (about the problems or the class)?