HW 9 -- POMDPs

The questions below are due on Monday May 05, 2025; 11:59:00 PM.

You are not logged in.
Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.

This week's Colab notebook can be found here: Colab Notebook (Click Me!)

1) Fitted Q Value Iteration§

In this problem we'll play with the Mountain Car MDP, a classic problem. You control a car, constrained to move in a line, by telling it to either accelerate left, right, or not accelerate. The car starts initially at some point in a valley, and the goal is to get the car to the top of the mountain on its right as fast as possible. Here is a possible initial state, where the red dot represents the car:

Formally, here is the description of the MDP:

\mathcal S = \mathcal X \times \mathcal V, where \mathcal X = [-1.2, 0.6] is the set of possible positions of the car, and \mathcal V = [-0.07, 0.07] is the set of possible velocities of the car
\mathcal A = \{0, 1, 2\}, denoting acceleration to the left, no acceleration, and acceleration to the right respectively.
Given state (x, v), and action a, the transition updates v' = v + ((a-1)\alpha + g + \eta)\delta_t, where \alpha is the constant describing the acceleration of the car's engine, g \propto \cos(3 x) is the effect of gravity on the car, and \eta is an i.i.d. gaussian noise. Then, x' = x + v' \delta_t.
The terminal states are the right boundary of the position of the car. i.e. \{(0.6, v) | v \in \mathcal V\}
The reward is always -1
\gamma = 1

The MDP is already implemented for you in the class MountainCar. Please take a look at the code and answer the following questions.

1.1) Value from Q§

Compute the optimal value function from the Q function.

For reference, our solution is 3 line(s) of code.

1.2) Greedy Policy from Q§

Write the greedy policy given a Q function.

For reference, our solution is 6 line(s) of code.

1.3) Epsilon Greedy Policy from Q§

Write the epsilon greedy policy given a Q function. With probability epsilon, the policy should pick a random action, and with probability 1-epsilon, pick the greedy action. You may use the greedy_policy_from_q function. You may find rng.choice and rng.random useful.

For reference, our solution is 6 line(s) of code.

In addition to all the utilities defined at the top of the Colab notebook, the following functions are available in this question environment: greedy_policy_from_q. You may not need to use all of them.

1.4) Sampling points in grid§

Next, we'd want to sample points in the state space on which to perform Bellman backups. Look at the colab for function sample_grid_points which samples the state space in an evenly spaced grid, and returns the sampled points and their corresponding transitions for every possible action.

1.5) Sampling points from policy§

Another way to sample points is to collect points by rolling out trajectories from a policy. Implement a function that samples several trajectories from a given policy, and concatenate the trajectories together to produce a list of (state, action, reward, next_state) tuples.

For reference, our solution is 8 line(s) of code.

1.6) Fitted Q Visualization§

Complete the fitted_Q_learn function given in the colab notebook. You're now ready to conduct fitted Q learning on the Mountain Car Problem! You may use either sampling method (grid or policy), and either regression method (KNNRegressor or NeuralRegressor).

Try fitted Q learning with sample_grid_points(x_division=4, v_division=4) and 20 iterations. You may use either KNNRegressor or NeuralRegressor. Visualize the fitted Q and the final greedy policy. What explains your observations?

Hint: Use the examples run_fitted_q_example_1 and run_fitted_q_example_2 in the colab as a starting point.

	Grid sampling scales exponentially with the dimension of state space
	The granularity of the grid points is too coarse to capture the MDP transition
	There were not enough iterations for the Q function to converge
	The regression method overfitted the Q function

Try fitted Q learning with sample_policy_points using eps=0.05 and 20 iterations. You may use either KNNRegressor or NeuralRegressor. Visualize the fitted Q and the final greedy policy. What explains your observations?

	The policy did not explore enough of the state space because `eps` is too low
	The points obtained were too far off policy because `eps` is too high
	The regression method overfitted the Q function
	The sample points change when the policy changes

In this problem, it turns out that both grid sampling and policy sampling are viable. Select all that are true about these two sampling methods.

	Grid sampling scales exponentially with the dimension of state space, whereas policy sampling does not.
	Policy sampling should always be preferred over grid sampling
	Policy sampling tends to be more unstable compared to grid sampling.

Submit a successful trajectory with fewer than 13 actions. You should be able to obtain the trajectory from the sim_episode method.

Hint: Make sure your sampler gets good coverage of the state space.

Hint: You can sample your learned policy multiple times to find a trajectory with fewer than 13 actions.

Hint: Simply print() the final trajectory and copy it into the answer box.

2) Set-Based Uncertainty§

In this section, we will consider a planning problem in a non-deterministic domain. Specifically, consider a robot moving on a 2D grid world. The robot can take 4 actions, moving N (up) by 1 grid cell, moving S (down), moving W (left), and moving E (right). If an action would run the robot into a wall, then the robot does not move. The state is simply a tuple (r, c), indicating the robot's position (row and column).

We will consider the following 4 by 4 map with walls surrounding it. There are no interior obstacles. For example, the following figure shows a visualization of the state (1, 2) (X is the robot position and # indicates wall).

  0123
  ####
0#    #
1#  X #
2#    #
3#    #
  ####

We will be using sets to represent belief states. For example, if the belief of the robot's position is either (0, 0) or (1, 1), the belief state can be represented as {(0, 0), (1, 1)}. In the problems below, belief states will be input as lists [(0, 0), (1, 1)], but the order of the elements doesn't matter.

2.1) Conformant Plan 1§

Consider the following setting.

The initial belief is that the robot could be anywhere.
There are no observations.

What is the belief state after the action sequence: ['N', 'N', 'N']? Enter a Python list of tuples.
What is a shortest sequence of actions to reach the belief state [(2, 2)]? Enter a Python list of strings.

2.2) Conformant Plan 2§

Now, what if your wheels are slippery, but only to the south. So, if you execute action south, you might (non-deterministically) move 1 or 2 squares downward, except, of course, that you can’t move through walls. In this setting, consider the same planning task:

The initial belief is that the robot could be anywhere
There are no observations.
The goal is to be certain it is in (2, 2).

What is a shortest sequence of actions to reach the belief state [(2, 2)]? Enter a Python list of string or 'None' if one does not exist.

2.3) Too Slippery§

Now, what if your wheels are slippery, but both to the north and south, in the same sense as above, which is that they non-deterministically move 1 or 2 squares.

The initial belief is that the robot could be anywhere.
There are no observations

Mark all of the belief states that would be reachable by any sequence of these types of actions. (Note that this won't be an exhaustive list of belief states that are reachable with these actions)

	[(2,2)]
	[(1,2),(2,2)]
	[(0,2),(1,2),(2,2)]
	[(0,2),(1,2),(2,2),(3,2)]
	[(0,1),(1,1),(2,1),(3,1),(0,2),(1,2),(2,2),(3,2)]
	[(0,0),(1,0),(2,0),(3,0),(0,1),(1,1),(2,1),(3,1),(0,2),(1,2),(2,2),(3,2)]

2.4) Using Sensors§

Now let's use the sensors. Specifically, suppose we have a sensor that can give us observations of walls next to us. Specifically, the observation at each state is a tuple of length 4, containing 4 values (n, s, e, w) (Note the order.). Each entry in the tuple indicates whether there is a wall at the grid north/south/east/west. For example, (0, 0, 0, 0) indicates that the robot is in a space that is not adjacent to any walls.

Consider the following setting.

The initial belief is that the robot is in one of the internal squares [(1, 1), (1, 2), (2, 1), (2, 2)].
After each action, it will make the following observation: in each direction whether there is a wall.
The robot has deterministic N, S, E, W movement. We are no longer slippery.
The goal is to be certain it is in (2, 2).

What's the belief state (before considering the observation) if we take an action to move South? Enter a Python list of tuples.
What's the belief state if we start in the initial belief, take an action move South, and get the observation (0, 1, 0, 0)?
Starting from the initial state, suppose that we moved one step South and got the observation (0, 1, 0, 0), then, in the second step, if we move East and get the observation (0, 1, 0, 0), what's the belief state?

2.5) Conditional Plan in Belief Space§

Now let's consider making a simple conditional plan to reach the belief state [(2, 2)]. We have already done some of the work, and decided to start by executing actions 'S' and then 'E'. Now the robot will make an observation, and we need you to determine what subsequent actions it would need to execute to reach the target belief.

The initial belief is that the robot is in one of the internal squares [(1, 1), (1, 2), (2, 1), (2, 2)].

For each of the possible observations below, give an unconditional sequence of actions (a conformant plan) that reaches the goal if such a sequence exists or None if none exists. The empty sequence [] means to not move.

Observation is (0, 0, 0, 0). Enter a list representing an action sequence (as a list of strings) or None.
Observation is (0, 0, 1, 0). Enter a list representing an action sequence (as a list of strings) or None.
Observation is (0, 1, 0, 0). Enter a list representing an action sequence (as a list of strings) or None.
Observation is (0, 1, 1, 0). Enter a list representing an action sequence (as a list of strings) or None.

The Robot Story

Our robot has lost its charger! The robot lives in a world with K=3 locations but it needs to have the charger in location 0 (where the power outlet is) in order to plug it in and charge.

The robot doesn't have any uncertainty about its own location (and we won't model that) but it does have uncertainty about the location of the charger. When the robot looks in location loc, by executing a look(loc) action, it will receive an observation in \{0, 1\}. The low battery means that its sensors aren't working very well.

If the charger is in the location loc, it will get observation 1 with probability 0.9.
If the charger is not in location loc, it will get observation 1 with probability 0.4.

The robot can also try to execute a move(loc1, loc2) (which moves the charger):

If the charger is actually in loc1, then with probability 0.8, the charger will be moved to loc2, otherwise, it will stay in loc1.
If the charger is not in loc1, then nothing will happen.

Because of the constraints of the robot's arm, we can't do all possible move actions. Specifically, the only valid movements are move(0, 1), move(1, 2), and move(2, 0).

The robot has two more actions (charge and nop)

If it executes the charge action when the charger is in location 0, then it gets a reward of 10 and the game terminates.
If it executes that action in any other state, it gets reward -100 and the game terminates.

It gets reward -0.1 for every look action and -0.5 for every move action in all other states.

Finally, it has a nop action, with reward 0, which does not provide a useful observation, and does not affect the world state.

For actions that don't yield useful information about the environment (move, charge, and nop), we assume that they get observation 0 with probability 1, independent of the state.

3) Robot HMM§

Before worrying about how to select actions for the robot, we will think about how the robot can estimate where its charger is, given a history of actions and observations. This problem should seem familiar! A POMDP, ignoring the problem of choosing actions, is an Hidden Markov Model (HMM). We can solve the problem of computing P(S_{t+1} \mid O_2, \ldots O_{t+1}, A_1, \ldots, A_t) using the generic sum-product method, or HMM forward inference. You might also remember this computation as the "Bayes filter" from Lecture 12.

Bayes Filter

Let b_t = P(S_{t} \mid O_{2:t}, A_{1:t-1}) be our belief at time t about the hidden state S_t. In the robot-charger problem, since there are K=3 possible locations for the charger, we can represent the state as \mathcal{S} = \{s_0, s_1, s_2\}, indicating that the charger is at location 0, at location 1, and at location 2, respectively. Thus, a belief b_t over the current state can be represented as a 3-tuple (p_0, p_1, p_2) such that 0 \le p_i \le 1 and \sum_i p_i = 1, indicating the probability that we are at state s_0, s_1, and s_2.

The equation for computing our belief at time b_{t+1} given our current belief b_t, an action A_t=a_t, and an observation O_{t+1}=o_{t+1} can be represented as:

b_{t+1} \propto P(O_{t+1}=o_{t+1} \mid S_{t+1}, A_t=a_t)[\sum_{s_t} P(S_{t+1} \mid S_{t}=s_t, A_{t}=a_t)b_t(s_t)]

It is also possible to understand the above equation as updating the belief in three steps:

Transition update: For each s_{t+1} \in \mathcal{S}, b'(s_{t+1}) = \sum_{s_t} P(S_{t+1}=s_{t+1} \mid S_{t}=s_t, A_{t}=a_t)b_t(s_t).
Observation update: For each s_{t+1} \in \mathcal{S}, b''(s_{t+1}) = P(O_{t+1}=o_{t+1} \mid S_{t+1}=s_{t+1}, A_t=a_t)b'(s_{t+1}).
Normalize: Let Z = \sum_{s} b''(s). Then for each s_{t+1} \in \mathcal{S}, b_{t+1}(s_{t+1}) = \frac{1}{Z}b''(s_{t+1}).

3.1) Practice§

Consider starting from a uniform prior over the charger's location: b_0 = (1/3, 1/3, 1/3). The robot takes the following three actions and receives the observations.

At step 0, the robot takes action A_0 = \textit{look}(0), and observes O_1 = 0, what's the updated belief b_1?

Please input your answer a Python list of three float numbers. You should make sure that the precision is at least 1e-4.
Continue. At step 1, the robot takes action A_1 = \textit{look}(1), and observes O_2 = 1, what's the updated belief b_2?

Please input your answer a Python list of three float numbers. You should make sure that the precision is at least 1e-4.
Continue. At step 2, the robot takes action A_2 = \textit{move}(2, 0), and observes O_3 = 0, what's the updated belief b_3?

Please input your answer a Python list of three float numbers. You should make sure that the precision is at least 1e-4.

4) Belief Space MDP§

Recall that we can model the sequential process that governs the evolution of the belief state over time as an MDP in continuous space. (This is a super-cool result!) What makes it kind of counter-intuitive (and cool!) is that the states in this ''belief MDP'' are probability distributions. What makes it manageable is that, from any given belief state b, given an action a, there are only as many new beliefs b' as there are of possible observations o \in O.

To gain some intuition, let's construct the belief MDP that corresponds to our simple robot POMDP.

Belief MDP state space: the three-dimensional simplex, that is, the set of vectors of three non-negative numbers that add up to 1, interpreted as a distribution over the 3 possible underlying states of the POMDP.
Action space: as before.
Reward function: expected value of reward given the belief state.
Transition distribution: for each o \in O, we can reach belief state bf(b, a, o) with probability P(o | b, a).

Assume that you have a Bayes filter function \textit{bf}(b, a, o) that takes a belief state b, action a, and observation o, and returns a new belief state b' as you computed in the previous problem.

What is the probability of receiving observation 1 given belief state b and action look(x)?

Treat b as a function of a state which returns the belief for that state, so b(0) is the belief for state 0. Let x be a variable denoting a state. Please input your answer as a Python expression involving function b and variable x. For example, b(x) * 0.5 + b(0) * 0.5.

4.1) Transition Model§

Let's work on expressing the transition model of the belief-space process.

If the robot executes action look(x), how many possible resulting belief states are there?

Please input your answer as a single integer.
What are they?

Please input your answer as a list of Python expressions. You can use the Bayes filter function bf, which takes three arguments: the belief, the action and the observation. You can represent the belief as b and the action as look(x). For example, you can input [bf(b, look(x), 1)]
If the robot executes action move(x1, x2), how many possible resulting belief states are there?

Please input your answer as a single integer.
What are they?

Please input your answer as a list of Python expressions. You can use the Bayes filter function bf, which takes three arguments: the belief, the action and the observation. You can represent the belief as b and the action as move(x1, x2). For example, you can input [bf(b, move(x1, x2), 1)]

4.2) Reward Model§

What is the expected reward value of taking action charge in belief state b?

Please input your answer as a Python expression of the belief function b. For example, b(0) * 0.5 + b(1) * 0.5.

4.3) Belief-Space MDP§

In this section, you will implement a function create_belief_mdp, that transforms a POMDP into a belief-space MDP. We have provided the basic skeleton for you. In particular, you only need to implement the get_reward and the get_transition_distribution function for the Belief MDP. You can use the function belief_filter(pomdp, b, a, o) which is already defined; the implementaation of that function is available in the colab.

For reference, our solution is 81 line(s) of code.

def create_belief_mdp(pomdp):
    '''Constructs a belief-space MDP from a POMDP.

Args:
        pomdp: The input POMDP object.

Returns:
        belief_mdp: The constructed belief-space MDP.
    '''
    def state_is_terminal(belief):
        '''The state_is_terminal function for the belief-space MDP. It returns true iff. all possible states
        in the belief are terminal states.

Args:
            belief: A DiscreteDistribution of the state.

Returns:
            is_terminal: Whether the current belief is a "terminal" belief.
        '''
        for state, p in belief.items():
            if p > 0 and not pomdp.state_is_terminal(state):
                return False
        return True

def get_reward(belief, action, next_belief=None):
        '''Compute the expected reward function for the belief-space MDP.

You only need to implement the case where the original reward function only
        depends on the state and the action (but not the next state).

In this case, the reward function of the belief-space MDP will be only a function
        of belief and action, but not next_belief.

In general (where the reward function if a function of state, action, and next_action),
        in order to compute the expected reward, we need to also marginalize over the next state
        distribution (which is next_belief).

Args:
            belief: A DiscreteDistribution of the state.
            action: An action.
            next_belief: A DiscreteDistribution of the next state. Should be ignored (see above).

Returns:
            reward: the expected reward at this step.
        '''
        raise NotImplementedError()

def get_transition_distribution(belief, action):
        '''Compute the transition distribution for an input belief and an action.

Specifically, the output will be a distribution over beliefs. That is, a distribution over
        distributions. Since we have restricted our observation space to be finite, the
        possible next belief is also a finite space. Thus, we can still use a DiscreteDistribution
        object to represent the distribution over the next belief.

Args:
            belief: A DiscreteDistribution of the state.
            action: An action.

Returns:
            next_belief: A DiscreteDistribution of the next state.
        '''
        raise NotImplementedError()

# Construct a new MDP based on the functions defined above.
    return LambdaMDP(
        state_space=None,  # We are not going to specify the state space explicitly (it's a continuous space).
        action_space=pomdp.action_space,
        state_is_terminal_fn=state_is_terminal,
        get_reward_fn=get_reward,
        get_transition_distribution_fn=get_transition_distribution,
        temporal_discount_factor=pomdp.temporal_discount_factor
    )

5) RHC with Expectimax§

Now that we have made a belief MDP, we can use our MDP methods for online search to decide actions to take. Let's explore receding horizon control using expectimax!

5.1) Receding Horizon Control§

Use the RHC implementation provided in the colab and answer the questions in Catsoop. Note that the simulation function sets the random seed so that your results will be deterministic.

5.2) Practice§

Consider a situation in which:

the temporal discount (gamma) is set to 1
the robot's prior is (b(0), b(1), b(2)) = (0.6, 0.25, 0.15).

Run the RHC with horizon 3. What's the first action that the model will take?

Select the RHC action for H = 3.
Run the RHC with horizon 4. What's the first action that the model will take?

Select the RHC action for H = 4.

Why does increasing the horizon produce a policy with a better expected value?

Mark all reasons that apply.

	Because we should not execute the charge action before we are very sure that the charger is at 0 (having a large belief 'mass' at 0).
	Because we can't move enough belief 'mass' to 0 within 3 steps.
	Because for H=3, the cost of moving or looking exceeds the expected value of charging.

5.3) Policy Tree (Optional)§

Use the expectimax code and your belief MDP to find the optimal horizon 4 policy tree (with temporal discount factor \gamma = 1.0). You should use the initial belief: b = (0.5, 0.3, 0.2). You may need to make multiple calls with different horizons to do this. Note that we are considering a finite-horizon Belief MDP with horizon=4.

Please enter the policy tree of the Belief MDP as a Python dictionary, defined recursively.

The first field of the dictionary is action, which takes values from RobotChargingPOMDP.action_space.
The second field of the dictionary is observation, which itself is a dictionary mapping from a possible observation after taking this action to a dictionary, corresponding to the policy tree.
When taking the action results in a terminal state (i.e., taking c) or the maximum horizon is reached, the dict does not contain the observation field.

An example solution is the following. (We have beautified the dict format for better visualization. You don't need to format your submission.)

{'action': 'l1',
 'observation': {0: {'action': 'm01',
                     'observation': {0: {'action': 'l2',
                                         'observation': {0: {'action': 'l1'},
                                                         1: {'action': 'nop'}}}}},
                 1: {'action': 'm12',
                     'observation': {0: {'action': 'l2',
                                         'observation': {0: {'action': 'l1'},
                                                         1: {'action': 'nop'}}}}}}}

The semantics of this policy tree is that:

At the initial belief, the action to take is l1. Then we have a branch.
If we take l1, receive 0, then the next action to take is m01.
If we take l1, receive 0, take m01, observe 0 (we can't receive 1 in this case), the next action to take is l2.
If we take l1, receive 0, take m01, observe 0, take l2, receive 0, the next action to take is l1 (then we have reached the max horizon, no further branching is required).

6) Most Likely State§

Consider a situation in which:

the charger is actually in location 2.
the robot's prior is (b(0), b(1), b(2)) = (0.4, 0.3, 0.3).
the horizon is indefinite, as in the original problem statement.

We will consider most likely state (MLS) determinization for solving the belief MDP.

What action will the MLS strategy take?

Select the MLS action.
What is the reward of taking this action?

Select the reward.
Is MLS a good strategy in this case?

Pick one.

Yes

No

7) Most Likely Observation§

In this section, we will implement a Receding Horizon Control algorithm based on the Most-Likely-Observation approximation.

Recall that Most-Likely-Observation determinization for a POMDP is equivalent to Most-Likely-Outcome determinization in the corresponding belief MDP.

7.1) Receding Horizon Control with Most-Likely-Observation§

Complete the RHC-MLO implementation provided in the colab and answer the questions in Catsoop. The only new requirement is to implement the expectimax_search_mlo that instead of computing the expected reward over all possible observations, only focuses on the most-likely observation.
HINT: You only need very MINIMAL changes to the original expectimax_search code.
HINT: You may find the DiscreteDistribution.max() function useful.

After implementing expectimax_search_mlo, use the following code snippets (in colab) to simulate RHC-MLO, and answer questions in Catsoop. Specifically, we will use gamma = 0.9 and initial belief b = (0.4, 0.3, 0.3) (the same one we used in the Most Likely State problem). Note that we are specifying the actual initial state to be 2.

For reference, our solution is 21 line(s) of code.

In addition to all the utilities defined at the top of the Colab notebook, the following functions are available in this question environment: create_belief_mdp. You may not need to use all of them.

def expectimax_search_mlo(initial_state, belief_mdp, horizon, return_Q=False):
    '''Use expectimax search to determine a next action.

Note that we're just computing the single next action to
    take, we do not need to store the entire partial V.

Horizon is given as a separate argument so that we can use
    expectimax search with receding horizon control, for example,
    even if belief_mdp.horizon is inf.

Args:
        initial_state: A state in the belief_mdp.
        belief_mdp: An MDP.
        horizon: An int horizon.
        return_Q: A boolean value. If true, also return the Q value
            at the root instead of the action.

Returns:
        action: An action in the belief_mdp.
        Q: The Q value at the root state (only when return_Q is True).
    '''
    A = sorted(belief_mdp.action_space)
    R = belief_mdp.get_reward
    P = belief_mdp.get_transition_distribution
    gm = belief_mdp.temporal_discount_factor
    ts = belief_mdp.state_is_terminal

def V(s, h):
        if h == horizon or ts(s):
            return 0
        return max(Q(s, a, h) for a in A)

def Q(s, a, h):
        # TODO: Your code here.
        raise NotImplementedError()

Q_values = {a: Q(initial_state, a, 0) for a in A}
    if return_Q:
        return max(A, key=Q_values.get), Q_values
    return max(A, key=Q_values.get)

7.2) Practice§

Use the expectimax_search_mlo function you just implemented to answer the following questions. Consider a situation in which:

the charger is actually in location 2.
the robot's prior is (b(0), b(1), b(2)) = (0.4, 0.3, 0.3).
we use a horizon of 4 with RHC.

What's the action sequence that your algorithm takes?

Please input your answer a Python list of strings defined in RobotChargingPOMDP.action_space. For example, ['l0', 'm20', 'c'].
If your robot finally executes the charge action, what's the belief the robot has about the charger is at position 0? If your robot does not charge, enter 0.

Please input your answer a Python float number. You should make sure that the precision is at least 1e-4.
We observe that MLO has to execute a lot of steps before charging. Will this tend to result in better overall reward than MLS?

Pick one.

Yes

No

8) QMDP (Optional)§

Now, we will look at the QMDP approximation strategy.

8.1) QMDP§

Complete the QMDP implementation provided in the colab and answer the questions in catsoop. Here you can use the value function computed by the value_iteration algorithm provided and focus on computing the policy.

For reference, our solution is 21 line(s) of code.

8.2) Practice§

Use the QMDP function you just implemented to answer the following questions. Consider the initial belief b = (0.5, 0.3, 0.2). Set the temporal discount factor to 0.9.

What's the action that QMDP will take based on this belief?

Please input your answer as a string defined in RobotChargingPOMDP.action_space. For example, l0.
After taking this action, assume the observation you get is 0. What's the updated belief?

Please input your answer as a Python list of three float numbers. You should make sure that the precision is at least 1e-4.
What's the action that QMDP will take next based on the updated belief?

Please input your answer as a string defined in RobotChargingPOMDP.action_space. For example, l0.
Let's start again. This time, we will use Receding Horizon Control with expectimax search (not determinized) to compute the policy. Set the horizon to 4. What's the action that RHC will take based on the initial belief b = (0.5, 0.3, 0.2)?

Please input your answer as a string defined in RobotChargingPOMDP.action_space. For example, l0.
Is QMDP a good POMDP solver for this case?

Pick one

Yes

No

9) Simple POMDP approximations§

Recall the Fortune or Ruin problem from lecture.

States: the world states have the form (Robot, Money), where Robot is in loc {1...7} indicating the robot’s location, and Money is in loc {1, 3} indicating the money’s location.
Actions: the robot can move into any adjacent square from where it is (the whole space is Left, Right, Up, Down). Actions are deterministic and only change the robot’s location. Problem terminates after the robot reaches loc 1 or loc 3.
Reward: R((r, m), a, (r', m')) is
- 10, if r' = m'.
- -100, if r' = 1 and m' = 3, or r' = 3 and m' = 1.
- -1 otherwise.
Observations: There are two possible observations: \{1, 3\}. O((r, m)) is
- 1, if r = 7 and m = 1.
- 3, if r = 7 and m = 3.
- 1, in all other cases.
Discount factor = 1
Initial belief b_0 = \{(5, 1) : .51, (5, 3) : .49\}

9.1) Most-likely State§

Let's consider most likely state (MLS) determinization for solving the belief MDP.

What action would this strategy take in b_0?

Left

Right

Up

Down
What action would it take in belief \{(2, 1) : .51, (2, 3) : .49\}?

Left

Right

Up

Down
What is the expected value of this policy starting in belief state \{(2, 1) : .51, (2, 3) : .49\}?

Please input your answer as a Python float number or an equation that can be evaluated. You should make sure that the precision is at least 1e-4.

9.2) QMDP§

The first step in applying Q_MDP is finding the optimal Q value function for the underlying MDP. What are the Q values for the following (s, a) pairs?
1. ((2, 1), left)
  
  Please input your answer as a Python float number or an equation that can be evaluated. You should make sure that the precision is at least 1e-4.
2. ((2, 1), right)
  
  Please input your answer as a Python float number or an equation that can be evaluated. You should make sure that the precision is at least 1e-4.
3. ((2, 1), down)
  
  Please input your answer as a Python float number or an equation that can be evaluated. You should make sure that the precision is at least 1e-4.
4. ((2, 3), left)
  
  Please input your answer as a Python float number or an equation that can be evaluated. You should make sure that the precision is at least 1e-4.
5. ((2, 3), right)
  
  Please input your answer as a Python float number or an equation that can be evaluated. You should make sure that the precision is at least 1e-4.
6. ((2, 3), down)
  
  Please input your answer as a Python float number or an equation that can be evaluated. You should make sure that the precision is at least 1e-4.
What action would Q_MDP take in belief state \{(2, 1) : .51, (2, 3) : .49\}?

Left

Right

Up

Down
What action would Q_MDP take in belief state \{(4, 1) : .51, (4, 3) : .49\}?

Left

Right

Up

Down

What can we say about the behavior of Q_MDP relative to the behavior of MLS in this problem?

Mark all that are True

	MLS is more likely to reach a state with the maximum reward possible.
	MLS is more likely to reach a state with the lowest reward possible.
	Without discounting Q_MDP will have an infinite negative value.
	With discounting (e.g. .9) we would prefer the Q_MDP strategy in this case.

9.3) Most-likely Observation§

Recall that this approximation turns our POMDP into a deterministic reward-maximization problem.

If the robot starts in belief \{(6, 1) : .51, (6, 3) : .49\} and executes action ‘right’, what is the next belief state according to this approximation? You only need to include beliefs with positive probability in your answer.

Enter the result as a Python dictionary.
What is the Q value of taking action ‘left’ in belief \{(2, 1) : .51, (2, 3) : .49\}?

Please input your answer as a Python float number or an equation that can be evaluated. You should make sure that the precision is at least 1e-4.
Starting with initial belief \{(5, 1) : .51, (5, 3) : .49\}, how many reachable belief states are there? (Assume that once the robot reaches location 1 or 3, the game is over and so there are no reachable beliefs with the robot at either of those locations.) Hint: The world is in exactly one of the two possible configurations (e.g., the gold is actually in location 1).

Please input your answer as an integer, e.g. 1
What is the optimal path from belief \{(5, 1) : .51, (5, 3) : .49\} ?

Enter the result as a list of strings of given actions.. E.g. ['Left','Up','Up'].
What is the total reward of this optimal path?

Please input your answer as a Python float number or an equation that can be evaluated. You should make sure that the precision is at least 1e-4.
What can we say about the behavior of MLO relative to the other approximate strategies?

Mark all that are True

MLO will not seek out information because it is assuming a particular observation.

The Q_MDP value of a belief would be an admissible heuristic for doing A* in MLO.
Imagine that when moving right from location 6, there is a .1 chance of falling into a maelstrom (r = -10000) and getting a special observation ‘m’. Would this affect the choice of action in MLO?

Yes

No
What actually would be the optimal policy in the undiscounted maelstrom version of this problem, with initial belief \{(5, 1) : .51, (5, 3) : .49\}?

Move directly to location 1.

Try to move to location 7 to get information and then go to where the money is.

Just move back and forth in the comfort of the safe, safe, hallway forever.
What actually would be the optimal policy in the maelstrom version of this problem, with discount factor 0.9 and initial belief \{(5, 1) : .51, (5, 3) : .49\}?

Move directly to location 1.

Try to move to location 7 to get information and then go to where the money is.

Just move back and forth in the comfort of the safe, safe, hallway forever.

10) (Bonus) Real-World Applications§

Advances in learning and planning within MDPs - especially Q-learning - have led to several breakthroughs in modern AI: most notably they enabled AI to master Atari games as well as several complex board games like Go and Shogi. They have also been used to optimize the creation of new chips, and even control hot-air balloons to navigate wind currents.

POMDPs are used across several applications within sub-fields of AI; we've already touched on some of their uses in robotics, but there have also been proposals to use them to plan diagnosis and treatment of medical conditions like heart disease.

11) Feedback§

How many hours did you spend on this homework?
Do you have any comments or suggestions (about the problems or the class)?

	MLO will not seek out information because it is assuming a particular observation.
	The Q_MDP value of a belief would be an admissible heuristic for doing A* in MLO.

	Move directly to location 1.
	Try to move to location 7 to get information and then go to where the money is.
	Just move back and forth in the comfort of the safe, safe, hallway forever.

	Yes
	No

	Yes
	No

	Yes
	No

	Left
	Right
	Up
	Down