Home / homework / HW 8 -- MDPs

HW 8 -- MDPs

The questions below are due on Wednesday November 15, 2023; 11:59:00 PM.
 
You are not logged in.

Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.

The due of this Homework has been extended to Wed. Nov. 15, 2023, 11:00pm. See Piazza note for more details.

This week's Colab notebook can be found here: Colab Notebook (Click Me!)

1) Decision Theory§

1.1) Playing the Lottery§

Economists often make use of an exponential utility function for money: U(x)=-e^{-x/R}, where R is a positive constant representing an individual’s risk tolerance. Risk tolerance reflects how likely an individual is to accept a lottery with a particular expected monetary value (EMV) versus some certain payoff. As R (which is measured in the same units as x) becomes larger, the individual becomes less risk-averse.

Assume Mary has an exponential utility function with R=\$400. Mary is given the choice between receiving \$400 with certainty (probability 1) or participating in a lottery which has a 60\% probability of winning \$5000 and a 40\% probability of winning nothing.

  1. What is Mary's utility for participating in the lottery?

    Please input a number (2 decimal places).

  2. What is Mary's utility for not participating in the lottery?

    Please input a number (2 decimal places).

  3. Assuming Mary acts rationally, which option would she choose?

    Mary will participate in the lottery
    Mary won't participate in the lottery

  4. Consider the choice between receiving \$100 with certainty (probability 1) or participating in a lottery which has a 50\% probability of winning \$500 and a 50\% probability of winning nothing. Approximate the value of R in an exponential utility function that would cause an individual to be indifferent to these two alternatives. (You might find it helpful to write a short program to help you solve this problem.)

    Please input a number (3 decimal places).

1.2) A Trip to Leningrad§

In 1713, Nicolas Bernoulli stated a puzzle, now called the St. Petersburg paradox, which works as follows. You have the opportunity to play a game in which a fair coin is tossed repeatedly until it comes up heads. If the first heads appears on the nth toss, you win 2^n dollars.

  1. What is the expected monetary value of this game?

    Please a number (2 decimal places) or one of the quoted strings: "+inf" or "-inf".

  2. How much would you, personally, pay to play the game?

    Please input an integer.

  3. Nicolas’s cousin Daniel Bernoulli resolved the apparent paradox in 1738 by suggesting that the utility of money is measured on a logarithmic scale (i.e., U(S_n)=a\log_2{n}+b, where S_n is the state of having n). What is the expected utility of the game under this assumption?

    Please input a Python formula in terms of variables: a and b; you can use functions log() and exp().

  4. Assuming that R amount of money received at step t is actually worth \gamma^t R (where t=0 for the first step), what is the game worth in expectation?

    Please input a Python formula in terms of gamma; you can use functions log() and exp().

2) Reductions§

In this section, we will consider reductions between MDPs. Intuitively, A can be reduced to B if B is at least as hard as A. More formally, we will say that a set of MDPs A can be reduced to another set of MDPs B if it is possible to do the following:

  1. Given an MDP in A, convert it to an MDP in B (by any procedure, ignoring its complexity);
  2. Given that MDP in B, find an optimal policy, e.g., using value iteration;
  3. Convert the optimal policy back into an optimal policy for the original MDP in A (using some fixed polynomial-time procedure).

For example, if A is the set of all MDPs with 0 < R(s, a, s') < 100 for all s, a, s', and B is the set of all MDPs with all 0 < R(s, a, s') < 1, then A can be reduced to B: given an MDP in A, we can construct an MDP in B by dividing all rewards by 100. The resulting optimal policy for the MDP in B will automatically be an optimal policy for original MDP, so the conversion (step 3) is trivial.

Another example: if A is the set of all MDPs with action space {0, 1}, and B is the set of all MDPs with action space {"up", "down", "left", "right"}, then A can be reduced to B: given an MDP in A, we can construct an MDP in B by replacing all instances of the action 0 with "up", and all instances of the action 1 with "down" in R and P. We could then ensure that the "left" and "right" actions will not be used by the optimal policy, for example, by making those actions always get -\infty reward. With an optimal policy found, we can convert it to an optimal policy for the original MDP by replacing all instances of "up" with 0, and all "down" with 1.

  1. Here we will show that the set of all MDPs with finite horizons can be reduced to the set of all MDPs with infinite horizon. Given a finite-horizon MDP (\mathcal{S}, \mathcal{A}, P, R, H), we will construct an infinite-horizon MDP (\mathcal{S}', \mathcal{A}, P', R', \gamma).

    Which of the following are good choices for \mathcal{S}', P', R', and \gamma? Note that you may need to check multiple boxes to fully define P' and R'.
    \mathcal{S'} = \mathcal{S}
    \mathcal{S'} = \mathcal{S} \times \{1, 2, \dots, H\}
    \mathcal{S'} = (\mathcal{S} \times \{1, 2, \dots, H\}) \cup \{ \text{sink} \}
    \mathcal{S'} = (\mathcal{S} \times \{1, 2, \dots, H, H+1, \dots\}) \cup \{ \text{sink} \}
    \gamma = H
    \gamma = 0
    \gamma = 0.99
    \gamma = 1
    P'(s' \mid s, a) = P(s' \mid s, a)
    P'(\text{sink} \mid (s, H), a) = 1
    P'(\text{sink} \mid \text{sink}, a) = 1
    P'((s', h+1) \mid (s, h), a) = P(s' \mid s, a) for h < H
    P'(\cdot \mid (s, h), a) = 0 if not otherwise specified
    R'(\text{sink}, \cdot, \cdot) = \infty
    R'(\text{sink}, \cdot, \cdot) = 0
    R'((s, h), a, (s', h+1)) = R(s, a, s').
    R'(\cdot, \cdot, \cdot) = 0 if not otherwise specified

  2. The set of all MDPs with rewards R(s, a, s') can be reduced to the set of all MDPs with rewards R'(s), that is, specified only in terms of the current state.

    Which of the following informal descriptions would help prove this reduction?
    Given an MDP with rewards R'(s), we can create an MDP with rewards R(s, a, s') = R'(s), ignoring the a and s' inputs.
    Given an MDP with rewards R(s, a, s'), we can create an MDP with state space \mathcal{S} \times \mathcal{A} \times \mathcal{S'}, where being in the state (s, a, s') means that we are in state s' from the original MDP, after having just executed action a from state s.
    Given an MDP with rewards R(s, a, s'), we can use set the discounting factor \gamma in the new MDP in such a way that the rewards only depend on s.
    Given an MDP with rewards R(s, a, s'), select one action a_{*} and one possible next state s_{*}', and define R'(s) = R(s, a_{*}, s_{*}').

3) Online Planning§

3.1) Expectimax Search§

Complete the implementation of expectimax search for a finite horizon MDP.

For reference, our solution is 16 line(s) of code.

3.2) Determined to Determinize§

Indicate which of the following statements are true:
Unlike expectimax search, determinization requires enumerating all states in the MDP, regardless of reachability.
Most-likely-outcome and all-outcomes determinization will always lead to the same final policy, albeit through different determinized models.
All-outcomes determinization may result in a suboptimal policy.
All-outcomes determinization is slower than most-likely-outcome determinization, but it is guaranteed to result in a better policy.
The determinization strategies we discussed are best suited for Stochastic Shortest Path MDPs because after determinization, we can use goal-based deterministic planners like A*.
There is no point interleaving planning and execution when using determinization, since the determinized model will always be the same.
Consider an MDP where an agent is navigating a grid using up/down/left/right actions, trying to reach a goal location in the grid. The transition model is such that with 90% probability, the effect of the action will be as intended, but with 10% probability, a random action's effect will happen instead. Rewards are sparse: the agent receives +1 for reaching the goal and 0 otherwise. Most-likely outcome determinization would lead to the optimal policy in this MDP.
Consider an MDP where an agent could either walk across a shakey bridge above a lava river to reach a goal in 10 steps, or walk around the river to reach the goal in 12 steps. While on the bridge, the agent has a 49% chance of falling into the river at each timestep. Walking around the river will deterministically reach the goal. Most-likely outcome determinization would lead to the optimal policy in this MDP.
In the same lava problem from the previous problem, all-outcomes determinization would lead to the optimal policy.

4) More MDPs Please§

4.1) Quick Review§

Note: some sources may disagree on definitions pertinent to the questions below. Please use our lecture notes as the ground truth.

Indicate which of the following statements are true:
MDPs must have finite state and action spaces.
Value iteration assumes an MDP with finite state and action spaces.
MDPs must either be finite-horizon or have temporal discount factor \gamma < 1.
Given an optimal value function V, it is possible to compute an optimal action-value function Q without access to the MDP transition model P.
Given an optimal action-value function Q, it is possible to compute an optimal value function V without access to the MDP transition model P.
An MDP has a unique optimal policy.
An MDP has a unique optimal value function.
An MDP has a unique optimal action-value function.
When value iteration converges, it is guaranteed to have computed an optimal value function (within error bounds), regardless of the initialization.
When policy iteration converges, it is guaranteed to have computed an optimal policy (within error bounds), regardless of the initialization.

5) Offline Planning§

5.1) Wait, Bellman, Backup!§

Complete the implementation of the bellman backup for an infinite or indefinite horizon MDP.

For reference, our solution is 13 line(s) of code.

5.2) There's Value in that Iteration§

Complete the implementation of value iteration for an infinite or indefinite horizon MDP.

For reference, our solution is 20 line(s) of code.

In addition to all the utilities defined at the top of the Colab notebook, the following functions are available in this question environment: bellman_backup. You may not need to use all of them.

6) Additive Reward§

Consider an infinite horizon MDP with \mathit{mdp} = \langle S, A, P, R, \gamma\rangle, where S is the state set, A the action set, P(s' | a, s) the transition probability, R(s, a, s') the reward function, and 0 < \gamma < 1 the temporal discount factor.

Now let's consider a new MDP which only differs from \mathit{mdp} in its reward function: \mathit{mdp}' = \langle S, A, P, R', \gamma \rangle, where R'(s, a, s') \equiv R(s, a, s') + c, and c is a constant. In other words, we shift all reward values by c.

Let's run value iteration (VI) on both MDPs. Let V_0(s) \equiv 0 be the all-zero value function, which is the initialization for our value iteration algorithm. Let V_1(s) be the value function after 1 iteration of VI on \mathit{mdp}, and V'_1(s) be the value function after 1 iteration of VI on \mathit{mdp}'.

  1. What's the relationship between V_1(s) and V'_1(s)? Please write down a formula for V'_1(s) in terms of V_1.
    Input your answer as a python formula. You can use variables s, s_prime, c and gamma, and also functions P, R, and V_1. We also provide you with a function max(variable, function). For example max(a, R(s, a, s_prime)) is \max_a R(s, a, s'). For example, you can enter V_1(s) + gamma + gamma * max(a, R(s, a, s_prime)).

Let's do a second iteration. Let V_2(s) be the value function after 2 iterations of VI on \mathit{mdp}, and V'_2(s) be the value function after 2 iterations of VI on \mathit{mdp}'.

  1. What's the relationship between V_2(s) and V'_2(s)? Please write down a formula for V'_2(s) in terms of V_2.
    Input your answer as a python formula. You can use variables s, s_prime, c and gamma, and also functions P, R, V_2 and max.

Let's do many iterations. Let V_\infty(s) be the value function computed by VI on \mathit{mdp} (until convergence), and V'_2(s) be the value function computed by VI on \mathit{mdp}' (until convergence).

  1. What's the relationship between V_\infty(s) and V'_\infty(s)? Please write down a formula for V'_\infty(s) in terms of V_\infty.
    Input your answer as a python formula. You can use variables s, s_prime, c and gamma, and also functions P, R, V_infty and max.
    Hint: recall the geometric series.

  2. Is the optimal policy derived from V_\infty and V'_\infty different?
    Different
    Not Different

Using a similar proof technique, one can also show that multiplying the reward function by a positive constant also yields the same optimal policy. However, in order to formally prove that, you must carefully handle the max operator in value iteration.

7) Value difference§

Now, we'll think about the way in which value iteration makes progress, improving the quality of the approximate value function it computes at each iteration. We'll define a distance measure between value functions in terms of the maximum absolute difference between the values they assign to any state: \|V_1 - V_2\| = \max_s |V_1(s) - V_2(s)|.

Assume, for simplicity, that we have deterministic rewards such that 0 \le R(s, a, s’) \le \mathrm{Rmax} for all states s, s' and actions a.

Now, if we define B(V) as the value function we get after applying one iteration of value iteration to V, we have the following cool theorem, which shows that after every iteration of value iteration, we get an answer that is closer to the optimal value function!

Theorem: \|B(V) - V^*\| \le \gamma \|V - V^*\|, where V^* is the optimal value function.

  1. What is the largest possible value of any state? Provide an upper bound on the value of V^*(s) for any s, in terms of \gamma and \mathrm{Rmax}.
    Please input your answer as a python formula. You can use variables rmax and gamma. For example, you can enter rmax + (gamma ** 2 + 1) / gamma.

  2. Provide an upper bound on \|V_0 - V^*\| where V_0(s) = 0 for all s . Please input your answer as a python formula. You can use variables rmax and gamma.

  3. Provide an upper bound on \|V_k - V^*\| where V_k is the value function you get after k iterations of value iteration starting from V_0, where V_0(s) = 0 for all s. Please input your answer as a python formula. You can use variables rmax and gamma, and k.

  4. As \gamma increases towards 1, how does this affect the number of iterations of value iteration we would need to reach a guaranteed level of approximation?
    We need more iterations
    We need fewer iterations
    It doesn't matter

8) Usefulness of Heuristic V in Forward Search!§

**For the following questions, we will use the same notation as in the "Value Difference" question above.**

Given the current state s and a horizon h, expectimax(s, h, U) is a policy that does expectimax search to depth h and uses heuristic value U(s), which may be zero, at the leaves. It returns the optimizing action for state s. We can use this policy to pick an action, execute it, and repeat. This is receding horizon control (RHC) with T_\mathrm{replan} = 1, as presented in lecture.

If the heuristic is zero for all states, this process is equivalent to executing the policy that is greedy with respect to the finite-horizon value function with horizon h for that MDP. When we have a non-trivial heuristic, the contraction property of the Bellman backup guarantees that RHC with expectimax will be equivalent to acting greedily with respect to a value function that is no worse than the heuristic.

Here’s another useful theorem that relates a value function \hat{V} to the expected value you would get from executing the policy that is greedy with respect to \hat{V}.

To derive the theorem, let's think about the Q-function Q_V induced by a value function V, which is defined as:

Q_V(s, a) = \mathbb{E}_{s' \sim P\left(s' | s, a\right)}\left[ R(s, a, s') + \gamma V(s') \right].
Let Q^* denote the optimal Q-function derived from V^*.

Let V_\pi be the value that you get from executing policy \pi, and let \pi^g_Q be the policy that selects its actions greedily with respect to some Q function Q. Let \hat{Q} be another Q-function. If \|\hat{Q} - Q^*\| < \delta, that is, for all (s, a), \|\hat{Q}(s, a) - Q^*(s, a)\| < \delta, then the value you get from executing the policy that is greedy with respect to \hat{Q}, i.e., V_{\pi^g_{\hat{Q}}}, has the following bound:

V^*(s) - V_{\pi^g_{\hat{Q}}}(s) < 2 \delta / (1 - \gamma), \forall s.

  1. Suppose you have an arbitrary value function \hat{V}, which has the property that \|\hat{V} - V^*\| < \epsilon. What's the upper bound for the difference between the Q^* \hat{Q} = Q_{\hat{V}}, i.e., \|\hat{Q} - Q^*\|? Please input your answer as a python formula. You can use variables rmax and gamma, and epsilon.

  2. Suppose you have an arbitrary value function \hat{V}, which has the property that \|\hat{V} - V^*\| < \epsilon. Now denote \hat{Q} = Q_{\hat{V}}, and we are going to follow the policy that is greedy with respect to \hat{Q}, i.e., \pi^g_{\hat{Q}}. Use the upper bound you just derived for \|\hat{Q} - Q^*\| and the theorem above, to derive a bound for \|V_{\pi^g_{\hat{Q}}} - V^*\| represented in terms of \epsilon, \gamma , and \mathrm{Rmax}. Please input your answer as a python formula. You can use variables rmax and gamma, and epsilon.

  3. Suppose you have an arbitrary value function \hat{V}, which has the property that \|\hat{V} - V^*\| < \epsilon. Denote \hat{Q} = Q_{\hat{V}}. Consider the actual value you can get by following the greedy policy derived from \hat{Q}, i.e., V_{\pi^g_{\hat{Q}}}. Is V_{\pi^g_{\hat{Q}}} always greater than the original \hat{V} for any state s?
    Note: After submitting your answer, please click the "View Explanation" button to see more details.
    No, and the difference between V_{\pi^g_{\hat{Q}}} and \hat{V} can be arbitrarily large (in this case, you can be arbitrarily "optimistic" about the value).
    No, and the difference between V_{\pi^g_{\hat{Q}}} and \hat{V} is bounded by a function of \epsilon and \gamma.
    Yes, you will always get a higher value than \hat{V}(s) if you follow the greedy policy derived from \hat{V}.

  4. Use your results from part 2 and the theorem above to provide an upper bound on the regret you will experience (difference between your actual value function and V^*) from executing a policy computed using depth h expectimax, with U(s)=0 for all s. Please input your answer as a python formula. You can use variables rmax and gamma, h, and epsilon.
    Hint: computing \hat{Q} from \hat{V}, i.e., \hat{Q} = Q_{\hat{V}}, can be viewed as an h=1 expectimax search!

  5. As \gamma increases toward 1, does the horizon required to obtain the same regret increase or decrease?
    Increase
    Decrease

  6. Consider the situation in which \|U - V^*\| < \epsilon, and you are doing depth h expectimax with U at the leaves. What is the regret of executing the resulting policy? Please input your answer as a python formula. You can use variables rmax, gamma, h, and epsilon.
    Hint: computing \hat{Q} from \hat{V}, i.e., \hat{Q} = Q_{\hat{V}}, can be viewed as a h=1 expectimax search!

  7. When U differs significantly from V^*, i.e., when |U(s) - V^*(s)| > \delta for some s, we should probably just use U_0(s) \equiv 0 as the heuristic value. We can work out (using the theorems above) what the value of |U(s) - V^*(s)| is when U(s) = 0. When the difference between |U(s) - V^*(s)| becomes larger than that value (\delta), we should use U(s)=0. What is that value?

    Express \delta in terms of other variables. Please input your answer as a python formula. You can use variables rmax, gamma, h, and epsilon.

  8. Assuming you have a good U (so \|U - V^*\| is small), does this give you more help when \gamma is bigger or smaller?
    Bigger
    Smaller
    No Difference

  9. Assuming you have a good U (so \|U - V^*\| is small), does this give you more help when the search horizon h is bigger or smaller?
    Bigger
    Smaller
    No Difference

9) Working with MDPs§

Continuing the theme of a "search and rescue" domain, we will consider the problem of planning to navigate to rescue people and take them to the hospital. But alas, nature is less co-operative in the real-world than we saw in the last mini-project, and we will have to expect our actions don't always the intended outcome.

We have provided you with a RescueMDP class. The robot knows its location and the location of the person to be rescued (the goal). However, in this case, the robot's motion is uncertain. As we've seen in class, there are two ways we can deal with this uncertainty: we can either replan when the unexpected happens, or we can compute a policy that tells us ahead of time what to do when the unexpected happens. There are of course trade-offs between these two approaches.

Let's examine the properties that make MDPs easier or harder.

9.1) MDP Question 1§

We have provided a RescueMDP class for you, that models the robot's motion as noisy. This class is in many ways similar to the ChaseMDP class in that it is a maze MDP with obstacles. However, there's just the one agent (the robot, no bunny).

Recall that an MDP is defined by the following tuple:

  • States: The state space is a set of coordinates. The set is defined over a grid, but states that aren't obstacles aren't in the state space.
  • Actions: The robot can take four actions, up, down, left, right.
  • A transition function: The probability the action succeeds at moving the robot in the given direction is given by the class field correct_transition_probability. If this probability is less than 1, then the rest of the probability mass is uniformly distributed among the other three directions. Any probability mass for a motion that would move the robot into an obstacle or out of the map is mapped to the robot staying the same place. If the robot is in the goal state and the MDP has the flag goal_is_terminal = True, it cannot transition out of the goal state. If the flag goal_is_terminal = False, the robot is free to leave the goal state and re-enter it.
  • Reward function: The robot gets a reward of living_reward for each action it takes. It gets reward of goal_reward every time it enters the goal state.

Remember that arrays are row-major order, that is, the arrays are indexed by (row, column). The person to be rescued (goal_location) is at (0, 0).

We also want you to be able to examine the policy that results from solving for the optimal value function. Please use the helper function plot_value_function to generate a plot of both the value function and the corresponding policy.

Please create a LargeRescueMDP and compute the optimal value function.

For reference, our solution is 4 line(s) of code.

In addition to all the utilities defined at the top of the Colab notebook, the following functions are available in this question environment: bellman_backup, value_iteration. You may not need to use all of them.

9.2) MDP Question 2§

The default transition model for the LargeRescueMDP is pretty close to deterministic. There is a .97 chance that each action succeeds in the intended direction (unless there's an obstacle in the way or its the edge of the map) and only a .01 chance of ending up in one of the other 3 neighbouring grid cells.

What happens if we make the transition model noisier? Setting the correct_transition_probability to .76 is not too noisy, but it has implications for our ability to solve for the optimal policy.

Please create a LargeRescueMDP and set the correct_transition_probability to be .76. You can set the correct_transition_probability field of the LargeRescueMDP class directly, and the transition function will be adjusted according. (You can read the comment in the LargeRescueMDP setter function for more detail on how setting the correct_transition_probability works.)

For reference, our solution is 5 line(s) of code.

In addition to all the utilities defined at the top of the Colab notebook, the following functions are available in this question environment: bellman_backup, value_iteration. You may not need to use all of them.

Does the policy change?
Yes
No

What happens to the value function?
Values increase because it takes fewer steps to reach the goal
Values increase because the future is worth more
Values stay the same because the policy stays the same
Values decrease because the future is worth less
Values decrease because it takes more steps to reach the goal

9.3) MDP Question 3§

The default temporal_discount_factor is .9. This is in general quite a low discount factor. An action 20 steps away will have .9^{20} \approx .12 impact on later actions.

What happens if we increase the discount factor? Please create a LargeRescueMDP, set the temporal_discount_factor to be .99, and also set the correct_transition_probability to be .76. Then please solve for the optimal value function.

For reference, our solution is 6 line(s) of code.

In addition to all the utilities defined at the top of the Colab notebook, the following functions are available in this question environment: bellman_backup, value_iteration. You may not need to use all of them.

Does the policy change?
Yes
No

What happens to the value function?
Values increase because it takes fewer steps to reach the goal
Values increase because the future is worth more
Values stay the same because the policy stays the same
Values decrease because the future is worth less
Values decrease because it takes more steps to reach the goal

9.4) MDP Question 4§

Now let's make the dynamics really noisy. Once again, please create a LargeRescueMDP with a hazard at `{(1, 4)}', but with correct_transition_probability set to 0.5. You should see a very substantial change in both the policy and value function.

For reference, our solution is 6 line(s) of code.

In addition to all the utilities defined at the top of the Colab notebook, the following functions are available in this question environment: bellman_backup, value_iteration. You may not need to use all of them.

Does the policy change?
Yes
No

What happens to the value function?
The policy stays the same
The policy changes to take a shorter path
The policy changes to take a riskier path
The policy changes to take a longer path
The policy changes to take a safer path

10) Fitted Q Value Iteration§

In this problem we'll play with the Mountain Car MDP, a classic problem. You control a car, constrained to move in a line, by telling it to either accelerate left, right, or not accelerate. The car starts initially at some point in a valley, and the goal is to get the car to the top of the mountain on its right as fast as possible. Here is a possible initial state, where the red dot represents the car:

Formally, here is the description of the MDP:

  • \mathcal S = \mathcal X \times \mathcal V, where \mathcal X = [-1.2, 0.6] is the set of possible positions of the car, and \mathcal V = [-0.07, 0.07] is the set of possible velocities of the car
  • \mathcal A = \{0, 1, 2\}, denoting acceleration to the left, no acceleration, and acceleration to the right respectively.
  • Given state (x, v), and action a, the transition updates v' = v + ((a-1)\alpha + g + \eta)\delta_t, where \alpha is the constant describing the acceleration of the car's engine, g \propto \cos(3 x) is the effect of gravity on the car, and \eta is an i.i.d. gaussian noise. Then, x' = x + v' \delta_t.
  • The terminal states are the right boundary of the position of the car. i.e. \{(0.6, v) | v \in \mathcal V\}
  • The reward is always -1
  • \gamma = 1

The MDP is already implemented for you in the class MountainCar. Please take a look at the code and answer the following questions.

10.1) Value from Q§

Compute the value function from the Q function.

For reference, our solution is 3 line(s) of code.

10.2) Greedy Policy from Q§

Write the greedy policy given a Q function.

For reference, our solution is 6 line(s) of code.

10.3) Epsilon Greedy Policy from Q§

Write the epsilon greedy policy given a Q function. With probability epsilon, the policy should pick a random action, and with probability 1-epsilon, pick the greedy action. You may use the greedy_policy_from_q function. You may find rng.choice and rng.random useful.

For reference, our solution is 6 line(s) of code.

In addition to all the utilities defined at the top of the Colab notebook, the following functions are available in this question environment: greedy_policy_from_q. You may not need to use all of them.

10.4) Sampling points in grid§

Next, we'd want to sample points in the state space on which to perform Bellman backups. Look at the colab for function sample_grid_points which samples the state space in an evenly spaced grid, and returns the sampled points and their corresponding transitions for every possible action.

10.5) Sampling points from policy§

Another way to sample points is to collect points by rolling out trajectories from a policy. Implement a function that samples several trajectories from a given policy, and concatenate the trajectories together to produce a list of (state, action, reward, next_state) tuples.

For reference, our solution is 8 line(s) of code.

10.6) Fitted Q Visualization§

Complete the fitted_Q_learn function given in the colab notebook. You're now ready to conduct fitted Q learning on the Mountain Car Problem! You may use either sampling method (grid or policy), and either regression method (KNNRegressor or NeuralRegressor).

Try fitted Q learning with sample_grid_points(x_division=4, v_division=4) and 20 iterations. You may use either KNNRegressor or NeuralRegressor. Visualize the fitted Q and the final greedy policy. What explains your observations?
Grid sampling scales exponentially with the dimension of state space
The granularity of the grid points is too coarse to capture the MDP transition
There were not enough iterations for the Q function to converge
The regression method overfitted the Q function

Try fitted Q learning with sample_policy_points using eps=0.05 and 20 iterations. You may use either KNNRegressor or NeuralRegressor. Visualize the fitted Q and the final greedy policy. What explains your observations?
The policy did not explore enough of the state space because eps is too low
The points obtained were too far off policy because eps is too high
The regression method overfitted the Q function
The sample points change when the policy changes

In this problem, it turns out that both grid sampling and policy sampling are viable. Select all that are true about these two sampling methods.
Grid sampling scales exponentially with the dimension of state space, whereas policy sampling does not.
Policy sampling should always be preferred over grid sampling
Policy sampling tends to be more unstable compared to grid sampling.

Submit a successful trajectory with fewer than 13 actions. You should be able to obtain the trajectory from the sim_episode method.

11) Feedback§

  1. How many hours did you spend on this homework?

  2. Do you have any comments or suggestions (about the problems or the class)?