Miniproject 3 -- Reasoning under uncertainty
Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.
This miniproject's Colab notebooks can be found here:
In this project, we will revisit the "pickup and rescue" problem you encountered in miniproject 1. We will consider a stochastic version and a partially observable version of this problem in the forms of an MDP and a POMDP, respectively. You will develop approximate online-planning solutions to solving each problem. To that end, you will combine various skills you have acquired throughout this course including heuristics search and Monte-Carlo tree search and loopy belief-propagation.
Two Phase Release
This project contains two parts: planning with complete observability and planning with partial observability. What you are reading on this page now is both parts of the project.
This project requires benchmarking, which could be time-consuming. Therefore, we highly recommend that you start early.
Lastly, note that some of the coding questions do not have catsoop tests. You will still see input code boxes on this catsoop page. But, they contain no points and they are there to help you locate relevant function and class to be implemented. We will grade your submission based on the requested submission materials.
Code Structure
To help you navigate through the source code, here is a summary of most important classes that you will find in our starter code. You may skip this section for now and come back later for reference.
- Agent
- RolloutLookAheadAgent: a simple baseline agent that uses rollout lookahead to plan.
- FireMDPDeterminizedAStarAgent: a planning agent by approximating the
FireMDP
problem by a deterministic min-cost path problem and uses A* search to find an action, needs implementation. - MCTSAgent: a MCTS agent, needs implementation
- FirePOMDPBeliefStateAgent (problems to be released)
- FirePOMDPRolloutLookaheadAgent: a simple baseline agent for
FirePOMDP
that uses rollout lookahead to plan. - FirePOMDP_MLO_MCTSAgent: a planning agent by approximate the
FirePOMDP
using MLO approximation, then uses MCTS to find an action. needs implementation
- FirePOMDPRolloutLookaheadAgent: a simple baseline agent for
- PickupProblem: the deterministic pickup and rescue problem.
- FireProcess: the fire stochastic process.
- FireMDP: the fire MDP problem by combining a
PickupProblem
and aFireProcess
. - DeterminizedFireMDP: a deterministic min-cost path approximation to a
FireMDP
, needs implementation. - FirePOMDP: the partially ibservable fire problem.
- FireMRF: a Markov random field representation used to approximately represent the fire distribution of a
FirePOMDP
. - MLODeterminizedFirePOMDP: a MLO approximation to
FirePOMDP
, used byFirePOMDP_MLO_MCTSAgent
. needs implementation
1) Planning with Complete Observability §
Let's first understand our problem setup.
The Pickup and Rescue Problem
We will consider a gridworld-like domain that consists of a robot, a patient and a target hospital.
The robot's goal is to pick up the patient and rescue them to the target hospital. As you can see in the above figure, the blue circle represents our robot, the orange circle represents the patient, and the red cross represents the hospital. The environment has one-way roadblocks, indicated by a flat triangle between two cells. For example, the agent can only move down from (1, 1) to (2, 1) but not up. The environment also has two-way roadblocks, indicated by thick black boundaries between two cells. For example, the figure above has a long corridor formed by two walls that the agent can travel down from its initial position, but not sideways.
The Evolution of Fire
The above problem is almost the same as you saw in MP01 --- it's a deterministic search problem. But here's what's interesting: Our grid now has fire. The fire starts at some initial locations (represented by red grid cells in the above figure) and evolves through time.
Thanks to our research at MIT, we know the exact model of how fire evolves through time.
Most notably, the fire grid at time t is independent of the robot and patient, and depends only on the fire grid at the previous time step. Fire also completely ignores walls, roadblocks and the hospital.
Further, given the fire grid at time t, the probabilities of fire at any two different cells at time t+1 are independent:
You might recognize that given the fire grid at time t, a matrix of 0-1 values, the probability of fire at time t+1 is the 2D convolution of the fire grid and the weights W (normalized such that the entries sum to one).
Here's another way to understand the fire process:
- We start with some initial fire grid.
- At each time step t, for each cell, we randomly select a neighbor (including the current cell) with probability proportional to the weights matrix; then, the current cell at time t+1 gets the selected neighbor's fire value at time t. Note that neighors outside of the grid does not have fire.
Our Approach
We are going to adopt an online-planning approach, where at every step, our agent:
- plans according to the current state,
- executes an action, and
- observes a new state of the fire grid and replans (i.e., restarts from step one).
We will consider two different styles of planning: determinized approximation and Monte-Carlo tree search.
1.1) Approximate, Determinize and Replan §
In our first approach, at each planning step, we will try to find an open-loop plan that is most likely to succeed. In particular, we turn the MDP into an min-cost path problem:
- The state space no longer has the fire grid, but only contains the state of the pickup and rescue problem.
- For a step (s_t, a, s_{t+1}) at time t, we charge a cost of c - \log \left(1 - P\left(\mathtt{on\_fire}_{t+1}(s_{t+1})\right)\right), where c is a small cost for taking each step, and P\left(\mathtt{on\_fire}_t(s)\right) is the marginal probability of stepping on fire at state s at time t.
- We try to find the least-cost path to reach the patient and rescue them to the hospital --- this path becomes our found open-loop plan.
Once we have a min-cost path problem, we can use A* search with a simple heuristic that ignores fire.
In particular, we will use a simple heuristic that is the sum of the manhattan distance from robot to the patient and the manhattan distance
from patient to the hospital, scaled by the small cost c
charged at each step.
Note that when the robot is carrying a patient, the distance between the robot and the patient is zero.
Hints:
- Before you code, try to derive the marginal probabilities of each grid cell on fire at time t as an expression of the marginal probabilities of each grid cell on fire at time t-1. What do you find? More concretely, you may start small: Consider a grid consisting of only two cells, named X and Y, and assume that W is uniform. Then, try to write the marginal probability of cell P(X_t=1 | X_0, Y_0) as an expression of P(X_{t-1}=1 | X_0, Y_0) and P(Y_{t-1}=1 | X_0, Y_0).
- We can formulate the A* state as (\mathtt{robot\_loc}, \mathtt{patient\_loc}, \mathtt{time}) and use time as an index into a precomputed sequence of grid of the marginal probabilities of each cell on fire. See
DeterminizedFireMDPState
andDeterminizedFireMDP
for more details.
1.1.1) Determinized Min-cost Path Problem§
Please complete the implementation of DeterminizedFireMDP
. In particular, you should:
- Complete the function
fire_dist_at_time
to compute the log-likelihood of each cell being on fire at time t given the true fire state at time 0. - Using your implementation of
fire_dist_at_time
, complete the functionstep_cost
. - Complete the rest of the
DeterminizedFireMDP
and code the heuristic functionh
based on description above.
For reference, our solution is 68 line(s) of code.
1.1.2) Determinized Fire MDP Agent§
Please complete the implementation of FireMDPDeterminizedAStarAgent
.
Note that we have filled in most of the implementation for you --- including
the call to run_astar_search
from HW01.
All you need to implement is the determinized_problem
method.
For reference, our solution is 27 line(s) of code.
1.1.3) What are you waiting for? §
Now that we have the agent, let's try to experiment with it under some environments!
We have provided you various environments under the get_problem
function.
Try to run your agent in each provided environment several times.
You can visualize the agent's behavior using the run_agent_on_problem
and animate_trajectory
functions.
Let's take a closer look at the particular MDP of get_problem("just_wait")
.
What's your agent's behavior (on average)?
In particular, does the robot "wait" by patrolling in the top row for a while,
and then moves out to rescue the patient?
If not, then it is very likely that your agent implementation is buggy!
Experiment 1
Please try to generate an animation of a successful run
(i.e., the robot succesfully rescues the patient)
of the agent in the just_wait
MDP, but the fire does not completely
die out when the robot moves down from the top row.
You might need to repeat the experiment a few times to produce this animation.
If you failed to create this animation after a handful of trials (say, 10),
your agent implementation might be buggy.
Please feel free to reach out to us anytime if you get stuck.
1.1.4) What's the Right Choice? §
Sometimes our determinized agent finds plan that is not optimal.
To see the above effect, try to run the determinized agent in the MDP get_problem("the_choice")
.
In this environment, the agent faces a choice of going right or down from the initial location.
- It may choose the shortcut by taking the down action. But, it risks itself getting close to the fire next to the one-way passage, and it cannot hide from the fire in this passage.
- It may also accept the challenge by taking the right action. Here, the robot moves to a large room with more fire than the one-way passage. But it can move around to try its best to avoid fire, until it finds a clear path to rescue the patient.
Experiment 2
Similar to experiment 1, please visualize the behavior of the determinized planning agent in this environment, and generate an animation of a successful run. What choice does your determinized agent make?
1.2) Monte-Carlo Tree Search §
Now that we have seen a failure mode of our determinized planning agent, let's try to do better with closed-loop planning with MCTS!
With MCTS, we have more of a chance of hedging bets, so we might be inclined to go in directions where there are more options in case we get caught, even if the expected open-loop cost is higher.
We have provided you with an MCTS implementation, run_mcts_search
, modified from the one in HW01, that works on both MDPs and POMDPs.
Please take a look at the documentation of run_mcts_search
to understand how to use it on MDPs, then implement an MCTS agent for MDPs.
1.2.1) MCTS Agent§
Please complete the implementation of MCTSAgent
.
Hint: You can pass in the self.planning_horizon
to run_mcts_search
,
to handle both infinite-horizon problems (by receding-horizon planning) and finite-horizon problems.
For reference, our solution is 42 line(s) of code.
1.2.2) Making the Right Choice! §
Let's run our MCTS agent in the_choice
MDP, and see if it makes the right choice!
Experiment 3
Similar to experiment 2, please visualize the behavior of the MCTS planning agent
in the_choice
MDP and generate an animation of a successful run.
1.3) Benchmarks §
Now that we have both a determinized agent and an MCTS agent, let's do some benchmarking.
We have provided you with a simple agent, RolloutLookaheadAgent
: At each step, it performs several rollouts for each action and chooses the best one.
Note that RolloutLookaheadAgent
with receding_horizon=0
becomes a naive agent that chooses action uniformly at random.
Let's compare these agents:
RolloutLookaheadAgent(receding_horizon=0)
,
RolloutLookaheadAgent(receding_horizon=40)
,
FireMDPDeterminizedAStarAgent
,
MCTSAgent(iteration_budget=10, receding_horizon=40)
and
MCTSAgent(iteration_budget=50, receding_horizon=40)
.
Please set unspecified parameters to their default values.
Then, run the benchmarks. In particular:
- For each environment in
get_problem
, run each agent at least 10 times in that environment. - Record the obtained total rewards for each run.
- Record the average, standard deviation, min, and max rewards each agent obtains.
Note that running the MCTS agent can take a while on Colab (with the risk that Colab might disconnect you due to idling).
In our experience, the benchmarking may take a few hours.
Therefore, we recommend running the MCTS agent on a local laptop or desktop.
If you choose to do so, you may change MCTSAgent(iteration_budget=50)
to one with more iterations,
such as MCTSAgent(iteration_budget=100)
or MCTSAgent(iteration_budget=500)
--- doing so should produce much better MCTS agent.
You may also want to repeat each setting more than 10 times, such as 30 times --- doing so will reduce the effect of stochasticity in the experiments.
Please prepare a table comparing the above agents' performances in the environments. We do not impose a format for the table, but you should prepare the table such that it is reasonably readable. Remember to indicate the experiment settings for the table (the parameters you used and the number of repetitions, etc.). Once you have the table, try to identify any interesting patterns from the table, and summarize your findings in words.
Hints:
- You may want to take a look at the
benchmark_agent
andcompare_agents
functions, which contains some boilerplate code to get you started. - In particular, you may also use the parameter
max_steps
of the above functions to limit the number of steps for each evaluation episode, if evaluation is taking too much time.
2) Planning with Partial Observability §
Let us now turn to a partially observable environment. We will consider an environment the same as in part 1, except for the following changes:
- The fire now spreads before the agent enters the grid for an unknown amount of time.
- When the agent enters the grid, the fire stops spreading and remains still throughout the episode.
- Apart from the robot, the agent controls a flying drone. At each step, the drone can move to any square within distance 3 of the robot. The drone flies high enough so it won't be burned by fire. Unfortunately, the drone travels so fast that it cannot make observations of grid cells during its travel. But, at each time step, the drone stops above a grid cell and can look closely at whether the cell has fire.
- At each step, the agent can accurately observe the 3 by 3 neighborhood of around ground robot and the cell below the flying drone.
2.1) Belief propagation §
Let us put aside the robot and the drone for a moment and consider the problem of how we can infer locations that are on fire, given observations of some of the grid cells.
Recall that the fire now spreads before the agent enters the grid for an unknown amount of time. In principle, since we know the fire process from part one, if we have a probabilistic model of how long the fire had spread, we could have built an exact probabilistic model of fire at the time the agent enters the grid. However, in our problem setting, we do not know the duration (or its model) a priori. Further, even if we know the duration model, the overall probabilistic model would likely be too complicated, and inference on the model will be difficult.
To the end of performing inference with reasonable efficiency, we will use an approximate model of the fire distribution. In particular, we will consider modeling the fire distribution as a Markov random field with unit potentials on every cell specifying the likelihood that they are on fire and a pairwise potential connecting neighboring grid cells indicating that they are more likely to be in the same state than not. Below is the factor graph corresponding to this MRF:
Each pair of neighboring cells in this model shares the same binary potential \phi(X_i, X_j). Each node also has a unitary potential \psi_{i}(X_i) that may vary for each cell. Overall, the model expresses the joint distribution:
2.1.1) Exact Marginalization§
Write a function fire_mrf_exact_marginalize
to compute the exact marginals by multiplying all potentials and then marginalizing.
We don't expect this function to work for large grids, but it should work for grids that are smaller than 3 x 3.
For reference, our solution is 35 line(s) of code.
Experiment 4
Using your implementation of fire_mrf_exact_marginalize
, let's compute the marginals for some MRFs.
- A 2 x 2 MRF. The unitary potentials are all [0.5, 0.5]. The pairwise potential is [[0.7, 0.3], [0.3, 0.7]].
- A 3 x 3 MRF. The unitary potentials are all [0.5, 0.5]. The pairwise potential is [[0.7, 0.3], [0.3, 0.7]].
- A 3 x 3 MRF. The unitary potentials are all [0.5, 0.5], except that the variable at the upper-left corner has a unitary potential of [0.3, 0.7]. The pairwise potential is [[0.7, 0.3], [0.3, 0.7]].
- A 3 x 3 MRF. A random grid of unitary potentials. You may generate the potentials however you like, as long as the unitary potentials are different between any two cells. The pairwise potential is [[0.5, 0.5], [0.5, 0.5]].
Plot the marginal probabilities computed by fire_mrf_exact_marginalize
on the above MRFs on a heatmap (e.g., using plt.imshow
):
import matplotlib.pyplot as plt
marginals = fire_mrf_exact_marginalize(fire_mrf)
plt.imshow(marginals, vmin=0, vmax=1)
plt.show()
What do you find from these marginals? In particular, how do the unitary and pairwise potential affect the marginals?
2.1.2) Good Approximation? §
From now on, let us fix the pairwise potential of our MRF to [[0.7, 0.3], [0.3, 0.7]].
Using your findings from experiment 4, discuss when and when not our MRF is a good approximation to the true fire distribution. Note that this is an open-ended question, and we don't expect you to be exhaustive. Please pick at least two scenarios and try your best to answer the question. You may pick more than two scenarios and get credits for your correct analyses (i.e., we won't penalize any incorrect analysis).
Here are some possible scenarios you may consider:
- What happens when the
FireProcess
has a particularfire_weights
? In particular: - When did the
fire_weights
is more peaked at the center? - When the
fire_weights
is flatter? - When are the values at the corners of
fire_weights
larger/smaller? - What happens when the fire spreads a longer/shorter time before the agent starts?
2.1.3) Loopy Belief Propagation§
Our exact marginalization is very slow for larger grids. So instead, we will use loopy belief propagation to compute the marginals.
We have provided you with a function fire_mrf_lbp_marginals
that implements loopy belief
propagation on a FireMRF
to compute the marginals approximately.
This function is the same as the belief propagation you saw from the lecture, except that it is
works on our FireMRF model only.
By specializing to our 2D grid model, we have taken care to make it very efficient
by avoiding Python loops and using NumPy operations instead.
But, this function does not handle any observation.
Please complete the implementation of fire_mrf_lbp_conditionals
.
This function should take the same arguments as fire_mrf_lbp_marginals
, with the addition of an
argument observations
that is a FirePOMDPObservation
.
A FirePOMDPObservation
is just a 2d NumPy array --- please see the docstring
for FirePOMDPObservation
for more details.
Hints:
- You can use
np.isnan(observation)
to find the indices of the unobserved cells. - You can reduce
fire_mrf_lbp_conditionals
into a call tofire_mrf_lbp_marginals
by incorporating the observation into the unitary potentials.
For reference, our solution is 17 line(s) of code.
2.2) Partially Observable Agents (Optional) §
We now come back to our partially observable fire problem.
We have provided you with an abstract base agent, FirePOMDPBeliefStateAgent
.
All this base agent does is that it tracks the belief state by combining
all observations made by the agent so far.
Notice that this belief state representation is exact, since it contains all the
information of the complete history.
This agent uses our MRF approximation: at each step, it assumes that the fire is distributed according to a FireMRF, conditioned on previous observations.
You may refer to FirePOMDPBeliefStateAgent
and FirePOMDPBeliefState
for more details of this base agent.
We have also provided you with a simple agent, FirePOMDPRolloutLookaheadAgent
, which mirrors the FireMDPRolloutLookaheadAgent
from part 1.
This agent samples several complete POMDP states from the approximate fire distribution MRF at every step by Gibbs sampling.
It then plays random rollouts from each complete state and chooses the best first action.
2.2.1) MLO Determinized FirePOMDP§
Please complete the implementation of MLODeterminizedFirePOMDP
,
a deterministic reward problem that is the MLO approximation of the FirePOMDP
.
Recall that in MLO approximation, the next state is determined by the most likely observation. Also, recall that our agent can observe one or more grid cells at each step. Technically, the most likely observation is the configuration of the observed grid cells with the highest probability. However, computing the joint distribution of the observed cells is expensive. Therefore, we make another approximation: We consider each cell individually and use the most likely configuration of that cell as its most likely observation. By using this approximation, we can use loopy belief propagation from previous section to compute the most likely observation for each cell and then use those observations to determine the next state.
For reference, our solution is 72 line(s) of code.
2.2.2) MLO Determinized FirePOMDP§
Please complete the implementation of FirePOMDP_MLO_MCTSAgent
.
At each step, this agent:
- Applies MLO approximation by casting the POMDP to an
MLODeterminizedFirePOMDP
and - Uses MCTS to search for a good action under the MLO approximation.
Once you have the implementation, try to run the agent on the problems defined in get_problem_part2
,
and see how it performs.
You may use the same utilities (e.g., run_agent_on_problem
and animate_trajectory
) as in part 1.
As a sanity check for correctness, you should see that the agent achieves a reward >0.9
most of the time in get_problem_part2("only_fire")
, as long as there exists a path that
allows the agent to pick up the patient and then reach the goal.
Tips for using MCTS:
- We recommend using
pomdp_random_rollout_policy
as the rollout policy for MCTS instead of the plain random policy. Please see the docstring of that function for more details. - You would want to use
n_simulations=1
since the environment is deterministic. We also recommend settingmax_backup=False
to perform the Monte-Carlo backup instead of the Bellman backup.
For reference, our solution is 43 line(s) of code.
Experiment 5
Please create an agent that assumes the default FireMRF (i.e., FireMRF.default
), and
generate an animation of a successful run (i.e., the robot succesfully rescues the patient) of the MLO MCTS agent in the only_fire
POMDP.
Please feel free to reach out to us anytime if you get stuck.
2.2.3) Clever Agents §
Your MLO agent works well on some of the problems we gave you (such as only_fire
), but likely not all.
Solving POMDPs is generally hard, and so is solving our partially observable fire problem.
In this section, we ask you to implement any agent you wish to try to better solve our partially observable fire problem. Here are some ideas:
- Reduce
MLODeterminizedFirePOMDP
, which is a reward problem, to a min-cost path problem --- you may recall this reduction from Lecture 02. Then, use A* search on the min-cost path problem with heuristics. - Write an improved rollout policy for the MCTS search. For example, you can let the drone visit more uncertain locations with higher probability.
- Write a POMCP agent. You can use
FireMRFGibbsSampler
to sample from the current belief state. Ourrun_mcts_search
already implements POMCP by passing in thestate_sampler
argument. - Our MLO approximation currently needs to perform loopy belief propagation every time it sees a new belief state. Instead, we could consider warm starting LBP. In particular, we can pass in the messages produced by a previous call to LBP if the belief state does not differ much from that previous invocation. The idea is that LBP can take fewer iterations to converge due to warm-starting from those previous messages. Thus, it can produce a speedup for the search and allow more MCTS iterations. Our function
fire_mrf_lbp_marginals
has a parameterreturn_msgs
that can help with this strategy. - Play with the MRF approximation. For example: What happens if, instead of the default FireMRF, we use a unitary potential closer to the
initial_fire_grid
of the fire process? The idea is that although thisinitial_fire_grid
is not necessarily the same as the fire grid when the agent starts, it could be good prior knowledge if the fire did not spread too much.
Please implement some agents (at least one) and try to see if they beat the MLO MCTS agent. It is ok to implement an agent but find that it fails to perform as well as you expected --- we will not grade your submission based on the agent's performance. On the other hand, what we expect from your submission is careful thinking and agent implementation that demonstrates your grasp of the course material.
Please write a short description for each agent you wrote. If the agent involves a heuristic (such as a better rollout policy or an A* heuristic function), please describe it and explain why you think it is good domain knowledge for solving our problem.
Please benchmark your agents against FirePOMDPRolloutLookaheadAgent
and FirePOMDP_MLO_MCTSAgent
.
Similar to part 1 benchmarking, you should repeat each experiment at least 10 times, ideally 30 times, to get a good performance measurement.
Please report your experiments' results in a table, then write a summary of your results.
Again, be careful to document the parameters you used for each agent.
No file selected