HW 3 -- Temporal models

The questions below are due on Monday March 03, 2025; 11:59:00 PM.

You are not logged in.
Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.

This week's Colab notebook can be found here: Colab Notebook (Click Me!)

1) Variable Elimination§

Implement Variable Elimination given a list of Potentials and a Variable order. You may use the functions multiply_potentials and marginalize from last week's homework.

For reference, our solution is 17 line(s) of code.

2) Gaussians§

2.1) §

Let A and B be univariate Gaussian random variables and are independent. Which of the following are Gaussian?

	A + B
	10A - 12B
	A \times B
	The joint distribution of A and B

Let A \sim \mathcal{N}(\mu_A, \sigma_{A}^{2}) and let B | A \sim \mathcal{N} (A, \sigma^{2}_{B|A}). We can show the distribution A | B = b is a univariate Gaussian.

For parsing purposes, enter \mu_A as mu_A, \sigma_{A} as sigma_A, \sigma_{B \mid A} as sigma_BA, b as b, and the ^ character for exponentation.

2.2) §

What is the mean of p(A \mid B = b)?

2.3) §

What is the variance of p(A \mid B = b)?

2.4) §

Given the context of the previous problem, if our prior p(A) is very uncertain but the observation B = b is a very precise measurement of A, then generally, then \sigma^2_{B|A} is smaller than \sigma^2_A

	True
	False

2.5) §

Given the context of the previous problems, if \sigma^2_{B|A} is smaller than \sigma^2_A, then the mean of the distribution of A | B = b will be closer to \mu_A than to b

	True
	False

3) Continuous PGMs §

Consider a Bayesian network with three one-dimensional continuous variables: A, B, and C. Let A \sim \mathcal{N}(1, 1), B \sim \mathcal{N}(2, 4), and C \sim \mathcal{N}(A + B, 1).

3.1) §

What is the mean of the joint distribution P(A, B). Please give as a Python list of two numbers.

3.2) §

What is the covariance matrix of the joint distribution P(A, B). Please give as a 2 \times 2 Python list.

3.3) §

What is the mean of the joint distribution P(A, B, C). Please give as a Python list of three numbers.

3.4) §

What is the covariance matrix of the joint distribution P(A, B, C)? Please give as a 3 \times 3 Python list.

Hint: We can apply linearity of covariance to compute Cov(A,C) and Cov(B, C).

3.5) §

What is the mean and variance of P(C). Please give as a Python of list of 2 numbers, mean and variance, in that order.

3.6) §

What is the mean of the joint distribution P(A, B \mid C = 10)? Please give as a Python list of 2 numbers.

3.7) §

What is the covariance matrix of P(A, B \mid C = 10)? Please give as a 2 \times 2 Python list.

Hint: See Lec05 slide 22 for the general formula of the covariance matrix.

3.8) §

What is the mean and variance of P(A \mid C = 10)? Please give as a Python of list of 2 numbers, mean and variance, in that order.

3.9) §

Are A and B independent in this model?

	Independent
	Not independent

3.10) §

Are A and B conditionally independent given C?

	Conditionally independent
	Not conditionally independent

4) Sampling§

4.1) Sampling types§

Indicate which of the following statements are true:

	In the limit of infinite samples, the statistics we compute from a set of samples generated by Gibbs sampling, such as the mean of a variable, match the statistics of the underlying distribution.
	It is possible to know ahead of time how many samples will be needed for importance sampling.
	For some samplers, it is possible to know at execution time, by measuring the total weight of the samples collected so far, whether more samples are needed
	When computing importance weights w(X) = P(X)/Q(X) for a sample X, it is important that P(X) be correctly normalized.
	It is important that the potentials in the graphical model are correctly normalized for sampling to work.
	We need to know ahead of time the statistics of the target distribution when designing the proposal distribution for importance sampling.
	A bad proposal distribution means more samples from the proposal distribution will be needed to get a good estimate of the target distribution.
	We don't need to worry about the direction of arrows when Gibbs sampling in a Bayes net.
	The samples generated by Gibbs sampling are independent and identically distributed.

4.2) Sampling efficiency§

Kullback-Leibler divergence is an estimate of how similar two distributions are, and for discrete distributions it is given by

D_{KL}(P||Q) = \sum_{x\in {\mathcal {X}}} P(x) \log \left ( \frac{P(x)}{Q(x)} \right )

Let us assume that P is the target distribution, and Q is the proposal distribution, and we are performing rejection sampling.

What does this mean for our sampling strategy?

Select all of these that are true.

	When Q(x) is small for x\in {\mathcal {X}} where P(x) is large, D_{KL}(P\|\|Q) will always be large.
	When Q(x) is large for x\in {\mathcal {X}} where P(x) is small, D_{KL}(P\|\|Q) will always be large.
	When D_{KL}(P\|\|Q) is large, the sampling strategy will be very good. P and Q are closely aligned.
	When D_{KL}(P\|\|Q) is small, the sampling strategy will be very good no matter what. P and Q are closely aligned.
	Our sampling strategy will be efficient when the KL divergence is small and P(x) isn't small or zero for x where Q(x) is large.
	If Q has zero probability in places where P is non-zero, we will not get a representative set of samples.

5) Sampling§

5.1) Gibbs Sampling§

Consider the following 3 node factor graph:

The table on the right gives three different potential tables for the factor. Gibbs sampling did a poor job on generating samples that reflected the joint of these three variables for one of these potential tables. Which was it?

Select one of these as the problematic potential for Gibbs sampling.

	a
	b
	c

5.2) Sampling in Bayes net§

Consider the following Bayes net.

We are interested in drawing samples from the distribution described by this network, conditioned on evidence, using ancestral sampling with rejection.

Is this a good way to sample from P(K \mid B = b)?

True

False
Is this a good way to sample from P(B \mid K = k)?

True

False

5.3) Markov Blanket§

Consider the Bayes net from the above question.

Enter the set of nodes in the Markov Blanket of node G, e.g. ['B','D'].
Enter the set of nodes in the Markov Blanket of node D, e.g. ['B','D'].

6) HMMs: Are my leftovers edible? §

I use an HMM to model the state of my leftover food over time. The state space is \{\textit{tasty}, \textit{smelly}, \textit{furry} \}. Initially, I'm sure they're tasty: P(S_0 = \textit{tasty}) = 1. The state transition model is:

6.1) State Distributions §

What is the state distribution P(S_{2} \mid S_{0} = \textit{tasty})? Express as a length 3 Python list corresponding to "tasty", "smelly" and "furry".

You observe your leftovers on day 4 and they're definitely furry (S_4 = \textit{furry}). You decide to formulate this inference problem as a factor graph and use message passing to find a distribution on the state of your leftovers on day 2. What is the table in the factor connecting nodes S_2 and S_3? Give your answer as a 3 \times 3 Python list of numbers.

6.2) Message Passing§

Let \phi_{ij} be the factor between nodes S_i and S_j. Provide the values of the following messages (we calculated a couple of them for you). Give your answer a Python list of 3 numbers, with the assumption that the possible values are, in order, (tasty, smelly, furry). We're asking for the messages in the order we would want them to compute P(S_2).

i. S_0 \rightarrow \phi_{01} = \left(1, 0, 0\right)

ii.

\phi_{01} \rightarrow S_1

iii.

S_1 \rightarrow \phi_{12}

iv.

\phi_{12} \rightarrow S_2

v. S_4 \rightarrow \phi_{34} = (0, 0, 1)

vi.

\phi_{34} \rightarrow S_3

viii.

S_3 \rightarrow \phi_{23}

ix. \phi_{23} \rightarrow S_2 = \left(\frac{1}{9}, \frac{1}{2}, 1 \right)

6.3) Checking Message Passing §

To calculate P(S_2 \mid S_0 = \textit{tasty}, S_4 = \textit{furry}), we multiply the messages from (enter the answers in order of increasing day):

	\phi_{01}
	\phi_{12}
	\phi_{23}
	\phi_{34}

and

	\phi_{01}
	\phi_{12}
	\phi_{23}
	\phi_{34}

and then

	normalize
	don't normalize

6.4) Sampling Formulas §

You decide to estimate P(S_2) by forward (prior) sampling. You generate 100 sample trajectories of length 3 for your leftovers. Given those trajectories what are the three terms to construct an estimate of P(S_2)?

Term 1: {tasty}

	(number of trajectories ending in tasty)/100
	(number of trajectories with tasty)/100
	(number of trajectories ending in smelly)/100
	(number of trajectories with smelly)/100
	(number of trajectories ending in furry)/100
	(number of trajectories with furry)/100

Term 2: {smelly}

	(number of trajectories ending in tasty)/100
	(number of trajectories with tasty)/100
	(number of trajectories ending in smelly)/100
	(number of trajectories with smelly)/100
	(number of trajectories ending in furry)/100
	(number of trajectories with furry)/100

Term 3: {furry}

	(number of trajectories ending in tasty)/100
	(number of trajectories with tasty)/100
	(number of trajectories ending in smelly)/100
	(number of trajectories with smelly)/100
	(number of trajectories ending in furry)/100
	(number of trajectories with furry)/100

6.5) Kalman Filters Check §

You decide that three discrete values is not sufficient precision for describing the state of your leftovers! You would like to score them on a continuum from 0.0 (completely fresh and tasty) to 1.0 (totally decomposed). Would a Kalman filter be a good choice of solution for predicting the current state of your leftovers given some previous observations?

	Yes
	No

7) Properties of Viterbi§

Maximillian thinks the Viterbi algorithm is too complicated and we could get away without the backward “decoding” pass, and instead just return the maximum of the marginal distribution at each time step.

You know better than Max! In the HMM below, provide a sequence of three observations for which Max will get the wrong answer, but Viterbi decoding will be correct Your answer should be a Python list of three True/False values. Assume that the initial distribution is uniform.

Enter a python array of True and False values, e.g., [True, True, True].

8) Coding: Localization with Viterbi§

In this section, we will implement an HMM for a robot that is moving around randomly in a 2D grid with obstacles. The robot has sensors that allow it to detect obstacles in its immediate vicinity. It knows the grid map, with the locations of all obstacles, but it is uncertain about its own location in the grid. We will use Viterbi to determine the most likely locations for the robot given a sequence of local and potentially noisy observations.

Concretely, we will represent the 2D grid with obstacles as a list of lists, where 1s represent obstacles and 0s represent free space. Example:

obstacle_map = [
  [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
  [1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1],
  [1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0],
  [0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0]
]

The state of the robot is its location in the grid, represented as a tuple of ints, (row, col). Transitions are uniformly distributed amongst the robot's current location and the neighboring free (not obstacle) locations, where neighboring = 4 cardinal directions (up, down, left, right).

Observations are a 4-tuple that list which directions have obstacles, in order [N E S W], with a 1 for an obstacle and 0 for no obstacle. Observations that are "off the map" are 1, as though they are obstacles. For instance, in the map above, if there were no observation noise, then the observation for the top left corner (state=(0, 0)) would be (1, 0, 1, 1). Observations can also be corrupted with noise; see the create_observation_potential docstring for more details.

Our ultimate task will be to take in a sequence of observations and return the corresponding sequence of most likely states.

8.1) Conditioning Warmup§

Given a potential and a variable that occurs in it, return a new potential conditioned on the specified variable having value 0.

For reference, our solution is 2 line(s) of code.

8.2) HMM§

Write a function that creates a random variable for a state at a given time step in an obstacle HMM. The domain of the state variable should be a list of (row, col) indices into the map. Only free positions (not obstacles) should be included in the domain of the state variable. See docstring for more description.

For reference, our solution is 4 line(s) of code.

def create_state_variable(obstacle_map, name):
    '''Creates a RV for the HMM state.

The state can be any position in the map.

Example map:
      obstacle_map = [
        [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
        [1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1],
        [1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0],
        [0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0]
      ]

Ones are obstacles and zeros are free positions.

The domain of the state variable should be a list of (row, col)
    indices into the map. Only free positions (not obstacles) should
    be included in the domain of the state variable.

The domain should be in row-major order. For example, an empty
    2x2 obstacle map should lead to the domain:

[(0, 0), (0, 1), (1, 0), (1, 1)].

Args:
      obstacle_map: A list of lists of ints, see example above.
      name: A str name for the state variable.

Returns:
      state_var: A RV as described above.
    '''
    raise NotImplementedError("Implement me!")

8.3) Transition potential creation§

Write a function that creates a potential for the transition distribution between states s_{t} and s_{t+1}. Refer to the previous question for more information about the state variables and their domains.

For reference, our solution is 13 line(s) of code.

def create_transition_potential(obstacle_map, st, st1):
    '''Write a function to create a potential for the transition from state s_t
    to s_{t+1}, in an HMM that corresponds to the map.

Transitions are uniformly distributed amongst the robot's current
    location and the neighboring free (not obstacle) locations, where
    neighboring = 4 cardinal directions (up, down, left, right).

Hint: remember that if we have a potential with two variables A and B,
    with dimension N and M, then the potential table will be a numpy array
    of shape (N, M). Furthermore, the potential value for the i^{th} domain
    value of A and the j^{th} domain value of B will be table[i, j]. With
    this in mind, you may find it useful to use the following pattern in
    your code somewhere:

```
    for i, (prev_r, prev_c) in enumerate(st.domain):
      ...
      for j, (next_r, next_c) in enumerate(st1.domain):
        ...
        table[i, j] = ...
    ```

Args:
      st: An RV representing the state at time t.
      st1: An RV representing the state at time t+1.
      obstacle_map: A list of lists of ints;
          see example and description in `create_state_variable`.

Returns:
      potential: A Potential for the transition between st and st1.
    '''
    raise NotImplementedError("Implement me!")

8.4) Observation variable creation§

Write a function that creates a random variable for an observation at a given time step in an obstacle HMM. See docstring for description.

For reference, our solution is 3 line(s) of code.

8.5) Observation potential creation§

Write a function that creates a potential for the observation distribution between s_{t} and z_{t}.

For reference, our solution is 29 line(s) of code.

def create_observation_potential(obstacle_map, state_rv, observation_rv,
                                 noise_prob=0.):
    '''Write a function to create a potential between state_rv
    and observation_rv in an HMM that corresponds to the map.

You can assume that state_rv was created by `create_state_variable`
    and observation_rv was created by `create_observation_variable`.

See `create_observation_variable` for a description of the
    observation model. Recall the order is [N E S W].

If noise_prob = 0., then the observations are noise-free. That is,
    you observe 0 if there is a free space and 1 otherwise.

In general, for each of the four observation entries, with
    probability 1 - noise_prob, the entry will be "correct"; with
    probability noise_prob, the entry will be incorrect, that is,
    the opposite of the true occupancy.

So if the noise-free observation would be (1, 0, 0, 1), then
    the probability of observation (1, 1, 0, 1) would be
    noise_prob*(1 - noise_prob)^3.

Args:
      obstacle_map: A list of lists of ints;
          see example and description in `create_state_variable`.
      state_rv: An RV representing the state at time t.
      observation_rv: An RV representing the observation at time t.
      noise_prob: A float between 0 and 1 indicating the probability
          that an observation flips.

Returns:
      potential: A Potential for the distribution between st and zt.
    '''
    raise NotImplementedError("Implement me!")

8.6) Viterbi Decoding§

Complete the implementation of Viterbi decoding. Note: in general, it is a good idea for numerical stability to implement Viterbi in log space. However, for the purpose of this homework, we are not expecting your implementation to be in log space. Moreover, the provided helper functions (e.g., multiply_potentials) assume that you are not working in log space.

For convenience, we have the following utilities:

We can access the initial distribution through hmm.initial_distribution
To initialize the forward message, we have condition_potential and multiply_potentials which can be used to compute the observation message and then be multiplied to the initial distribution.
We also have access to argmax_potential, max_potential, normalize_potential. See the Colab file for implementation details.

For reference, our solution is 38 line(s) of code.

8.7) Viterbi Visualization§

We have created a visualization tool for you in colab to visualize the inference result. You can run the following code blocks and they will generate the visualized trajectory step by step, and show the comparison between the observed trajectory as well as the "noiseless" observation.

If you are not passing the test, you can use the simple map visualize_viterbi_test1 for debugging. Hint: one debugging strategy is to modify your viterbi code and print out the likelihood you get at each step.

9) Feedback§

How many hours did you spend on this homework?
Do you have any comments or suggestions (about the problems or the class)?

	When Q(x) is small for x\in {\mathcal {X}} where P(x) is large, D_{KL}(P\|\|Q) will always be large.
	When Q(x) is large for x\in {\mathcal {X}} where P(x) is small, D_{KL}(P\|\|Q) will always be large.
	When D_{KL}(P\|\|Q) is large, the sampling strategy will be very good. P and Q are closely aligned.
	When D_{KL}(P\|\|Q) is small, the sampling strategy will be very good no matter what. P and Q are closely aligned.
	Our sampling strategy will be efficient when the KL divergence is small and P(x) isn't small or zero for x where Q(x) is large.
	If Q has zero probability in places where P is non-zero, we will not get a representative set of samples.