HW 0 -- Background

The questions below are due on Monday February 10, 2025; 11:59:00 PM.

You are not logged in.
Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.

This assignment is a diagnostic of whether you have the prerequisite background for this subject.

It has an introduction to cat-soop, our homework platform, including how questions are evaluated.
It is followed by some analytical and some coding questions.
If you find that it is very difficult, that might be a sign that you should strengthen your background in the prerequite areas. Feel free to ask a staff member (via private Piazza post) about this.

1) Intro to Homework§

Welcome to your first homework!

We view homework as a critical mechanism for teaching and learning, not as an evaluation mechanism. There are class-grade points associated with it, as an incentive for you to do it.
There is `free checking' in the sense that you can submit your anwers to questions as many times as you want to, and get feedback on their correctness. So, it's possible to get a perfect score through persistence.
Note that you won't get credit for a question until you click 'submit' and that, once you click 'view answer', you will no longer be allowed to submit.
You can read the late submission policy on the course information page. The way it interacts with cat-soop is best understood in two phases:
1. First, ignoring extensions, for each question, we evaluate each of your submissions in terms of how late it was submitted and how correct it was. You will be given the score of the best submission (so, for example, if you submit a correct solution on time, but then go back and play with a question later and submit something incorrect, that won't decrease your score.)
2. Second, in an outer loop, we will determine the best possible way to apply your extension days, where the policy above is now applied relative to the extended deadlines.
Please take the questions seriously as educational material. For example, although a monkey could get all the multiple-choice questions right eventually, it might not learn much! Once you submit a solution we encourage you hit 'view answer' to see our solution and then hit 'view explanation' to see an explanation of the answer, and be sure you understand why your answer was right.
Some of the material is brand new this semester and there will inevitably be bugs. Please post on Piazza if you suspect a bug (and/or if you're stuck). If your question reveals the answer, then please post privately. We also have not yet completed explanations for all the questions. If you'd like to contribute some, please post them!
You may choose to tackle the questions in any order, but the homeworks are designed to be followed sequentially. Often, insights from the first problems will help with the later ones.
This software environment (cat-soop) is great for many things, but it's not a great software debugging enviroment. We strongly encourage you to debug your solutions to coding questions using Google Colab (an online environment which means you don't need to install a particular version of Python or packages on your machine).
The 'Run Code' button on cat-soop will run exactly the code in the submission box, and not any test cases. Click 'submit' to run our test cases.
Each week, we will provide a Colab notebook with predefined helper functions and test cases. You should draft and debug your solution in Colab, then paste it into cat-soop for final checking and submission.
Though where you work and debug is ultimately up to you, please note that the colab notebooks are the only place where you can view the pre-defined imports and utility functions that will be extremely useful for completing the coding questions. So, before you start any coding, please look at colab! And if you leave colab for a different development environment, you will want to take these imports and functions with you. (They are already defined in catsoop, so copying them there is not necessary.)
This week's Colab notebook can be found here: Colab Notebook (Click Me!)

Here are some practice questions. Play around with them, including checking wrong answers. When you're done, view answers and read the explanations.

Who is the best robot?

Enter a Python list of numbers corresponding to the first three Fibonacci numbers greater than 1.

1.1) Fibonacci§

Complete the following implementation of Fibonacci number that returns the n-th Fibonacci number

For reference, our solution is 5 line(s) of code.

2) Non-coding questions§

2.1) Probability Warmup§

Assume that A and B are binary random variables. Which of the following expressions are equivalent to \Pr(A=0 | B=0)?

	\frac{\Pr(A=0,~B=0)}{\Pr(B=0,~A=1) + \Pr(B=0,~A=0)}
	\frac{\Pr(B=0~\|~A=0)\Pr(A=0)}{\Pr(A=0,~B=0) + \Pr(A=1,~B=0)}
	\frac{\Pr(A=0,~B=0)}{\Pr(A=0)}
	\frac{\Pr(A=0,~B=0)}{\Pr(B=0)}
	\frac{\Pr(B=0~\|~A=0)\Pr(A=0)}{\Pr(B=0~\|~A=0)\Pr(A=0) + \Pr(B=0~\|~A=1)\Pr(A=1)}
	\frac{\Pr(A=0)}{\Pr(B=0)}
	\frac{\Pr(B=0~\|~A=0)\Pr(A=0)}{\Pr(B=0)}
	\Pr(B=0~\|~A=0) \Pr(A=0)
	\Pr(B=0~\|~A=0)

Which of the following statements are always true? (Assume A and B are binary variables).

P(A=1) \geq P(A=1,~B=1)

P(A=1) \leq P(A=1,~B=1)

P(A=1) = P(A=1,~B=1) + P(A=1,~B=0)

P(A=1,B=1) = P(A=1)P(B=1)

P(A=1~|~B=1) \geq P(A=1)

P(A=1~|~B=1) \geq P(A=1,~B=1)

2.2) Commute§

After running late many days in a row, you decide to do some probabilistic modeling of your morning commute, in an effort to make sure you can always get to 6.4110 on time :)

Let L be a random variable that takes value 1 if you are running late leaving the house, and 0 otherwise.
Let T be a random variable that takes value 1 if you get stuck behind a train, and 0 otherwise. (Assume we're at the far end of the E line where this will actually happen).
Let G be a random variable that takes value 1 if you get a green light at Vassar and Mass Ave, and 0 otherwise.

Consider the following table of probabilities:

	T=1	T=0
L=1	0.70	0.10
L=0	0.06	0.14

Enter a single number in each of the following boxes, accurate to three digits after the decimal point. It is also fine to type in a numerical expression (like 2 / 3).

What is the probability that you are running late?
What is the probability that you get stuck behind a train, given that you are running late?
What is the total probability that you get stuck behind a train?

Assume that L and T are related as given in the table above, and that \Pr(G = 1 | T = 1) = 0.1 and \Pr(G = 1 | T = 0) = 0.2, regardless of the value of L.

What is the probability that you were stuck behind a train given that you got a green light at Vassar and Mass Ave?

2.3) SNPs §

Cheap sequencing enabling confident single-nucleotide readings over large pieces of DNA has only been achieved in the past ten years. One of the results of the ability to read DNA at the single nucleotide resolution with high confidenence is the ability to detect Single Nucleotide Polymorphisms (SNPs for short, but pronounced as "snips") in the human genome.

SNPs are common single-nucleotide variations (an A is instead a T for example) that are known to exist in the genome. Researchers have begun to analyze the frequency and association of SNPs with various diseases and medical disorders. These studies are called Genome Wide Association Studies (GWAS). When used in conjunction with Bayes' theorem, patient predisposition for diseases can be determined from analyzing the presence (or absence) of certain SNPs. We look at the probability that forms the basis of a GWAS on a fictional heart disorder below.

A heart disorder with a prevalence of 0.02 (2%) in the general population is investigated in a GWAS in a hospital study. 3000 subjects with the disorder are included in the study and 7000 subjects without the disorder are included as control subjects. Note that this doesn't mean the disorder has a 30% prevalence since selection bias went into collecting the study participants...general population prevalence is still 2% (i.e. P(\text{Heart Disorder}) = 0.02).

Phenotype	Total	Has SNP1	Has SNP2	Has SNP3
Has Heart Disorder:	3000	1600	920	1750
Control (no Heart Disorder):	7000	3250	2150	2100

Based on the data in the table above, answer the following questions with three digits after the decimal place. Consider each SNP independently (do not worry about combinatorials). Also try to carry the full values through in all of your calculations. The checker is looking only three decimal places back but if you use multiple rounded answers in combination the resulting answer can be "too" rounded.

How likely is it that an individual has SNP3, given that they have a heart disorder?
How likely is it that an individual has SNP3, given that they do not have a heart disorder?
Overall, how likely is an individual to have SNP3?
How likely is an individual to have a heart disorder if they have SNP1?
How likely is an individual to have a heart disorder if they have SNP2?
How likely is an individual to have a heart disorder if they have SNP3?
How many times as likely (compared to the general population on average) is an individual to have the heart disorder if they have SNP1?
How many times as likely (compared to the general population on average) is an individual to have the heart disorder if they have SNP2?
How many times as likely (compared to the general population on average) is an individual to have the heart disorder if they have SNP3?

The human genome has millions of SNPs, and humans suffer from tens of thousands of disorders and diseases. Throw in the fact that the SNPs influence one another (not independent events) and it becomes a very, very active area of research, both in developing ways to handle the large data sets and in making the actual discoveries from the data. Papers are being published every day on new discoveries from GWAS.

2.4) Bees (Optional)§

You are a Japanese Honey Bee in a hive. Your fellow bees leave once a day and come back in the evening to report what they found during that day's flight. Because bees don't talk, they communicate through dancing in order to convey what they found. The dances that bees do are comprised of combinations of the following four dance moves:

Shake
Dance in Circles
MoonWalk
Flapping Wings

A dance routine consists of two or more different moves performed simultaneously. There are four possible events that a bee could report and they are conveyed through the following dance move combinations:

Food Source Found: (Shake, Circles, Moonwalk)
Other Bees Found: (Shake, Flap, Moonwalk)
Japanese Giant Hornets are attacking: (Shake, Flap (and no Moonwalk))
Nothing to Report: (Moonwalk, Circles (and no Shake))

You may assume that these are the only four combinations of dance moves that a bee can use.

On a normal weekday for bees (Monday through Friday, inclusive),there is a probability distribution of a bee discovering the following events and returning to the hive to tell about them: DDist{'food':0.4, 'other bees': 0.1, 'Japanese Giant Hornets ATTACKING!': 0.2, 'Nothing': 0.3}

On a weekend day for bees (Saturday and Sunday), a bee has the following probability of reporting events: DDist{'food':0.2, 'other bees': 0.3, 'Japanese Giant Hornets ATTACKING!': 0.0, 'Nothing': 0.5}

Answer the following questions; all answers should be accurate to within 10^{-3}. The solution boxes will accept python numerical expressions so feel free to enter your computations into them that way.

On a random day, what is the probability that a specific bee returns having found food?
You see a bee shaking in circles while doing the moonwalk. Based on only this information, what is the probability that it is a Tuesday?
What is the probability that a specific bee's dance will involve shaking on any given day?
Because of the strobe lights and cigarette smoke, you can't see all of a bee's dance moves...but you know that it is at least shaking. What is the probability that the bee is trying to say that Japanese Giant Hornets are attacking the nest?
A bee is only shaking and flapping. Based only on this information, what is the probability that it is Monday?

While European honey bees are essentially defenseless against an attack by Japanese Giant Hornets, japanese honey bees have evolved an interesting defense mechanism: see this 7-minut video from National Geographic for more information.

2.5) Continuous Random Variables§

Given two independent, one-dimensional Gaussian random variables X_1 \sim \textrm{Normal}(\mu_1, \sigma_1^2) and X_2 \sim \textrm{Normal}(\mu_2, \sigma_2^2), answer the following questions. Please input a Python formula in terms of mu_1, mu_2, sigma_1, and sigma_2.

Provide an expression for the mean of the distribution of X_1 + X_2
Provide an expression for the variance of the distribution of X_1 + X_2
True or False: The original random variable with higher variance has more influence on the mean of the distribution of X_1 + X_2

True

False
True or False: The more random variables we add together the less variance there will be in the result.

True

False

2.6) Random vectors§

Indicate which of the following statements are true:

	A probability distribution can assign probability greater than 1 to an event.
	A probability density function can have values greater than 1.
	A probability density function can assign probability greater than 1 to an event.
	A probability density function of a Normal distribution can have values greater than 1.
	If the discrete random variables X and Y are independent, then P(X=x, Y=y) = P(X=x)P(Y=y)
	P(X=x\|Y=y) = P(Y=y\|X=x)P(Y=y)/P(X=x)
	P(X=x\|Y=y) = P(Y=y\|X=x)P(X=x)/P(Y=y)
	P(X=x\|Y=y) = P(Y=y,X=x)/P(Y=y)
	Two random vectors {\bf x}_1 \in R^d and {\bf x}_2 \in R^d are said to be uncorrelated if \rho = E \left [(\bar{{\bf x}}_1 -{\bf x}_1)(\bar{{\bf x}}_2-{\bf x}_2)^T \right ] = 0.
	Two random vectors {\bf x}_1 \in R^d and {\bf x}_2 \in R^d are said to be independent if \rho = E \left [(\bar{{\bf x}}_1 -{\bf x}_1)(\bar{{\bf x}}_2-{\bf x}_2)^T \right ] = 0.
	Assume a multivariate Normal distribution over [x_1, x_2]^T, where [x_1, x_2]^T \sim N(0, \Sigma) with covariance \Sigma = \begin{bmatrix} 1 & 2 \\ 2 & 1 \end{bmatrix}. The marginal distribution over x_1 is N(0, 1).
	Assume a multivariate Normal distribution over [x_1, x_2]^T, where [x_1, x_2]^T \sim N(0, \Sigma) with covariance \Sigma = \begin{bmatrix} 1 & 2 \\ 2 & 1 \end{bmatrix}. The marginal distribution over x_1 is N(0, 2).
	Uncorrelated, jointly Gaussian random variables are independent.

A reminder that these questions are intended to be diagnostic. If you are unfamiliar with the notation, or the terminology, you may not have the prerequisites for this subject. If you are unfamiliar with multivariate distributions, you also may not have the prerequisites for this subject.

3) Coding questions§

3.1) Gradient Descent§

Complete the following implementation of gradient descent to estimate the parameters of a linear model y = \theta^T x, where x \in \mathbb{R}^m, y \in \mathbb{R}^n, and \theta \in \mathbb{R}^{m \times n}. Please use the following loss function:

\mathcal{L} = \frac{1}{2} \sum_{i=1}^N \| \theta^T x_i - y_i \|_2^2,

where (x_i, y_i), i=1\dots N are training samples.

For reference, our solution is 5 line(s) of code.

3.2) Breadth-first search§

Complete the following implementation of breadth-first search. Note that the expected output is a list of actions. The search should not revisit any states.

For reference, our solution is 15 line(s) of code.

	\frac{\Pr(A=0,~B=0)}{\Pr(B=0,~A=1) + \Pr(B=0,~A=0)}
	\frac{\Pr(B=0~\|~A=0)\Pr(A=0)}{\Pr(A=0,~B=0) + \Pr(A=1,~B=0)}
	\frac{\Pr(A=0,~B=0)}{\Pr(A=0)}
	\frac{\Pr(A=0,~B=0)}{\Pr(B=0)}
	\frac{\Pr(B=0~\|~A=0)\Pr(A=0)}{\Pr(B=0~\|~A=0)\Pr(A=0) + \Pr(B=0~\|~A=1)\Pr(A=1)}
	\frac{\Pr(A=0)}{\Pr(B=0)}
	\frac{\Pr(B=0~\|~A=0)\Pr(A=0)}{\Pr(B=0)}
	\Pr(B=0~\|~A=0) \Pr(A=0)
	\Pr(B=0~\|~A=0)

	P(A=1) \geq P(A=1,~B=1)
	P(A=1) \leq P(A=1,~B=1)
	P(A=1) = P(A=1,~B=1) + P(A=1,~B=0)
	P(A=1,B=1) = P(A=1)P(B=1)
	P(A=1~\|~B=1) \geq P(A=1)
	P(A=1~\|~B=1) \geq P(A=1,~B=1)

	True
	False

	True
	False