HW 5 -- Probability warmup, Graphical model intro
Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.
1) Probability Warmup§
- Assume that A and B are binary random variables. Which of the following expressions are equivalent to \Pr(A=0 | B=0)?
\frac{\Pr(A=0,~B=0)}{\Pr(B=0,~A=1) + \Pr(B=0,~A=0)} \frac{\Pr(B=0~|~A=0)\Pr(A=0)}{\Pr(A=0,~B=0) + \Pr(A=1,~B=0)} \frac{\Pr(A=0,~B=0)}{\Pr(A=0)} \frac{\Pr(A=0,~B=0)}{\Pr(B=0)} \frac{\Pr(B=0~|~A=0)\Pr(A=0)}{\Pr(B=0~|~A=0)\Pr(A=0) + \Pr(B=0~|~A=1)\Pr(A=1)} \frac{\Pr(A=0)}{\Pr(B=0)} \frac{\Pr(B=0~|~A=0)\Pr(A=0)}{\Pr(B=0)} \Pr(B=0~|~A=0) \Pr(A=0) \Pr(B=0~|~A=0)
- Which of the following statements are always true? (Assume A and B are binary variables).
P(A=1) \geq P(A=1,~B=1) P(A=1) \leq P(A=1,~B=1) P(A=1) = P(A=1,~B=1) + P(A=1,~B=0) P(A=1,B=1) = P(A=1)P(B=1) P(A=1~|~B=1) \geq P(A=1) P(A=1~|~B=1) \geq P(A=1,~B=1)
2) Commute§
After running late many days in a row, you decide to do some probabilistic modeling of your morning commute, in an effort to make sure you can always get to 6.4110 on time :)
- Let L be a random variable that takes value 1 if you are running late leaving the house, and 0 otherwise.
- Let T be a random variable that takes value 1 if you get stuck behind a train, and 0 otherwise. (Assume we're at the far end of the E line where this will actually happen).
- Let G be a random variable that takes value 1 if you get a green light at Vassar and Mass Ave, and 0 otherwise.
Consider the following table of probabilities:
T=1 | T=0 | |
---|---|---|
L=1 | 0.70 | 0.10 |
L=0 | 0.06 | 0.14 |
Enter a single number in each of the following boxes, accurate to three digits after the decimal point. It is also fine to type in a numerical expression (like 2 / 3).
-
What is the probability that you are running late?
-
What is the probability that you get stuck behind a train, given that you are running late?
-
What is the total probability that you get stuck behind a train?
Assume that L and T are related as given in the table above, and that \Pr(G = 1 | T = 1) = 0.1 and \Pr(G = 1 | T = 0) = 0.2, regardless of the value of L.
-
What is the probability that you were stuck behind a train given that you got a green light at Vassar and Mass Ave?
3) SNPs §
Cheap sequencing enabling confident single-nucleotide readings over large pieces of DNA has only been achieved in the past ten years. One of the results of the ability to read DNA at the single nucleotide resolution with high confidenence is the ability to detect Single Nucleotide Polymorphisms (SNPs for short, but pronounced as "snips") in the human genome.
SNPs are common single-nucleotide variations (an A is instead a T for example) that are known to exist in the genome. Researchers have begun to analyze the frequency and association of SNPs with various diseases and medical disorders. These studies are called Genome Wide Association Studies (GWAS). When used in conjunction with Bayes' theorem, patient predisposition for diseases can be determined from analyzing the presence (or absence) of certain SNPs. We look at the probability that forms the basis of a GWAS on a fictional heart disorder below.
A heart disorder with a prevalence of 0.02 (2%) in the general population is investigated in a GWAS in a hospital study. 3000 subjects with the disorder are included in the study and 7000 subjects without the disorder are included as control subjects. Note that this doesn't mean the disorder has a 30% prevalence since selection bias went into collecting the study participants...general population prevalence is still 2% (i.e. P(\text{Heart Disorder}) = 0.02).
Phenotype | Total | Has SNP1 | Has SNP2 | Has SNP3 |
Has Heart Disorder: | 3000 | 1600 | 920 | 1750 |
Control (no Heart Disorder): | 7000 | 3250 | 2150 | 2100 |
Based on the data in the table above, answer the following questions with three digits after the decimal place. Consider each SNP independently (do not worry about combinatorials). Also try to carry the full values through in all of your calculations. The checker is looking only three decimal places back but if you use multiple rounded answers in combination the resulting answer can be "too" rounded.
-
How likely is it that an individual has SNP3, given that they have a heart disorder?
-
How likely is it that an individual has SNP3, given that they do not have a heart disorder?
-
Overall, how likely is an individual to have SNP3?
-
How likely is an individual to have a heart disorder if they have SNP1?
-
How likely is an individual to have a heart disorder if they have SNP2?
-
How likely is an individual to have a heart disorder if they have SNP3?
-
How many times as likely (compared to the general population on average) is an individual to have the heart disorder if they have SNP1?
-
How many times as likely (compared to the general population on average) is an individual to have the heart disorder if they have SNP2?
-
How many times as likely (compared to the general population on average) is an individual to have the heart disorder if they have SNP3?
The human genome has millions of SNPs, and humans suffer from tens of thousands of disorders and diseases. Throw in the fact that the SNPs influence one another (not independent events) and it becomes a very, very active area of research, both in developing ways to handle the large data sets and in making the actual discoveries from the data. Papers are being published every day on new discoveries from GWAS.
4) Bees (Optional)§
You are a Japanese Honey Bee in a hive. Your fellow bees leave once a day and come back in the evening to report what they found during that day's flight. Because bees don't talk, they communicate through dancing in order to convey what they found. The dances that bees do are comprised of combinations of the following four dance moves:
- Shake
- Dance in Circles
- MoonWalk
- Flapping Wings
A dance routine consists of two or more different moves performed simultaneously. There are four possible events that a bee could report and they are conveyed through the following dance move combinations:
- Food Source Found: (Shake, Circles, Moonwalk)
- Other Bees Found: (Shake, Flap, Moonwalk)
- Japanese Giant Hornets are attacking: (Shake, Flap (and no Moonwalk))
- Nothing to Report: (Moonwalk, Circles (and no Shake))
You may assume that these are the only four combinations of dance moves that a bee can use.
On a normal weekday for bees (Monday through Friday, inclusive),there is a probability distribution of a bee discovering the following events and returning to the hive to tell about them:
DDist{'food':0.4, 'other bees': 0.1, 'Japanese Giant Hornets ATTACKING!': 0.2, 'Nothing': 0.3}
On a weekend day for bees (Saturday and Sunday), a bee has the following probability of reporting events:
DDist{'food':0.2, 'other bees': 0.3, 'Japanese Giant Hornets ATTACKING!': 0.0, 'Nothing': 0.5}
Answer the following questions; all answers should be accurate to within 10^{-3}. The solution boxes will accept python numerical expressions so feel free to enter your computations into them that way.
-
On a random day, what is the probability that a specific bee returns having found food?
-
You see a bee shaking in circles while doing the moonwalk. Based on only this information, what is the probability that it is a Tuesday?
-
What is the probability that a specific bee's dance will involve shaking on any given day?
-
Because of the strobe lights and cigarette smoke, you can't see all of a bee's dance moves...but you know that it is at least shaking. What is the probability that the bee is trying to say that Japanese Giant Hornets are attacking the nest?
-
A bee is only shaking and flapping. Based only on this information, what is the probability that it is Monday?
5) Continuous Random Variables§
Given two independent, one-dimensional Gaussian random variables X_1 \sim \textrm{Normal}(\mu_1, \sigma_1^2) and X_2 \sim \textrm{Normal}(\mu_2, \sigma_2^2),
answer the following questions. Please input a Python formula in terms of mu_1
, mu_2
, sigma_1
, and sigma_2
.
-
Provide an expression for the mean of the distribution of X_1 + X_2
-
Provide an expression for the variance of the distribution of X_1 + X_2
-
True or False: The original random variable with higher variance has more influence on the mean of the distribution of X_1 + X_2
True False -
True or False: The more random variables we add together the less variance there will be in the result.
True False
6) Bayes Nets and Tables§
Let's first look at how to go back and forth between Bayes nets and probability tables. Consider the following simple Bayes net, with three variables, A, N and S. Note that we have only given the values for where the dependent variable is true.
-
How many rows are there in the probability table over the joint distribution of A, N, and S?
Enter an integer. -
What is the probability that the airspeed is low (A=T), the nose is up (N=T) and the aircraft is not in stall (S = F)?
Enter a number that is accurate to within 1.0e-5. You can also enter a python expression that will evaluate to a number (e.g.,3*2 + 4 - 7/11.0
). -
What is the probability that the aircraft is in stall (S = T), given that we don't know the air speed or the nose angle?
Enter a number that is accurate to within 1.0e-5. You can also enter a python expression that will evaluate to a number (e.g.,3*2 + 4 - 7/11.0
).
7) Satellites§
Consider the following Bayes net describing the health of a satellite, based on the status of its components:-
Ignoring the Bayes net figure above for the next three questions, let's practice manipulating tables. Suppose P\left (D = \begin{bmatrix} F \\ T \end{bmatrix}\right ) = \begin{bmatrix}0 \\ 1\end{bmatrix}, and P\left (\begin{bmatrix} D = F | E = F \\ D = T | E = F \\ D = F | E = T \\ D = T | E = T \end{bmatrix} \right ) = \begin{bmatrix} 0.9 \\ .1 \\ 0 \\ 1 \end{bmatrix}. Note that D=T is an observation, and we might want to infer the posterior over E. If we multiply these two factors together, we get a factor containing 4 unnormalised probabilities
Please give these unnormalised probabilities as a python list in the order of P(D=F, E=F), P(D=T, E=F), P(D=F, E=T), P(D=T, E=T): -
If we marginalise out D from that factor, we get a new factor with two unnormalised probabilities. (Note that this is not P(E | D = T)! We are missing the prior on E.)
Please give these unnormalised probabilities as a python list, in order of P(E=F), P(E=T). -
Imagine we want to reverse the arc from C to E. What new tables would be required?
It is not possible to reverse the arcs in a Bayes net. We need P(E|C). We need P(E|C) and P(C). We need P(E|B,S,C) and P(C). We need P(E|B,S,C), P(C) and P(D). We need P(E|B,S,C,D), P(C) and P(D). -
Given an observation of a trajectory deviation (see description in the above Bayes net), D=True, and a communication loss, C=True. Which of these gives us the probability of electical system failure, P(E=True)? Note that when we write a summation \sum_{B_i}, we are summing over the B_i different outcomes that B can take, which are True and False for all variables in this question.
\alpha P(D=T|E=T) * P(C=T|E=T) \alpha (\sum_{D_i} P(D_i|E=T)) * (\sum_{C_j} P(C_j|E=T)) * (\sum_{B_k} P(E=T|B_k)) * (\sum_{S_m} P(E=T|S_m)) \alpha P(D=T|E=T) * P(C=T|E=T) * P(E=T|B=T) * P(E=T|S=T) \alpha (\sum_D P(D|E=T)) * (\sum_C P(C|E=T)) * P(E=T|B) * P(E=T|S) \alpha P(D=T|E=T) * P(C=T|E=T) * (\sum_{B_i} \sum_{S_j} P(E=T|B_i,S_j) * P(B_i) * P(S_j))
8) Feedback§
-
How many hours did you spend on this homework?
-
Do you have any comments or suggestions (about the problems or the class)?