18.05 Spring 2005 Lecture Notes 18.05 Lecture 1 February 2, 2005
Required Textbook - DeGroot & Schervish, “Probability and Statistics,” Third Edition Recommended Introduction to Probability Text - Feller, Vol. 1
§1.2-1.4. Probability, Set Operations. What is probability? • Classical Interpretation: all outcomes have equal probability (coin, dice) • Subjective Interpretation (nature of problem): uses a model, randomness involved (such as weather) – ex. drop of paint falls into a glass of water, model can describe P(hit bottom before sides) – or, P(survival after surgery)- “subjective,” estimated by the doctor. • Frequency Interpretation: probability based on history – P(make a free shot) is based on history of shots made. Experiment ↔ has a random outcome. 1. Sample Space - set of all possible outcomes. coin: S={H, T}, die: S={1, 2, 3, 4, 5, 6} two dice: S={(i, j), i, j=1, 2, ..., 6}
2. Events - any subset of sample space ex. A √ S, A - collection of all events. 3. Probability Distribution - P: A ↔ [0, 1] Event A √ S, P(A) or Pr(A) - probability of A Properties of Probability: 1. 0 ← P(A) ← 1 2. P(S) = 1 3. For dist (mutually exclusive) events A, B (definition ↔ A ∞ B = ≥) P(A or B) = P(A) + P(B) - this can be written for any number of events. For a sequence of events A 1 , ..., An , ... all dist (Ai ∞ Aj = ≥, i ∈= j): P(
∗ �
Ai ) =
i=1
∗ �
P(Ai )
i=1
which is called “countably additive.”
If continuous, can’t talk about P(outcome), need to consider P(set)
Example: S = [0, 1], 0 < a < b < 1.
P([a, b]) = b − a, P(a) = P(b) = 0.
1
Need to group outcomes, not sum up individual points since they all have P = 0.
§1.3 Events, Set Operations
Union of Sets: A ⇒ B = {s ⊂ S : s ⊂ A or s ⊂ B}
Intersection: A ∞ B = AB = {s ⊂ S : s ⊂ A and s ⊂ B}
Complement: Ac = {s ⊂ S : s ⊂ / A}
Set Difference: A \ B = A − B = {s ⊂ S : s ⊂ A and s ⊂ / B} = A ∞ B c
2
Symmetric Difference: (A ∞ B c ) ⇒ (B ∞ Ac ) Summary of Set Operations: 1. Union of Sets: A ⇒ B = {s ⊂ S : s ⊂ A or s ⊂ B} 2. Intersection: A ∞ B = AB = {s ⊂ S : s ⊂ A and s ⊂ B } 3. Complement: Ac = {s ⊂ S : s ⊂ / A} 4. Set Difference: A \ B = A − B = {s ⊂ S : s ⊂ A and s ⊂ / B} = A ∞ B c 5. Symmetric Difference:
/ B ) or (s ⊂ B and s ⊂ / A)} = A⇔B = {s ⊂ S : (s ⊂ A and s ⊂ (A ∞ B c ) ⇒ (B ∞ Ac ) Properties of Set Operations: 1. A ⇒ B = B ⇒ A 2. (A ⇒ B) ⇒ C = A ⇒ (B ⇒ C) Note that 1. and 2. are also valid for intersections. 3. For mixed operations, associativity matters:
(A ⇒ B) ∞ C = (A ∞ C) ⇒ (B ∞ C)
think of union as addition and intersection as multiplication: (A+B)C = AC + BC
4. (A ⇒ B)c = Ac ∞ B c - Can be proven by diagram below:
Both diagrams give the same shaded area of intersection. 5. (A ∞ B)c = Ac ⇒ B c - Prove by looking at a particular point: s ⊂ (A ∞ B)c = s ⊂ / (A ∞ B) s⊂ / A or s ⊂ / B = s ⊂ Ac or s ⊂ B c s ⊂ (Ac ⇒ B c ) QED ** End of Lecture 1
3
18.05 Lecture 2 February 4, 2005
§1.5 Properties of Probability. 1. P(A) ⊂ [0, 1] 2. P(S) = 1 � 3. P(⇒Ai ) = P (Ai ) if dist ↔ Ai ∞ Aj = ≥, i = ∈ j The probability of a union of dist events is the sum of their probabilities. 4. P(≥), P(S) = P(S ⇒ ≥) = P(S) + P(≥) = 1
where S and ≥ are dist by definition, P(S) = 1 by #2., therefore, P(≥) = 0.
5. P(Ac ) = 1 − P(A)
because A, Ac are dist, P(A ⇒ Ac ) = P(S) = 1 = P(A) + P(Ac )
the sum of the probabilities of an event and its complement is 1. 6. If A √ B, P(A) ← P(B)
by definition, B = A ⇒ (B \ A), two dist sets.
P(B) = P(A) + P(B \ A) ∼ P(A)
7. P(A ⇒ B) = P(A) + P(B) − P(AB)
must subtract out intersection because it would be counted twice, as shown:
write in of dist pieces to prove it:
P(A) = P(A \ B) + P(AB)
P(B) = P(B \ A) + P(AB)
P(A ⇒ B) = P(A \ B) + P(B \ A) + P(AB)
Example: A doctor knows that P(bacterial infection) = 0.7 and P(viral infection) = 0.4
What is P(both) if P(bacterial ⇒ viral) = 1?
P(both) = P(B ∞ V)
1 = 0.7 + 0.4 - P(BV)
P(BV) = 0.1
Finite Sample Spaces There are a finite # of outcomes S = {s1 , ..., sn } Define pi = P(si ) as the probability function.
4
pi ∼ 0,
n �
pi = 1
i=1
P(A) =
�
P(s)
s∞A
Classical, simple sample spaces - all outcomes have equal probabilities. P(A) = #(A) #(S) , by counting methods. Multiplication rule: #(s1 ) = m, #(s2 ) = n, #(s1 × s2 ) = mn Sampling without replacement: one at a time, order is important
s1 ...sn outcomes
k ← n (k chosen from n)
#(outcome vectors) = (a1 , a2 , ..., ak ) = n(n − 1) × ... × (n − k + 1) = Pn,k
Example: order the numbers 1, 2, and 3 in groups of 2. (1, 2) and (2, 1) are different.
P3,2 = 3 × 2 = 6
Pn,n = n(n − 1) × ... × 1 = n!
Pn,k =
n! (n − k)!
Example: Order 6 books on a shelf = 6! permutations.
Sampling with replacement, k out of n
number of possibilities = n × n × n... = nk
Example: Birthday Problem- In a group of k people,
what is the probability that 2 people will have the same birthday?
Assume n = 365 and that birthdays are equally distributed throughout the year, no twins, etc.
# of different combinations of birthdays= #(S = all possibilities) = 365k
# where at least 2 are the same = #(S) − #(all are different) = 365k − P365,k
P(at least 2 have the same birthday) = 1 −
P365,k 365k
Sampling without replacement, k at once s1 ...sn sample a subset of size k, b1 ...bk , if we aren’t concerned with order. � � n n! number of subsets = Cn,k = = k k!(n − k)!
each set can be ordered k! ways, so divide that out of Pn,k Cn,k - binomial coefficients Binomial Theorem: (x + y)n =
n � � � n
k=0
5
k
xk y n−k
There are
� n� k
times that each term will show up in the expansion.
Example: a - red balls, b - black balls.
number of distinguishable ways to order in a row =
� � � � a+b a+b = a b Example: r1 + ... + rk = n; ri = number of balls in each box; n, k given
How many ways to split n objects into k sets?
Visualize the balls in boxes, in a line - as shown:
Fix the outer walls, rearrange the balls and the separators. If you fix the outer walls of the first and last boxes,
you can rearrange the separators and the balls using the binomial theorem.
There are n balls and k-1 separators (k boxes).
Number of different ways to arrange the balls and separators =
� � � � n+k−1 n+k−1 = n k−1 Example: f (x1 , x2 , ..., xk ), take n partial derivatives: �nf � 2 x1 �x2 � 5 x3 ...�xk k “boxes” � k “coordinates”
n “balls” � n “partial derivatives”
� � �n+k−1�
number of different partial derivatives = n+k−1 = k−1 n
Example: In a deck of 52 cards, 5 cards are chosen.
What is the probability that �all �5 cards have different face values?
total number of outcomes = 52 5 � � total number of face value combinations = 13 5 total number of suit possibilities, with replacement = 45 �13� 5 4 P(all 5 different face values) = �552� 5
** End of Lecture 2.
6
18.05 Lecture 3 February 7, 2005 n! - choose k out of n, order counts, without replacement. Pn,k = (n−k)! k n - choose k out of n, order counts, with replacement. n! Cn,k = k!(n−k)! - choose k out of n, order doesn’t count, without replacement.
§1.9 Multinomial Coefficients These values are used to split objects into groups of various sizes.
s1 , s2 , ..., sn - n elements such that n1 in group 1, n2 in group 2, ..., nk in group k.
n1 + ... + nk = n
� �� �� � � �� � n n − n1 n − n1 − n2 n − n1 − ... − nk−2 nk × ... × n2 n3 nk−1 nk n1 =
(n − n1 )! (n − n1 − n2 )! (n − n1 − ... − nk−2 )! n! × × ×1 × ... × n1 !(n − n1 )! n2 !(n − n1 − n2 )! n3 !(n − n1 − n2 − n3 )! nk−1 !(n − n1 − ... − nk−1 )! =
n! = n1 !n2 !...nk−1 !nk !
�
n n1 , n2 , ..., nk
�
These combinations are called multinomial coefficients.
Further explanation: You have n “spots” in which you have n! ways to place your elements.
However, you can permute the elements within a particular group and the splitting is still the same.
You must therefore divide out these internal permutations.
This is a “distinguishable permutations” situation.
Example #1 - 20 of a club need to be split into 3 committees (A, B, C) of 8, 8, and 4 people,
respectively. How many ways are there to split the club into these committees?
� � 20 20! ways to split = = 8!8!4! 8, 8, 4 Example #2 - When rolling 12 dice, what is the probability that 6 pairs are thrown?
This can be thought of as “each number appears twice”
There are 612 possibilities for the dice throws, as each of the 12 dice has 6 possible values.
In pairs, the only freedom is where the dice show up.
� � 12! 12! 12 = �P= = 0.0034 (2!)6 612 2, 2, 2, 2, 2, 2 (2!)6
7
Example #3 - Playing Bridge
Players A, B, C, and D each get 13 cards.
P(A − 6�s, B − 4�s, C − 2�s, D − 1�) =?
� 13 �� 39 � (choose �s)(choose other cards) 6,4,2,1 7,9,11,12 � � = P= = 0.00196 52 (ways to arrange all cards) 13,13,13,13 Note - If it didn’t matter who got the cards, multiply by 4! to arrange people around the hands. Alternate way to solve - just track the locations of the � s �13��13��13��13� P=
6
4
�52�2
1
13
Probabilities of Unions of Events:
P(A ⇒ B) = P(A) + P(B) − P(AB)
P(A ⇒ B ⇒ C) = P(A) + P(B) + P(C) − P(AB) − P(BC) − P(AC) + P(ABC) §1.10 - Calculating a Union of Events - P(union of events)
P(A ⇒ B) = P(A) + P(B) − P(AB) (Figure 1)
P(A ⇒ B ⇒ C) = P(A) + P(B) + P(C) − P(AB) − P(BC) − P(AC) + P(ABC) (Figure 2)
Theorem:
8
P(
n �
i=1
Ai ) =
� i�n
P(Ai ) −
�
P(Ai Aj ) +
i<j
�
i<j
P(Ai Aj Ak ) − ... + (−1)n+1 P(Ai ...An )
Express each dist piece, then add them up according to what sets each piece
belongs or doesn’t belong to.
A1 ⇒ ... ⇒ An can be split into a dist partition of sets:
Ai1 ∞ Ai2 ∞ ... ∞ Aik ∞ Aci(k+1) ∞ ... ∞ Acin where k = last set the piece is a part of. P(
n �
Ai ) =
i=1
�
P(dist partition)
To check if the theorem is correct, see how many times each partition is counted.
P(A �k� k ) - k times
� 1 ), P(A2 ), ..., P(A P(A A ) − i j i<j 2 times
(needs to contain Ai and Aj in k different intersections.) Example: Consider the piece A ∞ B ∞ C c , as shown:
This piece is counted: P(A ⇒ B ⇒ C) = once. P(A) + P(B) + P(C) = counted twice.
−P(AB) − P(AC) − P(BC) = subtracted once.
+P(ABC) = counted zero times.
The sum: 2 - 1 + 0 = 1, piece only counted once.
Example: Consider the piece A1 ∞ A2 ∞ A3 ∞ Ac4 k = 3, n = 4.
P(A1 ) + P(A2 ) + P(A3 ) + P(A4 ) = counted k times (3 times).
� � −P(A1 A2 ) − P(A1 A3 ) − P(A1 A4 ) − P(A2 A3 ) − P(A2 A4 ) − P(A3 A4 ) = counted k2 times (3 times).
�k � � as follows: i<j
9
k
0 = (1 − 1) =
k � � � k i=0
i
i
(−1) (1)
(k−i)
� � � � � � � � k k k k = − + ... − 0 1 2 3
0 = 1 − sum of times counted therefore, all dist pieces are counted once.
** End of Lecture 3
10
18.05 Lecture 4 February 11, 2005
Union of Events P(A1 ⇒ ... ⇒ An ) =
� i
P(Ai ) −
�
P(Ai Aj ) +
i<j
�
P(Ai Aj Ak ) + ...
i<j
It is often easier to calculate P(intersections) than P(unions)
Matching Problem: You have n letters and n envelopes, randomly stuff the letters into the envelopes.
What is the probability that at least one letter will match its intended envelope?
P(A1 ⇒ ... ⇒ An ), Ai = {ith position will match}
P(Ai ) = n1 = (n−1)! n! (permute everyone else if just Ai is in the right place.) P(Ai Aj ) = (n−2)! (Ai and Aj are in the right place) n! P(Ai1 Ai2 ...Aik ) = (n−k)! n! � � � � � � 1 n (n − 2)! n (n − 3)! n (n − n)! P(A1 ⇒ ... ⇒ An ) = n × − + − ... + (−1)n+1 n n! n 2 n! 3 n! general term:
� � n!(n − k)! 1 n (n − k)! = = k n! k!(n − k)!n! k! SUM = 1 − 2
1 1 1 + − ... + (−1)n+1 2! 3! n! 3
Recall: Taylor series for ex = 1 + x + x2! + x3! + ... 1 for x= -1, e−1 = 1 − 1 + 12 − 3! + ... therefore, SUM = 1 - limit of Taylor series as n ↔ → When n is large, the probability converges to 1 − e−1 = 0.63 §2.1 - Conditional Probability Given that B “happened,” what is the probability that A also happened? The sample space is narrowed down to the space where B has occurred:
The sample size now only includes the determination that event B happened. Definition: Conditional probability of Event A given Event B: P(A|B) =
P(AB) P(B)
Visually, conditional probability is the area shown below: 11
It is sometimes easier to calculate intersection given conditional probability: P(AB) = P(A|B)P(B) Example: Roll 2 dice, sum (T) is odd. Find P(T < 8). B = {T is odd}, A = {T < 8} P(A|B) =
P(AB) 18 1 , P(B) = 2 = P(B) 6 2
All possible odd T = 3, 5, 7, 9, 11.
Ways to get T = 2, 4, 6, 4, 2 - respectively.
2 12 = 13 ; P(A|B) = 1/3 P(AB) = 36 1/2 = 3 Example: Roll 2 dice until sum of 7 or 8 results (T = 7 or 8)
P(A = {T = 7}), B = {T = 7 or 8}
This is the same case as if you roll once.
P(A) 6/36 6 P(A|B) = P(AB) P(B) = P(B) = (6+5)/36 = 11 Example: Treatments for a Result Relapse No Relapse
disease, results A B C 18 13 22 22 25 16
after 2 years: Placebo 24 10
Example, considering Placebo: B = Placebo, A = Relapse. P(A|B) = 13 Example, considering treatment B: P(A|B) = 13+25 = 0.34
24 24+10
= 0.7
As stated earlier, conditional probability can be used to calculate intersections:
Example: You have r red balls and b black balls in a bin.
Draw 2 without replacement, What is P(1 = red, 2 = black)?
r What is P(2 = black) given that 1 = red ? P(1 = red) = r+b Now, there are only r - 1 red balls and still b black balls. b r P(2 = black|1 = red) = b+rb−1 � P(AB) = b+r−1 × r+b P(A1 A2 ...An ) = P(A1 ) × P(A2 |A1 ) × P(A3 |A2 |A1 ) × ... × P(An |An−1 ...A2 |A1 ) = = P(A1 ) ×
P(An An−1 ...A1 ) P(A2 A1 ) P(A3 A2 A1 ) × × ... = P(A1 ) P(A2 A1 ) P(An−1 ...A1 ) = P(An An−1 ...A1 )
Example, continued: Now, find P(r, b, b, r) 12
=
r b b−1 r−1 × × × r+b r−1+b r+b−2 r+b−3
Example, Casino game - Craps. What’s the probability of actually winning??
On first roll: 7, 11 - win; 2, 3, 12 - lose; any other number (x1 ), you continue playing.
If you eventually roll 7 - lose; x1 , you win!
P(win) = P(x1 = 7 or 11) + P(x1 = 4)P(get 4 before 7|x1 = 4)+ +P(x1 = 5)P(get 5 before 7|x1 = 5) + ... = 0.493 The game is almost fair! ** End of Lecture 4
13
18.05 Lecture 5 February 14, 2005
§2.2 Independence of events. P(A|B) = P(AB) P(B) ; Definition - A and B are independent if P(A|B) = P(A) P(A|B) =
P(AB) = P(A) � P(AB) = P(A)P(B) P(B)
Experiments can be physically independent (roll 1 die, then roll another die),
or seem physically related and still be independent.
Example: A = {odd}, B = {1, 2, 3, 4}. Related events, but independent.
P(A) = 21 .P(B ) = 32 .AB = {1, 3} P(AB) = 21 × 32 = P(AB ) = 31 , therefore independent. Independence does not imply that the sets do not intersect.
Dist ∈= Independent. If A, B are independent, find P(AB c )
P(AB) = P(A)P(B)
AB c = A \ AB, as shown:
so, P(AB c ) = P(A) − P(AB)
= P(A) − P(A)P(B)
= P(A)(1 − P(B))
= P(A)P(B c )
therefore, A and B c are independent as well.
similarly, Ac and B c are independent. See Pset 3 for proof.
Independence allows you to find P(intersection) through simple multiplication. 14
Example: Toss an unfair coin twice, these are independent events. P(H) = p, 0 ← p ← 1, find P(“T H ∅∅ ) = tails first, heads second P(“T H ∅∅) = P(T )P(H) = (1 − p)p Since this is an unfair coin, the probability is not just 14 H 1 If fair, HH+HTT+T H+T T = 4 If you have several events: A1 , A2 , ...An that you need to prove independent:
It is necessary to show that any subset is independent.
Total subsets: Ai1 , Ai2 , ..., Aik , 2 ← k ← n
Prove: P(Ai1 Ai2 ...Aik ) = P(Ai1 )P(Ai2 )...P(Aik )
You could prove that any 2 events are independent, which is called “pairwise” independence,
but this is not sufficient to prove that all events are independent.
Example of pairwise independence:
Consider a tetrahedral die, equally weighted.
Three of the faces are each colored red, blue, and green,
but the last face is multicolored, containing red, blue and green.
P(red) = 2/4 = 1/2 = P(blue) = P(green)
P(red and blue) = 1/4 = 1/2 × 1/2 = P(red)P(blue)
Therefore, the pair {red, blue} is independent.
The same can be proven for {red, green} and {blue, green}.
but, what about all three together?
P(red, blue, and green) = 1/4 ∈= P(red)P(blue)P(green) = 1/8, not fully independent.
Example: P(H) = p, P(T ) = 1 − p for unfair coin
Toss the coin 5 times � P(“HTHTT”)
= P(H)P(T )P(H)P(T )P(T )
= p(1 − p)p(1 − p)(1 − p) = p2 (1 − p)3
Example: Find P(get 2H and 3T, in any order)
= sum of probabilities for ordering
= P(HHT T T ) + P(HT HT T ) = ...
2 3 2 3 =p �5�(12− p) +3 p (1 − p) + ...
= 2 p (1 − p)
General Example: Throw a coin n times, P(k heads out of n throws)
� � n k = p (1 − p)n−k k
Example: Toss a coin until the result is “heads;” there are n tosses before H results.
P(number of tosses = n) =?
needs to result as “TTT....TH,” number of T’s = (n - 1)
P(tosses = n) = P(T T...H) = (1 − p)n−1 p
Example: In a criminal case, witnesses give a specific description of the couple seen fleeing the scene.
P(random couple meets description) = 8.3 × 10−8 = p
We know at the beginning that 1 couple exists. Perhaps a better question to be asked is:
Given a couple exists, what is the probability that another couple fits the same description?
P(2 couples exists)
A = P(at least 1 couple), B = P(at least 2 couples), find P(B |A)
P(B) P(B |A) = P(BA) P(A) = P(A) 15
�n Out of n couples, P(A) = P(at least 1 couple) = 1 − P(no couples) = 1 − i=1 (1 − p)
*Each* couple doesn’t satisfy the description, if no couples exist.
Use independence property, and multiply.
P(A) = 1 − (1 − p)n
P(B) = P(at least two) = 1 − P(0 couples) − P(exactly 1 couple)
= 1 − (1 − p)n − n × p(1 − p)n−1 , keep in mind that P(exactly 1) falls into P(k out of n)
P(B |A) =
1 − (1 − p)n − np(1 − p)n−1 1 − (1 − p)n
If n = 8 million people, P(B |A) = 0.2966, which is within reasonable doubt! P(2 couples) < P(1 couple), but given that 1 couple exists, the probability that 2 exist is not insignificant.
In the large sample space, the probability that B occurs when we know that A occured is significant! §2.3 Bayes’s Theorem It is sometimes useful to separate a sample space S into a set of dist partitions:
B1 , ..., Bk - a partition of sample space S. �k Bi ∞ Bj = ≥, for i ∈= j, S = i=1 Bi (dist) �k �k Total probability: P(A) = i=1 P(ABi ) = i=1 P(A|Bi )P(Bi ) �k (all ABi are dist, i=1 ABi = A)
** End of Lecture 5
16
18.05 Lecture 6 February 16, 2005
Solutions to Problem Set #1 1-1 pg. 12 #9 �∗ �∗ Bn = i=n Ai , Cn = i=n Ai a) Bn ∅ Bn+1 �∗... Bn = An ⇒ ( i=n+1 Ai ) = An ⇒ Bn+1 s ⊂ Bn+1 ≤ s ⊂ Bn+1 ⇒ An = Bn Cn ⊃ Cn+1 ... Cn = An ∞ Cn+1 s ⊂ Cn = n ∞ Cn+1 ≤ s ⊂ Cn+1 �A ∗ b) � s ⊂ n=1 Bn ≤ s ⊂ Bn for all n ∗ s ⊂ i=1 Ai for all n ≤ s ⊂ some Ai for i ∼ n, for all n ≤ s ⊂ infinitely many events Ai ≤ � Ai happen infinitely often.
�∗ ∗ c) s ⊂ n=1 Cn ≤ s ⊂ some Cn = i=n Ai ≤ for some n, s ⊂ all Ai for i ∼ n
≤ s ⊂ all events starting at n. 1-2 pg. 18 #4 P (at least 1 fails) = 1 − P (neither fail) = 1 − 0.4 = 0.6 1-3 pg. 18 #12 A1 , A2 , ... B1 � = A1 , B2 =� Ac1 A2 , ..., Bn = Ac1 ...Acn−1 An n n P ( i=1 Ai ) =� i=1 P (B� i ) splits the union into dist events, and covers the entire space. n n follows from: i=1 A = i i=1 Bi �n take point (s) in i=1 Ai , ≤ s ⊂ at least one ≤ s ⊂ A1 = B1 , if not, s ⊂ Ac1 , if s ⊂ A2 , then s ⊂ Ac
1 A2 = B2 , if not... etc. at some point, the point belongs to a set. c The sequence stops ∞ Ac2 ∞ ... ∞ Ack−1 ∞ Ak = Bk �nwhen s ⊂ A1� �n n ≤� s ⊂ i=1 Bi .P ( i=1 Ai ) = P ( i=1 Bi ) n = i=1 P (Bi ) if Bi ’s are dist. Should also prove that the point in Bi belongs in Ai . Need to prove Bi ’s dist - by construction: Bi , Bj ≤ Bi = Aci ∞ ... ∞ Aci−1 ∞ Ai
Bj = Ac1 ∞ ... ∞ Aci ∞ ... ∞ Ac
j−1 ∞ Aj s ⊂ B i ≤ s ⊂ A i , s∅ ⊂ B j ≤ s ∅ ⊂ / Ai . ≤ implies that s = ∈ s∅ 1-4 pg. 27 #5 #(S) = 6 × 6 × 6 × 6 = 64 #(all different) = 6 × 5 × 4 × 3 = P6,4 P 5 P (all different) = 66,4 = 18 4 1-5 pg. 27 #7
12 balls in 20 boxes.
P(no box receives > 1 ball, each box will have 0 or 1 balls)
also means that all balls fall into different boxes.
#(S) = 2012
#(all different) = 20 × 19... × 9 = P20,12
17
P (...) =
P20,12 2012
1-6 pg. 27 #10
100 balls, r red balls.
Ai = {draw red at step i}
think of arranging the balls in 100 spots in a row.
r a) P (A1 ) = 100 b) P (A50 )
sample space = sequences of length 50.
#(S) = 100 × 99 × ... × 50 = P100,50
#(A50 ) = r × P99,49 red on 50. There are 99 balls left, r choices to put red on 50.
r P (A50 ) = 100 , same as part a.
c) As shown in part b, the particular draw doesn’t matter, probability is the same.
r P (A100 ) = 100 1-7 pg. 34 #6
Seat n people in n spots.
#(S) = n!
#(AB sit together) =?
visualize n seats, you have n-1 choices for the pair.
2(n-1) ways to seat the pair, because you can switch the two people.
but, need to for the (n-2) people remaining!
#(AB) = 2(n − 1)(n − 2)!
= n2 therefore, P = 2(n−1)! n! or, think of the pair as 1 entity. There are (n-1) entities, permute them, multiply by 2 to swap the pair. 1-8 pg. 34 #11 � � Out of 100, choose 12. #(S) = 100 �98� 12 #(AB are on committee) = 10 , choose 10 from the 98 remaining. (98 10) P = 100 ( 12 ) 1-9 pg. 34 #16 50 states × 2 senators � each. � a) Select 8 , #(S) = 100 8� �� � �2��98� 2 #(state 1 or state 2) = 98 6 2 + 1 7
or, calculate: 1 − P(neither chosen) = 1 −
50 b) #(one senator from � state) = 2 �100each select group of 50 = 50
(988 ) (100 8 )
1-10 pg. 34 #17
In the sample �13� of the aces in the hands.
� � space, only consider the positions × , #(all go to 1 player) = 4 #(S) = 52 4 4 (13 4) P = 4 × 52 (4) 1-11 r balls, n boxes, no box is empty.
first of all, put 1 ball in each box from the beginning.
r-n balls remain to be distributed in n boxes.
18
�
n + (r − n) − 1 r−n
�
=
�
r−1 r−n
�
1-12 30 people, 12 months.
P(6 months with 3 birthdays, 6 months with 2 birthdays)
#(S) = 1230
Need to choose the 6 months with 3 or 2 birthdays, then the multinomial coefficient:
� � �� 12 30 #(possibilities) = 6 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2 ** End of Lecture 6
19
18.05 Lecture 7 February 18, 2005
Bayes’ Formula.
Partition B1 , ..., Bk �k B = S, Bi ∞ Bj = ≥ for i ∈= j i i=1 � �k P(A) = ki=1 P(ABi ) = i=1 P(A|Bi )P(Bi ) - total probability.
Example: In box 1, there are 60 short bolts and 40 long bolts. In box 2,
there are 10 short bolts and 20 long bolts. Take a box at random, and pick a bolt.
What is the probability that you chose a short bolt?
B1 = choose Box 1.
B2 = choose Box 2.
60 1 1 P(short) = P(short|B1 )P(B1 ) + P(short|B2 )P(B2 ) = 100 ( 2 ) + 10 30 ( 2 )
Example:
Partitions: B1 , B2 , ...Bk and you know the distribution.
Events: A, A, ..., A and you know the P(A) for each Bi
If you know that A happened, what is the probability that it came from a particular B i ?
P(Bi |A) =
P(Bi A) P(A|Bi )P(Bi ) = : Bayes’s Formula P(A) P(A|B1 )P(B1 ) + ... + P(A|Bk )P(Bk )
Example: Medical detection test, 90% accurate.
Partition - you have the disease (B1 ), you don’t have the disease (B2 )
The accuracy means, in of probability: P(positive|B1 ) = 0.9, P(positive|B2 ) = 0.1
In the general public, the chance of getting the disease is 1 in 10,000.
In of probability: P(B1 ) = 0.0001, P(B2) = 0.9999
If the result comes up positive, what is the probability that you actually have the disease? P(B 1 |positive)?
P(B1 |positive) =
=
P(positive|B1 )P(B1 ) P(positive|B1 )P(B1 ) + P(positive|B2 )P(B2 )
(0.9)(0.0001) = 0.0009 (0.9)(0.0001) + (0.1)(0.9999)
The probability is still very small that you actually have the disease.
20
Example: Identify the source of a defective item.
There are 3 machines: M1 , M2 , M3 . P(defective): 0.01, 0.02, 0.03, respectively.
The percent of items made that come from each machine is: 20%, 30%, and 50%, respectively.
Probability that the item comes from a machine: P (M1 ) = 0.2, P (M2 ) = 0.3, P (M3 ) = 0.5
Probability that a machine’s item is defective: P (D|M1 ) = 0.01, P (D|M2 ) = 0.02, P (D|M3 ) = 0.03
Probability that it came from Machine 1: P (M1 |D) =
=
P (D|M1 )P (M1 ) P (D|M1 )P (M1 ) + P (D|M2 )P (M2 ) + P (D|M3 )P (M |3) (0.01)(0.2) = 0.087 (0.01)(0.2) + (0.02)(0.3) + (0.03)(0.5)
Example: A gene has 2 alleles: A, a. The gene exhibits itself through a trait with two versions.
The possible phenotypes are “dominant,” with genotypes AA or Aa, and “recessive,” with genotype aa.
Alleles travel independently, derived from a parent’s genotype.
In a population, the probability of having a particular allele: P(A) = 0.5, P(a) = 0.5
Therefore, the probabilities of the genotypes are: P(AA) = 0.25, P(Aa) = 0.5, P(aa) = 0.25
Partitions: genotypes of parents: (AA, AA), (AA, Aa), (AA, aa), (Aa, Aa), (Aa, aa), (aa, aa).
Assume pairs match regardless of genotype.
Parent genotypes
(AA, AA)
(AA, Aa)
(AA, aa)
(Aa, Aa)
(Aa, aa)
(aa, aa)
Probabilities 1 16
2 × ( 14 )( 12 ) = 2 × ( 14 )( 14 ) = ( 12 )( 12 ) = 41 2 × ( 12 )( 14 ) = 1 16
1 4 1 8
Probability that child has dominant phenotype 1 1 1 3 4 1 2
1 4
0
If you see that a person has dark hair, predict the genotypes of the parents: P ((AA, AA)|A) =
1 16 (1)
+
1 4 (1)
+
1 1 × 16 1 3 + 4 ( 4 ) + 14 ( 12 ) +
1 8 (1)
1 16 (0)
=
1 12
You can do the same computation to find the probabilities of each type of couple. Bayes’s formula gives a prediction inside the parents that you aren’t able to directly see. Example: You have 1 machine.
In good condition: defective items only produced 1% of the time. P(in good condition) = 90%
In broken condition: defective items produced 40% of the time. P(broken) = 10%
Sample 6 items, and find that 2 are defective. Is the machine broken?
This is very similar to the medical example worked earlier in lecture:
P(good|2 out of 6 are defective) =
P (2of 6|good)P (good) P (2of 6|good)P (good) + P (2of 6|broken)P (broken) �6� 2 4 2 (0.01) (0.99) (0.9) �6� = �6� = 0.04 2 4 2 4 2 (0.01) (0.99) (0.9) + 2 (0.4) (0.6) (0.1)
=
** End of Lecture 7
21
18.05 Lecture 8 February 22, 2005
§3.1 - Random Variables and Distributions Transforms the outcome of an experiment into a number.
Definitions:
Probability Space: (S, A, P)
S - sample space, A - events, P - probability
Random variable is a function on S with values in real numbers, X:S ↔ R
Examples:
Toss a coin 10 times, Sample Space = {HTH...HT, ....}, all configurations of H & T.
Random Variable X = number of heads, X: S ↔ R
X: S ↔ {0, 1, ..., 10} for this example.
There are fewer outcomes than in S, you need to give the distribution of the
random variable in order to get the entire picture. Probabilities are therefore given.
Definition: The distribution of a random variable X:S ↔ R, is defined by: A √ R, P(A) = P(X ⊂ A) = P(s ⊂ S : X(s) ⊂ A)
The random variable maps outcomes and probabilities to real numbers.
This simplifies the problem, as you only need to define the mapped R, P, not the original S, P.
The mapped variables describe X, so you don’t need to consider the original
complicated probability space.
� � 1 k 1 10−k �10� 1 From the example, P(X = #(heads in 10 tosses) = k) = 10 = k 210 k (2) (2) Note: need to distribute the heads among the tosses,
for probability of both heads and tails tossed.
This is a specific example of the more general binomial problem:
A random variable X ⊂ {1, ..., n}
� � n k P(X = k) = p (1 − p)n−k k This distribution is called the binomial distribution: B(n, p), which is an example of a discrete distribution. Discrete Distribution A random variable X is called discrete if it takes a finite or countable number (sequence) of values: X ⊂ {s1 , s2 , s3 , ...} It is completely described by telling the probability of each outcome. Distribution defined by: P(X = sk ) = f (sk ), the probability function (p.f.) p.f. cannot be �negative and should sum to 1 over all outcomes. P(X ⊂ A) = sk ∞A f (sk )
Example: Uniform distribution of a finite number of values {1, 2, 3, ..., n} each outcome 22
has equal probability ↔ f (sk ) = n1 : uniform probability function. random variable X ⊂ R, P(A) = P(X ⊂ A), A √ R
can redefine probability space on random variable distribution:
(R, A, P) - sample space, X: R ↔ R, X(x) = x (identity map)
P(A) = P(X : X(x) ⊂ A) = P(x ⊂ A) = P(x ⊂ A) = P(A)
all you need is the outcomes mapped to real numbers and relative probabilities
of the mapped outcomes.
Example: Poisson Distribution, {0, 1, 2, 3, ...} �(�), � = intensity
probability function:
f (k) = P(X = k) =
�k −� e , where � parameter > 0. k!
�∗ k = e−� k→0 �k! = e−� e� = e0 = 1 Very common distribution, will be used later in statistics.
Represents a variety of situations - ex. distribution of “typos” in a book on a particular page,
number of stars in a random spot in the sky, etc.
Good approximation for real world problems, as P > 10 is small.
�∗
�k −� k→0 k! e
Continuous Distribution Need to consider intervals not points.
Probability distribution function (p.d.f.): f (x) ∼ 0.
�∗ Summation replaced by integral: −∗ f (x)dx = 1
� then, P(A) = A f (x)dx, as shown:
If you were to choose a random point on an interval, the probability of choosing
a particular point is equal to zero.
You can’t assign positive probability to any point, as it would add up infinitely on a continuous interval.
It is necessary to take P(point is �in a particular sub-interval).
a Definition implies that P({a}) = a f (x)dx = 0
Example: In a uniform distribution [a, b], denoted U[a, b]: 1 p.d.f.: f (x) = b−a , for x ⊂ [a, b]; 0, for x ⊂ / [a, b] Example: On an interval [a, b], such that a < c < d < b, �d 1 P([c, d]) = c b−a dx = d−c b−a (probability on a subinterval) Example: Exponential Distribution
E(∂), ∂ > 0 parameter p.d.f.: f (x) = ∂e−∂x , if x ∼ 0; 0, if x < 0 Check that it integrates to 1:
23
�∗
∂e−∂x dx = ∂(− ∂1 e−∂x |∗ 0 =1 Real world: Exponential distribution describes the life span of quality products (electronics). 0
** End of Lecture 8
24
18.05 Lecture 9 February 23, 2005
Discrete Random Variable: - defined by probability function (p.f.) {s1 , s2 , ...}, f (si ) = P(X = si ) Continuous: � ∗probability distribution� function (p.d.f.) - also called density function. f (x) ∼ 0, −∗ f (x)dx, P(X ⊂ A) = A f (x)dx
Cumulative distribution function (c.d.f):
F (x) = P(X ← x), x ⊂ R
Properties:
1. x1 ← x2 , {X ← x1 } ⊃ {X ← x2 } ↔ P(X ← x1 ) ← P(X ← x2 ) non-decreasing function. 2. limx�−∗ F (x) = P(X ← −→) = 0, limx�∗ F (x) = P(X ← →) = 1. A random variable only takes real numbers, as x ↔ −→, set becomes empty. Example: P(X = 0) = 21 , P(X = 1) =
1 2
P(X ← x < 0) = 0 P(X ← 0) = P(X = 0) = 21 , P(X ← x) = P(X = 0) = 21 , x ⊂ [0, 1) P(X ← x) = P(X = 0 or 1) = 1, x ⊂ [1, →) 3. “right continuous”: limy�x+ F (y) = F (x) F (y) = P(X ← y), event {X ← y} ∗ �
n=1
{X ← yn } = {X ← x}, F (yn ) ↔ P(X ← x) = F (x)
Probability of random variable occuring within interval: P(x1 < X < x2 ) = P({X ← x2 }\{X ← x1 }) = P(X ← x2 ) − P(X ← x1 ) = F (x2 ) − F (x1 )
25
{X ← x2 } ∪ {X ← x1 } Probability of a point x, P(X = x) = F (x) − F (x− ) where F (x− ) = limx�x− F (x), F (x+ ) = limx�x+ F (x)
If continuous, probability at a point is equal to 0, unless there is a jump,
where the probability is the value of the jump.
P(x1 ← X ← x2 ) = F (x2 ) − F (x−
1 ) P(A) = P(X ⊂ A)
X - random variable with distribution P
When observing a c.d.f:
Discrete: sum of probabilities at all the jumps = 1. Graph is horizontal in between the jumps, meaning that probability = 0 in those intervals.
�x Continuous: F (x) = P(X ← x) = −∗ f (x)dx eventually, the graph approaches 1.
26
If f continuous, f (x) = F ∅ (x) Quantile: p ⊂ [0, 1], p-quantile = inf {x : F (x) = P(X ← x) ∼ p} find the smallest point such that the probability up to the point is at least p.
The area underneath F(x) up to this point x is equal to p.
If the 0.25 quantile is at x = 0, P(X ← 0) ∼ 0.25
Note that if dist, the 0.25 quantile is at x = 0, but so is the 0.3, 0.4...all the way up to 0.5. What if you have 2 random variables? multiple?
ex. take a person, measure weight and height. Separate behavior tells you nothing
about the pairing, need to describe the t distribution.
Consider a pair of random variables (X, Y)
t distribution of (X, Y): P((X, Y ) ⊂ A)
Event, set A ⊂ R2
27
Discrete distribution: {(s11 , s21 ), (s12 , s22 ), ...} ⊆ (X, Y ) t p.f.: f (s1i , s2i ) = P((X, Y ) = (s1i , s21 )) = P(X = s1i , Y = s2i ) Often visualized as a table, assign probability for each point:
1 1.5 3
0 0.1 0 0.2
-1 0 0 0
-2.5 0.2 0 0.4
5 0 0.1 0
Continuous: f (x, y) ∼ 0,
�
f (x, y)dxdy =
R2
� t p.d.f. f (x, y) : P((X, Y ) ⊂ A) = A f (x, y)dxdy t c.d.f. F (x, y) = P(X ← x, Y ← y)
�
∗ −∗
�
∗
f (x, y)dxdy = 1
−∗
If you want the c.d.f. only for x, F (x) = P(X ← x) = P(X ← x, Y ← +→)
= F (x, →) = limy�∗ F (x, y)
Same for y.
To find the probability within a rectangle on the (x, y) plane:
Continuous: F (x, y) =
�x �y
−∗ −∗
f (x, y )dxdy. Also,
** End of Lecture 9
28
�2F �x�y
= f (x, y )
18.05 Lecture 10 February 25, 2005
Review of Distribution Types Discrete distribution for (X, Y): t� p.f. f (x, y) = P(X = x, Y = y) Continuous: t p.d.f. f (x, y) ∼ 0, R2 f (x, y)dxdy = 1 t c.d.f.: F (x, y) = P(X ← x, Y ← y) F (x) = P(X ← x) = limy�∗ F (x, y)
�x �y In the continuous case: F (x, y) = P(X ← x, T ← y) = −∗ −∗ f (x, y)dxdy. Marginal Distributions Given the t distribution of (X, Y), the individual distributions of X, Y
are marginal distributions.
Discrete (X, Y): marginal � � probability function
f1 (x) = P(X = x) = y P(X = x, Y = y) = y f (x, y)
In the table for the previous lecture, of probabilities for each point (x, y):
Add up all values for y in the row x = 1 to determine P(X = 1)
�∗ Continuous (X, Y): t p.d.f. f(x, y); p.d.f. of X: f1 (x) = −∗ f (x, y )dy �x �∗ F (x) = P(X ← x) = P(X ← x, Y ← →) = −∗ −∗ f (x, y )dydx
�∗ f1 (x) = �F �x = −∗ f (x, y)dy Why not integrate � ∗ over � x line?
P({X = x}) = −∗ ( x f (x, y)dx)dy = 0
P(of continuous random variable at a specific point) = 0. Example: t p.d.f. 2 2 f (x, y) = 21 4 x y, x ← y ← 1, 0 ← x ← 1; 0 otherwise
29
What is the distribution of x? �1 1 2 1 2 2 x ydy = 21 p.d.f. f1 (x) = x2 21 4 4 x × 2 y |x 2 =
21 2 8 x (1
− x4 ), −1 ← x ← 1
Discrete values for X, Y in tabular form: 1 2 1 0.5 0 0.5 2 0 0.5 0.5 0.5 0.5 Note: If all entries had 0.25 values, the two variables would have the same marginal dist. Independent X and Y: Definition: X, Y independent if P(X ⊂ A, Y ⊂ B) = P(X ⊂ A)P(Y ⊂ B)
t c.d.f. F (x, y) = P(X ← x, Y ← y) = P(X ← x)P(Y ← y) = F1 (x)F2 (y) (intersection of events)
The t c.d.f can be factored for independent random variables.
Implication: (X, Y): t p.d.f. f(x, y), �y � x continuous � y f2 (y) � x marginal f1 (x), F (x, y) = −∗ −∗ f (x, y)dydx = F1 (x)F2 (y) = −∗ f1 (x)dx × −∗ f2 (y)dy 2
� Take �x�y of both sides: f (x, y) = f1 (x)f2 (y) Independent if t density is a product.
Much simpler in the discrete case:
Discrete (X, Y): f (x, y) = P(X = x, Y = y) = P(X = x)P(Y = y) = f1 (x)f2 (y) by definition.
Example: t p.d.f.
f (x, y) = kx2 y 2 , x2 + y 2 ← 1; 0 otherwise
X and Y are not independent variables.
f (x, y) = ∈ f1 (x)f2 (y) because of the circle condition.
30
P(square) = 0 ∈= P(X ⊂ side) × P(Y ⊂ side) Example: f (x, y) = kx2 y 2 , 0 ← x ← 1, 0 ← y ← 1; 0 otherwise Can be written as a product, as they are independent:
f (x, y) = kx2 y 2 I(0 ← x ← 1, 0 ← y ← 1) = k1 x2 I(0 ← x ← 1) × k2 y 2 I(0 ← y ← 1)
Conditions on x and y can be separated.
Note: Indicator Notation
/ A I(x ⊂ A) = 1, x ⊂ A; 0, x ⊂ For the discrete case, given a table of values, you can tell independence:
a1 a2 ... an
b1 p11 ... ... pn1 p+1
b2 p12 ... ... ... p+2
... ... ... ... ... ...
pij = P(X = ai , Y =�bj ) = P(X = ai )P(Y = bj ) m pi+ = P(X = ai ) = j=1 pij �n p+j = P(Y = bj ) = i=1 pij pij = pi+ × p+j , for every i, j - all points in table. ** End of Lecture 10
31
bm p1m ... ... pnm p+n
p1+ p2+ ... pn+
18.05 Lecture 11 February 28, 2005
A pair (X, Y) of random variables:
f(x, y) t p.f. (discrete), t� p.d.f. (continuous)
Marginal Distributions: f (x) = y f (x, y) - p.f. of X (discrete)
� f (x) = f (x, y)dy - p.d.f. of X (continuous)
Conditional Distributions Discrete Case: P(X = x|Y = y) = f (x,y) f (y) = f (x|y) conditional f (y |x) = ff(x,y) (x) conditional p.f.
P=
P(X = x, Y = y) P(Y = y)
p.f. of X given Y = y. Note: defined when f(y) is positive.
of Y given X = x. Note: defined when f(x) is positive. If the marginal probabilities are zero, conditional probability is undefined.
Continuous Case:
Formulas are the same, but can’t treat like exact possibilities at fixed points.
Consider instead in of probability density:
Conditional c.d.f. of X given Y=y;
P(X ← x|Y ⊂ [y − φ, y + φ]) =
t p.d.f. f (x, y), P(A) =
�
A
f (x, y)dxdy
= As φ ↔ 0:
Conditional c.d.f:
P(X ← x, Y ⊂ [y − φ, y + φ]) P(Y ⊂ [y − φ, y + φ])
�x
� y+δ � x 1 2δ y−δ −∗ f (x, y)dxdy � y+δ � ∗ 1 f (x, y)dxdy × 2δ y−δ −∗
�−∗ ∗ −∗
f (x, y)dx f (x, y)dx
=
32
�x
−∗
f (x, y)dx f (y)
P(X ← x|Y = y) =
�x
−∗
f (x, y)dx f (y)
Conditional p.d.f: f (x|y) =
� f (x, y) P(X ← x|Y = y) = �x f (y)
Same result as discrete.
Also, f (x|y) only defined when f(y) > 0.
Multiplication Rule f (x, y) = f (x|y)f (y) Bayes’s Theorem: f (y |x) =
f (x, y) f (x|y)f (y) f (x|y)f (y) =� =� f (x) f (x, y)dy f (x|y)f (y)dy
Bayes’s formula for Random Variables. For� each� y, you know the distribution of x. Note: When considering the discrete case, ↔ In statistics, after observing data, figure out the parameter using Bayes’s Formula. Example: Draw X uniformly on [0, 1], Draw Y uniformly on [X, 1] p.d.f.: 1 f (x) = 1 × I(0 ← x ← 1), f (y |x) = × I(x ← y ← 1) 1−x t p.d.f: 1 × I(0 ← x ← y ← 1) f (x, y) = f (y |x)f (x) = 1−x Marginal: � � y 1 f (y) = f (x, y)dx = dx = − ln(1 − x)|y0 = − ln(1 − y) 1 − x 0
Keep in mind, this condition is everywhere: given, y ⊂ [0, 1] and f(y) = 0 if y ⊂ / [0, 1] Conditional (of X given Y): f (x|y) =
−1 f (x, y) I(0 ← x ← y ← 1) = f (y) (1 − x) ln(1 − y)
Multivariate Distributions Consider n random variables: X1 , X2 , ..., Xn � t p.f.: f (x1 , x2 , ..., xn ) = P(X�1 = x1 , ..., Xn = xn ) ∼ 0, f = 1 t p.d.f.: f (x1 , x2 , ..., xn ) ∼ 0, f dx1 dx2 ...dxn = 1 Marginal, Conditional in the same way: Define notation as vectors to simplify: ↔ − − X = (X1 , ..., Xn ), ↔ x = (x1 , ..., xn ) ↔ − ↔ − ↔ − ↔ − ↔ X = ( Y , Z ) subsets of coordinates: Y = (X1 , ..., Xk ), − y = (y1 ...yk ) ↔ − − z = (z1 ...zn−k ) Z = (Xk+1 , ..., Xn ), ↔ ↔ − − − − t p.d.f. or t p.f. of X , f (↔ x ) = f (↔ y ,↔ z) 33
Marginal: − f (↔ y)= Conditional:
�
− − − − f (↔ y ,↔ z )d↔ z , f (↔ z)=
�
− − − f (↔ y ,↔ z )d↔ y
− − − − − y |↔ z )f (↔ f (↔ z) f (↔ y ,↔ z) − − − − , f (↔ z |↔ y)= � ↔ f (↔ y |↔ z)= ↔ − − ↔ − ↔ − − f( z ) z f ( y | z )f ( z )d↔
Functions of Random Variables Consider random variable X and a function r: R ↔ R, Y = r(X), and you want to calculate the distribution of Y. Discrete Case: Discrete p.f.: f (y) = P(Y = y) = P(r(X) = y) = P(x : r(x) = y) =
�
P(X = x) =
x:r(x)=y
�
x:r(x)=y
(very similar to “change of variable”) Continuous Case:
Find the c.d.f. of Y = r(X) first.
P(Y ← y) = P(r(X) ← y) = P(x : r(x) ← y) = P(A(y)) = p.d.f. f (y) =
� �y
�
A(y)
f (x)dx
** End of Lecture 11
34
�
f (x)dx A(y)
f (x)
18.05 Lecture 12
March 2, 2005
Functions of Random Variables X - random variable, continuous with p.d.f. f(x)
Y = r(X)
Y doesn’t have to be continuous, if it is, find the p.d.f.
To find the p.d.f., first find the c.d.f.
P(Y ← y) = P(r(X) ← y) = P(x : r(X) ← y) =
�
f (x)dx. x:r(x)�y
Then, differentiate the c.d.f to find the p.d.f.: f (y) =
�(P(Y ← y)) �y
Example:
Take random variable X, uniform on [-1, 1]. Y = X 2 , find distribution of Y.
p.d.f. f (x) = { 21 for −1 ← x ← 1; 0 otherwise} ∩ ∩ Y = X , P(Y ← y) = P(X ← Y ) = P(− y ← X ← y) = 2
2
.
�
≥ y
≥ − y
f (x)dx
Take derivative before integrating. � 1 1 1 ∩ ∩ ∩ P(Y ← y) = f ( y) × ∩ + f (−y) × ∩ = ∩ (f ( y) + f (− y)) �y 2 y 2 y y 1 f (y) = { ∩ , 0 ← y ← 1; 0 otherwise.} y Suppose r is monotonic (strictly one-to-one function).
X = r(y), can always find inverse: y = r −1 (x) = s(y) - inverse of r.
P(Y ← y) = P(r(x) ← y) =
= P(X ← s(y)) if r is increasing (1)
= P(X ∼ s(y)) if r is decreasing (2) (1) = F (s(y)) where F () - c.d.f. of X, �P(Y ← y) �F (s(y)) = = f (s(y))s∅ (y) �y �y (2) = 1 − P(X < s(y)) = 1 − F (s(y)), −
�P(Y ← y) = −f (s(y))s∅ (y) �y
If r is increasing ↔ s = r −1 is increasing. ↔ s∅ (y) ∼ 0 ↔ s∅ (y) = |s∅ (y)| If r is decreasing ↔ s = r −1 is decreasing. ↔ s∅ (y) ← 0 ↔ −s∅ (y) = |s∅ (y)| Answer: p.d.f. of Y : f (y) = f (s(y))|s∅ (y)|
35
Example: f (x) = {3(1 − x)2 , 0 ← x ← 1; 0 otherwise.} Y = 10e5x Y = 10e5x ↔ X = f (y) = 3(1 −
1 1 Y ln( ); X‘ = 5 10 5Y
1 Y 1 ln( )| |, 10 ← y ← 10e5 ; 0, otherwise. 5 10 5Y
X has c.d.f. F (X) = P(X ← x), continuous. Y = F (X), 0 ← Y ← 1, what is the distribution of Y?
c.d.f. P(Y ← y ) = P(F (X) ← y ) = P(X ← F −1 (y )) = F (F −1 (y )) = y, 0 ← y ← 1 p.d.f. f (y) = {1, 0 ← y ← 1; 0, otherwise.} Y - uniform on interval [0, 1]
X - uniform on interval [0, 1]; F - c.d.f. of Y .
36
Y = F −1 (X); P(Y ← y) = P(F −1 (x) ← y) = P(X ← F (y)) = F (y). ↔ Random Variable Y = F −1 (X) has c.d.f. F (y). Suppose that (X, Y) has t p.d.f. f(x, y). Z = X + Y. � � P(Z ← z) = P(X + Y ← z) = f (x, y)dxdy = x+y�z
∗ −∗
�
z−x
f (x, y)dydx, −∗
p.d.f.: f (z) =
�P(Z ← z) = �z
�
∗ −∗
f (x, z − x)dx
If X, Y independent, f1 (x) = p.d.f. of X. f2 (y) = p.d.f. of Y t p.d.f.: � ∗ f (x, y) = f1 (x)f2 (y); f (z) = f1 (x)f2 (z − x)dx −∗
Example: X, Y independent, have p.d.f.: f (x) = {∂e−∂x , x ∼ 0; 0, otherwise}. Z =X +Y : f (z) =
�
z
∂e−∂x ∂e−∂(z−x) dx
0
Limits determined by: (0 ← x, z − x ∼ 0 ↔ 0 ← x ← z) � z � f (z) = ∂2 e−∂z dx = ∂2 e−∂z 0
z
dx = ∂2 ze−∂z
0
This distribution describes the lifespan of a high quality product.
It should work “like new” after a point, given it doesn’t break early on.
Distribution of X itself:
37
X, P(X ∼ x) =
�
∗ x
1 = e−∂x ∂e−∂x dx = ∂e−∂x (− )|∗ ∂ x
Conditional Probability:
P(X ∼ x + t|X ∼ x) =
P(X ∼ x + t) e−∂(X+t) P(X ∼ x + t, x ∼ x) = = = e−∂t = P(X ∼ t) P(X ∼ x) P(X ∼ x) e−∂x
** End of Lecture 12
38
18.05 Lecture 13 March 4, 2005 Functions of random variables. If (X, Y) with t� p.d.f. f (x, y), consider Z = X + Y. ∗ p.d.f. of Z: f (z) = −∗ f (x, z − x)dx �∗ If X and Y independent: f (z) = −∗ f1 (x)f2 (z − x)dx
Example:
X, Y independent, uniform on [0, 1], X, Y ≈ U [0, 1], Z = X + Y
p.d.f. of X, Y:
f1 (x) = {1, 0 ← x ← 1; 0 otherwise} = I(0 ← x ← 1),
f2 (y) =�I(0 ← y ← 1) = I(0 ← z − x ← 1)
∗ f (z) = −∗ I(0 ← x ← 1) × I(0 ← z − x ← 1)dx
Limits: 0 ← x ← 1; z − 1 ← x ← z
Both must be true, consider all the cases for values of z:
� Case 1: (z ← 0) ↔ ≤ = 0 �z Case 2: (0 ← z ← 1) ↔ 0 1dx = z �1 Case 3: (1 ← z ← 2) ↔ z−1 1dx = 2 − z � Case 4: (z ∼ 2) ↔ ≤ = 0 Random variables likely to add up near 1, peak of f(z) graph. Example: Multiplication of Random Variables X ∼ 0, Y ∼ 0.Z = XY (Z is positive) First, look at the c.d.f.:
P(Z ← z) = P(XY ← z) =
�
f (x, y)dxdy = XY �z
p.d.f. of Z: 39
�
∗ 0
�
z/x
f (x, y)dydx 0
�P(Z ← z) f (z) = = �z
�
∗ 0
z 1 f (x, ) dx x x
Example: Ratio of Random Variables � ∗ � zy � Z = X/Y (all positive), P(Z ← z) = P(X ← zY ) = x�zy f (x, y)dxdy = 0 0 f (x, y)dxdy �∗ p.d.f. f (z) = 0 f (zy, y)ydy In general, look at c.d.f. and express in of x and y.
Example: X1 , X2 , ..., Xn - independent with same distribution (same c.d.f.)
f (x) = F ∅ (x) - p.d.f. of Xi
P(Xi ← x) = F (x)
Y = maximum among X1 , X2 ...Xn
P(Y ← y) = P(max(X1 , ..., Xn ) ← y) = P(X1 ← y, X2 ← y...Xn ← y)
Now, use definition of independence to factor:
= P(X1 ← y)P(X2 ← y)...P(Xn ← y) = F (y)n p.d.f. of Y: � F (y)n = nF (y)n−1 F ∅ (y) = nF (y)n−1 f (y) fˆ(y) = �y Y = min(X1 , . . . , Xn ), P(Y ← y) = P(min(X1 , ..., Xn ) ← y)
Instead of intersection, use union. But, ask if greater than y:
= 1 − P(min(X1 , ..., Xn ) > y)
= 1 − P(X1 > y, ..., Xn > y)
= 1 − P(X1 > y)P(X2 > y)...P(Xn > y)
= 1 − P(X1 > y)n
= 1 − (1 − P(X1 ← y))n
= 1 − (1 − F (y))n
↔ − ↔ ↔ − −
X = (X1 , X2 , .., Xn ), Y = (Y1 , Y2 , ..., Yn ) = r( X ) Y1 = r1 (X1 , ..., Xn )
Y2 = r2 (X1 , ..., Xn )
...
Yn = rn (X1 , ..., Xn )
↔ − ↔ − Suppose that a map r has inverse. X = r−1 ( Y ) � ↔ ↔ − − − − − P( Y ⊂ A) = A g(↔ y )d↔ y ↔ g(↔ y ) is the t p.d.f. of Y � � − − − ↔ ↔ ↔ − − − P( Y ⊂ A) = P(r( X ) ⊂ A) = P( X ⊂ s(A)) = s(A) f (↔ x )dx = A f (s(↔ y ))|J |d↔ y, ↔ − ↔ − Note: change of variable x = s( y ) Note: J = Jacobian: �s1 �y1
J = det
... �sn �y1
... ... ...
�s1 �yn
... �sn �yn
↔ − − The p.d.f. of Y : f (s(↔ y ))|J | Example:
(X1 , X2 ) with t p.d.f. f (x1 , x2 ) = {4x1 x2 , for 0 ← x1 ← 1, 0 ← x2 ← 1; 0, otherwise}
40
Y1 =
X1 ; Y 2 = X 1 X2 X2
−−−−−↔ Y1 = r1 (X1 , X2 ), Y2 = r2 (X1 , X2 ) inverse X1 = Y1 Y2 = s1 (Y1 , Y2 ), X2 =
�
Y2
= s2 (Y1 , Y2 ) Y1
But, keep in mind the intervals for non-zero values: J = det
≥ y ≥2 2 ≥y1 y2
−
3/2
2y1
≥ y ≥1 2 y2 ≥ 1≥ 2 y 2 y1
=
1 1 1 + = 4y1 4y1 2y1
t p.d.f. of (Y1 , Y2 ): ∩ g(y1 , y2 ) = {4 y1 y2
�
2y2 ∩ y2 |J | = , if 0 ← y1 y2 ← 1, and 0 ← y1 |y1 |
�
Condition implies that they are positive, absolute value is unneccessary.
t p.d.f. of (Y1 , Y2 ) ** Last Lecture of Coverage on Exam 1 ** ** End of Lecture 13
41
y2 ← 1; 0 otherwise } y1
18.05 Lecture 14 March 7, 2005 ↔ − ↔ − Linear transformations of random vectors: Y = r( X ) y1 x1 . . . =A . . . yn xn − ↔ − ↔ ∈ 0 ↔ A−1 = B A - n by n matrix, X = A−1 Y if det A = x1 = b1 y1 + ... + b1n yn b11 ... b1n ... ... where b∅i s are partial derivatives of si with respect to yi J = Jacobian = det bn1 ... bnn det B = det A−1 = 1 detA p.d.f. of Y: g(y) =
1 − f (A−1 ↔ x) |detA|
↔ − Example: X = (x1 , x2 ) with p.d.f.: f (x1 , x2 ) = {cx1 x2 , 0 ← x1 ← 1, 0 ← x2 ← 1; 0 otherwise} To make integral equal 1, c = 4. Y1 = X1 + 2X2 , Y2 = 2X1 + X2 ; A =
1 2
2 ↔ det(A) = −3 1
Calculate the inverse functions: 1 1 X1 = − (Y1 − 2Y2 ), X2 = − (Y2 − 2Y1 ) 3 3 New t function: 1 1 1 g(y1 , y2 ) = { × 4(− (y1 − 2y2 ))(− (y2 − 2y1 )) 3 3 3 1 1 for 0 ← − (y1 − 2y2 ) ← 1 and 0 ← − (y2 − 2y1 ) ← 1; 3 3 0, otherwise} Simplified: f (y1 , y2 ) = {
4 (y1 − 2y2 )(y2 − 2y1 ) for − 3 ← y1 − 2y2 ← 0, −3 ← y2 − 2y1 ← 0; 27 0, otherwise}
42
Linear transformation distorts the graph from a square to a parallelogram. Note: From Lecture 13, when min() and max() functions were introduced, such functions
describe engines in series (min) and parallel (max).
When in series, the length of time a device will function is equal to the minimum life
in all the engines (weakest link).
When in parallel, this is avoided as a device can function as long as one engine functions.
Review of Problems from PSet 4 for the exam: (see solutions for more details) Problem 1 - f (x) = {ce−2x for x ∼ 0; 0 otherwise}
Find c by integrating over the range and setting equal to 1:
� ∗ c 1 1= ce−2x dx = − ce−2x |∗ 0 = − × −1 = 1 ↔ c = 2 2 2 0 � 2 −2x P(1 ← X ← 2) = 1 2e dx = e−2 − e−4
Problem 3 - X ≈ U [0, 5], Y = 0 if X ← 1; Y = X if 1 ← X ← 3; Y = 5 if 3 < X ← 5 Draw the c.d.f. of Y, showing P(Y ← y)
Graph of Y vs. X, not the c.d.f. Write in of X ↔ P(X−?)
Cumulative Distribution Function
43
Cases:
y < 0 ↔ P(Y ← y) = P(≥) = 0
0 ← y ← 1 ↔ P(Y ← y) = P(0 ← X ← 1) = 1/5 1 < y ← 3 ↔ P(Y ← y) = P(X ← y) = y/5 3 < y ← 5 ↔ P(Y ← y) = P(X ← 3) = 3/5 y > 5 ↔ P(Y ← 5) = P(X ∼ 5) = 1 These values over X from 0 to → give its c.d.f. Problem 8 - 0 ← x ← 3, 0 ← y ← 4 1 c.d.f. F (x, y) = 156 xy(x2 + y) P(1 ← x ← 2, 1 ← y ← 2) = F (2, 2) − F (2, 1) − F (1, 2) + F (1, 1)
Rectangle probability algorithm. or, you can find the p.d.f. and integrate (more complicated): c.d.f. of Y: P(Y ← y) = P(X ← →, Y ← y) = P(X ← 3, Y ← y)
(based on the domain of the t c.d.f.)
1 P(Y ← y) = 156 3y(9 + y) for 0 ← y ← 4
Must also mention: y ← 0, P(Y ← y) = 0; y ∼ 4, P(Y ← y) = 1 Find the t p.d.f. of x and y: f (x, y) =
� 2 F (x, y) 1 ={ (3x2 + 2y), 0 ← x ← 3, 0 ← y ← 4; 0 otherwise} �x�y 156
P(Y ← X) =
�
f (x, y)dxdy = y�x
�
3 0
** End of Lecture 14
44
�
x 0
93 1 (3x2 + 2y)dydx = 156 208
18.05 Lecture 15 March 9, 2005
Review for Exam 1 Practice Test 1: 1. In the set of all green envelopes, only 1 card can be green.
Similarly, in the set of red envelopes, only 1 card can be red.
Sample Space = 10! ways to put cards into envelopes, treating each separately.
You can’t have two of the same color matching, as that would be 4 total.
Degrees of Freedom = which envelope to choose (5 × 5) and which card to select (5 × 5)
Then, arrange the red in green envelopes (4!), and the green in red envelopes (4!)
P=
54 (4!)2 10!
2. Bayes formula:
P(f air|HHH) =
0.53 × 0.5 P(HHH |f air)P(f air) = 3 P(HHH |f air)P(f air) + P(HHH |unf air)P(unf air) 0.5 × 0.5 + 1 × 0.5
3. f1 (x) = 2xI(0 < x < 1), f2 (x) = 3x2 I(0 < x < 1) Y = 1, 2 ↔ P(Y = 1) = 0.5, P(Y = 2) = 0.5 f (x, y) = 0.5 × I(y = 1) × 2xI(0 < x < 1) + 0.5 × I(y = 2) × 3x2 I(0 < x < 1) f (x) = 0.5 × 2xI(0 < x < 1) + 0.5 × 3x2 I(0 < x < 1) = (x + 1.5x2 )I(0 < x < 1) P(Y = 1|X =
f1 ( 14 ) × 12 1 )= 1 4 f1 ( 4 ) × 12 + f2 ( 41 ) ×
1 2
=
2 × 1/4 × 1/2 2 × 1/4 × 1/2 + 3 × 1/16 × 1/2
4. f (z) = 2e−2z I(Z > 0), T = 1/Z we know t > 0 P(T ← t) = P(1/Z ← t) = P(Z ∼ 1/t) = � ∗ 1 2 �F (T ← t) = −2e−2/t × − 2 = 2 e−2/t , t > 0 (0 otherwise) = 2e−2z dz, p.d.f. f (t) = �t t t 1/t T = r(Z), Z = s(T ) =
1 T
↔ g(t) = |s∅ (t)|f (1/t) by change of variable.
5. f (x) = e−x I(x > 0)
t p.d.f. f (x, y) = e−x I(x > 0)e−y I(y > 0) = e−(x+y) I(x > 0, y > 0)
X ;V = X +Y X +Y Step 1 - Check values for random variables: (0 < V < →), (0 < U < 1) Step 2 - for change of variables: X = U V ; Y = V − U V = V (1 − U ) Jacobian: U=
J = det
�X �U �Y �U
�X �V �Y �V
=
V -V
U = V (1 − U ) + U V = V 1-U
45
g(u, v) = f (uv, v(1 − u)) × |v |I(uv > 0, v(1 − u) > 0) = e−v vI(v > 0, 0 < u < 1) Problem Set #5 (practice pset, see solutions for details): p. 175 #4
f (x1 , x2 ) = x1 + x2 I(0 < x1 < 1, 0 < x2 < 1)
Y = X1 X2 (0 < Y < 1)
� First look at the c.d.f.: P(Y ← y) = P(X1 X2 ← y) = {x1 x2 �y}={x2 �y/x1 } f (x1 , x2 )dx1 dx2
Due to the complexity of the limits, you can integrate the area in pieces, or you can find the complement, which is easier with only 1 set of limits. f (x1 , x2 ) = 1 −
�
{x1 x2 >y}
=1−
�
1 y
�
1 y/x1
(x1 + x2 )dx2 dx1 = 1 − (1 − y)2 = 2y − y 2
f (x1 , x2 ) = 0 for y < 0; 2y − y 2 for 0 < y < 1; 1 for y > 1. p.d.f.: g(y) = {
�P(Y ← y) = 2(1 − y), y ⊂ (0, 1); 0, otherwise.} �y
p. 164 #3 f (x) = { x2 , 0 ← x ← 2; 0 otherwise}
Y = X(2 − X), find the p.d.f. of Y.
First, find the limits of Y, notice that it is not a one-to-one function.
Y varies from 0 to 1 as X varies from 0 to 2. Look at the c.d.f.: P(Y ← y) = P(X(2 − X) ← y) = P(X 2 − 2X + 1 ∼ 1 − y) = P((1 − X)2 ∼ 1 − y) = = P(|1 − X | ∼
1 − y) = P(1 − X ∼ 46
1 − y or 1 − X ← − 1 − y) =
= P(X ← 1 − = P(0 ← X ← 1 − ={
�
≥ 1− 1−y
0
x dx + 2
�
Take derivative to get the p.d.f.:
2
≥ 1+ 1−y
1 − y or X ∼ 1 +
1 − y) + P(1 +
1 − y) =
1 − y ← X ← 2) =
x dx = 1 − 1 − y, 0 ← y ← 1; 0, y < 0; 1, y > 1}
2
1
g(y) = ∩ , 0 ← y ← 1; 0, otherwise. 4 1−y ** End of Lecture 15
47
18.05 Lecture 16 March 14, 2005
Expectation of a random variable. X - random variable
roll a die - average value = 3.5
flip a coin - average value = 0.5 if heads = 0 and tails = 1
Definition: If X is discrete, p.f. f� (x) = p.f. of X,
Then, expectation of X is EX = xf (x)
For a die:
f(x)
1 1/6
E=1×
1 6
2 1/6
3 1/6
+ ... + 6 ×
1 6
4 1/6
5 1/6
6 1/6
= 3.5
Another way to think about it:
Consider each pi as a weight on a horizontal bar. Expectation = center of gravity on the bar. � If X - continuous, f (x) = p.d.f. then E(X) = xf (x)dx �1 Example: X - uniform on [0, 1], E(X) = 0 (x × 1)dx = 1/2
� � Consider Y � = r(x), then EY = x r(x)f (x) or r(x)f (x)dx p.f. g(y) = {x:y=r(x)} f (x)
� � � � � � � E(Y ) = y yg(y) = y y {x:y=r(x)} f (x) = y {x:r(x)=y} yf (x) = y {x:r(x)=y} r(x)f (x)
then, can�drop y since no reference to y: E(Y ) = x r(x)f (x) Example: X - uniform on [0, 1] �1 EX 2 = 0 X 2 × 1dx = 1/3
X1 , ..., Xn - random� variables with t p.f. or p.d.f. f (x1 ...xn ) E(r(X1 , ..., Xn )) = r(x1 , ..., xn )f (x1 , ..., xn )dx1 ...dxn Example: Cauchy distribution p.d.f.:
f (x) =
1 ψ(1 + x2 )
Check validity of integration: �
∗ −∗
1 1 dx = tan−1 (x)|∗ −∗ = 1 ψ(1 + x2 ) ψ
But, the expectation is undefined:
48
�
∗
1 dx = 2 E|X| = |x| ψ(1 + x2 ) −∗
�
∗
0
1 x = ln(1 + x2 )|∗ 0 =→ ψ(1 + x2 ) 2ψ
Note: Expectation of X is defined if E|X | < → Properties of Expectation: 1) E(aX + b) = aE(X) � +b � � Proof: E(aX + b) = (aX + b)f (x)dx = a xf (x)dx + b f (x)dx = aE(X) + b 2) E(X1 + X2 + ... + X�n�) = EX1 + EX2 + ... + EXn Proof: (x�1 �+ x2 )f (x1 , x2 )dx1 dx2 = � � E(X1 + X2 ) = = � x�1 f (x1 , x2 )dx1 dx2 + � x�2 f (x1 , x2 )dx1 dx2 = = � x1 f (x1 , x2 )dx�2 dx1 + x2 f (x1 , x2 )dx1 dx2 = = x1 f1 (x1 )dx1 + x2 f2 (x2 )dx2 = EX1 + EX2
Example: Toss a coin n times, “T” on i: Xi = 1; “H” on i: Xi = 0. Number of tails = X1 + X2 + ... + Xn E(number of tails) = E(X1 + X2 + ... + Xn ) = EX1 + EX2 + ... + EXn EXi = 1 × P(Xi = 1) + 0 × P(Xi = 0) = p, probability of tails Expectation = p + p + ... + p = np This is natural, because you expect np of n for p probability.
� � Y = Number of�tails, P(Y = k) = nk pk (1 − p)n−k � �n E(Y ) = k=0 k nk pk (1 − p)n−k = np More difficult to see though definition, better to use sum of expectations method. Two functions, h and g, such that h(x) ← g(x), for all x ⊂ R Then, E(h(X)) ← E(g(X)) ↔ E(g(X) − h(X)) ∼ 0 � (g(x) − h(x)) × f (x)dx ∼ 0 You know that f (x) ∼ 0, therefore g(x) − h(x) must also be ∼ 0 If a ← X ← b ↔ a ← E(X) ← E(b) ← b E(I(X ⊂ A)) = 1 × P(X ⊂ A) + 0 × P(X ⊂ / A), for A being a set on R / A) = 1 − P(X ⊂ A) Y = I(X ⊂ A) = {1, with probability P(X ⊂ A); 0, with probability P(X ⊂ E(I(X ⊂ A)) = P(X ⊂ A)} In this case, think of the expectation as an indicator as to whether the event happens. Chebyshev’s Inequality Suppose that X ∼ 0, consider t > 0, then: 1 E(X) t Proof: E(X) = E(X)I(X < t) + E(X)I(X ∼ t) ∼ E(X)I(X ∼ t) ∼ E(t)I(X ∼ t) = tP(X ∼ t) P(X ∼ t) ←
** End of Lecture 16
49
18.05 Lecture 17 March 16, 2005
Properties of Expectation. Law of Large Numbers. E(X1 + ... + Xn ) = EX1 + ... + EXn Matching Problem (n envelopes, n letters)
Expected number of letters in correct envelopes?
Y - number of matches
Xi = {1, letter i matches; 0, otherwise}, Y = X1 + ... + Xn
E(Y ) = EX1 + ... + EXn , but EXi = 1 × P(Xi = 1) + 0 × P(Xi = 0) = P(Xi = 1) = Therefore, expected match = 1:
1
n
1 =1 n If X1 , ..., Xn are independent, then E(X1 × ... × Xn ) = EX1 × ... × EXn As with the sum property, we will prove for two variables: EX1 X2 = EX1 × EX2
t p.f. or � �p.d.f.: f (x1 , x2 ) = f1 (x1 )f�2 (x � 2 )
x1 x�2 f (x1 , x2 )dx1 dx2 = x1 x2 f1 (x1 )f2 (x2 )dx1 dx2 =
EX�1 X2 = = f1 (x1 )x1 dx1 f2 (x2 )x2 dx2 = EX1 × EX2
E(Y ) = n ×
X1 , X2 , X3 - independent, uniform on [0, 1]. Find EX12 (X2 − X3 )2 .
= EX12 E(X2 − X3 )2 by independence.
= EX12 E(X22 − 2X2 X3 + X32 ) = EX12 (EX22 + EX32 − 2EX2 X3 )
By independence of X2 , X3 ; = EX12 (EX22 + EX32 − 2EX2 EX3 )
�1 �1 EX1 = 0 x × 1dx = 1/2, EX12 = 0 x2 × 1dx = 1/3 (same for X2 and X3 ) 1 EX12 (X2 − X3 )2 = 31 ( 13 + 31 − 2( 21 )( 12 )) = 18
For discrete � random variables, X takes values 0, 1, 2, 3, ... E(X) = ∗ n=0 nP(x = n) for n = 0, contribution = 0; for n = 1, P(1); for n = 2, 2P(2); for n = 3, 3P(3); ... �∗ E(X) = n=1 P(X ∼ n) Example: X - number of trials until success.
P(success) = p
P(f ailure) = 1 − p = q
E(X) =
∗ �
n=1
P(X ∼ n) =
∗ �
n=1
(1 − p)n−1 = 1 + q + q 2 + ... =
1 1 = 1−q p
Formula based upon reasoning that the first n - 1 times resulted in failure.
Much easier than the�original formula:
� ∗ ∗ n−1
p n=0 nP(X = n) = n=1 n(1 − p) Variance:
Definition: Var(X) = E(X − E(X))2 = θ 2 (X)
Measure of� the deviation from the expectation (mean).
Var(X) = (X − E(X))2 f (x)dx - moment of inertia.
50
≈
�
(X − center of gravity )2 × mx
Standard Deviation:
θ(X) = Var(X) Var(aX + b) = a2 Var(X) θ(aX + b) = |a|θ(X) Proof by definition:
E((aX + b) − E(aX + b))2 = E(aX + b − aE(X) − b)2 = a2 E(X − E(X))2 = a2 Var(X)
Property: Var(X) = EX 2 − (E(X))2
Proof:
Var(X) = E(X − E(X))2 = E(X 2 − 2XE(X) + (E(X))2 ) =
EX 2 − 2E(X) × E(X) + (E(X))2 = E(X)2 − (E(X))2
Example: X ≈ U [0, 1] EX =
�
1 0
x × 1dx =
1 , EX 2 = 2
�
1 0
x2 × 1dx =
1 3
1 1 1 − ( )2 = 3 2 12 If X1 , ..., Xn are independent, then Var(X1 + ... + Xn ) = Var(X1 ) + ... + Var(Xn ) Proof: Var(X) =
Var(X1 + X2 ) = E(X1 + X2 − E(X1 + X2 ))2 = E((X1 − EX1 ) + (X2 − EX2 ))2 = = E(X1 − EX1 )2 + E(X2 − EX2 )2 + 2E(X1 − EX1 )(X2 − EX2 ) = = Var(X1 ) + Var(X2 ) + 2E(X1 − EX1 ) × E(X2 − EX2 ) By independence of X1 and X2 : = Var(X1 ) + Var(X2 ) Property: Var(a1 X1 + ... + an Xn + b) = a21 Var(X1 ) + ... + an2 Var(Xn ) � � Example: Binomial distribution - B(n, p), P(X = k) = nk pk (1 − p)n−k X = X1 + � ... + Xn , Xi = {1, Trial i is success ; 0, Trial i is failure.} Var(X) = ni=1 Var(Xi ) Var(Xi ) = EXi2 − (EXi )2 , EXi = 1(p) + 0(1 − p) = p; EXi2 = 12 (p) + 02 (1 − p) = p. Var(Xi ) = p − p2 = p(1 − p) Var(X) = np(1 − p) = npq Law of Large Numbers: X1 , X2 , ..., Xn - independent, identically distributed. X1 + ... + Xn −−−−↔ n ↔ → EX1 n Take φ > 0 - but small, P(|Sn − EX1 | > φ) ↔ 0 as n ↔ → By Chebyshev’s Inequality: Sn =
51
P((Sn − EX1 )2 > φ2 ) = P(Y > M ) ←
1 EY = M
1 1 X1 + ... + Xn 1 1 − EX1 )2 = 2 Var( (X1 + ... + Xn )) = E(Sn − EX1 )2 = 2 E( φ2 φ n φ n =
1 φ 2 n2
(Var(X1 ) + ... + Var(Xn )) =
for large n. ** End of Lecture 17
52
nVar(X1 ) Var(X1 ) = ↔0 2 2 φ n nφ2
18.05 Lecture 18
March 18, 2005
Law of Large Numbers. X1 , ..., Xn - i.i.d. (independent, identically distributed) X1 + ... + Xn ↔ as n ↔ →, EX1 n Can be used for functions of random variables as well:
Consider Yi = r(X1 ) - i.i.d.
x=
r(X1 ) + ... + r(Xn ) ↔ as n ↔ → , EY1 = Er(X1 ) n Relevance for Statistics: Data points xi , as n ↔ →, The average converges to the unknown expected value of the distribution which often contains a lot (or all) of information about the distribution. Y =
Example: Conduct a poll for 2 candidates:
p ⊂ [0, 1] is what we’re looking for
Poll: choose n people randomly: X1 , ..., Xn
P(Xi = 1) = p
P(Xi = 0) = 1 − p
EX1 = 1(p) + 0(1 − p) = p ♥
X1 + ... + Xn as n ↔ → n
Other characteristics of distribution:
Moments of the distribution: for each integer, k ∼ 1, kth moment EX k
kth moment is defined only if E|X |k < → Moment generating function: consider a parameter y ⊂ R.
and define δ(t) = EetX where X is a random variable.
δ(t) - m.g.f. of X
Taylor series of δ(t) =
∗ � δk (0)
k=0
Taylor series of EetX = E
∗ �
k=0 k
k
k!
tk
∗
� tk (tX)k = EX k k! k! k=0
EX = δ (0) Example: Exponential distribution E(∂) with p.d.f. f (x) = {∂e−∂x , x ∼ 0; 0, x < 0}
Compute the moments:
�∗ EX k = 0 xk ∂e−∂x dx is a difficult integral.
Use the m.g.f.: � ∗ � ∗ tX tx −∂x δ(t) = Ee = e ∂e dx = ∂e(t−∂)x dx 0
0
(defined if t < → to keep the integral finite)
53
=
∗
� t � tk ∂ ∂ 1 ∂e(t−∂)x ∗ |0 = 1 − = = = ( )k = EX k t−∂ t−∂ ∂−t 1 − t/∂ ∂ k! k=0
Recall the formula for geometric series:
∗ �
xk =
k=0
1 when k < 1 1−x
1 Exk k! = ↔ Exk = k k ∂ k! ∂ The moment generating function completely describes the distribution.
� Exk = xk f (x)dx
If f(x) unknown, get a system of equations for f ↔ unique distribution for a set of moments.
M.g.f. uniquely determines the distribution. X1 , X2 from E(∂), Y = X1 + X2 .
To find distribution of sum, we could use the convolution formula,
but, it is easier to find the m.g.f. of sum Y :
EetY = Eet(X1 +X2 ) = EetX1 etX2 = EetX1 EetX2 Moment generating function of each: ∂ ∂−t
For the sum: ( Consider the exponential distribution:
∂ 2 ) ∂−t
1 , f (x) = {∂e−∂x , x ∼ 0; 0, x < 0} ∂ This distribution describes the life span of quality products. ∂ = E1X , if ∂ small, life span is large. E(∂) ≈ X1 , EX =
Median: m ⊂ R such that: P(X ∼ m) ∼
1 1 , P(X ← m) ∼ 2 2
(There are times in discrete distributions when the probability cannot ever equal exactly 0.5) When you exclude the point itself: P(X > m) ← 12 P(X ← m) + P(X > m) = 1 The median is not always uniquely defined. Can be an interval where no point masses occur.
54
For a continuous distribution, you can define P > or < m as equal to 21 . But, there are still cases in which the median is not unique!
For a continuous distribution: P(X ← m) = P(X ∼ m) =
1 2
The average measures center of gravity, and is skewed easily by outliers.
The average will be pulled towards the tail of a p.d.f. relative to the median. Mean: find a ⊂ R such that E(X − a)2 is minimized over a. � E(X − a)2 = −E2(X − a) = 0, EX − a = 0 ↔ a = EX �a expectation - squared deviation is minimized. Median: find a ⊂ R such that E|X − a| is minimized. E|X − a| ∼ E|X − m|, where m - median E( � |X − a| − |X − m|) ∼ 0 (|x − a| − |x − m|)f (x)dx
55
Need to look at each part: 1)a − x − (m − x) = a − m, x ← m 2)x − a − (x − m) = m − a, x ∼ m 3)a − x − (x + m) = a + m − 2x, m ← x ← a
The integral can now be simplified: � � (|x − a| − |x − m|)f (x)dx ∼ = (a − m)(
�
m −∗
f (x)dx −
�
∗ m
m −∗
(a − m)f (x)dx +
�
∗ m
(m − a)f (x)dx =
f (x)dx) = (a − m)(P(X ← m) − P(X > m)) ∼ 0
As both (a − m) and the difference in probabilities are positive. The absolute deviation is minimized by the median. ** End of Lecture 18
56
18.05 Lecture 19 March 28, 2005
Covariance and Correlation Consider 2 random variables X, Y θx2 = Var(X), θy2 = Var(Y ) Definition 1:
Covariance of X and Y is defined as:
Cov(X, Y ) = E(X − EX)(Y − EY ) Positive when both high or low in deviation.
Definition 2:
Correlation of X and Y is defined as:
π(X, Y ) =
Cov(X, Y ) Cov(X, Y ) =
θx θy Var(X)Var(Y )
The scaling is thus removed from the covariance.
Cov(X, Y ) = E(XY − XEY − Y EX + EXEY ) = = E(XY ) − EXEY − EY EX + EXEY = E(XY ) − EXEY Cov(X, Y ) = E(XY ) − EXEY Property 1:
If the variables are independent, Cov(X, Y ) = 0 (not correlated)
Cov(X, Y ) = E(XY ) − EXEY = EXEY − EXEY = 0
Example: X takes values {−1, 0, 1} with equal probabilities { 31 , 13 , 13 } Y = X2 X and Y are dependent, but they are uncorrelated. Cov(X, Y ) = EX 3 − EXEX 2 but, EX = 0, and EX 3 = EX = 0 Covariance is 0, but they are still dependent. Also - Correlation is always between -1 and 1. Cauchy-Schwartz Inequality: (EXY )2 ← EX 2 EY 2 Also known
as the dot-product inequality: − − − − v ||↔ u| |(↔ v ,↔ u )| ← |↔ To prove for expectations: δ(t) = E(tX + Y )2 = t2 EX 2 + 2tEXY + EY 2 ∼ 0 Quadratic f(t), parabola always non-negative if no roots:
D = (EXY )2 − EX 2 EY 2 ← 0) (discriminant)
Equality is possible if δ(t) = 0 for some point t.
δ(t) = E(tX + Y )2 = 0, if tX + Y = 0, Y = -tX, linear dependence.
(Cov(X, Y ))2 = (E(X − EX)(Y − EY ))2 ← E(X − EX)2 E(Y − EY )2 = θx2 θy2 |Cov(X, Y )| ← θx θy , 57
|π(X, Y )| =
|Cov(X, Y )| ←1 θx θy
So, the correlation is between -1 and 1. Property 2: −1 ← π(X, Y ) ← 1 When is the correlation equal to 1, -1? |π(X, Y )| = 1 only when Y − EY = c(X − EX), or Y = aX + b for some constants a, b.
(Occurs when your data points are in a straight line.)
If Y = aX + b :
π(X, Y ) =
a E(aX 2 + bX) − EXE(aX + b) aVar(X)
= = sign(a) = 2 |a|Var(X) |a| Var(X) × a Var(X)
If a is positive, then the correlation = 1, X and Y are completely positively correlated. If a is negative, then correlation = -1, X and Y are completely negatively correlated.
Looking at the distribution of points on Y = X 2 , there is NO linear dependence, correlation = 0. However, if Y = X 2 + cX, then there is some linear dependence introduced in the skewed graph. Property 3: Var(X + Y ) = E(X + Y − EX − EY )2 = E((X − EX) + (Y − EY ))2 = E(X − EX)2 − 2E(X − EX)(E(Y − EY ) + E(Y − EY )2 = Var(X) + Var(Y ) − 2Cov(X, Y ) Conditional Expectation:
(X, Y) - random pair.
What is the average value of Y given that you know X?
f(x, y) - t p.d.f. or p.f. then f (y |x) - conditional p.d.f. or p.f.
Conditional expectation: � � E(Y |X = x) = yf (y|x)dy or yf (y |x) E(Y |X) = h(X) = Property 4:
�
yf (y |X)dy - function of X, still a random variable.
E(E(Y |X)) = EY 58
Proof:
� E(E(Y � � |X)) = E(h(X)) = �f (x)f � (x)dx =
�� = � ( �yf (y|x)dy)f (x)dx� = yf (y |x)f (x)dydx = yf (x, y)dydx =
= y( f (x, y)dx)dy = yf (y)dy = EY
Property 5:
E(a(X)Y |X) = a(X)E(Y |X) See text for proof. Summary of Common Distributions: 1) Bernoulli Distribution: B(p), p ⊂ [0, 1] - parameter Possible values of the random variable: X = {0, 1}; f (x) = px (1 − p)1−x P(1) = p, P(0) = 1 − p E(X) = p, Var(X) = p(1 − p) 2) Binomial Distribution: � �B(n, p), n repetitions of Bernoulli X − {0, 1, ..., n}; f (x) = nx px (1 − p)1−x E(X) = np, Var(X) = np(1 − p) 3) Exponential Distribution: E(∂), parameter ∂ > 0 X = [0, →), p.d.f. f (x) = {∂e−∂x , x ∼ 0; 0, otherwise } EX =
k! 1 , EX k = k ∂ ∂
Var(X) =
2 1 1 − 2 = 2 2 ∂ ∂ ∂
** End of Lecture 19
59
18.05 Lecture 20 March 30, 2005
§5.4 Poisson Distribution
�(�), parameter � > 0, random variable takes values: {0, 1, 2, ...}
p.f.:
f (x) = P(X = x) =
�x −� −� � �x = e−� × e� = 1 e ;e x! x! x→0
Moment generating function: Π(t) = Ee−tX =
�
x→0
etX ×
� (et �)x t t �x −� � (et �)x −� e = e = e−� = e−� ee � = e�(e −1) x! x! x! x→0
x→0
EX k = Πk (0) t EX = Π∅ (0) = e�(e −1) × �et |t=0 = � t t EX 2 = Π∅∅ (0) = (�e�(e −1)+t )∅ |t=0 = �e�(e −1)+t (�et + 1)| t=0 = �(� + 1)
Var(X) = EX 2 − (EX)2 = �(� + 1) − �2 = � If X1 ≈ �(�1 ), X2 ≈ �(�2 ), ...Xn ≈ �(�n ), all independent:
Y = X1 + ... + Xn , find moment generating function of Y,
Π(t) = EetY = Eet(X1 +...+Xn ) = EetX1 × ... × etXn By independence: EetX1 EetX2 × ... × EetXn = e�1 (e
t
−1) �2 (et −1)
e
...e�n (e
t
−1)
Moment generating function of �(�1 + ... + �n ): Π(t) = e(�1 +�2 +...+�n )(e
t
−1)
If dependent, for example:
X1 , X1 − 2X1 ⊂ {0, 2, 4, ...} - skips odd numbers, so not Poisson.
Approximation of Binomial: X1 , ..., Xn ≈ B(p), P(Xi = 1) = p, P(X1 =�0)� = 1 − p Y = X1 + .. + Xn ≈ B(n, p), P(Y = k) = nk pk (1 − p)n−k If p is very small, n is large; np = � p = 1/100, n = 100; np = 1 � � � � � � � n−k � −k � n n 1 n k n � k k n−k = � (1 − ) (1 − ) p (1 − p) = ( ) (1 − ) n n! k nk k k n n Many factors can be eliminated when n is large ↔
limn�∗ (1 +
x n ) = ex n
� � n 1 n! 1 (n − k + 1)(n − k + 2)...n 1 = = k nk k!(n − k)! nk n × n × ... × n k!
Simplify the left fraction:
(1 −
k−2 1 k−1 )(1 − )...(1 − ) ↔ 1 n n n 60
↔
1 k!
So, in the end: � � �k −� n k p (1 − p)n−k = e k k! Poisson distribution with parameter � results. Example:
B(100, 1/100) � �(1); P(2) � 12 e−1 =
e−1 2
very close to actual.
Counting Processes: Wrong connections to a phone number, number of typos in a book on
a page, number of bacteria on a part of a plate.
Properties:
1) Count(S) - a count of random objects in a region S √ T
E(count(S)) = � × |S |, where |S | - size of S
(property of proportionality)
2) Counts on dist regions are independent.
3) P(count(S) ∼ 2) is very small if the size of the region is small.
1, 2, and 3 lead to count(S) ≈ �(�|S |), � - intensity parameter.
A region from [0, T] is split into n sections, each section has size |T |/n The count on each region is X1 , ..., Xn By 2), X1 , ..., Xn are independent. P(Xi ∼ 2) is small if n is large. By 1), EXi = � |Tn | = 0(P(X1 = 0)) + 1(P(X1 = 1) + 2(P(X1 = 2)) + ...
But, over 1 the value is very small.
P(Xi = 1) � �|nT |
| P(X1 = 0) � 1 − �|T n P(count(T ) = k) = P(X1 + ... + Xn = k) � B(n, §5.6 - Normal Distribution
61
�|T | (�|T |)k −�|T | ) � �(�|T |) � e n k!
(
�
e
−x2 2
dx)2
Change variables to facilitate integration: � � � � −(x2 +y2 ) −y2 −x2 2 2 2 = e dx × e dy = e dxdy Convert to polar:
=
�
2λ 0
�
∗
e
− 21 r 2
rdrdχ = 2ψ
0
So, original integral area =
∩ 2ψ
�
∗
e
− 12 r 2
rdr = 2ψ
0
�
∗ −∗
�
∗
e
0
1 −x2 ∩ e 2 dx = 1 2ψ
p.d.f.: 1 −x2 f (x) = ∩ e 2 2ψ Standard normal distribution, N(0, 1) ** End of Lecture 20
62
− 12 r 2
r2 rd( ) = 2ψ 2
�
∗ 0
e−t dt = 2ψ
18.05 Lecture 21 April 1, 2005
Normal Distribution Standard Normal Distribution, N(0, 1) p.d.f.: 2 1 f (x) = ∩ e−x /2 2ψ
m.g.f.: δ(t) = E(etX ) = et
2
/2
Proof - Simplify integral by completing the square: � � 2 2 1 1 δ(t) = etx ∩ e−x /2 dx = ∩ etx−x /2 dx = 2ψ 2ψ � � 2 2 2 2 2 1 1 1 ∩ et /2−t /2+tx−x /2 dx = ∩ et /2 e− 2 (t−x) dx 2ψ 2ψ Then, perform the change of variables y = x - t: � � ∗ � ∗ 2 2 2 2 1 2 1 2 1 1 = ∩ et /2 e− 2 y dy = et /2 f (x)dx = et /2 e− 2 y dy = et /2 ∩ 2ψ 2ψ −∗ −∗ Use the m.g.f. to find expectation of X and X 2 and therefore Var(X): E(X) = δ∅ (0) = tet
2
/2
|t=0 = 0; E(X 2 ) = δ∅∅ (0) = et
2
/2 2
t + et
2
/2
|t=0 = 1; Var(X) = 1
Consider X ≈ N (0, 1), Y = θX + µ, find the distribution of Y: P(Y ← y) = P(θX + µ ← y) = P(X ← p.d.f. of Y: f (y) =
y−µ )= θ
�
y−µ �
−∗
2 1 ∩ e−x /2 dx 2ψ
(y−µ)2 (y−µ)2 1 1 �P(Y ← y) 1 = ∩ e− 2�2 = ∩ e− 2�2 ↔ N (µ, θ) �y θ θ 2ψ 2ψ
EY = E(θX + µ) = θ(0) + µ(1) = µ
− µ)2 = E(θX + µ − µ)2 = θ 2 E(X 2 ) = θ 2 - variance of N (µ, θ)
E(Y
θ = Var(X) - standard deviation
63
To describe an altered standard normal distribution N(0, 1) to a normal distribution N (µ, θ), The peak is located at the new mean µ, and the point of inflection occurs θ away from µ
Moment Generating Function of N (µ, θ ); Y = θX + µ δ(t) = EetY = Eet(πX+µ) = Ee(tπ)X etµ = etµ Ee(tπ)X = etµ e(tπ)
2
/2
= etµ+t
2
(π)2 /2
Note: X1 ≈ N (µ1 , θ1 ), ..., Xn ≈ N (µn , θn ) - independent.
Y = X1 + ... + Xn , distribution of Y:
Use moment generating function:
2 2
EetY = Eet(X1 +...+Xn ) = EetX1 ...etXn = EetX1 ...EetXn = eµ1 t+π1 t =e
P
µi t+
P
πi2 t2 /2
≈ N(
�
µi ,
��
/2
θi2 )
The sum of different normal distributions is still normal!
This is not always true for other distributions (such as exponential)
Example:
X ≈ N (µ, θ), Y = cX, find that the distribution is still normal:
Y = c(θN (0, 1) + µ) = (cθ)N (0, 1) + (µc)
Y ≈ cN (µ, θ) = N (cµ, cθ)
Example:
Y ≈ N (µ, θ)
b−µ P(a ← Y ← b) = P(a ← θx + µ ← b) = P( a−µ π ← X ← π )
This indicates the new limits for the standard normal.
Example:
Suppose that the heights of women: X ≈ N (65, 1) and men: Y ≈ N (68, 2)
P(randomly chosen woman taller than randomly chosen man)
P(X > Y ) = P(X − Y > 0)
∩ Z = X − Y ≈ N (65 − 68, 12 + 22 ) = N (−3, (5)) ≥ ≥ P(Z > 0) = P( Z−(−3) > −(−3) ) = P(standard normal > ≥35 = 1.342) = 0.09 5 5 Probability values tabulated in the back of the textbook. Central Limit Theorem Flip 100 coins, expect 50 tails, somewhere 45-50 is considered typical.
64
2 2
× ... × eµn t+πn t
/2
Flip 10,000 coins, expect 5,000 tails, and the deviation can be larger, perhaps 4,950-5,050 is typical. Xi = {1(tail); 0(head)} X1 + ... + Xn 1 1 1 1 number of tails ↔ E(X1 ) = by LLN Var(X1 ) = (1 − ) = = n n 2 2 4 2 But, how do you describe the deviations? X1 , X2 , ..., Xn are independent with some distribution P n
µ = EX1 , θ 2 = Var(X1 ); x =
1� Xi ↔ EX1 = µ n i=1
≥ ∩ x − µ on the order of n ↔ n(x−µ) behaves like standard normal. π ∩ n(x − µ)
is approximately standard normal N (0, 1) for large n θ ∩ n(x − µ) −−↔ ← x) − → P(standard normal ← x) = N (0, 1)(−→, x) P( n−↔ θ This is useful in of statistics to describe outcomes as likely or unlikely in an experiment.
P(number of tails ← 4900) = P(X1 + ... + X10,000 ← 4, 900) = P(x ← 0.49) = = P(
∩
10, 000(x − 21 ) 1 2
←
∩
10, 000(0.49 − 0.5) 1 2
) � N (0, 1)(−→, −
100(0.01) 1 2
= −2) = 0.0267
Tabulated values always give for positive X, area to the left. In the table, look up -2 by finding the value for 2 and taking the complement. ** End of Lecture 21
65
18.05 Lecture 22 April 4, 2005
Central Limit Theorem X1 , ..., Xn - independent, identically distributed (i.i.d.) x = n1 (X1 + ... + Xn ) µ = EX, θ 2 = Var(X) ∩ n(x − µ) −−−−↔ n ↔ → N (0, 1) θ You can use the knowledge of the standard normal distribution to describe your data: ∩ n(x − µ) θY = Y, x − µ = ∩ θ n This expands the law of large numbers:
It tells you exactly how much the average value and expected vales should differ.
∩ xn − µ 1 n(x − µ) ∩ 1 x1 − µ = n ( + ... + ) = ∩ (Z1 + ... + Zn ) θ n θ θ n where: Zi = Xiπ−µ ; E(Zi ) = 0, Var(Zi ) = 1 Consider the m.g.f., see that it is very similar to the standard normal distribution: Ee
t �1n (Z1 +...+Zn )
= EetZ1 /
≥ n
× ... × etZn /
≥ n
= (EetZ1 /
≥ n n
)
1 1 EetZ1 = 1 + tEZ1 + t2 EZ12 + t3 EZ13 + ... 2 6 1 1 = 1 + t2 + t3 EZ13 + ... 2 6 Ee(t/
≥
n)Z1
=1+
t2 t3 t2 + 3/2 EZ13 + ... � 1 + 2n 6n 2n
Therefore: (EetZ1 /
≥ n n
) � (1 +
t2 n ) 2n
t2 n −−−−↔ t2 /2 ) n↔→e - m.g.f. of standard normal distribution! 2n Gamma Distribution: Gamma function; for ∂ > 0, λ > 0 (1 +
66
�(∂) =
�
∗
x∂−1 e−x dx
0
p.d.f of Gamma distribution, f(x): � ∗ 1 ∂−1 −x 1 ∂−1 −x x e dx, f (x) = { x e , x ∼ 0; 0, x < 0} 1= �(∂) �(∂) 0 Change of variable x = λy, to stretch the function: � ∗ ∂ � ∗ 1 ∂−1 ∂−1 −ξy λ 1= λ y e λdy = y ∂−1 e−ξy dy �(∂) �(∂) 0 0
p.d.f. of Gamma distribution, f (x|∂, λ): f (x|∂, λ) = {
λ ∂ ∂−1 −ξx x e , x ∼ 0; 0, x < 0} − Gamma(∂, λ) �(∂)
Properties of the Gamma Function: � �(∂) =
∗
x∂−1 e−x dx =
0
Integrate by parts:
= x∂−1 e−x |∗ 0 −
�
∗ 0
�
∗
x∂−1 d(−e−x ) =
0
(−e−x )(∂ − 1)x∂−2 dx = 0 + (∂ − 1)
�
∗ 0
x∂−2 e−x dx = (∂ − 1)�(∂ − 1)
In summary, Property 1: �(∂) = (∂ − 1)�(∂ − 1) You can expand Property 1 as follows: �(n) = (n − 1)�(n − 1) = (n − 1)(n − 2)�(n − 2) = (n − 1)(n − 2)(n − 3)�(n − 3) = = (n − 1)...(1)�(1) = (n − 1)!�(1), �(1) = In summary, Property 2: �(n) = (n − 1)!
�
∗ 0
e−x dx = 1 ↔ �(n) = (n − 1)!
Moments of the Gamma Distribution: X ≈ (∂, λ) � ∗ � ∗ ∂ λ∂ k k λ ∂−1 −ξx EX = x x e dx = x(∂+k)−1 e−ξx dx �(∂) �(∂) 0 0
Make this integral into a density to simplify: � λ ∂ �(∂ + k) ∗ λ ∂+k x(∂+k)−1 e−ξx dx = �(∂) λ ∂+k �(∂ + k) 0 The integral is just the Gamma distribution with parameters (∂ + k, λ)! =
�(∂ + k) (∂ + k − 1)(∂ + k − 2) × ... × ∂�(∂) (∂ + k − 1) × ... × ∂ = = �(∂)λ k �(∂)λ k λk
For k = 1:
67
E(X) =
∂ λ
For k = 2: E(X 2 ) = Var(x) =
(∂ + 1)∂ λ2
(∂ + 1)∂ ∂2 ∂ − 2 = 2 λ λ λ2
Example:
If the mean = 50 and variance = 1 are given for a Gamma distribution, Solve for ∂ = 2500 and λ = 50 to characterize the distribution. Beta Distribution: �
1 0
x∂−1 (1 − x)ξ−1 dx =
�(∂)�(λ) ,1 = �(∂ + λ)
Beta distribution p.d.f. - f (x|∂, λ)
�
1 0
�(∂ + λ) ∂−1 x (1 − x)ξ−1 dx �(∂)�(λ)
Proof: �(∂)�(λ) =
�
∗
x∂−1 e−x dx
0
Set up for change of variables:
�
∗
y ξ−1 e−y dy =
0
�
∗ 0
�
∗
x∂−1 y ξ−1 e−(x+y) dxdy
0
x∂−1 y ξ−1 e−(x+y) = x∂−1 ((x + y) − x)ξ−1 e−(x+y) = x∂−1 (x + y)ξ−1 (1 −
x ξ−1 −(x+y) ) e x+y
Change of Variables: s = x + y, t =
x , x = st, y = s(1 − t) ↔ J acobian = s(1 − t) − (−st) = s x+y
Substitute: =
�
1 0
�
∗ 0
t∂−1 s∂+ξ−2 (1 − t)ξ−1 e−s sdsdt = =
�
1 0
�
1 0
t∂−1 (1 − t)ξ−1 dt
t∂−1 (1 − t)ξ−1 × �(∂ + λ) = �(∂)�(λ)
Moments of Beta Distribution: 68
�
∗ 0
s∂+ξ−1 e−s ds =
EX k =
�
1 0
xk
�(∂ + λ) ∂−1 �(∂ + λ) x (1 − x)ξ−1 dx = �(∂)�(λ) �(∂)�(λ)
�
1 0
x(∂+k)−1 (1 − x)ξ−1 dx
Once again, the integral is the density function for a beta distribution. =
�(∂ + λ) �(∂ + k)�(λ) �(∂ + λ) �(∂ + k) (∂ + k − 1) × ... × ∂ = = × �(∂)�(λ) �(∂ + λ + k) �(∂ + λ + k) �(∂) (∂ + λ + k − 1) × ... × (∂ + λ)
For k = 1:
EX =
∂ ∂+λ
For k = 2: EX 2 = Var(X) =
(∂ + 1)∂ (∂ + λ + 1)(∂ + λ)
(∂ + 1)∂ ∂2 ∂λ = − (∂ + λ + 1)(∂ + λ) (∂ + λ)2 (∂ + λ)2 (∂ + λ + 1)
Shape of beta distribution. ** End of Lecture 22
69
18.05 Lecture 23 April 6, 2005
Estimation Theory: If only 2 outcomes: Bernoulli distribution describes your experiment.
If calculating wrong numbers: Poisson distribution describes experiment.
May know the type of distribution, but not the parameters involved.
A sample (i.i.d.) X1 , ..., Xn has distribution P from the family of distributions: {Pβ : χ ⊂ Γ}
P = Pβ0 , χ0 is unknown
Estimation Theory - take data and estimate the parameter.
It is often obvious based on the relation to the problem itself.
Example: B(p), sample: 0 0 1 1 0 1 0 1 1 1
p = E(X) ♥ x = 6/10 = 0.6
Example: E(∂), ∂e−∂x , x ∼ 0, E(X) = 1/∂.
Once again, parameter is connected to the expected value.
1/∂ = E(X) ♥ x, ∂ � 1/x - estimate of alpha.
Bayes Estimators: - used when intuitive model can be used in describing the data.
X1 , ..., Xn ≈ Pβ0 , χ0 ⊂ Γ Prior Distribution - describes the distribution of the set of parameters (NOT the data)
f (χ) - p.f. or p.d.f. ↔ corresponds to intuition.
P0 has p.f. or p.d.f.; f (x|χ)
Given x1 , ..., xn t p.f. or p.d.f.: f (x1 , ..., xn |χ) = f (x1 |χ) × ... × f (xn |χ) To find the Posterior Distribution - distribution of the parameter given your collected data. Use Bayes formula:
f (χ|x1 , ..., xn ) = �
f (x1 , .., xn |χ)f (χ) f (x1 , ..., xn |χ)f (χ)dχ
The posterior distribution adjusts your assumption (prior distribution) based upon your sample data. Example: B(p), f (x|p) = px (1 − p)1−x ; 70
f (x1 , ..., xn |p) = �pxi (1 − p)1−xi = p
P
xi
(1 − p)n−
P
xi
Your only possibilities are p = 0.4, p = 0.6, and you make a prior distribution based on the probability that the parameter p is equal to each of those values. Prior assumption: f(0.4) = 0.7, f(0.6) = 0.3 You test the data, and find that there are are 9 successes out of 10, pˆ = 0.9 Based on the data that give pˆ = 0.9, find the probability that the actual p is equal to 0.4 or 0.6. You would expect it to shift to be more likely to be the larger value. t p.f. for each value: f (x1 , ..., x10 |0.4) = 0.49 (0.6)1 f (x1 , ..., x10 |0.6) = 0.69 (0.4)1 Then, find the posterior distributions: f (0.4|x1 , ..., xn ) = f (0.6|x1 , ..., xn ) =
(0.49 (0.6)1 )(0.7) = 0.08 + (0.69 (0.4)1 )(0.3)
(0.49 (0.6)1 )(0.7)
(0.69 (0.4)1 )(0.3) = 0.92 (0.49 (0.6)1 )(0.7) + (0.69 (0.4)1 )(0.3)
Note that it becomes much more likely that p = 0.6 than p = 0.4
Example: B(p), prior distribution on [0, 1]
Choose any prior to fit intuition, but simplify by choosing the conjugate prior.
f (p|x1 , ..., xn ) =
p�xi (1 − p)n−�xi f (p) � (...)dp
Choose f(p) to simplify the integral. Beta distribution works for Bernoulli distributions. Prior is therefore: f (p) =
�(∂ + λ) ∂−1 p (1 − p)ξ−1 , 0 ← p ← 1 �(∂)�(λ)
Then, choose ∂ and λ to fit intuition: makes E(X) and Var(X) fit intuition. � � P P �(∂ + xi + λ + n − xi ) � � f (p|x1 ...xn ) = × p(∂+ x1 )−1 (1 − p)(ξ+n− xi )−1 �(∂ + xi )�(λ + n − xi ) � � Posterior Distribution = Beta(∂ + xi , λ + n − xi )
The conjugate prior gives the same distribution as the data.
Example:
71
B(∂, λ) such that EX = 0.4, Var(X) = 0.1 Use knowledge of parameter relations to expectation and variance to solve: EX = 0.4 =
∂λ ∂ , Var(X) = 0.1 = ∂+λ (∂ + λ)2 (∂ + λ + 1)
The posterior distribution is therefore: Beta(∂ + 9, λ + 1) And the new expected value is shifted: EX =
∂+9 ∂ + λ + 10
Once this posterior is calculated, choose the parameters by finding the expected value. Definition of Bayes Estimator: Bayes estimator of unknown parameter χ0 is χ(X1 , ..., Xn ) = expectation of the posterior distribution. � � Example: B(p), prior Beta(∂, λ), X1 , ..., Xn ↔ posterior Beta(∂ + xi , λ + n − xi ) � � ∂ + xi ∂ + xi � � = Bayes Estimator: ∂ + xi + λ + n − xi ∂+λ+n
To see the relation to the prior, divide by n: =
∂/n + x ∂/n + λ/n + 1
Note that it erases the intuition for large n.
The Bayes Estimator becomes the average for large n.
** End of Lecture 23
72
18.05 Lecture 24 April 8, 2005
Bayes Estimator. Prior Distribution f (χ) ↔ compute posterior f (χ|X1 , ..., Xn ) Bayes’s Estimator = expectation of the posterior. E(X − a)2 ↔ minimize a ↔ a = EX Example: B(p), f (p) = Beta(∂, λ) ↔ f (p|x1 , ..., xn ) = Beta(∂ + � ∂ + xi χ(x1 , ..., xn ) = ∂+λ+n
�
xi , λ + n −
�
xi )
Example: Poisson Distribution
�(�), f (x|�) =
�x x!e−�
t p.f.: n �xi −� � xi −n� e = e f (x1 , ..., xn |�) = x! �xi ! i=1 i P
If f (�) is the prior distribution, posterior:
� xi −n� f (�) �xi ! e � �P x i g(x1 ...xn ) = �xi ! e−n� f (�)d� P
f (�|x1 , ..., xn ) = Note that g does not depend on �:
f (�|x1 , ..., xn ) ≈ �
P
xi −n�
e
f (�)
Need to choose the appropriate prior distribution, Gamma distribution works for Poisson. Take f (�) - p.d.f. of �(∂, λ), λ ∂ ∂−1 −ξ� e � �(∂) � ↔ �(∂ + xi , λ + n) f (�) =
f (�|x1 , ..., xn ) ≈ � Bayes Estimator:
P
xi +∂−1 −(n+ξ)�
e
�(x1 , ..., xn ) = EX =
� ∂ + xi n+λ
Once again, balances both prior intuition and data, by law of large numbers: � ∂/n + xi /n −−−−↔ �(x1 , ..., xn ) = n ↔ → x ↔ E(X1 ) ↔ � 1 + λ/n The estimator approaches what you’re looking for, with large n. Exponential E(∂), f (x|∂) = ∂e−∂x , x ∼ P0 f (x1 , ..., xn |∂) = �ni=1 ∂e−∂xi = ∂n e−( xi )∂ If f (∂) - prior, the posterior:
73
f (∂|x1 , ..., xn ) ≈ ∂n e−(
P
xi )∂
f (∂)
Once again, a Gamma distribution is implied. Choose f (∂) − �(u, v) f (∂) =
v u u−1 −v∂ ∂ e �(u)
New posterior: f (∂|x1 , ..., xn ) ≈ ∂n+u−1 e−(
P
xi +v)u
↔ �(u + n, v +
Bayes Estimator: ∂(x1 , ..., xn ) =
�
xi )
1 u+n u/n + 1 1 − −−↔ � = � n−↔ =∂ → ↔ v + xi v/n + xi /n x EX
Normal Distribution: N (µ, θ), f (x|µ, θ) = f (x1 , ..., xn |µ, θ) =
2 1 1 ∩ e− 2�2 (x−µ) θ 2ψ
1 1 ∩ e− 2�2 n (θ 2ψ)
Pn
i=1 (xi −µ)
2
It is difficult to find simple prior when both µ, θ are unknown. Say that θ is given, and µ is the only parameter: 2 1 1 Prior: f (µ) = ∩ e− 2b2 (µ−a) = N (a, b) b 2ψ
Posterior: 1
f (µ|X1 , ..., Xn ) ≈ e− 2�2
P
(xi −µ)2 − 2b12 (µ−a)2
Simplify the exponent: =
� 1 � 2 1 1 a xi 2 n 2 2 2 − 2µx + µ ) + (µ − 2aµ + a ) = µ ( + ) − 2µ( + 2 ) + ... (x i i 2 2 2 2 2 2θ 2b 2θ 2b 2θ 2b = µ2 A − 2µB + ... = A(µ2 − 2µ B 2
f (µ|X1 , ..., Xn ) ≈ e−A(µ− A ) = e
− 2(1/�12A)2
B B2 B B + ( )2 ) − + ... = A(µ − )2 + ... A A A A
(µ −
B 1 θ 2 b2 θ 2 A + nb2 x B 2 ) = N( , ∩ ) = N( 2 , 2 ) 2 A A θ + nb θ + nb2 2A
Normal Bayes Estimator: µ(X1 , ..., Xn ) =
θ 2 a + nb2 x θ 2 a/n + b2 x −−−−↔ = n ↔ → x ↔ E(X1 ) = µ θ 2 + nb2 θ 2 /n + b2
** End of Lecture 24
74
18.05 Lecture 25 April 11, 2005
Maximum Likelihood Estimators X1 , ..., Xn have distribution Pβ0 ⊂ {Pβ : χ ⊂ Γ} t p.f. or p.d.f.: f (x1 , ..., xn ) = f (x1 |χ) × ... × f (xn |χ) = ξ(χ) - likelihood function. If Pβ - discrete, then f (x|χ) = Pβ (X = x), and ξ(χ) - the probability to observe X1 , ..., Xn Definition: A Maximum likelihood estimator (M.L.E.):
χˆ = χˆ(X1 , ..., Xn ) such that ξ(χˆ) = maxβ ξ(χ)
Suppose that there are two possible values of the parameter, χ = 1, χ = 2
p.f./p.d.f. - f (x|1), f (x|2)
Then observe points x1 , ..., xn
view probability with first parameter and second parameter:
ξ(1) = f (x1 , ..., xn |1) = 0.1, ξ(2) = f (x1 , ..., xn |2) = 0.001,
The parameter is much more likely to be 1 than 2. Example: Bernoulli Distribution B(p), P p ⊂ [0.1], P ξ(p) = f (x1 , ..., xn |p) = p xi (1 − p)n− xi ξ(χ) ↔ max �∝ log ξ(χ) ↔ max � (log-likelihood)
log ξ(p) = xi log p + (n − xi ) log(1 − p), maximize over [0, 1]
Find the critical point:
� log ξ(p) = 0 �p � � n − xi xi =0 − p 1−p � � � � � xi − p xi − np + p xi = 0 xi (1 − p) − p(n − xi ) = �
xi = x ↔ E(X) = p n For Bernoulli distribution, the MLE converges to the actual parameter of the distribution, p. pˆ =
Example: Normal Distribution: N (µ, θ 2 ), 2 1 1 f (x|µ, θ 2 ) = ∩ e− 2�2 (x−µ) 2ψθ
ξ(µ, θ 2 ) = ( ∩
1 1 )n e− 2�2 2ψθ
Pn
i=1 (xi −µ)
2
n ∩ 1 � log ξ(µ, θ 2 ) = n log( 2ψθ) − 2 (xi − µ)2 ↔ max : µ, θ 2 2θ i=1
Note that the two parameters are decoupled. First, for a fixed θ, we minimize
�n
i=1 (xi
− µ)2 over µ
75
n � i=1
n n � � � 2(xi − µ) = 0, (xi − µ)2 = − �µ i=1 i=1 n
xi − nµ = 0, µ ˆ=
1� xi = x ↔ E(X) = µ0 n i=1
To summarize, the estimator of µ for a Normal distribution is the sample mean. To find the estimator of the variance: n ∩ 1 � (xi − x)2 ↔ maximize over θ −n log( 2ψθ) − 2 2θ i=1 n � n 1 � (xi − x)2 = 0 =− + 3 �θ θ θ i=1
θ ˆ2 = Find θ ˆ 2
θ ˆ2 =
1 � (xi − x)2 - MLE of θ02 ; θ ˆ 2 − a sample variance n
1� 2 1� 2 1� 1 � 2
xi + (x)2 = (xi − 2xi x + (x)2 ) = xi − 2x xi − 2(x)2 + (x)2 = n n n n =
1� 2 xi − (x)2 = x2 − (x)2 ↔ E(x21 ) − E(x1 )2 = θ02 n
To summarize, the estimator of θ02 for a Normal distribution is the sample variance.
Example: U (0, χ), χ > 0 - parameter.
1 f (x|χ) = { , 0 ← x ← χ; 0, otherwise } χ Here, when finding the maximum we need to take into that the distribution is ed on a finite interval [0, χ]. ξ(χ) =
n 1 1 I(0 ← xi ← χ) = n I(0 ← x1 , x2 , ..., xn ← χ) χ χ i=1
The likelihood function will be 0 if any points fall outside of the interval.
If χ will be the correct parameter with P = 0,
you chose the wrong χ for your distribution.
ξ(χ) ↔ maximize over χ > 0
76
If you graph the p.d.f., notice that it drops off when χ drops below the maximum data point. χˆ = max(X1 , ..., Xn ) The estimator converges to the actual parameter χ0 :
As you keep choosing points, the maximum gets closer and closer to χ0
Sketch of the consisteny of MLE. ξ(χ) ↔ max ⊇
1 log ξ(χ) ↔ max n n
Ln (χ) =
1 1� 1 log f (xi |χ) ↔ L(χ) = Eβ0 log f (x1 |χ). log ξ(χ) = log f (xi |χ) = n n n i=1
ˆ by definition of MLE. Let us show that L(χ) is maximized at χ0 . Ln (χ) is maximized at χ, Then, evidently, χˆ ↔ χ0 . L(χ) ← L(χ0 ) : Expand the inequality:
L(χ) − L(χ0 )
= =
�
�
f (x|χ) log f (x|χ0 )dx ← f (x|χ0 )
� �
� f (x|χ) − 1 f (x|χ0 )dx f (x|χ0 )
(f (x|χ) − f (x|χ0 )) dx = 1 − 1 = 0.
Here, we used that the graph of the logarithm will be less than the line y = x - 1 except at the tangent point. ** End of Lecture 25
77
18.05 Lecture 26 April 13, 2005
Confidence intervals for parameters of Normal distribution. Confidence intervals for µ0 , θ02 in N (µ0 , θ02 ) ˆ 2 = x2 − (x)2 µ ˆ = x, θ µ ˆ ↔ µ0 , θ ˆ 2 ↔ θ02 with large n, but how close exactly? You can guarantee that the mean or variance are in a particular interval with some probability:
Definition: Take ∂ ⊂ [0, 1], ∂− confidence level
If P(S1 (X1 , ..., Xn ) ← µ0 ← S2 (X1 , ..., Xn )) = ∂,
then interval [S1 , S2 ] is the confidence interval for µ0 with confidence level ∂. Consider Z0 , ..., Zn - i.i.d., N(0, 1)
Definition: The distribution of Z12 + Z22 + ... + Zn2 is called a chi-square (α2 ) distribution,
with n degrees of freedom. As shown in §7.2, the chi-square distribution is a Gamma distribution ↔ �( n2 , 21 ) Definition: The distribution of �
Z0 1 2 n (Z1
is called a t-distribution with n d.o.f.
+ ... + Zn2 )
The t-distribution is also called Student’s distribution, see §7.4 for detail. To find the confidence interval for N (µ0 , θ02 ), need the following: Fact: Z1 , ..., Zn ≈ i.i.d.N (0, 1) z=
1� 2 1 1� 2 zi ) (Z1 + ... + Zn ), z 2 − (z)2 = zi − ( n n n
∩ Then, A = nz ≈ N (0, 1), B = n(z 2 − (z)2 ) ≈ α2n−1 , and A and B are independent. Take X1 , ..., Xn ≈ N (µ0 , θ02 ), µ0 , θ02 unknown. Z1 =
xn − µ 0 x1 − µ 0 , ..., Zn = ≈ N (0, 1) θ0 θ0 A=
B = n(z 2 − (z)2 ) = =
∩ ∩
n(
n
θ02
∩
nz =
∩
n(
x − µ0 ) θ0
∩ 1 � (xi − µ0 )2 x − µ0 2 n 1� (xi − µ0 )2 − (x − µ0 )2 ) = ) ) = − ( ( 2 2 n θ0 θ0 θ0 n
(x2 − 2µ0 x + µ20 − x2 + 2µ0 x − µ02 ) =
∩
n 2 (x − (x)2 ) θ02
To summarize: A=
∩
∩ x − µ0 n ) ≈ N (0, 1); B = 2 (x2 − (x)2 ) ≈ α2n−1 n( θ0 θ0 78
and, A and B are independent.
You can’t compute B, because you don’t know θ0 , but you know the distribution:
B=
∩ n(x2 − (x)2 ) ≈ α2n−1 θ02
Choose the most likely values for B, between c1 and c2 .
Choose the c values from the chi-square tabled values, such that area = ∂ confidence. With probability = confidence (∂), c1 ← B ← c2 c1 ←
∩
n(x2 − (x)2 ) ← c2 θ02
Solve for θ0 : ∩
n(x2 − (x)2 ) ← θ02 ← c2
∩
n(x2 − (x)2 ) c1
Choose c1 and c2 such that the right tail has probability 1−∂ 2 , same as left tail. This results in throwing away the possibilities outside c1 and c2 Or, you could choose to make the interval as small as possible, minimize: c11 − c12 given ∂ Why wouldn’t you throw away a small interval in between c1 and c2 , with area 1 − ∂? Though it’s the same area, you are throwing away very likely values for the parameter! ** End of Lecture 26
79
18.05 Lecture 27 April 15, 2005
Take sample X1 , ..., Xn ≈ N (0, 1) A=
∩ n(x − µ) n(x2 − (x)2 ≈ N (0, 1), B = ≈ α2n−1 θ θ2
A, B - independent.
To determine the confidence interval for µ, must eliminate θ from A:
�
A 1 n−1 B
=�
Z0 1 2 n−1 (z1
2 + ... + zn−1 )
Where Z0 , Z1 , .., Zn−1 ≈ N (0, 1) The standard normal is a symmetric distribution, and
1 2 n−1 (Z1
≈ tn−1
2 ) ↔ EZ12 = 1 + ... + Zn−1
So tn -distribution still looks like a normal distribution (especially for large n), and it is symmetric about zero. Given ∂ ⊂ (0, 1) find c, tn−1 (−c, c) = ∂ −c ← �
with probability = confidence (∂) −c ←
∩
n(x − µ) / θ
−c ← �
�
A 1 n−1 B
←c
1 n(x2 − (x)2 ) ←c θ2 n−1
x−µ
←c − (x)2 ) � � 1 1 (x2 − (x)2 ) ← µ ← x + c (x2 − (x)2 )
x−c n−1 n−1 1 2 n−1 (x
By the law of large numbers, x ↔ EX = µ
The center of the interval is a typical estimator (for example, MLE). � error � estimate of variance � 2
θ ˆ =
x2
π2 n
for large n.
2
− (x) is a sample variance and it converges to the true variance, 80
by LLN θ ˆ 2 ↔ θ2 1 1 Eˆ θ 2 = E (x21 + ... + x2n ) − E( (x1 + ... + xn ))2 = n n = EX12 −
1 1 � EXi Xj = EX12 − 2 (nEX12 + n(n + 1)(EX1 )2 ) n2 i,j n
Note that for i = ∈ j, EXi Xj = EXi EXj = (EX1 )2 = µ2 , n(n - 1) with different indices. Eˆ θ2 = =
n−1 n−1 EX12 − (EX1 )2 = n n
n−1 n−1 2 n−1 (EX12 − (EX1 )2 ) = Var(X1 ) = θ n n n
Therefore: n−1 2 θ < θ2 n Good estimator, but more often than not, less than actual. So, to compensate for the lower error: Eˆ θ2 =
E Consider (θ ∅ )2 =
n ˆ2 , n−1 θ
n θ ˆ 2 = θ2 n−1
unbiased sample variance. � � � 1 1 1 ∅ 2 ±c (x2 − (x)2 ) = ±c θ ˆ 2 = ±c (θ ) n−1 n−1 n � � (θ ∅ )2 (θ ∅ )2 ←µ←x+c x−c n n
§7.5 pg. 140 Example: Lactic Acid in Cheese 0.86, 1.53, 1.57, ..., 1.58, n = 10 ≈ N (µ, θ 2 ), x = 1.379, θ ˆ 2 = x2 − (x)2 = 0.0966 Predict parameters with confidence ∂ = 95% Use a t-distribution with n - 1 = 9 degrees of freedom.
81
See table: (−→, c) = 0.975 gives c = 2.262 � x − 2.262
1 2 θ ˆ ← µ ← x + 2.262 9
�
1 2 θ ˆ 9
0.6377 ← µ ← 2.1203 Large interval due to a high guarantee and a small number of samples. If we change ∂ to 90% c = 1.833, interval: 1.189 ← µ ← 1.569
Much better sized interval.
Confidence interval for variance:
c1 ←
nˆ θ2 ← c2 θ2
where the c values come from the α2 distribution
Not symmetric, all positive points given for α2 distribution. c1 = 2.7, c2 = 19.02 ↔ 0.0508 ← θ 2 ← 0.3579 again, wide interval as result of small n and high confidence. Sketch of Fisher’s theorem. z∩1 , ..., zn ≈ N (0, 1) nz = ≥1n (z1 + ... + zn ) ≈ N (0, 1) n(z 2 − (z)2 ) = n(
� 1� 2 1� 2 1 zi − ( zi ) ) = zi2 − ( ∩ (z1 + ... + zn ))2 ≈ α2n−1 n n n
P 2 2 1 1 f (z1 , ..., zn ) = ( ∩ )n e−1/2 zi = ( ∩ )n e−1/2r 2ψ 2ψ
P 2 2 1 1 f (y1 , ..., yn ) = ( ∩ )n e−1/2r = ( ∩ )n e−1/2 yi 2ψ 2ψ
The graph is symmetric with respect to rotation, so rotating the coordinates gives again i.i.d. standard normal sequence. i
2 1 ∩ e−1/2yi ↔ y1 , ..., yn − i.i.d.N (0, 1) 2ψ
Choose coordinate system such that: 1 1 1 y1 = ∩ (z1 + ... + zn ), i.e. ρv1 = ( ∩ , . . . , ∩ ) - new first axis. n n n Choose all other vectors however you want to make a new orthogonal basis:
82
y12 + ... + yn2 = z12 + .. + zn2 since the length does not change after rotation! ∩ n(z 2 − (z)2 ) = ** End of Lecture 27
nz = y1 ≈ N (0, 1)
�
yi2 − y12 = y22 + ... + yn2 ≈ α2n−1
83
18.05 Lecture 28 April 20, 2005
Review for Exam 2 pg. 280, Problem 5
µ = 300, θ = 10; X1 , X2 , X3 ≈ N (300, 100 = θ 2 )
P(X1 > 290 ⇒ X2 > 290 ⇒ X3 > 290) = 1 − P(X1 ← 290)P(X2 ← 290)P(X3 ← 290)
x2 − 300 x3 − 300 x1 − 300 ← −1)P( ← −1)P( ← −1) 10 10 10 Table for x = 1 gives 0.8413, x = -1 is therefore 1 - 0.8413 = 0.1587 = 1 − (0.1587)3 = 0.996 = 1 − P(
pg. 291, Problem 11
600 seniors, a third bring both parents, a third bring 1 parent, a third bring no parents.
Find P(< 650 parents)
Xi − 0, 1, 2 ↔ parents for the ith student.
P(Xi = 2) = P(Xi = 1) = P(Xi = 0) = 31
P(X1 + ... + X600 < 650) - use central limit theorem. µ = 0(1/3) + 1(1/3) + 2(1/3) = 1 EX 2 = 02 (1/3) + 12 (1/3) + 22 (1/3) = 53 Var(X) = EX 2 − (EX)2 = 35 − 1 = 23
θ = 2/3 P ∩ ∩ 650 − 1) 600 ( 600 600( 600xi − 1)
P( < ) 2/3 2/3 ∩ n(x − µ) < 2.5) � N (0, 1), P(Z ← 2.5) = δ(2.5) = 0.9938 P( θ pg. 354, Problem 10 Time to serve X ≈ E(χ), n = 20, X1 , ..., X20 , x = 3.8 min Prior distribution of χ is a Gamma dist. with mean 0.2 and std. dev. 1 ∂/λ = 0.2, ∂/λ 2 = 1 ↔ λ = 0.2, ∂ = 0.04 Get the posterior distribution: P f (x|χ) = χe−βx , f (x1 , ..., xn |χ) = χn e−β xi P ξ � ∂−1 −ξβ f (χ) = �(∂) χ e , f (χ|x1 , ..., xn ) ≈ χ(∂+n)−1 e−(ξ+ xi )β � Posterior is �(∂ + n, λ + xi ) = �(0.04 + 20, 0.2 + 3.8(20)) Bayes estimator = mean of posterior distribution = =
20.04 3.8(20) + 0.2
Problem 4 f (x|χ) = {eβ−x, x ∼ 0; 0, x < 0} Find the MLE of χ Likelihood δ(χ) = f (x1 |χ) × ... × f (xn |χ) P = eβ−x1 ...eβ−xn I(x1 ∼ χ, ..., xn ∼ χ) = enβ− xi I(min(x1 , ..., xn ) ∼ χ) 84
Maximize over χ.
Note that the graph increases in χ, but χ must be less than the min value.
If greater, the value drops to zero.
Therefore:
χˆ = min(x1 , ..., xn ) Also, by observing the original distribution, the maximum probability is at the smallest Xi .
p. 415, Problem 7:
To get the confidence interval, compute the average and sample variances:
Confidence interval for µ:
� � 1 1 2 2 (x − (x) ) ← µ ← x − c (x2 − (x)2 ) x−c n−1 n−1
To find c, use the t distribution with n - 1 degrees of freedom:
tn−1 = t19 (−→, c) = 0.95, c = 1.729 Confidence interval for θ 2 : ∩ n(x − µ) n(x2 − (x)2 ) ≈ N (0, 1), ≈ α2n−1 θ θ2
85
tn−1
∩ N (0, 1) n(x − µ)/θ ≈� ≈ tn−1 =� 1 2 1 n(x2 −(x)2 ) 2 n−1 αn−1 n−1
Use the table for
α2n−1
c1 ←
π
n(x2 − (x)2 ) ← c2 θ2
From the Practice Problems: (see solutions for more detail) p. 196, Number 9 P(X1 = def ective) = p Find E(X − Y ) Xi = {1, def ective; −1, notdef ective}; X − Y = X1 + ... + Xn E(X − Y ) = EX1 + ... + EXn = nEX1 = n(1 × p − 1(1 − P )) = n(2p − 1) p. 396, Number 10 X1 , ..., X6 ≈ N (0, 1) 2 c((X + X5 + X6 )2 ) ≈ α2n ∩ 1 + X2 + X3 ) +2 (X4∩ ( c(X1 + X2 + X3 )) + ( c(X4 + X5 + X6 ))2 ≈ α22 But of N(0, 1) ∩ each needs a distribution ∩ E c(X c(EX + X + X ) = 1 2 3 1 + EX2 + EX3 ) = 0 ∩ Var( c(X1 + X2 + X3 )) = c(Var(x1 ) + Var(X2 ) + Var(X3 )) = 3c In order to have the standard normal distribution, variance must equal 1. 3c = 1, c = 1/3 ** End of Lecture 28
86
18.05 Lecture 29 April 25, 2005
Score distribution for Test 2: 70-100 A, 40-70 B, 20-40 C, 10-20 D Average = 45 Hypotheses Testing. X1 , ..., Xn with unknown distribution P Hypothesis possibilities: H1 : P = P 1 H2 : P = P 2 ... Hk : P = P k There are k simple hypotheses. A simple hypothesis states that the distribution is equal to a particular probability distribution. Consider two normal distributions: N(0, 1), and N(1, 1).
There is only 1 point of data: X1
Depending on where the point is, it is more likely to come from either N(0, 1) or N(1, 1).
Hypothesis testing is similar to maximum likelihood testing ↔
Within your k choices, pick the most likely distribution given the data.
However, hypothesis testing is NOT like estimation theory, as there is a different goal:
Definition: Error of type i
P(make a mistake |Hi is true) = ∂i
Decision Rule: β : X n ↔ (H1 , H2 , ..., Hk )
Given a sample (X1 , ..., Xn ), β(X1 , ..., Xn ) ⊂ {H1 , ..., Hk }
∂i = P(β = ∈ Hi |Hi ) - error of type i
“The decision rule picks the wrong hypothesis” = error. Example: Medical test, H1 - positive, H2 - negative.
Error of Type 1: ∂1 = P(β ∈= H1 |H1 ) = P(negative|positive)
Error of Type 2: ∂2 = P(β ∈= H2 |H2 ) = P(positive|negative) These are very different errors, have different severity based on the particular situation. Example: Missile Detection vs. Airplane Type 1 ↔ P(airplane|missile), Type 2 ↔ P(missile|airplane) Very different consequences based on the error made. Bayes Decision Rules Choose a prior distribution on the hypothesis. 87
Assign a weight to �each hypothesis, based upon the importance of the different errors.
�(1), ..., �(k) ∼ 0, �(i) = 1
Bayes error ∂(�) = �(1)∂1 + �(2)∂2 + ... + �(k)∂k
Minimize the Bayes error, choose the appropriate decision rule.
Simple solution to finding the decision rule:
X = (X1 , ..., Xn ), let fi (x) be a p.f. or p.d.f. of Pi
fi (x) = fi (x1 ) × ... × fi (xn ) - t p.f./p.d.f.
Theorem: Bayes Decision Rule:
β = {Hi : �(i)fi (x) = maxi�j�k �(j)fj (x) Similar to max. likelihood.
Find the largest of t densities, but weighted in this case.
� � ∂(�) =� �(i)Pi (β ∈= Hi ) = �(i)(1 −� Pi (β = Hi )) =
� �(i)Pi (β = Hi ) = 1 − �(i) I(β(x) = Hi )fi (x)dx =
=1−� � = 1 − ( �(i)I(β(x) = Hi )fi (x))dx - minimize, so maximize the integral:
Function within the integral:
I(β = H1 )�(1)f1 (x) + ... + I(β = Hk )�(k)fk (x) The indicators pick the term ↔
β = H1 : 1�(1)f1 (x) + 0 + 0 + ... + 0
So, just choose the largest term to maximize the integral.
Let β pick the largest term in the sum.
Most of the time, we will consider 2 simple hypotheses:
β = {H1 : �(1)f1 (x) > �(2)f2 (x),
�(2) f1 (x) ; H2 if <; H1 or H2 if =} > f2 (x) �(1)
Example:
H1 : N (0, 1), H2 : N (1, 1)
�(1)f1 (x) + �(2)f2 (x) ↔ minimize
P 2 P 2 1 1 1 1 f1 (x) = ( ∩ )n e− 2 xi ; f2 (x) = ( ∩ )n e− 2 (xi −1) 2ψ 2ψ P 2 1P P 2 1 n f1 (x) �(2) = e− 2 xi + 2 (xi −1) = e 2 − xi > f2 (x) �(1)
β = {H1 :
�
xi <
n �(2) − log ; H2 if >; H1 or H2 if =} 2 �(1)
Considering the earlier example, N(0, 1) and N(1, 1)
88
X1 , n = 1, �(1) = �(2) =
1 2
1 1 ; H2 x1 > ; H1 or H2 if =} 2 2 However, if 1 distribution were more important, it would be weighted. β = {H1 : x1 <
If N(0, 1) were more important, you would choose it more of the time, even on some occasions when xi > 21 Definition: H1 , H2 - two simple hypotheses, then: ∂1 (β) = P(β ∈= H1 |H2 ) - level of significance. λ(β) = 1 − ∂2 (β) = P(β = H2 |H2 ) - power. For more than 2 hypotheses,
∂1 (β) is always the level of significance, because H1 is always the
Most Important hypothesis.
λ(β) becomes a power function, with respect to each extra hypothesis.
Definition: H0 - null hypothesis
Example, when a drug company evaluates a new drug,
the null hypothesis is that it doesn’t work.
H0 is what you want to disprove first and foremost,
you don’t want to make that error!
Next time: consider class of decision rules.
K∂ = {β : ∂1 (β) ← ∂}, ∂ ⊂ [0, 1]
Minimize ∂2 (β) within the class K∂
** End of Lecture 29
89
18.05 Lecture 30 April 27, 2005
Bayes Decision Rule �(1)∂1 (β) + �(2)∂2 (β) ↔ minimize. β = {H1 :
f1 (x) �(2) > ; H2 : if <; H1 or H2 : if =} f2 (x) �(1)
Example: see pg. 469, Problem 3 H0 : f1 (x) = 1 for 0 ← x ← 1 H1 : f2 (x) = 2x for 0 ← x ← 1 Sample 1 point x1 Minimize 3∂0 (β) + 1∂1 (β) β = {H0 :
1 1 1 1 < ; either if equal} > ; H1 : 2x1 3 2x1 3
Simplify the expression: 3 3 ; H1 : x1 > } 2 2 Since x1 is always between 0 and 1, H0 is always chosen. β = H0 always. β = {H0 : x1 ←
Errors:
∈ H0 ) = 0
∂0 (β) = P0 (β = ∂1 (β) = P1 (β = ∈ H1 ) = 1 We made the ∂0 very important in the weighting, so it ended up being 0. Most powerful test for two simple hypotheses. Consider a class K∂ = {β such that ∂1 (β) ← ∂ ⊂ [0, 1]} Take the following decision rule: β = {H1 :
f1 (x) f1 (x) < c} ∼ c; H2 : f2 (x) f2 (x)
Calculate the constant from the confidence level ∂: ∂1 (β) = P1 (β ∈= H1 ) = P1 (
f1 (x) < c) = ∂ f2 (x)
Sometimes it is difficult to find c, if discrete, but consider the simplest continuous case first: Find �(1), �(2) such that �(1) + �(2) = 1, �(2) �(1) = c
Then, β is a Bayes decision rule.
�(1)∂1 (β) + �(2)∂2 (β) ← �(1)∂1 (β ∅ ) + �(2)∂2 (β ∅ )
for any decision rule β ∅
If β ∅ ⊂ K∂ then ∂1 (β ∅ ) ← ∂. Note: ∂1 (β) = ∂, so: �(1)∂ + �(2)∂2 (β) ← �(1)∂ + �(2)∂2 (β ∅ )
Therefore: ∂2 (β) ← ∂2 (β ∅ ), β is the best (mosst powerful) decision rule in K∂
Example:
H1 : N (0, 1), H2 : N (1, 1), ∂1 (β) = 0.05
90
P 2 1P P 2 n 1 f1 (x) = e− 2 xi + 2 (xi −1) = e 2 − xi ∼ c f2 (x)
Always simplify first: � � n � n xi ∼ log(c), − xi ← + log(c), xi ← c ∅ 2 2 The decision rule becomes: � � β = {H1 : xi ← c ∅ ; H 2 : xi > c ∅ } Now, find c∅ � ∂1 (β) = P1 ( xi > c∅ )
recall, subscript on P indicates that x1 , ..., xn ≈ N (0, 1)
Make into standard normal:
� c∅ xi P1 ( ∩ > ∩ ) = 0.05 n n ∩ Check the table for P(z > c∅∅ ) = 0.05, c∅∅ = 1.64, c∅ = n(1.64)
Note: a very common error with the central limit theorem: � � � ∩ n1 xi − µ xi − nµ ∩ )↔ xi ↔ n( θ nθ
These two conversions are the same! Don’t combine techniques from both. The Bayes decision rule now becomes: � � ∩ ∩ β = {H1 : xi ← 1.64 n; H2 : xi > 1.64 n}
Error of Type � 2:
∩ ∂2 (β) = P2 ( xi ← c = 1.64 n)
Note: subscript indicates that X1 , ..., Xn ≈ N (1, 1) � ∩ ∩ xi − n(1) 1.64 n − n ∩ ∩ ) = P2 (z ← 1.64 − n) = P2 ( ← n n
Use tables for standard normal to get the probability.
∩ If n = 9 ↔ P2 (z ← 1.64 − 9) = P2 (z ← −1.355) = 0.0877
Example:
H1 : N (0, 2), H2 : N (0, 3), ∂1 (β) = 0.05
( 2≥12λ )n e−
P
1
x2i
2(2) P 2 1 3 n/2 − 12 f1 (x) xi P 1 2 =( ) ∼c = e x − 1 n f2 (x) 2 i 2(3) ≥ ( 3 2λ ) e � � β = {H1 : x2i ← c∅ ; H2 : xi2 > c∅ }
This is intuitive, as the sum of squares ≈ sample variance. If small ↔ θ = 2 If large ↔ θ = 3
91
� x 2i � c� 2 ∅∅ ∂1 (β) = P1 ( x 2i > c∅ ) = P1 ( 2 > 2 ) = P1 (αn > c ) = 0.05 2
∅∅ ∅∅ ∅ If n = 10, P1 (α10 > c ) = 0.05; c = 18.31, c = 36.62 Can find error of type 2 in the same way as earlier: � P(α2n > c3 ) ↔ P(α2
10 > 12.1) � 0.7 A difference of 1 in variance is a huge deal! Large type 2 error results, small n. ** End of Lecture 30
92
18.05 Lecture 31 April 29, 2005
t-test X1 , ..., Xn - a random sample from N (µ, θ 2 ) 2-sided Hypothesis Test: H1 : µ = µ 0 H2 : µ = ∈ µ0 2 sided hypothesis - parameter can be greater or less than µ0 Take ∂ ⊂ (0, 1) - level of significance (error of type 1) Construct a confidence interval ↔ confidence = 1 - ∂ If µ0 falls in the interval, choose H1 , otherwise choose H2 How to construct the confidence interval in of the decision rule: T =�
x − µ0
1 2 n−1 (x
− (x)2 )
≈ t distribution with n - 1 degrees of freedom.
Under the hypothesis H1 , T is has a t-distribution.
See if the T value falls in the expected area of the t-distribution:
Accept the null hypothesis (H1 ), if −c ← T ← c, Reject if otherwise.
Choose c such that area between c and -c is 1 − ∂, each tail area = ∂/2 Error of type 1: ∂1 = P1 (T < −c, T > c) = ∂2 + ∂2 = ∂ Definition: p-value
93
p-value = probability of values less likely than T
If p-value ∼ ∂, accept the null hypothesis.
If p-value < ∂, reject the null hypothesis.
Example: p-value = 0.0001, very unlikely that this T value would occur
if the mean were µ0 . Reject the null hypothesis!
1-sided Hypothesis Test: H1 : µ ← µ 0 H2 : µ > µ 0 x − µ0
T =�
1 2 n−1 (x
=�
1 2 n−1 (x
− (x)2 )
See how the distribution behaves for three cases: 1) If µ = µ0 , T ≈ tn−1 .
2) If µ < µ0 : T =�
x − µ0
1 2 n−1 (x
−
(x)2 )
µ − µ0 (x)2 )
+�
− ∩ (µ − µ0 ) n − 1 T � tn−1 + ↔ −→ θ
3) If µ > µ0 , similarly T ↔ +→ Decision Rule: β = {H1 : T ← �; H∞ : T > �}
94
x−µ
1 2 n−1 (x
− (x)2 )
∂1 = P1 (T > c) = ∂
p-value: Still the probability of values less likely than T ,
but since it is 1-sided,
you don’t need to consider the area to the left of −T as you would in the 2-sided case.
The p-value is the area of everything to the right of T Example: 8.5.1, 8.5.4 µ0 = 5.2.n = 15, x = 5.4, θ ∅ = 0.4226 ∈ 5.2 H1 : µ = 5.2, H2 : µ = T is calculated to be = 1.833, which leads to a p-value of 0.0882
If ∂ = 0.05, accept H1 , µ = 5.2. because the p-value is over 0.05
Decision rule:
Such that ∂ = 0.05, the areas of each tail in the 2-sided case = 2.5%
95
From the table ↔ c = 2.145 β = {H1 : −2.145 ← T ← 2.145; H2 otherwise} Consider 2 samples, want to compare their means: X1 , ..., Xn ≈ N (µ1 , θ12 ) and Y1 , ..., Ym ≈ N (µ2 , θ22 ) Paired t-test: Example (textbook): Crash test dummies, driver and enger seats ≈ (X, Y)
See if there is a difference in severity of head injuries depending on the seat:
(X1 , Y1 ), ..., (Xn , Yn )
Observe the paired observations (each car) and calculate the difference:
Hypothesis Test:
H1 : µ1 = µ2
H2 : µ1 ∈= µ2
Consider Z1 = X1 − Y1 , ..., Zn = Xn − Yn ≈ N (µ1 − µ2 = µ, θ 2 )
H1 : µ = 0; H2 : µ ∈= 0
Just a regular t-test:
p-values comes out as < 10−6 , so they are likely to be different.
** End of Lecture 31
96
18.05 Lecture 32 May 2, 2005
Two-sample t-test X1 , ..., Xm ≈ N (µ1 , θ 2 )
Y1 , ..., Yn ≈ N (µ2 , θ 2 )
Samples are independent.
Compare the means of the distributions.
Hypothesis Tests:
H1 : µ1 = µ2 , µ1 ← µ2
H2 : µ1 ∈= µ2 , µ1 > µ2
By properties of Normal distribution and Fisher’s theorem: ∩ ∩ m(x − µ1 ) n(y − µ2 ) , ≈ N (0, 1) θ θ θx2 = x2 − (x)2 , θy2 = y 2 − (y)2 nθy2 mθx2 2 ≈ α ≈ α2n−1 , m −1 θ2 θ2 T =�
x−µ
1 2 n−1 (x
− (x)2 )
≈ tn−1
Calculate x − y x − µ1 1 1 1 y − µ2 ≈ ∩ N (0, 1) = N (0, ), ≈ N (0, ) θ m θ n m x − µ1 y − µ2 (x − y) − (µ1 − µ2 ) 1 1 = ≈ N (0, + ) − θ θ θ m n (x − y) − (µ1 − µ2 ) � ≈ N (0, 1) 1 θ m + n1 nθy2 mθx2 + ≈ α2m+n−2 θ2 θ2 97
Construct the t-statistic: �
N (0, 1) 1 2 m+n−2 (αm+n−2 )
≈ tm+n−2
(x − y) − (µ1 − µ2 ) (x − y) − (µ1 − µ2 ) � T = � ≈ tm+n−2 =� 2 +nπ 2 mπ 1 1 ( m + n1 ) m+1n−2 (mθx2 + nθy2 ) θ m + n1 m+1n−2 ( xπ2 y )
Construct the test:
H1 : µ1 = µ2 , H2 : µ1 ∈= µ2
If H1 is true, then:
T =�
x−y
1 1 + n1 ) m+n−2 (mθx2 + nθy2 ) (m
≈ tm+n−2
Decision Rule: β = {H1 : −c ← T ← c, H2 : otherwise} where the c values come from the t distribution with m + n - 2 degrees of freedom.
c = T value where the area is equal to ∂/2, as the failure is both below -c and above +c
If the test were: H1 : µ1 ← µ2 , H2 : µ1 > µ2 ,
then the T value would correspond to an area in one tail, as the failure is only above +c.
There are different functions you can construct to approach the problem,
based on different combinations of the data.
This is why statistics is entirely based on your assumptions and the resulting
98
distribution function! Example: Testing soil types in different locations by amount of aluminum oxide present.
m = 14, x = 12.56 ≈ N (µ1 , θ 2 ); n = 5, y = 17.32 ≈ N (µ2 , θ 2 )
H1 : µ1 ← µ2 ; H2 : µ1 > µ2 ↔ T = −6.3 ≈ t14+5−2=17
c-value is 1.74, however this is a one-sided test. T is very negative, but we still accept H 1
If the hypotheses were: H1 : µ1 ∼ µ2 ; H2 : µ1 < µ2 ,
Then the T value of -6.3 is way to the left of the c-value of -1.74. Reject H1
Goodness-of-fit tests. Setup: Consider r different categories for the random variable. The � probability that a data point takes value Bi is pi pi = p1 + ... + pr = 1 Hypotheses: H1 : pi = p0i for all i = 1, ..., r; H2 : otherwise. Example: (9.1.1)
3 categories exist, regarding a family’s financial situation.
They are either worse, better, or the same this year as last year.
Data: Worse = 58, Same = 64, Better = 67 (n = 189)
Hypothesis: H1 : p1 = p2 = p3 = 31 , H2 : otherwise. Ni = number of observations in each category.
You would expect, under H1 , that N1 = np1 , N2 = np2 , N3 = np3
Measure using the central limit theorem:
N1 − np1
↔ N (0, 1) np1 (1 − p1 ) 99
However, keep in mind that the Ni values are not independent!! (they sum to 1) Ignore part of the scaling to for this (proof beyond scope):
N1 − np1 ↔ 1 − p1 N (0, 1) = N (0, 1 − p1 ) ∩ np1
Pearson’s Theorem:
T =
(N1 − np1 )2 (Nr − npr )2 ↔ α2r−1 + ... + np1 npr
If H1 is true, then: T = If H1 is not true, then:
r � (Ni − np0 )2 i
i=1
np0i
2 ↔ αr−1
T ↔ +→ Proof: if p1 ∈= p01 ,
N1 − np0i N1 − np1 n(p1 − p01 )
= ↔ N (0, θ 2 ) + (±→) + np0i npi0 np10
However, is squared ↔ +→ Decision Rule:
β = {H1 : T ← c, H2 : T > c}
The example yields a T value of 0.666, from the α2r−1=3−1=2 = α22 c is much larger, therefore accept H1 .
The difference among the categories is not significant.
** End of Lecture 32
100
18.05 Lecture 33 May 4, 2005
Simple goodness-of-fit test: H1 : pi = p0i , i ← r; H2 : otherwise. T =
r � (Ni − np0 )2 i
i=1
np0i
2 ≈ αr−1
Decision Rule: β = {H1 : T ← c; H2 : T > c} If the distribution is continuous or has infinitely many discrete points: Hypotheses: H1 : P = P0 ; H2 : P ∈= P0
Discretize the distribution into intervals, and count the points in each interval.
You know the probability of each interval by area, then, consider a finite number of intervals.
This discretizes the problem.
New Hypotheses: H1∅ : pi = P(X ⊂ Ii ) = P0 (X ⊂ Ii ); H2 otherwise.
If H1 is true ↔ H1∅ is also true.
Rule of Thumb: np0i = nP0 (X ⊂ Ii ) ∼ 5
If too small, too unlikely to find points in the interval,
does not approximate the chi-square distribution well.
Example 9.1.2 ↔ Data ≈ N (3.912, 0.25), n = 23
H1 : P ≈ N (3.912, 0.25)
Choose k intervals ↔ p0i = k1 n( k1 ) ∼ 5 ↔ 23 k ∼ 5, k = 4 101
≥ N (3.912, 0.25) ≈ X ↔ X−3.912 ≈ N (0, 1) 0.25 Dividing points: c1 , c2 = 3.912, c3 Find the normalized dividing points by the following relation:
ci − 3.912 = c∅i 0.5
The c∅i values are from the std. normal distribution. ↔ c∅1 = −0.68 ↔ c1 = −0.68(0.5) + 3.912 = 3.575 ↔ c∅2 = 0 ↔ c2 = 0(0.5) + 3.912 = 3.912
↔ c∅3 = 0.68 ↔ c3 = 0.68(0.5) + 3.912 = 4.249 Then, count the number of data points in each interval.
Data: N1 = 3, N2 = 4, N3 = 8, N4 = 8; n = 23
Calculate the T statistic:
T =
(3 − 23(0.25))2 (8 − 23(0.5))2 + ... + = 3.609 23(0.25 23(0.25)
Now, decide if T is too large. ∂ = 0.05 - significance level. α2r−1 ↔ α23 , c = 7.815
102
Decision Rule:
β = {H1 : T ← 7.815; H2 : T > 7.815}
T = 3.609 < 7.815, conclusion: accept H1
The distribution is relatively uniform among the intervals.
Composite Hypotheses: H1 : pi = pi (χ), i ← r for χ ⊂ Γ - parameter set.
H2 : not true for any choice of χ
Step 1: Find χ that best describes the data.
Find the MLE of χ
Likelihood Function: ξ(χ) = p1 (χ)N1 p2 (χ)N −2 × ... × pr (χ)Nr
Take the log of ξ(χ) ↔ maximize ↔ χ� Step 2: See if the best choice of χ� is good enough. H1 : pi = pi (χ�) for i ← r, H2 : otherwise. T =
r � (Ni − npi (χ�))2 i=1
npi (χ�)
≈ α2r−s−1
where s - dimension of the parameter set, number of free parameters. Example: N (µ, θ 2 ) ↔ s = 2
If there are a lot of free parameters, it makes the distribution set more flexible.
Need to subtract out this flexibility by lowering the degrees of freedom.
Decision Rule:
β = {H1 : T ← c; H2 : T > c}
Choose c from α2
r−s−1 with area = ∂
Example: (pg. 543) Gene has 2 possible alleles A1 , A2 Genotypes: A1 A1 , A1 A2 , A2 A2 Test that P(A1 ) = χ, P(A2 ) = 1 − χ,
103
but you only observe genotype. H1 : P(A1 A2 ) = 2χ(1 − χ) ♥ N2 P(A1 A1 ) = χ2 ♥ N1 P(A2 A2 ) = (1 − χ)2 − ♥ N3 r = 3 categories. s = 1 (only 1 parameter, χ) ξ(χ) = (χ2 )N1 (2χ(1 − χ))N2 ((1 − χ)2 )N3 = 2N2 χ2N1 +N2 (1 − χ)2N3 +N2 log ξ(χ) = N2 log 2 + (2N1 + N2 ) log χ + (2N3 + N2 ) log(1 − χ) � 2N3 + N2 2N1 + N2 = − =0 �χ χ 1−χ (2N1 + N2 )(1 − χ) − (2N3 + N2 )χ = 0 χ� =
compute χ� based on data. p0i = χ�2 , p02 = 2χ�(1 − χ�), p03 = (1 − χ�)2 T =
2N1 + N2 2N1 + N2 = 2N1 + 2N2 + 2N3 2n
� (Ni − np0 )2 i
np0i
For an ∂ = 0.05, c = 3.841 from the α21 distribution. Decision Rule: β = {H1 : T ← 3.841; H2 : T > 3.841} ** End of Lecture 33
104
2 ≈ αr−s−1 = α21
18.05 Lecture 34 May 6, 2005
Contingency tables, test of independence.
Feature 1 = 1 F1 = 2 F1 = 3 ... F1 = a col. total
Feature 2 = 1 N11 ... ... ... Na1 N+a
F2 = 2 F2 = 3 ... ...
...
... ...
...
... ...
...
... ...
...
... ...
...
... ...
...
...
F2 = b N1b ... ... ... Nab N+b
row total N1+ ... ... ... Na+ n
Xi1 ⊂ {1, ..., a} Xi2 ⊂ {1, ..., b} Random Sample: X1 = (X11 , X12 ), ..., Xn = (Xn1 , Xn2 ) Question: Are X 1 , X 2 independent? Example: When asked if your finances are better, worse, or the same as last year, see if the answer depends on income range: ← 20K 20K - 30K ∼ 30K
Worse 20 24 14
Same 15 27 22
Better 12 32 23
Check if the differences and subtle trend are significant or random. χij = P(i, j) = P(i) × P(j) if independent, for all cells ij Independence hypothesis can be written as: H1 : χij = pi qj where p1 + ... + pa = 1, q1 + ... + qb = 1 H2 : otherwise. r = number of categories = ab s = dimension of parameter set = a + b − 2 The MLE p�i , qj� needs to be found ↔ T =
� (Nij − np�i qj� )2 ≈ α2r−s−1=ab−(a+b−2)−1=(a−1)(b−1) �q� np i j i,j
Distribution has (a - 1)(b - 1) degrees of freedom. Likelihood: − − ξ(↔ p ,↔ q)=
(pi qj )Nij =
i,j
�
�
i
Note: Ni+ = j Nij and N+j = i Nij Maximize each factor to maximize the product. 105
Ni+
pi
×
j
N+j
qj
�
i
Ni+ log pi ↔ max, p1 + ... + pa = 1
Use Lagrange multipliers to solve the constrained maximization: � � N log p − �( p − 1) ↔ maxp min� i+ i i i i
� Ni+ Ni+ = − � = 0 ↔ pi = �pi pi �
�
pi =
i
n Ni+ = 1 ↔ � = n ↔ p�i = � n p�i =
T =
N+j Ni+ � , qj = n n
� (Nij − Ni+ N+j /n)2 ≈ α2(a−1)(b−1) N N /n i+ +j i,j
Decision Rule:
β = {H1 : T ← c; H2 : T > c}
Choose c from the chi-square distribution, (a - 1)(b - 1) d.o.f., at a level of significance ∂ = area.
From the above example:
N1+ = 47, N2+ = 83, N3+ = 59
N+1 = 58, N+2 = 64, N+3 = 67
n = 189
For each cell, the component of the T statistic adds as follows:
T =
(20 − 58(47)/189)2 + ... = 5.210 58(47)/189
Is T too large? T ≈ α2(3−1)(3−1) = α42
For this distribution, c = 9.488 According to the decision rule, accept H1 , because 5.210 ← 9.488 Test of Homogeniety - very similar to independence test. 106
Group 1 ... Group a
Category 1 N11 ... Na1
... ... ... ...
Category b N1b ... Nab
1. Sample from entire population. 2. Sample from each group separately, independently between the groups. Question: P(category j | group i) = P(category j) This is the same as independence testing! P(category j, group i) = P(category j)P(group i) ↔ P(Cj |Gi ) =
P(Cj Gi ) P(Cj )P(Gi ) = = P(Cj ) P(Gi ) P(Gi )
Consider a situation where group 1 is 99% of the population, and group 2 is 1%.
You would be better off sampling separately and independently.
Say you sample 100 of each, just need to renormalize within the population.
The test now becomes a test of independence.
Example: pg. 560
100 people were asked if service by a fire station was satisfactory or not.
Then, after a fire occured, the people were asked again.
See if the opinion changed in the same people.
Before Fire After Fire
80 72 satisfied
20 28 unsatisfied
But, you can’t use this if you are asking the same people! Not independent! Better way to arrange: Originally Satisfied Originally Unsatisfied
70 2 After, Satisfied
10 18 After, Not Satisfied
If taken from the entire population, this is ok. Otherwise you are taking from a dependent population. ** End of Lecture 34
107
18.05 Lecture 35 May 9, 2005
Kolmogorov-Smirnov (KS) goodness-of-fit test Chi-square test is used with discrete distributions.
If continuous - split into intervals, treat as discrete.
This makes the hypothesis weaker, however, as the distribution isn’t characterized fully.
The KS test uses the entire distribution, and is therefore more consistent.
Hypothesis Test:
H1 : P = P0
H2 : P ∈= P0
P0 - continuous In this test, the c.d.f. is used. Reminder: c.d.f. F (x) = P(X ← x), goes from 0 to 1.
The c.d.f. describes the entire function. Approximate the c.d.f. from the data ↔ Empirical Distribution Function: n
1� #(points ← x) Fn (x) = I(X ← x) = n i=1 n by LLN, Fn (x) ↔ EI(X1 ← x) = P(X1 ← x) = F (x)
From the data, the composed c.d.f. jumps by 1/n at each point. It converges to the c.d.f. at large n. Find the largest difference (supremum) between the dist c.d.f. and the actual. −−↔ sup |Fn (x) − F (x)| − n−↔ →0 x
108
For a fixed x: ∩ n(Fn (x) − F (x)) =
�
(I(Xi ← x) − EI(X1 ← x)) ∩ n
By the central limit theorem: � � � N 0, Var(I(Xi ← x)) = p(1 − p) = F (x)(1 − F (x) You can tell exactly how close the values should be! Dn =
∩ n sup |Fn (x) − F (x)| x
a) Under H1 , Dn has some proper known distribution. b) Under H2 , Dn ↔ +→ If F (x) implies a certain c.d.f. which is β away from that predicted by H0 ↔
Fn (x) ↔ F (x), |Fn (x) ≥ − F0 (x)| > β/2 ∩ n|Fn (x) − F0 (x)| > 2nα ↔ +→ The distribution of Dn does not ∩ depend on F(x), this allows to construct the KS test. ∩ Dn = n supx |Fn (x) − F (x)| = n supy |Fn (F −1 (y)) − y | y = F (x), x = F −1 (y), y ⊂ [0, 1] n
Fn (F −1 (y)) =
n
n
1� 1� 1� I(Xi ← F −1 (y)) = I(F (Xi ) ← y) = I(Yi ← y) n i=1 n i=1 n i=1
Y values generated independently of F .
P(Yi ← y) = P(F (Xi ) ← y) = P (Xi ← F −1 (y)) = F (F −1 (y)) = y
Xi ≈ F (x)
F (Xi ) ≈ uniform on [0, 1], independent of Y.
Dn is tabulated for different values of n, since not dependent on the distribution.
(find table on pg. 570)
For large n, converges to another distribution, whose table you can alternatively use.
� 2 2 P(Dn ← t) ↔ H(t) = 1 − 2 i=1 (−1)i−1 e−2i t The function represents Brownian Motion of a particle suspended in liquid.
109
Distribution - distance the particle travels from the starting point. The maximum distance is the distribution of Dn H(t) = distribution of the largest deviation of particle in liquid (Brownian Motion) Decision Rule:
β = {H1 : Dn ← c; H2 : Dn > c } Choose c such that the area to the right is equal to ∂
Example:
Set of data points as follows ↔
n = 10,
0.58, 0.42, 0.52, 0.33, 0.43, 0.23, 0.58, 0.76, 0.53, 0.64
H1 : P uniform on [0, 1]
Step 1: Arrange in increasing order.
0.23, 0.33, 0.42, 0.43, 0.52, 0.53, 0.58, 0.64, 0.76
Step 2: Find the largest difference.
Compare the c.d.f. with data.
Note: largest difference will occur before or after the jump, so only consider end points. x: F(x): Fn (x) before: Fn (x) after:
0.23 0.23 0 0.1
0.33 0.33 0.1 0.2
0.42 0.42 0.2 0.3
... ... ... ...
Calculate the differences: |Fn (x) − F (x)| Fn (x) before and F(x): Fn (x) after and F(x):
0.23 0.13
0.23 0.13
The largest difference occurs near the end: |0.9 − 0.64| = 0.26 ∩ Dn = 10(0.26) = 0.82 Decision Rule: β = {H1 : 0.82 ← c; H2 : 0.82 > c} c for ∂ = 0.05 is 1.35. Conclusion - accept H1 . ** End of Lecture 35
110
0.22 0.12
... ...
18.05 Lecture 36 May 11, 2005
Review of Test 2 (see solutions for more details)
Problem 1:
P(X = 2c) = 21 , P(X = 12 c) = EXn = ( 45 )n c
1 2
↔ EX = 2c( 21 ) + 12 c( 21 ) = 54 c
Problem 2: X1 , ..., Xn n = 1000 P(Xi = 1) = 21 , P (Xi = 0) = 1
2 µ = EX = 21 , Var(X1 ) = p(1 − p) = Sn = X1 + ... + Xn P(440 ← Sn ← k) = 0.5
1 4
Sn − 1000(1/2) Sn − 500 Sn − nEX1
= ∩ ↔
250 nVar(X1 ) 1000(1/4) P(
440 − 500 k − 500 ∩ ) = 0.5 ←z← ∩ 250 250
by the Central Limit Theorem: 440 − 500 k − 500 k − 500 k − 500 ) − Π( ∩ ) = Π( ∩ ) − Π(−3.75) = Π( ∩ ) − 0.0001 = 0.5 � Π( ∩ 250 250 250 250 Therefore: k − 500 k − 500 = 0, k = 500 ) = 0.5001 ↔ ∩ Π( ∩ 250 250 Problem 3: f (x) =
χn enβ χeβ I(x ∼ e); ξ(χ) = � β+1 ↔ max β+1 x ( xi )
Easier to maximize the log-likelihood:
log ξ(χ) = n log(χ) + nχ − (χ + 1) log
xi
n n � + n − log xi = 0 ↔ χ = χ log xi − n
Problem 5:
Confidence Intervals, keep in mind the formulas!
� � 1 1 2 2 x−c (x − x ) ← µ ← x + c (x2 − x2 ) n−1 n−1 Find c from the T distribution with n - 1 degrees of freedom.
111
Set up such that the area between -c and c is equal to 1 − ∂ In this example, c = 1.833 n(x2 − x2 ) n(x2 − x2 ) ← θ2 ← c1 c2 Find c from the chi-square distribution with n - 1 degrees of freedom.
Set up such that the area between c1 and c2 is equal to 1 − ∂
In this example, c1 = 3.325, c2 = 16.92
Problem 4:
Prior Distribution:
f (χ) =
Posterior Distribution:
λ ∂ ∂−1 −ξβ χ e �(∂)
χn enβ f (x1 , ..., xn |χ) = � β+1 ( xi ) f (χ|x1 , ..., xn ) ≈ f (χ)f (x1 , ..., xn |χ)
Q Q χn enβ ≈ χ∂−1 e−ξβ � β = χ∂+n−1 e−ξβ+nβ e−β log xi = χ(∂+n)−1 e−(ξ−n+log xi )β ( xi ) � Posterior = �(∂ + n, λ − n + log xi ) Bayes Estimator:
χ� =
∂+n � λ − n + log xi 112
Final Exam Format Cumulative, emphasis on after Test 2.
9-10 questions.
Practice Test posted Friday afternoon.
Review Session on Tuesday Night - 5pm, Bring Questions!
Optional PSet:
pg. 548, Problem 3:
Gene has 3 alleles, so there are 6 possible combinations.
p1 = χ12 , p2 = χ22 , p3 = (1 − χ1 − χ2 )2
p4 = 2χ1 χ2 , p5 = 2χ1 (1 − χ1 − χ2 ), p6 = 2χ2 (1 − χ1 − χ2 )
Number of categories ↔ r = 6, s = 2.
2 Free Parameters.
T =
r � (Ni − npi )2 i=1
npi
≈ α2r−s−1=3
2N2 ξ(χ1 , χ2 ) = χ12N1 χ2
(1 − χ1 − χ2 )2N3 (2χ1 χ2 )N4 (2χ1 (1 − χ1 − χ2 ))N5 (2χ2 (1 N4 +N5 +N6 2N1 +N4 +N5 2N2 +N4 +N6 (1 − χ1 − χ2 )2N3 +N5 +N6 χ2
χ1 =2
− χ1 − χ2 ))N6
Maximize the log likelihood over the parameters.
log ξ = const. + (2N1 + N4 + N5 ) log χ1 + (2N2 + N4 + N6 ) log χ2 + (2N3 + N5 + N6 ) log(1 − χ1 − χ2 )
Max over χ1 , χ2 ↔ log ξ = a logχ1 + b logχ2 + c log(1 − χ1 − χ2 )
Solve for χ1 , χ2
� a c � b c = − = 0; = − =0 1 − χ 1 − χ2 �χ1 χ1 �χ2 χ2 1 − χ 1 − χ2 a b = ↔ aχ2 = bχ1 χ1 χ2
a − aχ1 − aχ2 − cχ1 = 0, a − aχ1 − bχ1 − cχ1 = 0 ↔ χ1 =
a b , χ2 = a+b+c a + b + c
Write in of the givens:
χ1 = where n =
�
2N1 + N2 + N5 1 2N2 + N4 + N6 1 = , χ2 = = 2n 5 2n 2
Ni
113
Decision Rule:
β = {H1 : T ∼ c, H2 : T < c}
Find c values from chi-square dist. with r s - 1 d.o.f. Area above c = ∂ ↔ c = 7.815
Problem 5:
There are 4 blood types (O, A, B, AB)
There are 2 Rhesus factors (+, -)
Test for independence:
+ -
O 82 13 95
A 89 27 116
T =
(82 −
B 54 7 61
AB 19 9 28
244(95) 2 300 ) 244(95) 300
Find the T statistic for all 8 cells. ≈ α2(a−1)(b−1) = α23 , and the test is same as before. ** End of Lecture 36
114
244 56 300
+ ...
18.05 Lecture 37 May 17, 2005
Final Exam Review - solutions to practice final. 1. f (x|v) = {uv −u xu−1 e− ( xv )u for x ∼ 0; 0 otherwise.} Find the MLE of v. ↔ Maximize v in the likelihood function = t p.d.f. P u ξ(v) = un v −nu ( xi )u−1 e− (xi /v) � � log ξ(v) = n log u − nu log v + (u − 1) log( xi ) − ( xvi )u Maximize with respect to v. � −nu −nu � u � −u �(log ξ(v)) xi = v = +u xui v −(u+1) = 0 − �v v �v v � u � xi v u xui n = ( u+1 ) = u v vu vu = v=( 2. Xi , ..., Xn ≈ U [0, χ] f (x|χ) = 1β I(0 ← x ← χ) Prior:
1� u xi n
1� u 1 xi ) u ↔ MLE n
f (χ) =
192 I(χ ∼ 4) χ4
Data: X1 = 5, X2 = 3, X3 = 8 Posterior: f (χ|x1 , ..., xn ) ≈ f (x1 , ..., xn |χ)f (χ) f (x1 , ..., xn |χ) = β1n I(0 ← all x’s ← χ) = β1n I(max(X1 , ..., Xn ) ← χ) 1 1 I(χ ∼ 4)I(max(x1 , ..., xn ) ← χ) ≈ βn+4 I(χ ∼ 8) f (χ|x1 , ..., xn ) ≈ βn+4 Find constant so it integrates to 1. � ∗ � ∗ c cχ−6 ∗ 1 −−−↔ dχ n = 3 1 = |8 = 8−6 = 1 1= cχ−7 dχ ↔ n +4 χ − 6 6 8 8 c = 6 × 86 3. Two observations (X1 , X2 ) from f(x)
H1 : f (x) = 1/2, I(0 ← x ← 2)
H2 : f (x) = {1/2, 0 ← x ← 1, 2/3, 1 < x ← 2} H3 : f (x) = {3/4, 0 ← x ← 1, 1/4, 1 < x ← 2} β minimizes ∂1 (β) + 2∂2 (β) + 2∂3 (β) = P(β ∈= Hi |Hi ) ∂i (β) � Find �(i)∂i , Decision rule picks �(i)fi (x1 , ..., xn ) ↔ max for each region. 115
�(i)fi (x1 )fi (x2 ) both x1 , x2 ⊂ [0, 1] point in [0, 1], [1, 2] both in [1, 2]
H1 (1)(1/2)(1/2) = 1/4 (1)(1/2)(1/2) = 1/4 (1)(1/2)(1/2) = 1/4
H2 (2)(1/3)(1/3) = 1/3 (2)(1/3)(2/3) = 4/9 (2)(1/3)(2/3) = 8/9
H3 (2)(3/4)(3/4) = 9/8 (2)(3/4)(1/4) = 3/8 (2)(1/4)(1/4) = 1/8
Decision Rule:
β = {H1 : never pick , H2 : both in [1, 2], one in [0, 1], [1, 2] , H3 : both in [0, 1]}
If two hypotheses:
f1 (x) �(2) > f2 (x) �(1) Choose H1 , ← H2 4. −1 2 1 f (x|µ) = { ∩ e 2 (ln x−µ) for x ∼ 0, 0 for x < 0} x 2ψ If X has this distribution, find distribution of ln X. Y = ln X � ey c.d.f. of Y: P(Y ← y) = (ln x ← y) = P(x ← ey ) = 0 f (x)dx However, you don’t need to integrate. � P(Y ← y) = f (ey ) × ey p.d.f. of Y, f (y) = �y
=
ey
1 ∩
2ψ
e
y 2 −1 2 (ln e −µ)
5. n = 10, H1 : µ = −1, H2 : µ = 1 ∂ = 0.05 ↔ ∂1 (β) = P1 ( ff21 < c) β = {H1 :
2 1 −1 × ey = ∩ e 2 (y−µ) ≈ N (µ, 1) 2ψ
− f1 (↔ x) ∼ c, H2 if less than} ↔ − f2 ( x )
− f1 (↔ x)= �
− f2 (↔ x)= �
P 2 1 1 ∩ e− 2 (ln xi +1) n xi ( 2ψ) P 2 1 1 ∩ e− 2 (ln xi −1) n xi ( 2ψ)
� P f1 (x) ln xi ← c∅ = e−2 ln xi ∼ c ∝ f2 (x) � � β = {H1 : ln xi ← c = −4.81, H2 : ln xi > c = −4.81} P � � ≥ ≥ ∼ c−nµ ) = P1 (Z ∼ c−nµ ) 0.05 = P1 ( ln xi ∼ c) = P1 ( N (−1, 1) ∼ c) = P1 ( x≥i −nµ nπ n nπ
c − nµ ∩ = 1.64, c = −4.81 nθ � � Power = 1 - type 2 error = 1 − P2 (β ∈= H2 ) = 1 − P2 ( ln xi < c) = 1 − P2 ( N (1, 1) < c) � −4.81 − 10 xi − n(1) ∩ ∩ )�1 < = 1 − P2 ( n 10
116
6. H1 : p1 = 2β , p2 = β3 , p3 = 1 − 56β , χ ⊂ [0, 1]
Step 1) Find MLE χ�
� � 5β � Step 2) p�1 = β2 , p2� = β3 , p�
3 =1− 6 Step 3) Calculate T statistic. T =
r � (Ni − np� )2 i
npi
i=1
≈ α2r−s−1=3−1−1=1
χ 5χ χ ξ(χ) = ( )N1 ( )N2 (1 − )N3 2 3 6 log ξ(χ) = (N1 + N2 ) log(χ) + N3 log(1 −
5χ ) − N1 log(2)N2 log(3) ↔ max χ 6
� N1 + N2 −5/6 = + N3 �χ χ 1 − 56β 5 5 N1 + N2 − (N1 + N2 )χ − N3 χ = 0 6 6 solve for χ ↔ χ � =
23 6 N1 + N 2 ( )= 5 n 25
Compute statistic, T = 0.586 β = {H1 : T ← 3.841, H2 : T > 3.841} Accept H1 7. n = 17, x = 3.2, x = 0.09 From N (µ, θ 2 ) H1 : µ ← 3 H2 : µ > 3 at ∂ = 0.05 T =�
3.2 − 3 ≈ tn−1 = � = 2.67 1 1 2 − (x)2 ) (0.09) (x n−1 16 x − µ0
Choose decision rule from the chi-square table with 17-1 degrees of freedom:
β : {H1 : T < 1.746, H2 : T > 1.746} H1 is rejected. 8. Calculate t statistic: T =
N+j Ni+ 2 ) n N+j Ni+ n
� (Nij − i,j
117
= 12.1
α2(a−1)(b−1) = α23×2 = α62 at 0.05, c = 12.59 β : {H1 : T ← 12.59, H2 : T > 12.59}
Accept H1 . But note if confidence level changes, ∂ ↔ 0.10, c = 10.64, would reject H1
9.
f (x) = 1/2, � x I(0 ← x ← 2)
F (x) = −∗ f (t)dt = x/2, x ← 2
x:
F(x):
F(x) before:
F(x) after:
diff F(x) before: diff F(x) after:
0.02 0.01 0 0.1 0.01 0.09
�n = |F (x) − Fn (x)| max �n = 0.295 c for ∂∩= 0.05 is 1.35 Dn = 10(0.295) = 0.932872 β = {H1 : 0.932872 ← 1.35, H2 : 0.932872 > 1.35} Accept H1 ** End of Lecture 37
*** End of 18.05 Spring 2005 Lecture Notes.
118
0.18 0.09 0.1 0.2 0.01 0.11
0.20 0.10 0.2 0.3 0.1 0.2
... ... ... ... ... ...
18.05. Practice test 1. (1) Suppose that 10 cards, of which five are red and five are green, are placed at random in 10 envelopes, of which five are red and five are green. Determine the probability that exactly two envelopes will contain a card with a matching color. (2) Suppose that a box contains one fair coin and one coin with a head on each side. Suppose that a coin is selected at random and that when it is tossed three times, a head is obtained three times. Determine the probability that the coin is the fair coin. (3) Suppose that either of two instruments might be used for making a certain measurement. Instrument 1 yields a measurement whose p.d.f. is f1 (x) =
�
2x, 0 < x < 1 0, otherwise
Instrument 2 yields a measurement whose p.d.f. is f2 (x) =
�
3x2 , 0 < x < 1 0, otherwise
Suppose that one of the two instruments is chosen at random and a mea surement X is made with it. (a) Determine the marginal p.d.f. of X. (b) If X = 1/4 what is the probability that instrument 1 was used? (4) Let Z be the rate at which customers are served in a queue. Assume that Z has p.d.f. � 2e−2z , z > 0, f (z) = 0, otherwise Find the p.d.f. of average waiting time T = Z1 . (5) Suppose that X and Y are independent random variables with the following p.d.f. � e−x , x > 0, f (x) = 0, otherwise Determine the t p.d.f. of the following random variables: U=
X and V = X + Y. X +Y
18.05. Practice test 2. (1) page 280, No. 5 (2) page 291, No. 11 (3) page 354, No. 10 (4) Suppose that X1 , . . . , Xn form a random sample from a distribution with p.d.f. � e�−x , x � � f (x|�) = 0, x < �. Find the MLE of the unknown parameter �. (5) page 415, No. 7. (Also compute 90% confidence interval for � 2 .)
Extra practice page 196, No. page 346, No. page 396, No. page 409, No. page 415, No.
problems: 9; 19; 10; 3. 3.
Go over psets 5, 6, 7 and examples in class.
18.05. Test 1. (1) Consider events A = {HHH at least once} and B = {TTT at least once}. We want to find the probability P (A � B). The complement of A � B will be Ac � B c , i.e. no TTT or no HHH, and P (A � B) = 1 − P (Ac � B c ). To find the last one we can use the probability of a union formula P (Ac � B c ) = P (Ac ) + P (B c ) − P (Ac � B c ). Probability of Ac , i.e. no HHH, means that on each toss we don’t get HHH. The probability not to get HHH on one toss is 7/8 and therefore, P (Ac ) =
7 �10 8
.
The same for P (B c ). Probability of Ac �B c , i.e. no HHH and no TTT, means that on each toss we don’t get HHH and TTT. The probability not to get HHH and TTT on one toss is 6/8 and, therefore, c
c
P (A � B ) = Finally, we get, P (A � B) = 1 −
7 �10 8
6 �10
+
8
.
7 �10 8
−
6 �10 � 8
.
(2) We have P (F ) = P (M ) = 0.5, P (CB |M ) = 0.05 and P (CB |F ) = 0.0025. Using Bayes’ formula, P (M |CB) =
P (CB |M )P (M ) 0.05 × 0.5 = P (CB |M )P (M ) + P (CB |F )P (F ) 0.05 × 0.5 + 0.0025 × 0.5
(3) We want to find f (y |x) =
f (x, y) f1 (x)
which is defined only when f (x) > 0. To find f1 (x) we have to integrate out y, i.e. � f1 (x) =
f (x, y)dy.
2 To find the limits we notice that for a given x,≤0 < y 2 < 1 − x≤ which is not 2 2 empty only if x < 1, i.e. −1 < x < 1. Then − 1 − x < y < 1 − x2 . So if −1 < x < 1 we get,
�
� 1−x2
� ≤ y 3 �� 1−x2 1 f1 (x) = � c(x +y )dy = c(x y+ )� � = 2c(x2 1 − x2 + (1−x2 )3/2 ). 2 3 − 1−x 3 − 1−x2 2
2
2
Finally, for −1 < x < 1, f (y |x) =
2c(x2
≤
x2 + y 2 c(x2 + y 2 ) ≤ = 2x2 1 − x2 + 32 (1 − x2 )3/2 1 − x2 + 31 (1 − x2 )3/2 )
≤ ≤ if − 1 − x2 < y < 1 − x2 , and 0 otherwise.
(4) Let us find the c.d.f first. P (Y � y) = P (max(X1 , X2 ) � y) = P (X1 � y, X2 � y) = P (X1 � y)P (X2 � y). The c.d.f. of X1 and X2 is P (X1 � y) = P (X2 � y) = If y � 0, this is P (X1 � y) =
�
P (X1 � y) =
�
y −�
and if y > 0 this is 0
Finally, the c.d.f. of Y, P (Y � y) =
y
f (x)dx. −�
�y � ex dx = ex �
−�
�0 e dx = e � x
−�
�
�
x�
−�
e2y , y � 0 1, y > 0.
Taking the derivative, the p.d.f. of Y, � 2y 2e , y � 0 f (y) = 0, y > 0.
= ey
= 1.
y z�1
� � � � � � � � ��� ��� ��� ��� ��� ��� ��� ��� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� � � � � � � � �
PSfrag
z > 1.
x
Figure 1: Region {x � zy} for z � 1 and z > 1. (5) Let us find the c.d.f. of Z = X/Y first. Note that for X, Y ∪ (0, 1), Z can take values only > 0, so let z > 0. Then � P (Z � z) = P (X/Y � z) = P (X � zY ) = f (x, y)dxdy. {x�zy}
To find the limits, we have to consider the intersection of this set {x � zy} with the square 0 < x < 1, 0 < y < 1. When z � 1, the limits are � 1 � zy � 1 2 � 1 2 �zy x z z2 z � 2 (x + y)dxdy = ( + xy)� dy = ( + z)y dy = + . 2 2 6 3 0 0 0 0 0 When z ∩ 1, the limits are different � 1 2 � 1� 1 �1 y 1 1 � (x + y)dydx = ( + xy)� dx = 1 − 2 − . 2 6z 3z x/z 0 x/z 0 So the c.d.f. of Z is
P (Z � z) = The p.d.f. is f (z) =
�
�
z2 6
+ z3 , 0
1 1 6z 2
z 3 1 3z 3
and zero otherwise, i.e. for z � 0.
+ 13 , 0
1
18.05. Test 2. (1) Let X be the players fortune after one play. Then
P (X = 2c) =
1 c 1
and P (X = ) = 2 2 2
and the expected value is EX = 2c ×
1 c 1 5 + × = c. 2 2 2 4
Repeating this n times we get the expected values after n plays (5/4)n c. (2) Let Xi , i = 1, . . . , n = 1000 be the indicators of getting heads. Then Sn = X1 + . . . + Xn is the total number of heads. We want to find k such that P (440 � Sn � k) � 0.5. Since µ = EXi = 0.5 and β 2 = Var(Xi ) = 0.25 by central limit theorem, Z=
Sn − nµ Sn − 500 ≈ = ≈ nβ 250
is approximately standard normal, i.e. k − 500 440 − 500 ≈ = −3.79 � Z � ≈ ) 250 250 k − 500 � �( ≈ ) − �(−3.79) = 0.5. 250
P (440 � Sn � k) = P (
From the table we find that �(−3.79) = 0.0001 and therefore k − 500 �( ≈ ) = 0.4999. 250 Using the table once again we get (3) The likelihood function is
and the log-likelihood is
k− � 500 250
� 0 and k � 500.
α n en� �(α) = � ( Xi )�+1
log �(α) = n log α + nα − (α + 1) log
Xi .
We want to find the maximum of log-likelihood so taking the derivative we get
n + n − log Xi = 0 α and solving for α, the MLE is αˆ =
log
(4) The prior distribution is f (α) =
�
n . Xi − n
� � �−1 −�� α e �(�)
and the t p.d.f. of X1 , . . . , Xn is α n en� f (X1 , . . . , Xn |α) = � . ( Xi )�+1
Therefore, the posterior is proportional to (as usual, we keep track only of the that depend on α) 1 α �+n−1 e−��+n� α n en� � � = f (α|X1 , . . . , Xn ) � α �−1 e−�� � ( Xi )�+1 Xi ( Xi ) � Q Q � α �+n−1 e−��+n�−� log Xi = α (�+n)−1 e−(�−n+log Xi )� . This shows that the posterior is again a gamma distribution with parameters
�(� + n, � − n + log Xi ).
Bayes estimate is the expectation of the posterior which in this case is αˆ =
�+n � . � − n + log Xi
(5) The confidence interval for µ is given by � � 1 1 ¯ 2) � µ � X ¯ +c ¯ 2) ¯ −c (X 2 − X X (X 2 − X n−1 n−1 where c that corresponds to 90% confidence is found from the condition t10−1 (c) − t10−1 (−c) = 0.9
or t9 (c) = 0.95 and c = 1.833. The confidence interval for β 2 is ¯ 2) ¯ 2) n(X 2 − X n(X 2 − X 2 �β � c2 c2 where c1 , c2 satisfy ϕ210−1 (c1 ) = 0.05 and ϕ210−1 (c2 ) = 0.95, and c1 = 3.325, c2 = 16.92.