-
Notifications
You must be signed in to change notification settings - Fork 0
/
7
1845 lines (1845 loc) · 85.3 KB
/
7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Lacoste-Julien Simon
00:00:03
Yes, so last class was fairly abstract with the so called decision theory and then I finished by just mentioning different estimators
00:00:16
Which we would might want allies, the properties. And so I mentioned it. The Emily, the map method of moments. So now, today, I will give you a last fourth type of estimator, which relates to what we do machine learning.
00:00:32
And talk about a few properties of Emily and then we'll go. We'll start to go much more concrete on
00:00:40
With classification approaches and regression. So we'll talk about linear regression, etc. So that's the plan.
user avatar
Unknown Speaker
00:00:51
Let's see.
user avatar
Lacoste-Julien Simon
00:00:53
So today will do
00:00:59
Finish.
00:01:01
Example of estimators
00:01:05
And we'll do
00:01:07
Linear regression
00:01:12
In a bit more gory details that you might have seen in other classes.
00:01:16
So,
00:01:19
So,
00:01:21
Another example of estimator.
00:01:31
So the example for that didn't have time to do last time.
00:01:35
So this basically is doing empirical risk musician. So if we are in the context
00:01:44
Of prediction.
00:01:46
In the mission running sense of just learning a mapping from input output. So the action space is, you know, set of functions from x to y.
00:01:58
This is just quick notation to mean that. So x is the input space.
00:02:07
And why is the space.
00:02:16
Then an example of estimator for that.
00:02:20
So this is, I've been different. And like the other example I gave, which was this meeting a parameter here, you know, the parameter represent the whole function.
00:02:29
But so an example of estimator which from observation would tell us what's the prediction function we want is using
00:02:41
empirical risk position.
00:02:45
Empirical
00:02:50
Risk
00:02:51
Minimization
user avatar
Unknown Speaker
00:02:54
Maybe
user avatar
Lacoste-Julien Simon
00:02:56
My position.
00:02:58
But the risk between quote because this is the Vatican risk. You remember
00:03:03
Like the
00:03:05
The Vedic risk, which is not the same thing as the frequent this risk.
00:03:10
IE basically the generalization air.
00:03:18
And so this is often called er em.
00:03:24
And
00:03:26
And the idea is, I will have my laptop risk. Oops.
00:03:32
I Jewish an air.
00:03:37
Wrong color.
00:03:40
So,
00:03:44
We have our generalization error which dependent on the true distribution p that we don't know, and our action which is F in this case. So by definition is the expectation over or random possible test example.
00:04:01
From P of the prediction loss. And then we have possible ground truth why. And then our prediction effort capital X right so that's just a transition error.
00:04:12
And the idea in our M is you replace this
00:04:17
This intractable. Well, this unknown expectation
00:04:22
With the empirical version. Right. So the empirical expectation
00:04:27
Of the loss.
00:04:29
Which means that we're using the empirical distribution of the data. So it's basically summation over today said
00:04:37
Average of the observed last on the training examples you guess
00:04:44
And yeah, we could use
00:04:50
It's us because now we have instantiated the protocol. So it will be. That's a why I
user avatar
Unknown Speaker
00:04:57
am so excited.
user avatar
Lacoste-Julien Simon
00:05:03
There's somebody didn't mute themselves.
00:05:15
So now the effort in the ER an estimator f hat er M is just minimizing
00:05:24
This training air.
00:05:26
Over some hypothesis guess
00:05:38
Would be the hypothesis.
00:05:45
Okay, so very standard approach in the shirt.
00:05:49
And you could even have a rigorous empirical risk and innovation by adding a regularization term which, instead of just minimizing the training error you also add
00:05:58
Some notion of measure on you know penalty on on different functions which could be the norm of the parameter square, for example, help to know. We'll see. We'll see this when we talk about rich regression
00:06:15
Okay.
00:06:17
Any question about this estimator.
00:06:33
Okay. So Jacob is asking whether we would do the same kind of approximation for the frequent this risk.
00:06:44
So the thing is to do an approximation to the frequent this risk. You will need to have multiple training sense
00:06:51
Right, so this is if you want to estimate the frequent this risk. It means you want to estimate the performance of your learning algorithm over multiple training set.
00:07:02
Which is fine, but it's different because here, what we want to do is fine learning algorithm actually which means it has some training set and I want to use the training set to
00:07:13
You know, get an estimate of what should be my prediction function. It's only had one training set. So now, the idea is I actually use this training set to approximate My, my, the supposed test set performance of my argument.
00:07:30
But so that's, I think, kind of a philosophical difference between those two approaches. The other thing is I can say that, indeed, in, in, in probably 10 statistics you
00:07:39
Will in a lot of different places do empirical average to estimate some quantity. So this happens all the time. And there's also this notion of like bootstrap procedure, which is to estimate uncertainty about your, your method. So, for example, indeed, like if I want to estimate the
00:07:59
Variation of my prediction rule when I change my training set and I want to know what's the variants of my prediction when I change my training set. So that's also a notion which I need to
00:08:09
Take an expectation of our training set. So what you can do in kind of the bootstrap type of approaches us resemble with replacement. The training sets to get multiples subset of the training set.
00:08:22
And then you train your classifier on each of them. And then you look at what's the variation, then you can average these every variation and that gives you an estimate of how it varies. When I changed my trainings.
00:08:33
So this is called like a bootstrap estimate
user avatar
Unknown Speaker
00:08:37
Okay.
user avatar
Lacoste-Julien Simon
00:08:38
And why does that Nick risk equal jurisdiction error. This is just from the setup so so so the journey mission. So the
00:08:50
Yeah, the case of prediction is just terminology. So I'm just saying, like, when we talk about empirical risk musician. So the, the risk in the empirical risk was this kind of quantity which is also where we call it missionary modernization.
00:09:05
So,
00:09:16
Okay, so somebody made a very good question. What's the difference between this hypothesis class and my set of functions.
00:09:24
Good question.
00:09:27
Usually you might decide that they're the same. So the idea is that this thing here.
00:09:34
Was a formalization of the problem we're trying to solve. Like, it's like this is like from a statistical theory perspective and saying,
00:09:40
Look, I'm trying to do prediction means I need to learn some kind of mapping from input output. I might decide that you know this. Include. Include all possible functions. Right.
00:09:51
And this is a high evaluate my prediction function, even though I don't know p. So those all of the ingredients for the problem and trying to solve.
00:09:59
Now, the way you try to solve it doesn't mean that you your, your estimator could decide to to output arbitrary function, you can decide that I'm only putting linear classifier.
00:10:09
And so I would restrict my class here and of course you know if if you're
00:10:14
The best if the distribution you have will have a prediction function which is very far from linear well this estimator will be very bad.
00:10:24
But that's fine. I mean, you're free to define estimator however you want, doesn't mean that they're good estimator. And actually, the assignment I gave you an example of an estimator, which was not very good. So
user avatar
Unknown Speaker
00:10:34
Yeah.
user avatar
Lacoste-Julien Simon
00:10:36
So, so that's this
00:10:42
The cake close case called fold cross validation would be closer to the frequent this risk.
user avatar
Unknown Speaker
00:10:57
Hmm.
user avatar
Lacoste-Julien Simon
00:11:03
That's a good question. So normally
00:11:07
The cross validation is to get a good estimate of the test there.
00:11:14
Because of training air. Let's say, for example, I do see I do classification
00:11:20
And I use the nearest neighbor classifier. Well, this always have zero training here.
00:11:24
So that doesn't give you a very good estimate of what's the test error because of course you cannot just say it would be zero tester. So doing k cold cross validation gives you an estimate of what is the the the
00:11:39
Oh no you're right, because you're changing the hyper parameters.
00:11:43
Your
00:11:45
Changing the decision rule when you do your basic training on a subset you estimate the value and use normally that's for model selection.
00:11:55
So yeah, I guess you could say that the cold cross audition would be closer to their frequencies.
00:12:12
It which which that are you talking about training error was a generalization there. Can you rephrase your question.
00:12:29
Yeah, so that make there's literally no training area and the Vatican risk because it's the true expectation. So the ethnic risk is is this expectation here which is not the training set. It's actually the, the true distribution.
00:12:44
Is just that we will replace it with an empirical version. When we have access to some like training examples.
00:12:52
That's the the empirical version will be the training air empirical risk is indeed training air.
00:12:58
True risk is the transition
user avatar
Unknown Speaker
00:13:02
Okay.
user avatar
Lacoste-Julien Simon
00:13:05
Alright, so
00:13:07
Let's talk about the James Stein estimator because
00:13:11
I told you that Emily has issues.
00:13:15
The James Stein estimator.
00:13:19
It's a very mysterious estimator.
00:13:23
So this is basically an estimator.
00:13:27
For the
00:13:32
To estimate the mean of
00:13:36
Random of Gaussian random variable.
00:13:41
So you want to estimate domain.
00:13:45
Of some Gaussian with me new I'll put a vector, just to say it's a vector.
00:13:52
In multiple dimension and
00:13:56
Independent covariance matrix. So basically I have D independent Gaussian
00:14:06
Gaussian variables.
00:14:11
They are basically x i, or
00:14:15
Independent
00:14:17
normal with mean new I and the same sigma square. They're not ID, because they have different means. But they are independent.
00:14:30
Right and so
00:14:34
There's this estimator. So if you do maximum likelihood estimate of domain of gushing random variable that's something I think you did, didn't the knowing the 72 the estimate of the violence but
00:14:48
The estimate for the main is just the empirical mean except for cash is very variable that just take the empirical mean and that gives me the maximum likelihood estimate
00:14:58
And it turns out that deed and the empirical mean if you think the expectation, you get the normal mean so it's unbiased estimate. OK, so the maximum IQ this estimator is unbiased.
00:15:09
The gemstone estimator instead of being the bias is actually biased.
00:15:14
Okay, you basically shrink your estimate towards zero. You do.
00:15:19
But by with the cost of a bit of bias, you actually decrease the variance significantly
00:15:29
And so let's say, but much lower variance. Then, Emily.
user avatar
Unknown Speaker
00:15:35
Right.
user avatar
Lacoste-Julien Simon
00:15:36
And if you recall
00:15:39
The bias variance, the composition
00:15:44
That we use in the assignment.
00:15:49
For the square less
00:15:54
You had that the frequencies risk for the squared loss when the true parameter is theta and my estimate is data hat was the expectation
00:16:07
Of theta minus beta hat in to norm.
00:16:11
That's just the squared error expectation respect to the training set right and it composed into pieces. There was the biased.
00:16:23
Minus data square and the variance
user avatar
Unknown Speaker
00:16:34
Okay.
user avatar
Lacoste-Julien Simon
00:16:39
And so for Emily, the bias is zero, but it has a high variance
00:16:44
For James Stein I increase a bit the bias, but the decrease significantly the violence and the some of the two is actually smaller than the Emily. Okay. So it turns out that the James Stein estimator.
00:17:00
strictly dominates the maximum likelihood estimate
00:17:10
The maximum likelihood estimator for
00:17:16
The bigger equal to three. So it turns out there's this weird phenomenon that in a low dimension, you can beat it. But if you have at least three dimension.
00:17:26
So this is the dimension of the meat right
00:17:28
Then using this this shrinkage, you can actually do better than maximum likely estimate
00:17:38
And what I mean by strictly dominates. It means that the frequent this risk.
00:17:43
Of the genocide estimator.
00:17:47
is smaller than the one for the Emily estimator, but this depends on data actually this is true for all theta.
00:18:00
And there exists some data.
00:18:04
Such that the risk is tricky smaller
00:18:14
And so from. So in this case, you remember like I talked about this risk profile and usually the cross. Well, what happens here is that the risk profile for the Jameson estimator is
00:18:23
Is always below the one from the Emily. So it's just a better estimator. So you should not use the value from this perspective.
00:18:31
Which means that when when that an estimator, as is the median by another estimator. It's called inadmissible in statistics. So Emily is
00:18:43
inadmissible.
user avatar
Unknown Speaker
00:18:46
In this case,
user avatar
Lacoste-Julien Simon
00:18:51
So by definition in the midst of all just means that gets limited by something else and to delete means well it's a bad estimator, because there's something which is tricky better than that. So why would you use it.
00:19:10
And so if you're curious about this. I recommend you look at the Wikipedia article on James time you'll see the estimator. It turns out that the the James Stein estimator, you can interpret the gemstone estimator.
00:19:25
As a empirical Bayesian approach.
00:19:36
So we'll see that will see what I'm pretty cool Beijing means in our
00:19:42
Much near in the class when we talk about the Beijing methods and more in more general it but a beige and mess up the idea was, I would put some prior to my parameter and then do these kind of like posterior kind of
00:19:54
Updates, and then pretty cool Beijing means that there's some parameter of my prior, which I didn't know how to pick them and events.
00:20:04
So if I have uncertainty about an element about some parameter, I should put a prayer over it. So you can actually. So let's say
00:20:10
I want to decide what are the parameter of my prior will. These are called hyper parameter. Well then, if you don't know how to choose your hyper parameter, because as a Bayesian you don't have a good
00:20:20
belief about it. There's uncertainty, then you need to put the prior over the hyper parameter. So this is called a hyper prior and then you can have
00:20:27
Also the time you might have uncertainty about the parameters of the hyper prior and then you put another prior so you get this your key of prayers
00:20:35
And being empirical Beijing means that I will fit some of these hyper parameters from data instead of just like coming up out of my thin air, which is what the true subjective patients would do. Okay, so
00:20:49
It's really cheating because of Beijing would never do that. But
00:20:53
Basically a practical frequent. This will use Beijing methods will do that.
00:20:59
What's the stepping crypto. Yeah. So this is kind of like
00:21:06
Yeah, so in this case means for the bigger equal to three. That can, or more specifically for the ocean.
00:21:15
To estimate the gun, the mean of a Gaussian, because in general like there's other places where the Emily's is admissible, but for
00:21:22
The setup of estimating the, the, the mean of a Gaussian and dimension three and more where, you know, the variance ball. There's a few kind of assumptions here, then it's an indivisible and
00:21:33
The
user avatar
Unknown Speaker
00:21:36
Only. There's a lot of punishing
user avatar
Unknown Speaker
00:21:38
Question.
user avatar
Lacoste-Julien Simon
00:21:43
Is it true that the L two frequencies. This is the same as the means squared error.
00:21:50
Well,
00:21:53
Yes and no. So yes, here, and no because means squared error, there's a question of what they mean respect to what right
00:22:03
And so the frequent distress would be the mean respect to the random training set, that's where the mean is coming. But for example, like in signal processing, you could just look at the main square error of your prediction.
00:22:16
And there's no notion of frequent this risk there. This is just an evaluation. This is just like something the square of your method.
00:22:23
Over some observation. So there's no frequent this risk notion there. So that's why I'm saying I mean squared error when the mean is respect to the
00:22:31
That possible training set that will be different. Once this risk for this squared loss because I forgot this risk would also be valid with other lost in the square last could be the binary that's or something.
00:22:48
Okay. Another question is, is the risk dependent, the loss function. Yes, it is. Does the gemstone dominate only for square loss.
00:22:58
Good question. I don't know if for the say the Edwin norm loss, it still would be the same. So this actually don't know.
00:23:08
Somebody knows that does know or you can also look it up, but I know it's true for the Altoona.
00:23:15
And then somebody asked is that it will the chain ever stop. Yes. So the funny thing here is like, Remember there was like
00:23:21
At the end of the earth or something. There's turtle. And then there's a on top of turtles and type of turtles effort. There's like there's like this mythology of like turtles on top of each other all the way up and
00:23:33
And indeed, for the prior over prior prior prior there's a notion of, well, when should stop and and the point is, like,
00:23:41
At some point there is not really uncertainty about the dependence, because they don't really matter too much. So it turns out that when you go higher in their yard key and then you tweak these parameters.
00:23:51
What's happened down, it becomes much less important like it. It's very insensitive. And so at this at this stage, people are fine say this properly encode my belief. I don't need to. I don't have more uncertainty.
00:24:05
I guess that's their answer. But indeed, in general, like you should only stop when you know that this is the correct and coding of your belief.
00:24:17
Okay.
00:24:19
So I could actually spend a whole lecture on James Stein, but I think we have other things to cover. So I'll move on.
00:24:27
But it's kind of a fascinating example but the main idea of James time is it's a bit similar to what we'll see in like
00:24:34
These regression, stuff like that. But we talked about trigger ization so when you read your eyes, your method you will usually bias it but
00:24:42
It will increase the bias, but it will decrease the, the variance and sometimes it will decrease the variance, much more which will make the method and more stable and nicer and Emily, usually called over fits and that way it has high variance
00:25:00
Yeah, I think the so Omar asked if the notion of an estimator being in message will depend on the wrist function or we always suppose Lt. So that's actually
00:25:10
A good question. I don't know enough of the details of the statistics terminology, because usually a lot of these terms when you take a statistics class during the basic
00:25:23
Classical setup and then the classical setup. It's all too warm and elsewhere loss and stuff like that. So, and, and I think
00:25:32
I think it's, I don't know. I don't think there's a theorem which says that if you're in admissible for one last year in Espanol for the other losses. So I'm pretty sure inadmissible is will be in the context of some kind of us.
00:25:49
All right. Alright, so let's talk about some properties of Emily to wrap up these properties of estimators
00:25:56
Properties of Italy.
00:26:06
And I guess these are synthetic properties.
00:26:13
So under suitable
00:26:19
Regulated conditions.
00:26:28
On the parameter space and you're Patrick family.
00:26:35
And I won't go into this conditions because actually forgot them and they're fairly technical and I think it's outside the scope of this test. But if you can look in any graduate statistics textbook, they will tell you.
00:26:49
Or I can dig them back up if you want
00:26:53
So basically if we define our estimator as the arc max.
00:26:59
Over the power of space.
00:27:01
Of the empirical
00:27:04
log likelihood
00:27:07
Expected like you would say to just
00:27:10
log of p of x i data.
00:27:14
So here I am.
00:27:18
Assuming that the data is ID.
00:27:22
And so when I evaluate the likelihood of the data, then it's the product of the likelihood. And so then when I think the login becomes
00:27:30
So whether the parties have this estimator. Well, the first thing is with very weak regulating conditions. It is a constant is consistent estimator.
00:27:40
So,
00:27:44
It is consistent
00:27:46
I eat converse the right data. And so the idea here. This is supposing that
00:27:53
We suppose that the end the training set is coming from.
00:27:58
P of feta race to the end. So basically, of course, you want to be consistent in the sense if the true parameter data doesn't lie in your parameter space, right. So if you don't have. If you're not modeling the correct set of distributions, then you won't be consistent.
00:28:15
Though you'll minimize your conversion to the minimum kill parameter
00:28:21
So you have consistency. You even have central limit theorem.
00:28:33
And so that means that when I looked at the deviations between my estimate and the true parameter this converge in distribution.
00:28:43
Through a normal with zero mean and some covariance, which is called the
00:28:51
Information matrix.
00:29:05
Which has to do with like
00:29:07
The derivative of the log likes you would
00:29:11
an expectation
00:29:16
Here I'm basically give you some keywords that if you want to go in more depth. You can look at the statistic textbook.
00:29:23
And then it is also called SM tactically optimal
00:29:37
And this is called the Kramer Rao lower bound.
00:29:50
Which basically means that
00:29:55
It has a minimal
00:30:01
Asset that exists variance
00:30:08
Among
00:30:10
All reasonable estimators
00:30:26
And reasonable as also some kind of regular condition. So what I mean by the aesthetic variance. Right. So the idea is
00:30:34
When I looked at the deviation of my estimate with the true value.
00:30:42
For for this means that
00:30:46
This random variable here will converge with it will have a distribution which is close to a normal as n goes to infinity, right. And so, in particular, what I can do is I can divide both by squared and and so it means that the variance here. It's basically divided by n.
00:31:04
And so, what you get is that
00:31:07
As an increases the Gaussian becomes more and more concentrated around zero, which means that your the difference between your estimate and the true thing is is is very, very small. Right, so it becomes very concentrated okay and
00:31:25
So, so this tells you how it various with the samples and and the value here. This is called the essence that the variance. So the bigger the essence of the variance
00:31:36
The, the more samples it take to get small deviation. Right. And so, so, so, so here it's a matrix because it's in multiple dimension, but if it wasn't dimension one, this would be just
00:31:49
Sigma square. The difference of your, of your the essence of the variants of you've got should have your estimator basically have your deviation of the estimate. So that's what it means by
00:31:58
When we say that Emily has minimal accepted the variance, it means that actually the other estimators will always have accepted the variance, which is actually
00:32:09
strictly bigger well this turkey bigger but big are equal, then the information matrix. The matrix is the best you could do, which is done by the Emily. Okay, so that's basically do what does Kramer rebel lower bound sets.
00:32:24
That was a bunch of questions.
00:32:31
Oh week i'm not saying week regularize I'm seeing regular routine. So this is regularity.
00:32:40
Here this is
00:32:43
So by week regularity conditions. I mean, there are some assumptions you need to make on the
00:32:52
The density and the parameter space to make sure that these result hold
00:32:58
Very good law.
user avatar
Unknown Speaker
00:33:01
Please curate
user avatar
Lacoste-Julien Simon
00:33:06
And indeed now important question from Suffolk, how does this a synthetic optimally to relate to the gemstone that's estimator.
00:33:14
So what happens is that for finite end the gym sign estimator dominates the Emily, but as NGOs infinity, the James Stein estimator becomes like the MLS there's no difference. And so they have the same except that the variance
00:33:28
And it comes a bit with the fact that the there's a there's kind of like you do a bit of like when you have a Bayesian estimator. When you say you do math estimate
00:33:39
There's a prior, and then there's a likelihood the effect of the prior becomes weaker and weaker, as you have more data point, because that's where all the information is going
user avatar
Unknown Speaker
00:33:49
OK.
user avatar
Lacoste-Julien Simon
00:33:52
OK. And then the fourth property.
00:33:56
Is invariance.
00:34:04
So basically the ML. He is preserved under reprivatisation
00:34:20
Okay, so what I mean by the. So, suppose
00:34:26
You have a by injection
00:34:32
From one set of parameter to another set of parameters will use prime
00:34:39
Then
00:34:42
If instead of estimating feta me stating
00:34:48
The premises ation FF data.
00:34:54
And that put a hat here. So that would be the Emily. This is the same thing as first doing Emily for theta and then mapping it with it. So if I do Emily in the transform space.
00:35:04
And I look at the powder which maximizes the likelihood in the transform space. It's the same thing as doing the Emily in the original space and then mapping it to the transform space.
00:35:16
And so this is actually very useful because then you don't have to worry too much about where you put the hat. So for example, let's say I want to estimate the, the variance, just something you didn't the assignment.
00:35:31
So I parameters, my gosh, and by sigma square
00:35:35
Well, this is the same thing as looking at the so this means you will take the derivative with respect to sigma square right because that's the parameter
00:35:44
And you can instead just take their respective sigma IE estimate the maximum that could pander in the sigma land where sigma is positive and then square it. So there's no difference between those two.
00:35:59
And similarly, you could have some crazy function like oh, let's say, now I want us to make
00:36:06
You know, I use sign in the proper because you have to have objection. So you need to restrict your, your possible parameter, but let's say I want us to make now that Emily, but I you sign of sigma square as am I my parameters, all you can just take sign of sigma hat and then squared.
user avatar
Unknown Speaker
00:36:27
That's the same thing.
user avatar
Lacoste-Julien Simon
00:36:31
And now, if it's not about rejection, like the sign example is not a bad direction.
00:36:39
You can actually generalized
00:36:42
The Emily.
00:36:49
With something called the profile like you
00:37:06
The profile.
00:37:09
Likelihood
00:37:14
And so
00:37:16
What do I mean by that. So let's suppose I have a mapping
00:37:22
From theta to a new set of parameters.
00:37:30
But there's multiple theta which are map to the same at that
00:37:35
So then it's a question of if I do Emily in space.
00:37:40
Which of the
00:37:42
Which of the parameter, should I use
00:37:47
Right. So by definition the profile likelihood would say
00:37:54
Likelihood
00:37:57
By definition,
00:37:59
It's like you will define on this new space. What we do is we actually look at the max over theta.
00:38:08
Of the points which are mapping to enter
00:38:14
Of its likelihood
00:38:18
So I will associate the likely I will define the likelihood of a specific ETA to be the likelihood for the parameter which actually maximize the likelihood of the data for all the parameters which are all mapped to the same at that
00:38:39
And then if we define
00:38:43
The maximum likelihood parameter in this transform space as just the arg max.
00:38:53
Of the profile.
00:38:57
Likelihood then we have
00:39:02
That the maximum likelihood in this space.
00:39:06
Is the same as same as just mapping the maximum next you in the originals, but
00:39:12
OK, so this profile accurate trick is one way to handle when there's no by injection
user avatar
Unknown Speaker
00:39:40
Oh,
user avatar
Lacoste-Julien Simon
00:39:41
Can I give an example of the profile acute situation.
00:40:03
Yeah, so let's say
00:40:06
Instead of prioritizing
00:40:09
Let's say g
00:40:11
Is
00:40:20
Let's say I have my mean I have a gash in with me view and vibrancy my square. And now I will say, gee, of mew is new square
00:40:35
This is different than the sigma square example of before because
00:40:40
It was only positive that which matters here now is that if I have a plus one or minus one.
00:40:48
So if I have a plus or minus mu their map to the same parameter. So I cannot distinguish a positive and a negative me
00:40:59
And so now the problem is, suppose that so I need to define what's the likelihood
00:41:06
Of my data given the value of new square, but now it's ill defined because there's multiple parameter, I could use in the original model and you have different likelihood
00:41:17
Okay. And, and you could have issues in the in the relationship between the maximum I could parameter. If, for example, sometimes I decided to choose the parameter
00:41:30
In the new space, which had lower likelihood of the data that I could like there's two parameters which are match to new square
00:41:39
And I picked them you which actually had smaller IQ. And so this means that, well, I won't pick this one because it has small IQ and then there's others which are bigger like you
00:41:47
Even though the other one which was mapped to the same one might have very high likelihood that that would actually be the maximum likelihood parameter if well defined. So that's basically what I'm saying.
00:41:56
This is just a way to make sure that the maximum likelihood in the transform space correspond to the maximum likelihood in the original space by making sure you always pick the likelihood in the original space which is maximized.
00:42:13
Does that answer your question, Dora.
00:42:16
Don't we end up with a worse estimator Indian
00:42:20
And in this case, know because
00:42:26
All you have so so
00:42:28
So basically, what it means is
00:42:31
Here when we estimate at Emily always says, well, we don't care about the design of the mean. Oh, we care is the square to me.
00:42:40
I'm trying to estimate. Describe to me.
00:42:43
I because it could be that indie like I have some observations and I wanted to estimate the square of the observation. I don't care what's the sign of the observation. Because the square is insensitive to the site.
00:42:55
And not just making sure that you know
00:42:59
This is well defined so that everything works.
00:43:05
And so if you didn't do this. Let's say you still need to define what should be the likelihood of a square and then there's a problem because there's multiple possibilities.
00:43:14
And then you will need to, for sure. Solve the maximum next to you in this transform space to get an estimator. And this, it's not guaranteed. In this case it will be the transformation of the original
00:43:32
Yes, I think for the for the Nick has answered, Jacob.
00:43:39
So I think
00:43:41
We're good. Oh, and by the way this terminology here, this, this is called a plugin estimator.
00:43:52
In the sense that, oh, well, we want to estimate a function of something
00:43:58
And. Well, one way to do that is just estimate this thing and then apply the function which is mean I plug my estimator inside the function
00:44:08
And for the Emily in this framework, it actually doesn't change anything. There's other places where this might change things. But here, this
user avatar
Unknown Speaker
00:44:17
Technology
user avatar
Lacoste-Julien Simon
00:44:19
Is there any other questions about Emily or properties of estimators. So the plan is. I think I'll take a 10 minute break and then I'll go over
00:44:30
Linear regression and logistic regression
user avatar
Remi Dion
00:44:35
I have a question. Sure.
00:44:38
What's the, what are the constraints on G here if we have any
user avatar
Lacoste-Julien Simon
00:44:46
Regarding condition.
00:44:49
That's my lame way to to escape so
00:44:56
I think you need some constraints like you, you need
00:45:06
Could you really just have an arbitrary function which is even is now in continuous
user avatar
Remi Dion
00:45:14
You said that even the
00:45:18
Non aggression.
user avatar
Lacoste-Julien Simon
00:45:21
As
00:45:21
If g is not about rejection then to apply this framework, you need to to properly define what would be the likelihood of the transform
00:45:31
Problem with this this profile like you, right. So we'll say will define the likelihood of a parameter eta as the max over all parameters which were mapped to this eta of the original model. So if you do that then doesn't matter. It's not by injection
00:45:54
Because it's a function, by the way. So this means that g is defined on all the feta. Alright, so, so that's one aspect.
00:46:02
Yeah, so I think I don't see any. So you might have some
00:46:08
Rigor the issues if these max.
00:46:13
Or not well defined right if there are infinite or if there's infinite. But if there's they're not achieved.
00:46:21
Their achieve at the boundary or something, I don't know. So there might be some weird stuff happening.
00:46:27
But, uh, yeah. I think it's very gentle.
00:46:32
Another question.
user avatar
Oumar Kaba
00:46:38
Yes, I did have a question that I asked in the chat.
00:46:42
For Amy's question school
user avatar
Lacoste-Julien Simon
00:46:46
Do you want to state it in words.
user avatar
Oumar Kaba
00:46:47
Yes. So I wondered if when doing a Bayesian parameter estimation with map, for example, or any other when
00:46:57
You can always find a prior that will make it equal to the maximum, like you mentioned, is it the case because in most example. I think we've seen in class, you can actually find a prior that will make it the same estimate the maximum likelihood
00:47:13
Formerly
user avatar
Lacoste-Julien Simon
00:47:15
So the answer is yes, but you need to generalize the priors. So there's this thing called em proper prior, which is a prior which is not a correct distribution, because it's infinite. It doesn't integrate to one.
00:47:31
So for example, let's say I do a merely of the Gaussian
00:47:37
Random variable. So I want to, I want to estimate the meat. The meat is could be anywhere on the real line. So it's unbounded set. So if I don't want to introduce a prior over to me and I need to put a prior overall real numbers.
00:47:50
And I want it to be uniform. If I don't want it to change the Emily right the map. Sorry, it's if the map to be equal to Emily. So then you need to put the uniform distribution to roll the real which is not possible. And so
00:48:03
What happens is, in many places people would call it an improper prior, which is just like you don't care that it's the prior was not normally visible because the posterior will be normalized once you add the an IQ than you realize. So so so that's kind of like a formal trick.
00:48:24
And so, but in general you can, as long as you define some kind of uniform thing.
00:48:31
Over your the
00:48:32
Thing you estimate then because when you do maximum map you multiply both the likelihood and the prior but you don't care about the normalization because it's just a
00:48:43
Constant. Well then, if the prior his uniform. There's no difference with the maximizing of their likelihood, which is what Emily's doing
user avatar
Oumar Kaba
00:48:52
I see. So it seems like even if you're a vision, you can still Backward. Backward. Backward. Backward engineer to find the prior that you want to satisfy any estimation want to do that.
user avatar
Lacoste-Julien Simon
00:49:09
Ah, I'm
00:49:11
Not necessarily, actually. So, so basically, there is this industry of approaches which is. That's what I'm, what I'm doing a support vector machine.
00:49:24
Which is not even Emily or whatever. It's just an estimator. Okay. Is there a probabilistic interpretation for
00:49:32
Estimating the classifier of support vector machines using properties and perhaps Bayesian approaches.
00:49:38
And then people came up with fancy distribution that when you do this approximation XYZ, then you get as VM or something, and that
00:49:45
For them, especially if you're Asian that's feel very satisfying because they say, oh, it's, it's like stuff. I know. And it gives some it's inside because oh
00:49:52
There's like the probably teams behind it. So it is kind of attractive as a as a method, but sometimes it's really, really hard to come up with distribution so that it work, and I'm not sure.
00:50:01
You could show that it's always possible. I don't think there's some stuff, which are weird. So already you cannot, you can already not even do it with proper distribution like I like okay I gave the example, if you're
00:50:13
Set is unbounded, then there's no way to define a uniform distribution on that. So this is kind of ALREADY CHEATING to do improper price.
user avatar
Oumar Kaba
00:50:21
Okay, thank you very much.
user avatar
Lacoste-Julien Simon
00:50:25
And Dora says if you repeat experience many times, well, if you have another training set. Sorry, a very large training set, a lot of observations, then
00:50:34
Map becomes like Emily. That's true. Yes, I'm saying that usually the defect of the prior becomes stability swamped by the data. Usually when so as NGOs and feed the give the same thing as small data question.
user avatar
Simon Demeule
00:50:50
Yep. So it's kind of a very general thing. And I don't know if there's really a way to answer this kind of shortly.
00:50:58
But just generally kind of the process of choosing a prior and going through the whole Bayesian process kind of didn't completely stick to my head and intuitive way. I kind of get the math and I kind of get why works but but still like baking a prior. It just seems very odd to me.
00:51:17
And I've kind of had a hard time just going through the examples. So is there maybe like a resource or something online you'd recommend reading
user avatar
Lacoste-Julien Simon
00:51:28
Ah, that's a good question. So
00:51:36
Right. So the first. So the resource on nine nine, we need to think. So perhaps like something like the practical Beijing or something. There's good books, but their books they not sure if there's a short text them that
00:51:51
You could also look at keywords like prior elucidation which is basically how to
00:51:59
Figure out prayers on things and or what would happen is usually that see your statistics and you would talk to experts and the expert has some good intuition about how things should behave
00:52:08
And then you try to kind of talk to them a lot to figure out how to formalize their belief about the system so that
user avatar
Simon Demeule
00:52:15
You know, you put the
user avatar
Lacoste-Julien Simon
00:52:16
You construct the correct prayer, according to their
user avatar
Simon Demeule
00:52:18
expert knowledge.
user avatar
Lacoste-Julien Simon
00:52:23
And Dora has suggested Beijing methods for hacker. Okay, so I don't know this book breakfast. Good. So
00:52:30
Two things I would like to say one thing is, first of all, you know, doing this kind of statistics is is an art which takes a lot of time like like people are trained statistician do a lot of years of training and practical
00:52:42
Things to build up this intuition about
00:52:44
Which statistical procedure to use in which situation or if they're Bayesian, you know what, how to build my prayers and support.
00:52:51
So it's, you know, it's definitely non trivial and we won't be able to do it to just in this class, and the other aspect I would say is that the general rule of thumb for you to come up with a prayer is just think about if you had observed this phenomenon in the past.
00:53:06
Right. So, for example,
00:53:09
The coin flip. Example is
00:53:13
The idea is, when you put a prior, which is a beta, you have this parameter alpha. It's a, it's a, it's a, heads or tails. So you have the parameter alpha and beta for your beta distribution.
00:53:25
And these are called prior counts in some sense that say you put one in what it means that you've observed once in the past head and once in the past that tail.
00:53:34
And
00:53:36
And that's why, in this case, it's still
00:53:41
Yeah. And so if you instead put 1000 1000 as your alpha and beta
00:53:49
The beta destruction will be much more concentrated
00:53:51
Around one half.
00:53:52
It's mean because you've seen a lot of observation and the were very, very equal
00:53:57
And so you're more. This is like a stronger prior in the sense that you're you're
00:54:01
you're committing much more on the fact that, oh, I know that the pounders should be around one half because I observing the passive thousand head in 1000
user avatar
Simon Demeule
00:54:10
Tail.
user avatar
Lacoste-Julien Simon
00:54:11
And so then that and then when you add your likes you update you will use combine your observation with that and it's just adding the counts. You've seen right so the to start with alpha beta prior
00:54:22
Than the posterior becomes alpha plus the number number of time I've seen becomes a beta plus here with off up as number of times I've seen head and beta number of times details. For example, and so
00:54:35
And so
00:54:37
You can think on the sense of constructing the the prior from just prior observations or interpreting what are the parameters in the prior from like observing things
user avatar
Unknown Speaker
00:54:49
Okay.
user avatar
Lacoste-Julien Simon
00:54:54
Now there was Ezekiel.
user avatar
Simon Demeule
00:54:57
Um,
user avatar
ezekiel williams
00:54:58
Yes, I asked a question earlier that I think was just missed in the chat. I was wondering, are for starters, if we're talking about a consistent estimator, then the
00:55:10
The variants of the estimator will converge to zero as the sample size goes to infinity, right, or
user avatar
Lacoste-Julien Simon
00:55:16
Well, so this is again going back to this technical point that I discuss in Slack about the assignment is that if you have
00:55:26
So so consistency here.
00:55:31
I've used the standard terminology of consistency, which is convergence and probably
00:55:37
When you have the bias and the variance goes to zero.
00:55:42
You have that the squared, the expected squared error.
00:55:48
By the bias vertical position will go to zero. So you'll have something like
00:55:54
Expectation
00:55:58
So if the bison the reins goes to zero, you will have that expectation of beta hat minus Theta.
00:56:06
Norm square, this goes to zero as and ghosts and fee. Okay. And this is called El to convergence for a random variable and an L to convergence implies convergence and protein.
00:56:22
And so it's a stronger thing.
00:56:24
And so if you have
user avatar