ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

机器学习著名定理之—No Free Lunch定理详解

2022-02-22 22:07:27  阅读:150  来源: 互联网

标签:frac limits No 定理 vr Free ge mathcal sum


No Free Lunch定理

定理(No Free Lunch): 假定 A \mathcal{A} A是一个在域 X \mathcal{X} X的二分类任务中任意一个机器学习算法,其损失函数为 0 - 1 0\text{-}1 0-1损失。令 n n n是一个大小为 ∣ X ∣ / 2 |\mathcal{X}|/2 ∣X∣/2的训练集,存在域 X \mathcal{X} X中的分布 D \mathcal{D} D,则有
(1)存在一个函数 f : X → { 0 , 1 } f:\mathcal{X}\rightarrow \{0,1\} f:X→{0,1},且有 L D ( f ) = 0 L_\mathcal{D}(f)=0 LD​(f)=0。
(2)对于子列 S ∼ D n \mathcal{S\sim D}^n S∼Dn,则概率不等式 P ( L D ( A ( S ) ) ≥ 1 / 8 : S ∼ D n ) ≥ 1 / 7 P(L_\mathcal{D}(\mathcal{A}(\mathcal{S}))\ge 1/8: \mathcal{S}\sim\mathcal{D}^n)\ge 1/7 P(LD​(A(S))≥1/8:S∼Dn)≥1/7成立。

证明:
(1) 令 C \mathcal{C} C表示域 X \mathcal{X} X中大小为 2 n 2n 2n的一个子集。主要的证明思路是只利用数据集 C \mathcal{C} C一半的数据样本点并不能给出剩下一半数据点的信息。 假定 H \mathcal{H} H表示数据集 C \mathcal{C} C到标签集合 { 0 , 1 } \{0,1\} {0,1}所有可能的函数集合,且 T T T表示的是函数集合的基数,其中 H = { f 1 , ⋯   , f T } \mathcal{H}=\{f_1,\cdots,f_T\} H={f1​,⋯,fT​}, T = 2 2 n T=2^{2n} T=22n。对于 H \mathcal{H} H中每一个函数假设,令 D i \mathcal{D}_i Di​是 C × { 0 , 1 } \mathcal{C}\times\{0,1\} C×{0,1}中的分布 D i ( { ( x , y ) } ) = { 1 / 2 m i f   y = f i ( x ) 0 o t h e r w i s e \mathcal{D}_i(\{(x,y)\})=\left\{\begin{array}{ll}1/2m & \mathrm{if}\text{ } y=f_i(x)\\0& \mathrm{otherwise}\end{array}\right. Di​({(x,y)})={1/2m0​if y=fi​(x)otherwise​进而可知存在函数 f i f_i fi​,在数据分布 D i \mathcal{D}_i Di​上则有 L D i ( f i ) = 0 L_{\mathcal{D}_i}(f_i)=0 LDi​​(fi​)=0。
(2)主要证明的关键在于即对任意的学习算法 A \mathcal{A} A有 max ⁡ i ∈ [ T ] E S ∼ D i n [ L D i ( A ( S ) ) ] ≥ 1 / 4 \max\limits_{i \in [T]}\mathbb{E}_{\mathcal{S}\sim \mathcal{D}_i^n}[L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}))]\ge 1 / 4 i∈[T]max​ES∼Din​​[LDi​​(A(S))]≥1/4首先从 C × { 0 , 1 } \mathcal{C}\times \{0,1\} C×{0,1}中采样出 n n n个样本构造一个训练集,其中采样出的样本可以重复,进而可知有 k = ( 2 n ) n k=(2n)^n k=(2n)n中可能的样本序列。令这些样本序列分别表示为 S 1 , S 2 , ⋯   , S k \mathcal{S}_1,\mathcal{S}_2,\cdots,\mathcal{S}_k S1​,S2​,⋯,Sk​。 S j i = ( ( x 1 , f i ( x 1 ) ) , ⋯   , ( x n , f i ( x n ) ) ) \mathcal{S}_j^i=((x_1,f_i(x_1)),\cdots,(x_n,f_i(x_n))) Sji​=((x1​,fi​(x1​)),⋯,(xn​,fi​(xn​)))表示的是函数 f j f_{j} fj​在样本序列 S j S_j Sj​中的数据集合,则有 E S ∼ D i n [ L D i ( A ( S ) ) ] = 1 k ∑ j = 1 k L D i ( A ( S j i ) ) \mathbb{E}_{\mathcal{S}\sim \mathcal{D}_i^n}[L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}))]=\frac{1}{k}\sum\limits_{j=1}^k L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}^i_j)) ES∼Din​​[LDi​​(A(S))]=k1​j=1∑k​LDi​​(A(Sji​))又因为 " m a x i m u m " ≥ " a v e r a g e " ≥ " m i n i m u m " \mathrm{"maximum"}\ge \mathrm{"average"}\ge \mathrm{"minimum"} "maximum"≥"average"≥"minimum",所以则有 max ⁡ i ∈ [ T ] 1 T ∑ j = 1 k L D i ( A ( S j i ) ) ≥ 1 T ∑ i = 1 T 1 k ∑ j = 1 k L D i ( A ( S j i ) ) = 1 k ∑ j = 1 k 1 T ∑ i = 1 T L D i ( A ( S j i ) ) ≥ min ⁡ j ∈ [ k ] 1 T ∑ i = 1 T L D i ( A ( S j i ) ) \begin{aligned}\max\limits_{i\in [T]}\frac{1}{T}\sum\limits_{j=1}^k L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}_j^i))&\ge \frac{1}{T}\sum\limits_{i=1}^T\frac{1}{k}\sum\limits_{j=1}^kL_{\mathcal{D}_i}(\mathcal{A(\mathcal{S}_j^i)})\\&=\frac{1}{k}\sum\limits_{j=1}^k\frac{1}{T}\sum\limits_{i=1}^T L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}^i_j))\\& \ge \min\limits_{j \in [k]}\frac{1}{T}\sum\limits_{i=1}^T L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}_j^i))\end{aligned} i∈[T]max​T1​j=1∑k​LDi​​(A(Sji​))​≥T1​i=1∑T​k1​j=1∑k​LDi​​(A(Sji​))=k1​j=1∑k​T1​i=1∑T​LDi​​(A(Sji​))≥j∈[k]min​T1​i=1∑T​LDi​​(A(Sji​))​现固定 j ∈ [ k ] j \in [k] j∈[k],令 S j = { x 1 , ⋯   , x n } \mathcal{S}_j=\{x_1,\cdots,x_n\} Sj​={x1​,⋯,xn​}, C \ S j = { v 1 , ⋯   , v p } C \backslash \mathcal{S}_j=\{v_1,\cdots,v_p\} C\Sj​={v1​,⋯,vp​},其中 p ≥ n p\ge n p≥n是剩余没有采样的样本数。对于每一个函数 h : C → { 0 , 1 } h:C \rightarrow\{0,1\} h:C→{0,1}, i ∈ [ T ] i\in[T] i∈[T]有 L D i ( h ) = 1 2 n ∑ x ∈ C I ( h ( x ) ≠ f i ( x ) ) = 1 2 n ( ∑ l = 1 n I ( h ( x l ) ≠ f i ( x l ) ) + ∑ r = 1 p I ( h ( v r ) ≠ f i ( v r ) ) ) ≥ 1 2 n ∑ r = 1 p I ( h ( v r ) ≠ f i ( v r ) ) ≥ 1 2 p ∑ r = 1 p I ( h ( v r ) ≠ f i ( v r ) ) \begin{aligned}L_{\mathcal{D}_i}(h)&=\frac{1}{2n}\sum\limits_{x \in C}\mathbb{I}(h(x)\ne f_i(x))\\&=\frac{1}{2n}\left(\sum\limits_{l=1}^n \mathbb{I}(h(x_l)\ne f_i(x_l))+\sum\limits_{r=1}^p \mathbb{I}(h(v_r)\ne f_i(v_r))\right)\\&\ge \frac{1}{2n}\sum\limits_{r=1}^p \mathbb{I}(h(v_r)\ne f_i(v_r))\\&\ge \frac{1}{2p}\sum\limits_{r=1}^{p}\mathbb{I}(h(v_r)\ne f_i(v_r))\end{aligned} LDi​​(h)​=2n1​x∈C∑​I(h(x)​=fi​(x))=2n1​(l=1∑n​I(h(xl​)​=fi​(xl​))+r=1∑p​I(h(vr​)​=fi​(vr​)))≥2n1​r=1∑p​I(h(vr​)​=fi​(vr​))≥2p1​r=1∑p​I(h(vr​)​=fi​(vr​))​所以 1 T ∑ i = 1 T L D i ( A ( S j i ) ) ≥ 1 T ∑ i = 1 T 1 2 p ∑ r = 1 p I ( A ( S j i ) ( v r ) ≠ f i ( v r ) ) = 1 2 p ∑ r = 1 p 1 T ∑ i = 1 T I ( A ( S j i ) ( v r ) ≠ f i ( v r ) ) ≥ 1 2 min ⁡ r ∈ [ p ] 1 T ∑ i = 1 T I ( A ( S j i ) ( v r ) ≠ f i ( v r ) ) \begin{aligned}\frac{1}{T}\sum\limits_{i=1}^T L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}_j^i))& \ge \frac{1}{T}\sum\limits_{i=1}^T\frac{1}{2p}\sum\limits_{r=1}^p \mathbb{I}(\mathcal{A(\mathcal{S}^i_j)(v_r)\ne f_i(v_r)})\\&=\frac{1}{2p}\sum\limits_{r=1}^p\frac{1}{T}\sum\limits_{i=1}^T \mathbb{I}(\mathcal{A}(\mathcal{S}^i_j)(v_r)\ne f_i(v_r))\\&\ge \frac{1}{2}\min\limits_{r \in [p]}\frac{1}{T}\sum\limits_{i=1}^T\mathbb{I}(\mathcal{A}(\mathcal{S}_j^i)(v_r)\ne f_i(v_r))\end{aligned} T1​i=1∑T​LDi​​(A(Sji​))​≥T1​i=1∑T​2p1​r=1∑p​I(A(Sji​)(vr​)​=fi​(vr​))=2p1​r=1∑p​T1​i=1∑T​I(A(Sji​)(vr​)​=fi​(vr​))≥21​r∈[p]min​T1​i=1∑T​I(A(Sji​)(vr​)​=fi​(vr​))​对于给定的 r ∈ [ p ] r\in [p] r∈[p],因为 T T T是所有可能函数映射的基数,所以总有成对存在的 a , b ∈ [ T ] a,b\in [T] a,b∈[T]有 I ( A ( S j i ) ( v r ) ≠ f a ( v r ) ) + I ( A ( S j i ) ( v r ) ≠ f b ( v r ) ) = 1 \mathbb{I}(\mathcal{A}(\mathcal{S}^i_j)(v_r)\ne f_a(v_r))+\mathbb{I}(\mathcal{A}(\mathcal{S}^i_j)(v_r)\ne f_b(v_r))=1 I(A(Sji​)(vr​)​=fa​(vr​))+I(A(Sji​)(vr​)​=fb​(vr​))=1进而则有 E S ∼ D i n [ L D i ( A ( S ) ) ] = e 1 T ∑ i = 1 T L D i ( A ( S j i ) ) ≥ 1 2 min ⁡ r ∈ [ p ] 1 T ∑ i = 1 T I ( A ( S j i ) ( v r ) ≠ f i ( v r ) ) = 1 2 ⋅ 1 T ⋅ T 2 = 1 4 \begin{aligned}\mathbb{E}_{\mathcal{S}\sim \mathcal{D}_i^n}[L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}))] &= e \frac{1}{T}\sum\limits_{i=1}^T L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}^i_j))\\&\ge \frac{1}{2}\min\limits_{r \in [p]}\frac{1}{T}\sum\limits_{i=1}^T \mathbb{I}(\mathcal{A}(\mathcal{S}^i_j)(v_r)\ne f_i(v_r))\\&=\frac{1}{2}\cdot \frac{1}{T}\cdot \frac{T}{2}=\frac{1}{4}\end{aligned} ES∼Din​​[LDi​​(A(S))]​=eT1​i=1∑T​LDi​​(A(Sji​))≥21​r∈[p]min​T1​i=1∑T​I(A(Sji​)(vr​)​=fi​(vr​))=21​⋅T1​⋅2T​=41​​根据马尔可夫不等式的修改版可知,给定一个随机变量 X ∈ [ 0 , 1 ] X \in[0,1] X∈[0,1],给定一个常数 a ∈ [ 0 , 1 ] a\in [0,1] a∈[0,1],进而则有 E [ X ] = ∫ 0 1 x p ( x ) d x = ∫ 0 a x p ( x ) d x + ∫ a 1 x p ( x ) d x ≤ a ∫ 0 a p ( x ) d x + ∫ a 1 p ( x ) d x = a ( 1 − p { X ≥ a } ) + p { X ≥ a } = a + ( 1 − a ) P { X ≥ a } \begin{aligned}\mathbb{E}[X]&=\int_0^1 x p(x)dx\\&= \int_0^a x p(x)dx + \int_a^1xp(x)dx\\&\le a \int_0^a p(x)dx + \int^1_a p(x)dx\\&=a(1-p\{X\ge a\})+p\{X\ge a\}\\&=a+(1-a)P\{X\ge a\}\end{aligned} E[X]​=∫01​xp(x)dx=∫0a​xp(x)dx+∫a1​xp(x)dx≤a∫0a​p(x)dx+∫a1​p(x)dx=a(1−p{X≥a})+p{X≥a}=a+(1−a)P{X≥a}​马尔可夫不等式为 P { X ≥ a } ≥ E [ X ] − a 1 − a P\{X\ge a\}\ge \frac{\mathbb{E}[X]-a}{1-a} P{X≥a}≥1−aE[X]−a​利用马尔可夫不等可知 P ( L D ( A ( S ) ) ≥ 1 / 8 : S ∼ D n ) = E S ∼ D i n [ L D i ( A ( S ) ) ] − 1 / 8 1 − 1 / 8 ≥ 1 / 4 − 1 / 8 1 − 1 / 8 = 1 7 \begin{aligned}P(L_\mathcal{D}(\mathcal{A}(\mathcal{S}))\ge 1/8: \mathcal{S}\sim\mathcal{D}^n)&=\frac{\mathbb{E}_{\mathcal{S}\sim \mathcal{D}_i^n}[L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}))]-1/8}{1-1/8}\\& \ge \frac{1/4-1/8}{1-1/8}=\frac{1}{7}\end{aligned} P(LD​(A(S))≥1/8:S∼Dn)​=1−1/8ES∼Din​​[LDi​​(A(S))]−1/8​≥1−1/81/4−1/8​=71​​

标签:frac,limits,No,定理,vr,Free,ge,mathcal,sum
来源: https://blog.csdn.net/qq_38406029/article/details/123061417

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有