标签:Training Transformers projection Self head patch lr ViT Study
论文主要工作
- 开辟ViT的自监督领域
- 探究ViT的instability的原因和解决方案
Self-supervised Transformer for vision
- Masks and reconstructs patches
- Contrastive/Siamese methods
MoCo v3
改动1:去掉了 memory queue
原因:batch size 足够大 (>4096) 时带来的增益不明显
改动2:吸取了BYOL的经验,\(f_q\)多加了一个prediction head,backbone使用ResNet或ViT
\(f_q\):backbone+projection head+prediction head
\(f_k\):backbone+projection head
Stability of Self-Supervised ViT Training
因为模型总会给出一个decent accuracy而不是catastrophic failure,不稳定性所导致的degradation in accuracy很难观察到(1%-3%)
Empirical Observations on Basic Factors
Batch Size
Batch size>2048时会出现不稳定的情况,更大的话不稳定性就会更明显
We hypothesize that the training is partially restarted and jumps out of the current local optimum, then seeks a new trajectory. As a consequence, the training does not diverge, but the accuracy depends on how good the local restart is.
Learning Rate
In practice, the learning rate is often scaled when the batch size increases.
随着lr的增大,准确率先增大再减小,但会逐渐不稳定。在lr还小的时候,训练很稳定,但是还处于under- fitting状态。当lr=1.5e-4时,准确率降低了,此时准确率受不稳定性影响。
Optimizer
本文选取的是AdamW优化,因为如果要选LAMB(an AdamW-counterpart of LARS)的话,就要谨慎的选取lr才能获得相当的效果。
A Trick for Improving Stability
作者发现,梯度的暴增导致了训练的不稳定,并且梯度的暴增出现在第一层即patch projection。
基于这个观察,作者选择在训练时冻结patch projection,随机初始化后就固定住,不训练。
We use a fixed random patch projection layer to embed the patches, which is not learned. This can be easily done by applying a stop-gradient operation right after this layer.
We note that freezing the first layer does not change the architecture, and it actually narrows down the solution space. This indicates that the underlying problem is on optimization.
这个方法可以alleviate instability instead of solving
虽然这个方法效果不错,但还遗留了一些问题:
It is an interesting observation that it is not necessary to train the patch projection layer. In this case, random projection should be sufficient to preserve the information of the original patches.
作者提到其实第一层并不是影响不稳定性的最关键因素,其实所有层都有参与到其中。只不过因为第一层是唯一non- transformer层,方便独立的处理,期待未来会有更好的解决方案。
展望
1.Self-supervised Transformer can achieve strong results using a contrastive learning framework.
2.去掉ViT中的position embedding只对准确度有很小的影响,说明:
- ViT can learn strong representations without the positional inductive bias.
- Positional information has not been sufficiently exploited.
3.期待有更好的解决instability的方法
4.Close the gap of pre-training methodology between vision and language
标签:Training,Transformers,projection,Self,head,patch,lr,ViT,Study 来源: https://www.cnblogs.com/xiaoqian-shen/p/15596819.html
本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享; 2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关; 3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关; 4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除; 5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。