ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

代码笔记6 关于使用torch.dataparallel时锁死的解决办法

2022-05-07 16:02:41  阅读:245  来源: 互联网

标签:set run torch four dataparallel num each 锁死


1

大致情况是这样的,就是在训练中通过torch.dataparallel时进行了训练,这个时候会出现不报错,也不显示任何进展的问题。这种情况可能一开始训练就会出现,也有可能再重新训练时出现。当终止进程时会出现

Process finished with exit code 137 (interrupted by signal 9: SIGKILL

然后我去查看gpu的使用情况,主显卡(用于加载模型)显存已经占用了一部分,说明模型已经加载进去了。而并行gpu却基本没有显存占用,说明数据没有被加载进去,问题一般出现在了dataloader。
网上有很多已有的办法,其实都没啥用,这些能用得到的就看看吧[1]
然后看到一个github链接[2],在里面试了各种办法,找到了一个办法可以解决我这种问题。

I recently come across a situation where I need to load many small images. My work station has a CPU with 22 cores and four GPUs, so I run four experiments with different random seeds, and each experiment uses one seperate GPU. I find out that the run time of four processes is amost four times the run time of a single process (No parallel benefit).

The model I train is relatively small and the most time-consuming part acutally comes from data loading. I have tried many different approachs including:

pin_memory =False/True
num_workers = 0/1/8
Increase ulimit
staggering the start of each experiment
Thanks to the system level diagnosis by @vjorlikowski, we find out that if we set num_workers = 0/1/8, each process will try to use all CPU cores and viciously compete with each other for CPU cores.

Solution:
Use export OMP_NUM_THREADS=N, as described here
or use torch.set_num_threads(N), as described here
We set num_workers = 0 and N=5 in our case, as we have 22 cores. The estimated run time of my program is reduced from 12 days to 1.5 days.

我是通过这句解决了锁死的问题:

torch.set_num_threads(N)

Refrences

[1]https://3water.com/article/8MTM21NDY22Ljg4
[2]https://github.com/pytorch/pytorch/issues/1355

标签:set,run,torch,four,dataparallel,num,each,锁死
来源: https://www.cnblogs.com/HumbleHater/p/16242850.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有