大规模人脸分类—allgather操作(1)

2021-10-02 17:33:43 阅读：208 来源： 互联网

标签：分类 torch grad gather rank 人脸 allgather local dist

pytorch中 all_gather 操作是不进行梯度回传的。在计算图构建中如果需要经过all_gather操作后，仍需要将梯度回传给各个进程中的allgather前的对应变量，则需要重新继承torch.autograd.Function
https://pytorch.org/docs/stable/autograd.html 中对torch.autograd.Function进行了介绍
https://pytorch.org/docs/stable/notes/extending.html#extending-torch-autograd 中举例介绍如何重新实现其子类

下面代码是为了说明all_gather相关特性及如何实现梯度回传.
\(x,y,z\)都是2x2矩阵，其之间关系为\(y=x+2, z=y*y\)
接下来就需要MPI进行进程间数据传递，将z进行汇总到每个进程即all_gather操作。然后将汇总的矩阵进行相乘，然后求均值。

r对y的导数如下:
\(r=0.25({}_{g_0}y_{11}^2*{}_{g_1}y_{11}^2+{}_{g_0}y_{12}^2*{}_{g_1}y_{12}^2+ {}_{g_0}y_{21}^2*{}_{g_1}y_{21}^2+ {}_{g_0}y_{22}^2*{}_{g_1}y_{22}^2)\)

\(\frac{dr}{d{}_{g_0}y}= \begin{Bmatrix} 0.5{}_{g_0}y_{11}*{}_{g_1}y_{11}^2 & 0.5{}_{g_0}y_{12}*{}_{g_1}y_{12}^2 \\ 0.5{}_{g_0}y_{21}*{}_{g_1}y_{21}^2 & 0.5{}_{g_0}y_{22}*{}_{g_1}y_{22}^2) \end{Bmatrix}\)

gpu0上x值为\(\begin{Bmatrix} 1 & 1 \\1 & 1 \end{Bmatrix}\)，gpu1上x值为\(\begin{Bmatrix} 0 & 0 \\0 & 0 \end{Bmatrix}\).通过公式可以计算出，r关于gpu0上的y的导数为\(\begin{Bmatrix}6 & 6 \\6 & 6\end{Bmatrix}\),r关于gpu1上的y的导数为\(\begin{Bmatrix}9 & 9 \\9 & 9\end{Bmatrix}\)

import os
import torch
from torch import nn
import sys
sys.path.append('./')
import torch.distributed as dist
from torch.autograd import Variable
from utils import GatherLayer

def test():
    #torch.manual_seed(0)
    torch.backends.cudnn.deterministic=True
    torch.backends.cudnn.benchmark=True
    dist.init_process_group(backend="nccl", init_method="env://")
    rank = dist.get_rank()
    local_rank = int(os.environ.get('LOCAL_RANK', 0))
    world_size = dist.get_world_size()
    torch.cuda.set_device(local_rank)
    print('world_size: {}, rank: {}, local_rank: {}'.format(world_size, rank, local_rank))

    if local_rank == 0:
        x = Variable(torch.ones(2, 2), requires_grad=True).cuda()
    else:
        x = Variable(torch.zeros(2, 2), requires_grad=True).cuda()
    y = x + 2
    y.retain_grad()
    z = y * y

    z_gather = [torch.zeros_like(z) for _ in range(world_size)]
    dist.all_gather(z_gather, z)
    #z_gather = GatherLayer.apply(z)
    r = z_gather[0] * z_gather[1]

    out = r.mean()
    out.backward()
    if local_rank == 0:
        print('rank:0', y.grad)
    else:
        print('rank:1', y.grad)

（1）上述述代码中，先使用pytorch中提供的all_gather操作，运行代码会提示错误。错误信息如下：

Traceback (most recent call last):
  File "test/test_all_gather.py", line 46, in <module>
Traceback (most recent call last):
  File "test/test_all_gather.py", line 46, in <module>
    test() 
  File "test/test_all_gather.py", line 36, in test
    out.backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
    test()

（2）参考https://github.com/Spijkervet/SimCLR/blob/master/simclr/modules/gather.py, 该函数就是继承torch.autograd.Function，实现了all_gather后，梯度也能回传。

上述代码，启用z_gather = GatherLayer.apply(z),就实现了梯度回传功能，打印对变量y的梯度

world_size: 2, rank: 0, local_rank: 0
world_size: 2, rank: 1, local_rank: 1
rank:0 tensor([[6., 6.],
        [6., 6.]], device='cuda:0')
rank:1 tensor([[9., 9.],
        [9., 9.]], device='cuda:1')

GatherLayer类实现如下：

class GatherLayer(torch.autograd.Function):
    """Gather tensors from all process, supporting backward propagation."""

    @staticmethod
    def forward(ctx, input):
        ctx.save_for_backward(input)
        output = [torch.zeros_like(input) for _ in range(dist.get_world_size())]
        dist.all_gather(output, input)
        return tuple(output)

    @staticmethod
    def backward(ctx, *grads):
        (input,) = ctx.saved_tensors
        grad_out = torch.zeros_like(input)
        grad_out[:] = grads[dist.get_rank()]
        return grad_out

下面网址有关all gather梯度传播的讨论
https://discuss.pytorch.org/t/will-dist-all-gather-break-the-auto-gradient-graph/47350

标签：分类,torch,grad,gather,rank,人脸,allgather,local,dist
来源： https://www.cnblogs.com/wolfling/p/15350067.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

大规模人脸分类—allgather操作(1)