ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

THE FUNDAMENTALS OF AUTOGRAD

2022-01-03 17:31:03  阅读:212  来源: 互联网

标签:functions tensor FUNDAMENTALS AUTOGRAD torch print grad fn


What Do We Need Autograd For?

A machine learning model is a function, with inputs and outputs. For this discussion, we’ll treat the inputs a as an i-dimensional vector \vec{x}x, with elements x_{i}xi​. We can then express the model, M, as a vector-valued function of the input: \vec{y} = \vec{M}(\vec{x})y​=M(x). (We treat the value of M’s output as a vector because in general, a model may have any number of outputs.)

Since we’ll mostly be discussing autograd in the context of training, our output of interest will be the model’s loss. The loss function L(\vec{y}y​) = L(\vec{M}M(\vec{x}x)) is a single-valued scalar function of the model’s output. This function expresses how far off our model’s prediction was from a particular input’s ideal output. Note: After this point, we will often omit the vector sign where it should be contextually clear - e.g., yy instead of \vec yy​.

In training a model, we want to minimize the loss. In the idealized case of a perfect model, that means adjusting its learning weights - that is, the adjustable parameters of the function - such that loss is zero for all inputs. In the real world, it means an iterative process of nudging the learning weights until we see that we get a tolerable loss for a wide variety of inputs.

How do we decide how far and in which direction to nudge the weights? We want to minimize the loss, which means making its first derivative with respect to the input equal to 0: \frac{\partial L}{\partial x} = 0∂x∂L​=0.

In particular, the gradients over the learning weights are of interest to us - they tell us what direction to change each weight to get the loss function closer to zero.

Since the number of such local derivatives (each corresponding to a separate path through the model’s computation graph) will tend to go up exponentially with the depth of a neural network, so does the complexity in computing them. This is where autograd comes in: It tracks the history of every computation. Every computed tensor in your PyTorch model carries a history of its input tensors and the function used to create it. Combined with the fact that PyTorch functions meant to act on tensors each have a built-in implementation for computing their own derivatives, this greatly speeds the computation of the local derivatives needed for learning.

a = torch.linspace(0., 2. * math.pi, steps=25, requires_grad=True)
print(a)

Out:

tensor([0.0000, 0.2618, 0.5236, 0.7854, 1.0472, 1.3090, 1.5708, 1.8326, 2.0944,
        2.3562, 2.6180, 2.8798, 3.1416, 3.4034, 3.6652, 3.9270, 4.1888, 4.4506,
        4.7124, 4.9742, 5.2360, 5.4978, 5.7596, 6.0214, 6.2832],
       requires_grad=True)

I think the most crucial point to understand here is the difference between a torch.tensor and np.ndarray:
While both objects are used to store n-dimensional matrices (aka "Tensors"), torch.tensors has an additional "layer" - which is storing the computational graph leading to the associated n-dimensional matrix.

So, if you are only interested in efficient and easy way to perform mathematical operations on matrices np.ndarray or torch.tensor can be used interchangeably.

However, torch.tensors are designed to be used in the context of gradient descent optimization, and therefore they hold not only a tensor with numeric values, but (and more importantly) the computational graph leading to these values. This computational graph is then used (using the chain rule of derivatives) to compute the derivative of the loss function w.r.t each of the independent variables used to compute the loss.

As mentioned before, np.ndarray object does not have this extra "computational graph" layer and therefore, when converting a torch.tensor to np.ndarray you must explicitly remove the computational graph of the tensor using the detach() command.

Note, that if you wish, for some reason, to use pytorch only for mathematical operations without back-propagation, you can use with torch.no_grad() context manager, in which case computational graphs are not created and torch.tensors and np.ndarrays can be used interchangeably.

with torch.no_grad():
  x_t = torch.rand(3,4)
  y_np = np.ones((4, 2), dtype=np.float32)
  x_t @ torch.from_numpy(y_np)  # dot product in torch
  np.dot(x_t.numpy(), y_np)  # the same dot product in numpy
grad_fn=<SinBackward0>
 grad_fn=<AddBackward0>
grad_fn=<MulBackward0>
grad_fn=<SumBackward0>

This grad_fn gives us a hint that when we execute the backpropagation step and compute gradients, we’ll need to compute the derivative of sin(x)sin(x) for all this tensor’s inputs.

Each grad_fn stored with our tensors allows you to walk the computation all the way back to its inputs with its next_functions property. We can see below that drilling down on this property on d shows us the gradient functions for all the prior tensors. Note that a.grad_fn is reported as None, indicating that this was an input to the function with no history of its own.

print('d:')
print(d.grad_fn)
print(d.grad_fn.next_functions)
print(d.grad_fn.next_functions[0][0].next_functions)
print(d.grad_fn.next_functions[0][0].next_functions[0][0].next_functions)
print(d.grad_fn.next_functions[0][0].next_functions[0][0].next_functions[0][0].next_functions)
print('\nc:')
print(c.grad_fn)
print('\nb:')
print(b.grad_fn)
print('\na:')
print(a.grad_fn)

Out:

d:
<AddBackward0 object at 0x7ff3d80a3518>
((<MulBackward0 object at 0x7ff3d80a3588>, 0), (None, 0))
((<SinBackward0 object at 0x7ff3d80a3588>, 0), (None, 0))
((<AccumulateGrad object at 0x7ff3d80a3518>, 0),)
()

c:
<MulBackward0 object at 0x7ff3d80a35f8>

b:
<SinBackward0 object at 0x7ff3d80a3518>

a:
None

Adding a constant, as we did to compute d, does not change the derivative. That leaves c = 2 * b = 2 * sin(a)c=2∗b=2∗sin(a), the derivative of which should be 2 * cos(a)2∗cos(a). Looking at the graph above, that’s just what we see.

Be aware than only leaf nodes of the computation have their gradients computed. If you tried, for example, print(c.grad) you’d get back None. In this simple example, only the input is a leaf node, so only it has gradients computed.

One important thing about the process: After calling optimizer.step(), you need to call optimizer.zero_grad(), or else every time you run loss.backward(), the gradients on the learning weights will accumulate

If you only need autograd turned off temporarily, a better way is to use the torch.no_grad():

There’s a corresponding context manager, torch.enable_grad(), for turning autograd on when it isn’t already. It may also be used as a decorator.

Finally, you may have a tensor that requires gradient tracking, but you want a copy that does not. For this we have the Tensor object’s detach() method - it creates a copy of the tensor that is detached from the computation history:

x = torch.rand(5, requires_grad=True)
y = x.detach()

print(x)
print(y)

Out:

tensor([0.4407, 0.8998, 0.4998, 0.0362, 0.1892], requires_grad=True)
tensor([0.4407, 0.8998, 0.4998, 0.0362, 0.1892])

We did this above when we wanted to graph some of our tensors. This is  because matplotlib expects a NumPy array as input, and the implicit conversion from a PyTorch tensor to a NumPy array is not enabled for tensors with requires_grad=True. Making a detached copy lets us move forward.

Jacobian

>>> inputs = (torch.rand(3), torch.rand(3)) # arguments for the function
>>> print(inputs)
(tensor([0.7074, 0.9178, 0.3003]), tensor([0.9081, 0.2903, 0.7643]))
>>> def exp_adder(x, y):
...     return 2 * x.exp() + 3 * y
...
>>> torch.autograd.functional.jacobian(exp_adder, inputs)
(tensor([[4.0574, 0.0000, 0.0000],
        [0.0000, 5.0077, 0.0000],
        [0.0000, 0.0000, 2.7005]]), tensor([[3., 0., 0.],
        [0., 3., 0.],
        [0., 0., 3.]]))

可以算梯度!!!!!!!!!!!!!

Hessian

functional-higher-level for autograd

Automatic differentiation package - torch.autograd — PyTorch 1.10.1 documentationicon-default.png?t=LBL2https://pytorch.org/docs/stable/autograd.html#functional-higher-level-api

标签:functions,tensor,FUNDAMENTALS,AUTOGRAD,torch,print,grad,fn
来源: https://blog.csdn.net/weixin_45649658/article/details/122240691

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有