标签:functions tensor FUNDAMENTALS AUTOGRAD torch print grad fn
What Do We Need Autograd For?
A machine learning model is a function, with inputs and outputs. For this discussion, we’ll treat the inputs a as an i-dimensional vector \vec{x}x, with elements x_{i}xi. We can then express the model, M, as a vector-valued function of the input: \vec{y} = \vec{M}(\vec{x})y=M(x). (We treat the value of M’s output as a vector because in general, a model may have any number of outputs.)
Since we’ll mostly be discussing autograd in the context of training, our output of interest will be the model’s loss. The loss function L(\vec{y}y) = L(\vec{M}M(\vec{x}x)) is a single-valued scalar function of the model’s output. This function expresses how far off our model’s prediction was from a particular input’s ideal output. Note: After this point, we will often omit the vector sign where it should be contextually clear - e.g., yy instead of \vec yy.
In training a model, we want to minimize the loss. In the idealized case of a perfect model, that means adjusting its learning weights - that is, the adjustable parameters of the function - such that loss is zero for all inputs. In the real world, it means an iterative process of nudging the learning weights until we see that we get a tolerable loss for a wide variety of inputs.
How do we decide how far and in which direction to nudge the weights? We want to minimize the loss, which means making its first derivative with respect to the input equal to 0: \frac{\partial L}{\partial x} = 0∂x∂L=0.
In particular, the gradients over the learning weights are of interest to us - they tell us what direction to change each weight to get the loss function closer to zero.
Since the number of such local derivatives (each corresponding to a separate path through the model’s computation graph) will tend to go up exponentially with the depth of a neural network, so does the complexity in computing them. This is where autograd comes in: It tracks the history of every computation. Every computed tensor in your PyTorch model carries a history of its input tensors and the function used to create it. Combined with the fact that PyTorch functions meant to act on tensors each have a built-in implementation for computing their own derivatives, this greatly speeds the computation of the local derivatives needed for learning.
a = torch.linspace(0., 2. * math.pi, steps=25, requires_grad=True)
print(a)
Out:
tensor([0.0000, 0.2618, 0.5236, 0.7854, 1.0472, 1.3090, 1.5708, 1.8326, 2.0944, 2.3562, 2.6180, 2.8798, 3.1416, 3.4034, 3.6652, 3.9270, 4.1888, 4.4506, 4.7124, 4.9742, 5.2360, 5.4978, 5.7596, 6.0214, 6.2832], requires_grad=True)
I think the most crucial point to understand here is the difference between a torch.tensor
and np.ndarray
:
While both objects are used to store n-dimensional matrices (aka "Tensors"), torch.tensors
has an additional "layer" - which is storing the computational graph leading to the associated n-dimensional matrix.
So, if you are only interested in efficient and easy way to perform mathematical operations on matrices np.ndarray
or torch.tensor
can be used interchangeably.
However, torch.tensor
s are designed to be used in the context of gradient descent optimization, and therefore they hold not only a tensor with numeric values, but (and more importantly) the computational graph leading to these values. This computational graph is then used (using the chain rule of derivatives) to compute the derivative of the loss function w.r.t each of the independent variables used to compute the loss.
As mentioned before, np.ndarray
object does not have this extra "computational graph" layer and therefore, when converting a torch.tensor
to np.ndarray
you must explicitly remove the computational graph of the tensor using the detach()
command.
Note, that if you wish, for some reason, to use pytorch only for mathematical operations without back-propagation, you can use with torch.no_grad() context manager, in which case computational graphs are not created and torch.tensor
s and np.ndarray
s can be used interchangeably.
with torch.no_grad():
x_t = torch.rand(3,4)
y_np = np.ones((4, 2), dtype=np.float32)
x_t @ torch.from_numpy(y_np) # dot product in torch
np.dot(x_t.numpy(), y_np) # the same dot product in numpy
grad_fn=<SinBackward0>
grad_fn=<AddBackward0>
grad_fn=<MulBackward0>
grad_fn=<SumBackward0>
This grad_fn
gives us a hint that when we execute the backpropagation step and compute gradients, we’ll need to compute the derivative of sin(x)sin(x) for all this tensor’s inputs.
Each grad_fn
stored with our tensors allows you to walk the computation all the way back to its inputs with its next_functions
property. We can see below that drilling down on this property on d
shows us the gradient functions for all the prior tensors. Note that a.grad_fn
is reported as None
, indicating that this was an input to the function with no history of its own.
print('d:')
print(d.grad_fn)
print(d.grad_fn.next_functions)
print(d.grad_fn.next_functions[0][0].next_functions)
print(d.grad_fn.next_functions[0][0].next_functions[0][0].next_functions)
print(d.grad_fn.next_functions[0][0].next_functions[0][0].next_functions[0][0].next_functions)
print('\nc:')
print(c.grad_fn)
print('\nb:')
print(b.grad_fn)
print('\na:')
print(a.grad_fn)
Out:
d: <AddBackward0 object at 0x7ff3d80a3518> ((<MulBackward0 object at 0x7ff3d80a3588>, 0), (None, 0)) ((<SinBackward0 object at 0x7ff3d80a3588>, 0), (None, 0)) ((<AccumulateGrad object at 0x7ff3d80a3518>, 0),) () c: <MulBackward0 object at 0x7ff3d80a35f8> b: <SinBackward0 object at 0x7ff3d80a3518> a: None
Adding a constant, as we did to compute d
, does not change the derivative. That leaves c = 2 * b = 2 * sin(a)c=2∗b=2∗sin(a), the derivative of which should be 2 * cos(a)2∗cos(a). Looking at the graph above, that’s just what we see.
Be aware than only leaf nodes of the computation have their gradients computed. If you tried, for example, print(c.grad)
you’d get back None
. In this simple example, only the input is a leaf node, so only it has gradients computed.
One important thing about the process: After calling optimizer.step()
, you need to call optimizer.zero_grad()
, or else every time you run loss.backward()
, the gradients on the learning weights will accumulate
If you only need autograd turned off temporarily, a better way is to use the
torch.no_grad()
:
There’s a corresponding context manager, torch.enable_grad()
, for turning autograd on when it isn’t already. It may also be used as a decorator.
Finally, you may have a tensor that requires gradient tracking, but you want a copy that does not. For this we have the Tensor
object’s detach()
method - it creates a copy of the tensor that is detached from the computation history:
x = torch.rand(5, requires_grad=True)
y = x.detach()
print(x)
print(y)
Out:
tensor([0.4407, 0.8998, 0.4998, 0.0362, 0.1892], requires_grad=True) tensor([0.4407, 0.8998, 0.4998, 0.0362, 0.1892])
We did this above when we wanted to graph some of our tensors. This is because matplotlib
expects a NumPy array as input, and the implicit conversion from a PyTorch tensor to a NumPy array is not enabled for tensors with requires_grad=True. Making a detached copy lets us move forward.
Jacobian
>>> inputs = (torch.rand(3), torch.rand(3)) # arguments for the function
>>> print(inputs)
(tensor([0.7074, 0.9178, 0.3003]), tensor([0.9081, 0.2903, 0.7643]))
>>> def exp_adder(x, y):
... return 2 * x.exp() + 3 * y
...
>>> torch.autograd.functional.jacobian(exp_adder, inputs)
(tensor([[4.0574, 0.0000, 0.0000],
[0.0000, 5.0077, 0.0000],
[0.0000, 0.0000, 2.7005]]), tensor([[3., 0., 0.],
[0., 3., 0.],
[0., 0., 3.]]))
可以算梯度!!!!!!!!!!!!!
Hessian
functional-higher-level for autograd
标签:functions,tensor,FUNDAMENTALS,AUTOGRAD,torch,print,grad,fn 来源: https://blog.csdn.net/weixin_45649658/article/details/122240691
本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享; 2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关; 3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关; 4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除; 5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。