MindSpore报错 Select GPU kernel op * fail! Incompatible data type

2022-07-17 17:34:31 阅读：272 来源： 互联网

标签：float16 kernel 00 Tensor Incompatible self 报错 mindspore float32

1 报错描述

1.1 系统环境

Hardware Environment(Ascend/GPU/CPU): GPU
Software Environment:
– MindSpore version (source or binary): 1.5.2
– Python version (e.g., Python 3.7.5): 3.7.6
– OS platform and distribution (e.g., Linux Ubuntu 16.04): Ubuntu 4.15.0-74-generic
– GCC/Compiler version (if compiled from source):

1.2 基本信息

1.2.1 脚本

训练脚本是通过构建BatchNorm单算子网络，对Tensor做归一化处理。脚本如下：

 01 class Net(nn.Cell):
 02     def __init__(self):
 03         super(Net, self).__init__()
 04         self.batch_norm = ops.BatchNorm()
 05     def construct(self,input_x, scale, bias, mean, variance):
 06         output = self.batch_norm(input_x, scale, bias, mean, variance)
 07         return output
 08
 09 net = Net()
 10 input_x = Tensor(np.ones([2, 2]), mindspore.float16)
 11 scale = Tensor(np.ones([2]), mindspore.float16)
 12 bias = Tensor(np.ones([2]), mindspore.float16)
 13 bias = Tensor(np.ones([2]), mindspore.float16)
 14 mean = Tensor(np.ones([2]), mindspore.float16)
 15 variance = Tensor(np.ones([2]), mindspore.float16)
 16 output = net(input_x, scale, bias, mean, variance)
 17 print(output)

1.2.2 报错

这里报错信息如下：

Traceback (most recent call last):
  File "116945.py", line 22, in <module>
    output = net(input_x, scale, bias, mean, variance)
  File "/data2/llj/mindspores/r1.5/build/package/mindspore/nn/cell.py", line 407, in __call__
    out = self.compile_and_run(*inputs)
  File "/data2/llj/mindspores/r1.5/build/package/mindspore/nn/cell.py", line 734, in compile_and_run
    self.compile(*inputs)
  File "/data2/llj/mindspores/r1.5/build/package/mindspore/nn/cell.py", line 721, in compile
    _cell_graph_executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode)
  File "/data2/llj/mindspores/r1.5/build/package/mindspore/common/api.py", line 551, in compile
    result = self._graph_executor.compile(obj, args_list, phase, use_vm, self.queue_name)
TypeError: mindspore/ccsrc/runtime/device/gpu/kernel_info_setter.cc:355 PrintUnsupportedTypeException] Select GPU kernel op[BatchNorm] fail! Incompatible data type!
The supported data types are in[float32 float32 float32 float32 float32], out[float32 float32 float32 float32 float32]; in[float16 float32 float32 float32 float32], out[float16 float32 float32 float32 float32]; , but get in [float16 float16 float16 float16 float16 ] out [float16 float16 float16 float16 float16 ]

原因分析

我们看报错信息，在TypeError中，写到Select GPU kernel op[BatchNorm] fail! Incompatible data type!

The supported data types are in[float32 float32 float32 float32 float32], out[float32 float32 float32 float32 float32]; in[float16 float32 float32 float32 float32], out[float16 float32 float32 float32 float32]; , but get in [float16 float16 float16 float16 float16 ] out [float16 float16 float16 float16 float16 ]，大概意思是GPU环境下，不支持当前输入的数据类型组合，并说明了支持的数据类型组合是怎样的：全部为float32或者input_x为float16，其余为float32。检查脚本的输入发现全部为float16类型，因此报错。

2 解决方法

基于上面已知的原因，很容易做出如下修改：

 01 class Net(nn.Cell):
 02     def __init__(self):
 03         super(Net, self).__init__()
 04         self.batch_norm = ops.BatchNorm()
 05     def construct(self,input_x, scale, bias, mean, variance):
 06         output = self.batch_norm(input_x, scale, bias, mean, variance)
 07         return output
 08 
 09 net = Net()
 10 input_x = Tensor(np.ones([2, 2]), mindspore.float16)
 11 scale = Tensor(np.ones([2]), mindspore.float32)
 12 bias = Tensor(np.ones([2]), mindspore.float32)
 13 mean = Tensor(np.ones([2]), mindspore.float32)
 14 variance = Tensor(np.ones([2]), mindspore.float32)
 15 
 16 output = net(input_x, scale, bias, mean, variance)
 17 print(output)

此时执行成功，输出如下：

output: (Tensor(shape=[2, 2], dtype=Float16, value=
[[ 1.0000e+00,  1.0000e+00],
 [ 1.0000e+00,  1.0000e+00]]), Tensor(shape=[2], dtype=Float32, value= [ 0.00000000e+00,  0.00000000e+00]), Tensor(shape=[2], dtype=Float32, value= [ 0.00000000e+00,  0.00000000e+00]), Tensor(shape=[2], dtype=Float32, value= [ 0.00000000e+00,  0.00000000e+00]), Tensor(shape=[2], dtype=Float32, value= [ 0.00000000e+00,  0.00000000e+00]))

3 总结

定位报错问题的步骤：

1、找到报错的用户代码行： 16 output = net(input_x, scale, bias, mean, variance);

2、根据日志报错信息中的关键字，缩小分析问题的范围：The supported data types are in[float32 float32 float32 float32 float32], out[float32 float32 float32 float32 float32]; in[float16 float32 float32 float32 float32], out[float16 float32 float32 float32 float32]; , but get in [float16 float16 float16 float16 float16 ] out [float16 float16 float16 float16 float16 ]

3、需要重点关注变量定义、初始化的正确性。

4 参考文档

4.1 BatchNorm算子API接口

标签：float16,kernel,00,Tensor,Incompatible,self,报错,mindspore,float32
来源： https://www.cnblogs.com/skytier/p/16487821.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9