ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

为x86 CPU自动调度神经网络

2020-12-24 06:32:06  阅读:233  来源: 互联网

标签:ax2 ax3 ax0 ax1 x86 PLACEHOLDER 神经网络 placeholder CPU


x86 CPU自动调度神经网络

对特定设备和工作负载进行自动调试对于获得最佳性能至关重要。这是有关如何使用自动调度器为x86 CPU调试整个神经网络的文档。

为了自动调试神经网络,将网络划分为小的子图,并对其进行独立调试。每个子图被视为一个搜索任务。任务调度程序可以对时间进行分片,并为这些任务动态分配时间资源。任务调度程序可以预测每个任务对端到端执行时间的影响,并优先调度可以最大程度地减少执行时间的任务。

对于每个子图,使用compute声明tvm/python/topi获取张量表达式形式的计算DAG。然后,使用自动调度器来构造此DAG的搜索空间,并搜索良好的调度(低级优化)。

与依靠手动模板定义搜索空间的基于模板的autotvm不同,自动调度程序不需要任何调度模板。换句话说,自动调度程序仅在tvm/python/topi中使用计算声明,而不使用现有的调度模板。

注意,本文无法在Windows或最新版本的macOS上运行。要使其运行,需要将本文的内容包装在一个块中。if __name__ == "__main__":

import numpy as np
 
import tvm
from tvm import relay, auto_scheduler
import tvm.relay.testing
from tvm.contrib import graph_runtime

定义网络

首先,需要使用中继前端API定义网络。可以加载一些预定义的网络tvm.relay.testing。还可以从MXNet,ONNX,PyTorch和TensorFlow加载模型。

对于卷积神经网络,尽管自动调度程序可以在任何布局下正常工作,但使用NHWC布局通常可以实现最佳性能。还使用自动调度程序对NHWC布局实施了更多优化。因此,建议将模型转换为NHWC布局以使用自动调度程序。可以在TVM中使用ConvertLayout pass进行布局转换。

def get_network(name, batch_size, layout="NHWC", dtype="float32"):
    """Get the symbol definition and random weight of a network"""
 
    # auto-scheduler prefers NHWC layout
    if layout == "NHWC":
        image_shape = (224, 224, 3)
    elif layout == "NCHW":
        image_shape = (3, 224, 224)
    else:
        raise ValueError("Invalid layout: " + layout)
 
    input_shape = (batch_size,) + image_shape
    output_shape = (batch_size, 1000)
 
    if name.startswith("resnet-"):
        n_layer = int(name.split("-")[1])
        mod, params = relay.testing.resnet.get_workload(
            num_layers=n_layer,
            batch_size=batch_size,
            layout=layout,
            dtype=dtype,
            image_shape=image_shape,
        )
    elif name.startswith("resnet3d-"):
        n_layer = int(name.split("-")[1])
        mod, params = relay.testing.resnet.get_workload(
            num_layers=n_layer,
            batch_size=batch_size,
            layout=layout,
            dtype=dtype,
            image_shape=image_shape,
        )
    elif name == "mobilenet":
        mod, params = relay.testing.mobilenet.get_workload(
            batch_size=batch_size, layout=layout, dtype=dtype, image_shape=image_shape
        )
    elif name == "squeezenet_v1.1":
        assert layout == "NCHW", "squeezenet_v1.1 only supports NCHW layout"
        mod, params = relay.testing.squeezenet.get_workload(
            version="1.1",
            batch_size=batch_size,
            dtype=dtype,
            image_shape=image_shape,
        )
    elif name == "inception_v3":
        input_shape = (batch_size, 3, 299, 299) if layout == "NCHW" else (batch_size, 299, 299, 3)
        mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
    elif name == "mxnet":
        # an example for mxnet model
        from mxnet.gluon.model_zoo.vision import get_model
 
        assert layout == "NCHW"
 
        block = get_model("resnet50_v1", pretrained=True)
        mod, params = relay.frontend.from_mxnet(block, shape={"data": input_shape}, dtype=dtype)
        net = mod["main"]
        net = relay.Function(
            net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs
        )
        mod = tvm.IRModule.from_expr(net)
 
    return mod, params, input_shape, output_shape
 
 
# Define the neural network and compilation target.
# If the target machine supports avx512 instructions, replace the
# "llvm -mcpu=core-avx2" with "llvm -mcpu=skylake-avx512"
network = "resnet-50"
batch_size = 1
layout = "NHWC"
target = tvm.target.Target("llvm -mcpu=core-avx2")
dtype = "float32"
log_file = "%s-%s-B%d-%s.json" % (network, layout, batch_size, target.kind.name)

提取搜索任务

接下来,从网络中提取搜索任务及其权重。任务的权重是整个网络中任务子图的出现次数。通过使用权重,可以将网络的端到端延迟近似为sum(latency[t] * weight[t]),其中latency[t]是任务的延迟,weight[t]是任务的权重。任务调度程序只会优化此目标。

# Extract tasks from the network
print("Extract tasks...")
mod, params, input_shape, output_shape = get_network(network, batch_size, layout, dtype=dtype)
tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, target)
 
for idx, task in enumerate(tasks):
    print("========== Task %d  (workload key: %s) ==========" % (idx, task.workload_key))
    print(task.compute_dag)

出:

Extract tasks...
========== Task 0  (workload key: ["b32ed43fb351136894c322ee49097a1a"]) ==========
placeholder = PLACEHOLDER [1, 1000]
T_softmax_maxelem(i0) max= placeholder[i0, k]
T_softmax_exp(i0, i1) = tir.exp((placeholder[i0, i1] - T_softmax_maxelem[i0]))
T_softmax_expsum(i0) += T_softmax_exp[i0, k]
T_softmax_norm(i0, i1) = (T_softmax_exp[i0, i1]/T_softmax_expsum[i0])
 
========== Task 1  (workload key: ["6129df1a3d5f6326c8393a8d17160199"]) ==========
placeholder = PLACEHOLDER [1, 2048]
placeholder = PLACEHOLDER [1000, 2048]
compute(z, y, x) += (placeholder[z, ((k*16) + x)]*placeholder[y, ((k*16) + x)])
compute(y, x) += compute[y, x, kk]
placeholder = PLACEHOLDER [1000]
T_add(ax0, ax1) = (compute[ax0, ax1] + placeholder[ax1])
 
========== Task 2  (workload key: ["36ee2798ed60bae3bcd1bb89a0285fe8"]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 2048]
tensor(ax0, ax1, ax2, ax3) += placeholder[ax0, ((ax1*7) + rv0), ((ax2*7) + rv1), ax3]
tensor(ax0, ax1, ax2, ax3) = (tensor[ax0, ax1, ax2, ax3]/(float32((select((bool)1, ((ax1 + 1)*7), (((ax1 + 1)*7) + 1)) - (ax1*7)))*float32((select((bool)1, ((ax2 + 1)*7), (((ax2 + 1)*7) + 1)) - (ax2*7)))))
 
========== Task 3  (workload key: ["dcf6fcf5f56fa614bf9aef0c82382caf"]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 512]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 512, 2048]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 7, 7, 2048]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
placeholder = PLACEHOLDER [1, 1, 1, 2048]
T_multiply(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3]*placeholder[ax0, 0, 0, ax3])
placeholder = PLACEHOLDER [1, 1, 1, 2048]
T_add(ax0, ax1, ax2, ax3) = (T_multiply[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 4  (workload key: ["7e3f0cf5a6dd80d36dab1a3dad92674a"]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 512]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 8)) && (i2 >= 1)) && (i2 < 8)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 512, 512]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 5  (workload key: ["e0a9eb3795b531085e0ebb772e7e800c"]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 2048]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 2048, 512]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 6  (workload key: ["03614e726dc588d11887eb0953a77e53"]) ==========
placeholder = PLACEHOLDER [1, 7, 7, 512]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 512, 2048]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 7, 7, 2048]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
 
========== Task 7  (workload key: ["7657f886f5e9d8b5f19a5fd2c5b90d8d"]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 1024]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 1024, 512]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 8  (workload key: ["7e09b626cf077cd419190fee02091dd6"]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 256]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 256, 1024]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 14, 14, 1024]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
placeholder = PLACEHOLDER [1, 1, 1, 1024]
T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 9  (workload key: ["95bf49cc8cf7a351e974b2359702aac0"]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 256]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 256, 256]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 10  (workload key: ["e043f834cc7f19597227e09dc7f59503"]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 1024]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 1024, 256]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 11  (workload key: ["cd7c4a374fb2bbc0d075c8cae638ad14"]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 256]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 256, 1024]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 14, 14, 1024]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
 
========== Task 12  (workload key: ["1dce2c5e4269b8a12dfc50cd4dd23ff1"]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 512]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 512, 256]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 13  (workload key: ["d3b36ce001dc24d693facfbdae1979b4"]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 128]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 128, 512]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 28, 28, 512]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
placeholder = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 14  (workload key: ["0fb1dfcdb5b755e2dab290ed0129dcf2"]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 128]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 128, 128]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 128]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 15  (workload key: ["45acfc473c772458684f36a34549d8aa"]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 512]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 512, 128]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 128]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 16  (workload key: ["5e3ceb6e23ae8c351d5a1770d5fc6c7c"]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 128]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 128, 512]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 28, 28, 512]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
 
========== Task 17  (workload key: ["a085717fb3dcb046e5c4c2c04d3dc541"]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 256]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 256, 128]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 128]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 18  (workload key: ["691feef049c8693bbe91bd5e7c9cdf34"]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 64]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 64, 256]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 56, 56, 256]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
placeholder = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 19  (workload key: ["a9e632e5167afb60fbe29e7aeef1d152"]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 64]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
placeholder = PLACEHOLDER [3, 3, 64, 64]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 20  (workload key: ["b51e06c1131d4cded40d1b215f722a4e"]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 256]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 256, 64]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 21  (workload key: ["8fcee68a4342c38248a827f1c6c69177"]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 64]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 64, 256]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 56, 56, 256]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
 
========== Task 22  (workload key: ["8dd7d81db440763f622f03fdc99e6d46"]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 64]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 64, 64]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 23  (workload key: ["ba2026d923536b75e9b4faed89287d5f"]) ==========
placeholder = PLACEHOLDER [1, 112, 112, 64]
pad_temp(ax0, ax1, ax2, ax3) = tir.if_then_else(((((ax1 >= 1) && (ax1 < 113)) && (ax2 >= 1)) && (ax2 < 113)), placeholder[ax0, (ax1 - 1), (ax2 - 1), ax3], -3.40282e+38f)
tensor(ax0, ax1, ax2, ax3) max= pad_temp[ax0, ((ax1*2) + dh), ((ax2*2) + dw), ax3]
placeholder = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (tensor[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 24  (workload key: ["a0eb8d6048282a4a0986cc2ccf14eaa2"]) ==========
placeholder = PLACEHOLDER [1, 224, 224, 3]
PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 3) && (i1 < 227)) && (i2 >= 3)) && (i2 < 227)), placeholder[i0, (i1 - 3), (i2 - 3), i3], 0f)
placeholder = PLACEHOLDER [7, 7, 3, 64]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
placeholder = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
 
========== Task 25  (workload key: ["45b4de07687dee43ee1cbde9f516b2bf"]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 64]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 64, 256]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
 
========== Task 26  (workload key: ["b2010aa63c95dedf1f58f3fe8bc78634"]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 256]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 256, 512]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
 
========== Task 27  (workload key: ["4d7e646d99bfa3cea8245bd7100369cb"]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 512]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 512, 1024]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
 
========== Task 28  (workload key: ["537c8642716948c33a6eaaabc86b159d"]) ==========
placeholder = PLACEHOLDER [1, 14, 14, 1024]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 1024, 2048]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])

开始Tuning调试

现在,设置一些选项来优化和启动搜索任务

  • num_measure_trials是在调试期间可以使用的测量试验次数。可以将其设置为较小的数字(例如200)以进行快速演示。实际上,建议将其设置为800 * len(tasks),通常足以使搜索收敛。例如,resnet-50中有29个任务,可以将其设置为20000。可以根据时间预算调试此参数。
  • 此外,还用RecordToFile将测量记录转储到日志文件中,这些测量记录可用于最好地查询历史记录,恢复搜索以及以后进行更多分析。
  • 有关更多参数, 请参见auto_scheduler.TuningOptions, auto_scheduler.LocalRunner
def run_tuning():
    print("Begin tuning...")
    tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
    tune_option = auto_scheduler.TuningOptions(
        num_measure_trials=200,  # change this to 20000 to achieve the best performance
        runner=auto_scheduler.LocalRunner(repeat=10, enable_cpu_cache_flush=True),
        measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
    )
 
    tuner.tune(tune_option)
 
 
# We do not run the tuning in our webpage server since it takes too long.
# Uncomment the following line to run it by yourself.
 
# run_tuning()

注意

tuning调试期间说明打印的信息

在tuning调试期间,控制台上会打印很多信息。它们用于调试目的。最重要的信息是任务调度程序的输出。下表是示例输出。

----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|  ID  | Latency (ms) | Speed (GFLOPS) | Trials |
-------------------------------------------------
|    0 |        0.010 |           0.40 |     64 |
|    1 |        0.087 |          47.19 |     64 |
|    2 |        0.008 |          -0.00 |     64 |
|    3 |        0.177 |         582.07 |     64 |
|    4 |        0.268 |         862.37 |    256 |
|    5 |        0.166 |         621.13 |    128 |
|    6 |        0.170 |         605.10 |    128 |
|    7 |        0.128 |         403.20 |     64 |
|    8 |        0.189 |         545.71 |     64 |
|    9 |        0.231 |        1001.01 |    448 |
|   10 |        0.155 |         664.80 |    256 |
|   11 |        0.155 |         662.86 |    256 |
|   12 |        0.119 |         434.08 |     64 |
|   13 |        0.199 |         522.13 |     64 |
|   14 |        0.235 |         986.56 |    320 |
|   15 |        0.149 |         689.13 |    128 |
|   16 |        0.155 |         664.80 |    192 |
|   17 |        0.151 |         340.64 |     64 |
|   18 |        0.176 |         597.55 |    128 |
|   19 |        0.220 |        1054.37 |    192 |
|   20 |        0.150 |         686.01 |    128 |
|   21 |        0.159 |         650.88 |    128 |
|   22 |        0.073 |         358.19 |     64 |
|   23 |        0.031 |          70.63 |     64 |
|   24 |        0.251 |         947.73 |    128 |
|   25 |        0.157 |         652.47 |    128 |
|   26 |        0.215 |         954.84 |    128 |
|   27 |        0.237 |         868.92 |    128 |
|   28 |        0.266 |         774.06 |    128 |
-------------------------------------------------
Estimated total latency: 10.016 ms      Trials: 3992    Used time : 1131 s      Next ID: 15

下表列出了所有任务的延迟和(估计)速度。它还列出了所有任务的测量试验分配。最后一行显示这些任务的总加权延迟,这可以粗略估计网络的端到端执行时间。最后一行还显示测量试验的总数,自动调试所花费的总时间以及要调试的下一个任务的ID。

也将出现一些“ dmlc :: Error”错误,因为自动调度程序将尝试某些无效的调度。如果可以继续进行调试,则可以放心地忽略它们,因为这些错误与主要过程是隔离的。

注意

提前终止调试

可以通过强制终止此过程来提前终止调试。只要为日志文件中的每个任务获得至少一个有效的调度,就应该能够进行编译(下面的部分)。

编译和评估

自动调试后,可以使用发现的最佳时间表来编译网络。在自动调试过程中,所有测量记录都将转储到日志文件中,因此可以读取日志文件并加载最佳调度。

# Compile with the history best
print("Compile...")
with auto_scheduler.ApplyHistoryBest(log_file):
    with tvm.transform.PassContext(opt_level=3, config={"relay.backend.use_auto_scheduler": True}):
        lib = relay.build(mod, target=target, params=params)
 
# Create graph runtime
ctx = tvm.context(str(target), 0)
module = graph_runtime.GraphModule(lib["default"](ctx))
data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
module.set_input("data", data_tvm)
 
# Evaluate
print("Evaluate inference time cost...")
ftimer = module.module.time_evaluator("run", ctx, repeat=3, min_repeat_ms=500)
prof_res = np.array(ftimer().results) * 1e3  # convert to millisecond
print("Mean inference time (std dev): %.2f ms (%.2f ms)" % (np.mean(prof_res), np.std(prof_res)))

出:

Compile...
Evaluate inference time cost...
Mean inference time (std dev): 30.72 ms (0.09 ms)

其他技巧

  • 在调试期间,自动调度器需要编译许多程序并从中提取功能。此部分占用大量CPU,因此建议使用具有多个内核的高性能CPU以加快搜索速度。
  • 可以 用python3 -m tvm.auto_scheduler.measure_record --mode distill --i log.json来提取大型日志文件,而仅保存最有用的记录。
  • 可以从上一个日志文件继续搜索。load_log_file在function中创建任务调度程序时,只需添加一个新参数run_tuning。也就是, tuner = auto_scheduler.TaskScheduler(tasks, task_weights, load_log_file=log_file)
  • 如果有多个目标CPU,则可以将它们全部用于测量以并行化测量。检查本 以了解如何使用RPC跟踪器和RPC服务器。要在自动调度使用RPC跟踪,在TuningOptions中用auto_scheduler.RPCRunner更换runner 。

 

标签:ax2,ax3,ax0,ax1,x86,PLACEHOLDER,神经网络,placeholder,CPU
来源: https://www.cnblogs.com/wujianming-110117/p/14182391.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有