如何在 GPU 上优化卷积

2021-10-30 07:34:58 阅读：249 来源： 互联网

标签：ni thread 卷积 axis fi GPU te 优化 size

如何在 GPU 上优化卷积

将演示如何在 TVM 中编写高性能卷积实现。正方形大小的输入张量和过滤器为例，假设卷积的输入具有大batch批量。在这个例子中，使用不同的布局存储数据，实现更好的数据局部性。缓冲区布局是 HWCN，代表高度、宽度、通道、批次。

准备和算法

对具有 256 个通道和 14 x 14 维度的输入张量使用固定大小。批量大小为 256。卷积过滤器包含 512 个大小为 3 x 3 的过滤器。使用步幅大小 1 和padding大小 1 进行卷积。以下代码定义了 TVM 中的卷积算法。

import numpy as np

import tvm

from tvm import te

# The sizes of inputs and filters

batch = 256

in_channel = 256

out_channel = 512

in_size = 14

kernel = 3

pad = 1

stride = 1

# Algorithm

A = te.placeholder((in_size, in_size, in_channel, batch), name="A")

W = te.placeholder((kernel, kernel, in_channel, out_channel), name="W")

out_size = (in_size - kernel + 2 * pad) // stride + 1

# Pad input

Apad = te.compute(

    (in_size + 2 * pad, in_size + 2 * pad, in_channel, batch),

    lambda yy, xx, cc, nn: tvm.tir.if_then_else(

        tvm.tir.all(yy >= pad, yy - pad < in_size, xx >= pad, xx - pad < in_size),

        A[yy - pad, xx - pad, cc, nn],

        tvm.tir.const(0.0, "float32"),

),

    name="Apad",

# Create reduction variables

rc = te.reduce_axis((0, in_channel), name="rc")

ry = te.reduce_axis((0, kernel), name="ry")

rx = te.reduce_axis((0, kernel), name="rx")

# Compute the convolution

B = te.compute(

    (out_size, out_size, out_channel, batch),

    lambda yy, xx, ff, nn: te.sum(

        Apad[yy * stride + ry, xx * stride + rx, rc, nn] * W[ry, rx, rc, ff], axis=[ry, rx, rc]

),

    name="B",

内存层次结构

首先指定缓冲区的内存层次结构。下图显示了 GPU 内存层次结构。与 CPU 内存层次结构的重要区别是 GPU，提供了共享内存的缓存缓冲区，由程序员管理。因此，如何最大化共享内存中的数据重用，对于在 GPU 内核中，实现高性能至关重要。

在这个例子中，将 Apad 和 W 加载到缓冲区 AA 和 WW 中，存储在共享内存中。这些缓冲区由同一线程块内的所有线程共享计算卷积。然后每个线程从共享缓冲区，加载自定义部分到的本地寄存器 AL 和 WL。BL 是输出 B 的本地缓存，存储在线程本地寄存器中。

# Designate the memory hierarchy

s = te.create_schedule(B.op)

s[Apad].compute_inline()  # compute Apad inline

AA = s.cache_read(Apad, "shared", [B])

WW = s.cache_read(W, "shared", [B])

AL = s.cache_read(AA, "local", [B])

WL = s.cache_read(WW, "local", [B])

BL = s.cache_write(B, "local")

阻塞

以下代码将工作负载拆分为线程块和单个线程。在矩阵乘法中遵循分块方案。如下图所示，给定一个像素坐标（y，x），一个线程块负责计算一个block_factor x block_factor（64 x 64）的区域，用于输出通道和batch。由于共享内存空间的限制，每次只从 Apad 和 B 加载 step x block_factor (8 x 64) 数据到共享内存中的缓冲区。

# tile consts

tile = 8

num_thread = 8

block_factor = tile * num_thread

step = 8

vthread = 2

# Get the GPU thread indices

block_x = te.thread_axis("blockIdx.x")

block_y = te.thread_axis("blockIdx.y")

block_z = te.thread_axis("blockIdx.z")

thread_x = te.thread_axis((0, num_thread), "threadIdx.x")

thread_y = te.thread_axis((0, num_thread), "threadIdx.y")

thread_xz = te.thread_axis((0, vthread), "vthread", name="vx")

thread_yz = te.thread_axis((0, vthread), "vthread", name="vy")

# Split the workloads

hi, wi, fi, ni = s[B].op.axis

bz = s[B].fuse(hi, wi)

by, fi = s[B].split(fi, factor=block_factor)

bx, ni = s[B].split(ni, factor=block_factor)

# Bind the iteration variables to GPU thread indices

s[B].bind(bz, block_z)

s[B].bind(by, block_y)

s[B].bind(bx, block_x)

虚拟线程拆分

将工作负载从线程块拆分为单个线程。为避免内存库冲突，使用虚拟线程，将区域分成 4 部分，平铺成 8x8 的网格。如下图所示，每个线程计算 4 个 strided 网格，每个网格的大小为 4 x 4。

tyz, fi = s[B].split(fi, nparts=vthread)  # virtual thread split

txz, ni = s[B].split(ni, nparts=vthread)  # virtual thread split

ty, fi = s[B].split(fi, nparts=num_thread)

tx, ni = s[B].split(ni, nparts=num_thread)

s[B].reorder(bz, by, bx, tyz, txz, ty, tx, fi, ni)

s[B].bind(tyz, thread_yz)

s[B].bind(txz, thread_xz)

s[B].bind(ty, thread_y)

s[B].bind(tx, thread_x)

Cooperative Fetching

每个时间步都需要将步 x block_factor 数据，从 GPU 全局内存传输到共享内存。为了减少每个线程的内存传输，以下代码让同一线程块中的线程协同，从全局内存中获取相关数据。

# Schedule BL local write

s[BL].compute_at(s[B], tx)

yi, xi, fi, ni = s[BL].op.axis

ry, rx, rc = s[BL].op.reduce_axis

rco, rci = s[BL].split(rc, factor=step)

s[BL].reorder(rco, ry, rx, rci, fi, ni)

# Attach computation to iteration variables

s[AA].compute_at(s[BL], rx)

s[WW].compute_at(s[BL], rx)

s[AL].compute_at(s[BL], rci)

s[WL].compute_at(s[BL], rci)

# Schedule for A's shared memory load

yi, xi, ci, ni = s[AA].op.axis

ty, ci = s[AA].split(ci, nparts=num_thread)

tx, ni = s[AA].split(ni, nparts=num_thread)

_, ni = s[AA].split(ni, factor=4)

s[AA].reorder(ty, tx, yi, xi, ci, ni)

s[AA].bind(ty, thread_y)

s[AA].bind(tx, thread_x)

s[AA].vectorize(ni)  # vectorize memory load

# Schedule for W's shared memory load

yi, xi, ci, fi = s[WW].op.axis

ty, ci = s[WW].split(ci, nparts=num_thread)

tx, fi = s[WW].split(fi, nparts=num_thread)

_, fi = s[WW].split(fi, factor=4)

s[WW].reorder(ty, tx, yi, xi, ci, fi)

s[WW].bind(ty, thread_y)

s[WW].bind(tx, thread_x)

s[WW].vectorize(fi)  # vectorize memory load

生成CUDA内核

最后，使用 TVM 生成和编译 CUDA 内核，评估卷积的延迟。

func = tvm.build(s, [A, W, B], "cuda")

dev = tvm.cuda(0)

a_np = np.random.uniform(size=(in_size, in_size, in_channel, batch)).astype(A.dtype)

w_np = np.random.uniform(size=(kernel, kernel, in_channel, out_channel)).astype(W.dtype)

a = tvm.nd.array(a_np, dev)

w = tvm.nd.array(w_np, dev)

b = tvm.nd.array(np.zeros((out_size, out_size, out_channel, batch), dtype=B.dtype), dev)

func(a, w, b)

evaluator = func.time_evaluator(func.entry_name, dev, number=1)

print("Convolution: %f ms" % (evaluator(a, w, b).mean * 1e3))

输去：

Convolution: 41.937872 ms

参考链接路径：

http://tvm.apache.org/docs/how_to/optimize_operators/opt_conv_cuda.html

标签：ni,thread,卷积,axis,fi,GPU,te,优化,size
来源： https://www.cnblogs.com/wujianming-110117/p/15484210.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

如何在 GPU 上优化卷积

如何在 GPU 上优化卷积

准备和算法

内存层次结构

阻塞

虚拟线程拆分

Cooperative Fetching

生成CUDA内核