手把手搭建 BP 神经网络

2020-11-16 12:31:29 阅读：144 来源： 互联网

标签：layer 手把手 labels 神经网络 num BP np theta size

Neural networks

Visualizing the data

在这一部分，首先需要加载数据并随机输出几个图像。

加载的数据有 \(15000\) 个训练样本（training examples），每一个训练样本是一个 \(64\times 64\) 像素的灰度图。每一个像素代表了一个 \(8\) 位的无符号整数，代表了每个位置的灰度强度。这 \(64\times 64\) 的像素网格将被展开成为一个 \(4096\) 维的向量。这些训练样本将会成为矩阵 \(X\) 的每一行。所以维度为 \(15000\times 4096\) 的矩阵。

\[X=\begin{bmatrix} -(x^{(1)})^T-\\ -(x^{(2)})^T-\\ \vdots \\ -(x^{(m)})^T-\\ \end{bmatrix} \]

第二部分，是一个 \(15000\) 维的向量 \(y\) 其包含了所有训练集的标签。其值是从 \(1\) 到 \(15\) 分别表示汉字：零、一、二、三、四、五、六、七、八、九、十、百、千、万、亿。

并且将加载的数据集按照 \(7:3\) 的比例分成两个集合：训练集（training set）、测试集（test set）。

from matplotlib import pyplot as plt
from matplotlib import image as img
import numpy as np
from os import listdir
from scipy import optimize
import pandas as pd
import openpyxl
import time
from scipy import io
import re

n = 64 * 64  # 特征数量
m = 15000  # 数据样本数量
input_layer_size = n  # 64 * 64 的灰度图像
hidden_layer_size = 256  # 256 个隐藏单元
num_labels = 15  # 15 个分类
Lambda = 1  # 正则化系数

def read_images(dir_str, n):
    files_list = listdir(dir_str)
    files_str = str(files_list)
    # 样本数量
    m = len(files_list)
    data = np.zeros((num_labels, m // num_labels, n))
    for label in range(1, num_labels + 1):
        pattern = 'input_\d*_\d*_' + str(label) + '\.jpg'
        label_list = re.findall(pattern, files_str)
        count = 0
        for file in label_list:
            data[label - 1, count, :] = np.ravel(img.imread(dir_str + file), order='F')
            count += 1
    return data
    

dir_str = './data/'
data = read_images(dir_str, n)

%matplotlib inline
fig, axes = plt.subplots(nrows=3, ncols=5)  # 子图为 3 行，5 列
i = 0
for axe in axes:
    for ax in axe:
        ax.imshow(data[i, 2, :].reshape(int(np.sqrt(n)), int(np.sqrt(n)), order='F'), cmap='gray')
        i += 1

output_4_0

Xy = np.empty((int(m * 0.7), n + 1))
Xytest = np.empty((int(m * 0.3), n + 1))
for i in range(num_labels):
    np.random.shuffle(data[i])
    Xy[i * int(m // num_labels * 0.7) : i * int(m // num_labels * 0.7) + int(m // num_labels * 0.7), 0:-1] = data[i, 0:int(m // num_labels * 0.7), :]
    Xy[i * int(m // num_labels * 0.7) : i * int(m // num_labels * 0.7) + int(m // num_labels * 0.7), -1] = np.ones(int(m // num_labels * 0.7)) * (i + 1)
    Xytest[i * int(m // num_labels * 0.3) : i * int(m // num_labels * 0.3) + int(m // num_labels * 0.3), 0:-1] = data[i, int(m // num_labels * 0.7):, :]
    Xytest[i * int(m // num_labels * 0.3) : i * int(m // num_labels * 0.3) + int(m // num_labels * 0.3), -1] = np.ones(int(m // num_labels * 0.3)) * (i + 1)

np.random.shuffle(Xy)
np.random.shuffle(Xytest)
X = Xy[:, 0:n]
y = Xy[:, -1]
Xtest = Xytest[:, 0:n]
ytest = Xytest[:, -1]

mat = io.loadmat('ex4data2.mat')
X = mat.get('X')
y = mat.get('y')
Xtest = mat.get('Xtest')
ytest = mat.get('ytest')

Model representation

神经网络如下图所示：

neural notworks

共有 \(3\) 层， \(1\) 个输入层， \(1\) 个隐层， \(1\) 个输出层。

每张图片共有 \(64\times 64\) 个像素，所以输入层有 \(4096\) 个输入单元，每个隐层有 \(256\) 个单元，输出层为 \(15\) 个输出单元，表示 \(15\) 个类别的个数。

Feedforward and cost function

神经网络的带有正则化项的损失函数（cost function with regularization）表示为：

\[J(\theta)=\frac{1}{m} \sum_{i=1}^{m}\sum_{k=1}^{K}\left[-y_k^{(i)}log((h_\theta (x^{(i)})))_k-(1-y_k^{(i)})log(1-(h_\theta (x^{(i)})))_k\right]+\frac{\lambda }{2m} \left[\sum_{j=1}^{256}\sum_{k=1}^{4096}(\Theta^{(1)}_{j,k})^2+\sum_{j=1}^{15}\sum_{k=1}^{256}(\Theta^{(2)}_{j,k})^2\right] \]

其中，\(h_\theta(x^{(i)})\) 表示向神经网络输入计算后的输出， \(K=15\) 表示所有可能的标签的个数， \(h_\theta(x^{(i)})_k=a_k^{(3)}\) 表示第 \(k\) 个输出单元的激励值（activation），并且不对偏置单元（bias unit）进行正则化。

除此之外，模型的原始标签值是 \(1,2,\dots,15\) 的整数，为了训练神经网络，需要将这些十进制的整数值重新编码为只有 \(0\) 和 \(1\) 的标签向量：

\[y= \begin{bmatrix} 1\\ 0\\ 0\\ \vdots \\ 0 \end{bmatrix}, \begin{bmatrix} 0\\ 1\\ 0\\ \vdots \\ 0 \end{bmatrix},\dots, \begin{bmatrix} 0\\ 0\\ 0\\ \vdots \\ 1 \end{bmatrix} \]

假设 \(x^{(i)}\) 表示汉字“五”的图片，那么对应的 \(y^{(i)}\) 应该是一个 \(15\) 维的向量，并且 \(y_5=1\) 其它元素都等于 \(0\) 。

def nnCostFun(nn_params, X, y, input_layer_size=input_layer_size, hidden_layer_size=hidden_layer_size, num_labels=num_labels, Lambda=Lambda):
    Theta1 = nn_params[0:hidden_layer_size * (input_layer_size + 1)].reshape(hidden_layer_size, input_layer_size + 1, order='F')
    Theta2 = nn_params[hidden_layer_size * (input_layer_size + 1) : ].reshape(num_labels, hidden_layer_size + 1, order='F')
    m = X.shape[0]  # 样本的个数
    # 将样本标签转换为 15 维的 0-1 向量
    yy = np.zeros((m, num_labels))
    for i in range(m):
        yy[i, int(y[i] - 1)] = 1
    # 执行前向传播
    # 给 X 增加 1 列，全为 1
    a1 = np.ones((m, 1), np.uint)
    a1 = np.hstack((a1, X))
    # 前向传播
    z2 = np.matmul(a1, Theta1.T)
    a2 = sigmoid(z2)
    # 增加第二层（隐藏层）的偏置单元
    ones = np.ones((m, 1), dtype=np.int)
    a2 = np.hstack((ones, a2))
    z3 = np.matmul(a2, Theta2.T)
    a3 = sigmoid(z3)  # m * num_labels
    # 计算 J
    s = -yy * np.log(a3)
    s -= (1 - yy) * np.log(1 - a3)
    s = np.sum(s)  # 计算每个输出单元的累加
    J = (1 / m) * s;
    # 正则化 J
    t1 = np.power(Theta1, 2)
    t2 = np.power(Theta2, 2)
    s = np.sum(t1[:, 1:]) + np.sum(t2[:, 1:])
    J += (Lambda / (2 * m)) * s
    # 反向传播计算梯度
    delta3 = a3 - yy
    delta2 = (np.matmul(delta3, Theta2)[:, 1:] * sigmoid_gradient(z2))
    Theta1_grad = (1 / m) * np.matmul(delta2.T, a1)
    Theta2_grad = (1 / m) * np.matmul(delta3.T, a2)
    # 正则化 D
    Theta1_grad[:, 1:] += (Lambda / m) * Theta1[:, 1:]
    Theta2_grad[:, 1:] += (Lambda / m) * Theta2[:, 1:]
    gradient = [np.ravel(Theta1_grad, order='F').tolist(), np.ravel(Theta2_grad, order='F').tolist()]
    gradient = gradient[0] + gradient[1]
    return J, np.array(gradient)


def costFun(params):
    return nnCostFun(params, X, y, input_layer_size, hidden_layer_size, num_labels, Lambda)

Backpropagation

在这一部分，我们实现后 BP 算法计算神经网络的损失函数的梯度（gradient）。

Sigmoid gradient

首先，我们实现 Sigmoid 函数的梯度：

\[sigmoid(z)=g(z)=\frac{1}{1+e^{-z}} \]

\[{g}'(z)=\frac{\mathrm{d}}{\mathrm{d}z} g(z)=g(z)(1-g(z)) \]

对于预测的越准确的激励值， \(z\) 的值也就越大，梯度就越趋向于 \(0\) 。当 \(z=0\) 时，梯度的值应该正好是 \(0.25\) 。

def sigmoid(z):
    return 1.0 / (1 + np.exp(-z))

def sigmoid_gradient(z):
    return sigmoid(z) * (1 - sigmoid(z))

Random initialization

当开始训练网络之前，随机初始化每一个参数用来对称破缺（symmetry breaking）是非常重要的。一个随机初始化非常有效的策略是，随机地为 \(\Theta^{(l)}\) 选择一个在统一 \(\left[-\epsilon_{init},\epsilon_{init}\right]\) 范围内的值。我们可以使用 \(\epsilon_{init}=0.12\) 。这个范围保证了参数保持在一个非常小的值，并使得学习更加有效。

def randInitialize(L_in, L_out):
    epsilon = 0.12
    W = np.random.rand(L_out, 1 + L_in) * 2 * epsilon - epsilon
    return W

Backpropagation

给定一个训练样本 \((x^{(t)}, y^{(t)})\) ，首先输入神经网络并前向传播计算神经网络所有的激励值，包括输出值 \(h_\Theta(x)\) 。然后，对于第 \(l\) 层的结点 \(j\) ，将通过测量结点对于任何误差的“负责度”（responsible），计算误差项（error item） \(\delta_j^{(l)}\) 。

对于一个输出结点，可以直接通过测量神经网络的激励值与真实的目标值的差值，并使用其定义 \(\delta_j^{(3)}\) 。对于隐藏单元，将基于第 \((l+1)\) 层的结点的误差项的加权平均值去计算 \(\delta_j^{(l)}\) 。

对于第 \(t\) 个训练样本 \(x^{(t)}\) ，设置输入层的值 \(a^{(1)}\) 。执行前向传播，计算每层的激励值 \((z^{(2)},a^{(2)},z^{(3)},a^{(3)})\) 。每一层都需要增加一个 \(+1\) 项的偏置单元。
对于第 \(3\) 层的输出单元 \(k\) ，设

\[\delta_k^{(3)}=(a_k^{(3)}-y_k) \]

对于隐藏层 \(l=2\) ，设

\[\delta^{(2)}=(\Theta^{(2)})^T\delta^{(3)}.*{g}'(z^{(2)}) \]

使用如下公式积累神经网络的梯度值：

\[\Delta^{(l)}=\Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T \]

注意，此步骤不对偏置单元 \(\delta_0^{(l)}\) 进行计算。
5. 将通过使用累积的梯度除以 \(m\) 得到神经网络的损失函数的梯度：

\[\frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta)=D^{(l)}_{ij}=\frac{1}{m}\Delta^{(l)}_{ij} \]

Gradient checking

可以将参数矩阵 \(\Delta\) 展开成为一个非常长的向量 \(\theta\) 。通过这样做，可以将损失函数看作 \(J(\theta)\) 并使用下面的步骤进行梯度检查。

假设有一个函数 \(f_i(\theta)\) 去计算 \(\frac{\partial}{\partial\theta_i}J(\theta)\) 检查 \(f_i\) 是否输出了正确的导数值。设

\[\theta^{(i+)}=\theta+\begin{bmatrix} 0\\ 0\\ \vdots\\ \epsilon\\ \vdots\\ 0 \end{bmatrix} \ and\ \theta^{(i-)}=\theta-\begin{bmatrix} 0\\ 0\\ \vdots\\ \epsilon\\ \vdots\\ 0 \end{bmatrix} \]

因此， \(\theta^{(i+)}\) 除了第 \(i\) 个元素的值被增加了 \(\epsilon\) 其它值都与 \(\theta\) 的值相同，与之类似地， \(\theta^{(i-)}\) 也是除了第 \(i\) 个元素被减少了 \(\epsilon\) 其余的值都与 \(\theta\) 相同。可以使用数值验证对于每一个 \(i\) 的 \(f_i(\theta)\) 的正确性：

\[f_i(\theta)\approx\frac{J(\theta^{(i+)})-J(\theta^{(i-)})}{2\epsilon} \]

这两个值彼此的接近程度将取决于 \(J\) 的细节。假设 \(\epsilon=1e^{-4}\) ，通常将会发现上面左手和右手边的式子至少接受 \(4\) 位有效数字。

def numericalGradient(theta, X, y, in_lay, hidden_lay, num, Lambda):
    numgrad = np.zeros(theta.shape)
    perturb = np.zeros(theta.shape)
    e = 1e-4
    for p in range(np.size(theta)):
        # 设置扰动向量
        perturb[p] = e
        loss1 = nnCostFun(theta - perturb, X, y, in_lay, hidden_lay, num, Lambda)
        loss1 = loss1[0]
        loss2 = nnCostFun(theta + perturb, X, y, in_lay, hidden_lay, num, Lambda)
        loss2 = loss2[0]
        # 计算数值解梯度
        numgrad[p] = (loss2 - loss1) / (2 * e)
        perturb[p] = 0
    return numgrad

def checkGradient(Lambda=0):
    # 测试数据
    in_lay = 3
    hidden_lay = 5
    num = 3
    mm = 5
    # 生成参数和样本
    Theta1 = np.arange(hidden_lay * (in_lay + 1)) + 1
    Theta2 = np.arange(num * (hidden_lay + 1)) + 1
    Theta1 = np.sin(Theta1).reshape(hidden_lay, in_lay + 1)
    Theta2 = np.sin(Theta2).reshape(num, hidden_lay + 1)
    X = np.arange(mm * in_lay) + 1
    X = np.sin(X).reshape(mm, in_lay)
    y = 1 + np.mod(np.arange(mm) + 1, num)
    # 展开参数
    pars = [np.ravel(Theta1).tolist(), np.ravel(Theta2).tolist()]
    pars = pars[0] + pars[1]
    pars = np.array(pars)
    # 分别计算梯度
    grad = nnCostFun(pars, X, y, in_lay, hidden_lay, num, Lambda)
    grad = grad[1]
    numgrad = numericalGradient(pars, X, y, in_lay, hidden_lay, num, Lambda)
    # 计算差的范数
    diff = np.linalg.norm(numgrad - grad) / np.linalg.norm(numgrad + grad)
    return diff

checkGradient()

1.6291376831737205e-10

Regularized Neural Networks

增加一个正则化项到梯度上，在使用反向传播计算完 \(\Delta_{ij}^{(l)}\) 之后，使用如下公式对梯度正则化：

\[\frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta)=D^{(l)}_{ij}=\frac{1}{m}\Delta^{(l)}_{ij}\ \ for\ j=0 \]

\[\frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta)=D^{(l)}_{ij}=\frac{1}{m}\Delta^{(l)}_{ij}+\frac{\lambda}{m}+\Theta^{(l)}_{ij}\ \ for\ j\ge1 \]

注意不会对 \(\Theta^{(l)}\) 的第一列的偏置项进行正则化。

Learning parameters

在成功地实现了神经网络的损失函数和梯度的计算之后，下一步走将会使用优化函数学习一个良好的参数集。

Minibatch Stochasitc Gradient Descent

按照数据生成分布抽取 \(m\) 个小批量（独立同分布的）样本，通过计算它们的梯度均值，可以得到梯度的无偏估计。

\(SGD\) 及相关的小批量亦或更广义的基于梯度优化的在线学习算法，一个重要的性质是每一步更新的计算时间不依赖训练样本数目的多寡。即使训练样本非常大时，它们也能收敛。对于足够大的数据集， \(SGD\) 可能会在处理整个训练集之前就收敛到最终测试误差的某个固定容差范围内。

在求数值解的优化算法中，先选取一组模型参数的初始值，如随机选取；接下来对参数进行多次迭代，使每次迭代都可能降低损失函数的值。在每次迭代中，先随机均匀采样一个由固定数目训练数据样本所组成的小批量（mini-batch）\(\mathcal{B}\)，然后求小批量中数据样本的平均损失有关模型参数的导数（梯度），最后用此结果与预先设定的一个正数的乘积作为模型参数在本次迭代的减小量。

模型的每个参数将作如下迭代：

\[\Theta^{(l)}_{ij}=\Theta^{(l)}_{ij}-\frac{\alpha}{\left|\mathcal{B}\right|}\sum_{i\in\mathcal{B}}\frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta)=\Theta^{(l)}_{ij}-\frac{\alpha}{\left|\mathcal{B}\right|}\sum_{i\in\mathcal{B}}D^{(l)}_{ij} \]

在上式中，\(|\mathcal{B}|\)代表每个小批量中的样本个数（批量大小，batch size），\(\alpha\)称作学习率（learning rate）并取正数。

Vectorized

在模型训练或预测时，常常会同时处理多个数据样本并用到矢量计算。

广义上讲，当数据样本数为\(m\)，特征数为\(n\)时，矢量计算表达式为

\[h_\theta(X)=\hat{y}=Xw+b \]

其中模型输出 \(\hat{y}\in\mathbb{R}^{m\times1}\) ，批量数据样本特征 \(X\in\mathbb{R}^{m\times n}\) ，权重 \(w\in\mathbb{R}^{n\times 1}\) ，偏差 \(b \in \mathbb{R}\) 。相应地，批量数据样本标签 \(\boldsymbol{y} \in \mathbb{R}^{\mathcal{B} \times 1}\) 。设模型参数 \(\boldsymbol{\theta}\) ，我们可以重写损失函数为

\[J(\theta)=\ell(\boldsymbol{\theta})=\frac{1}{2n}(\boldsymbol{\hat{y}}-\boldsymbol{y})^\top(\boldsymbol{\hat{y}}-\boldsymbol{y}). \]

小批量随机梯度下降的迭代步骤将相应地改写为

\[\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \nabla_{\boldsymbol{\theta}} \ell^{(i)}(\boldsymbol{\theta}), \]

def unroll(params, input_layer_size, hidden_layer_size, num_labels):
    Theta1 = params[0:hidden_layer_size * (input_layer_size + 1)].reshape(hidden_layer_size, input_layer_size + 1, order='F')
    Theta2 = params[hidden_layer_size * (input_layer_size + 1) : ].reshape(num_labels, hidden_layer_size + 1, order='F')
    return Theta1, Theta2

def roll(Theta1, Theta2):
    l = [np.ravel(Theta1, order='F').tolist(), np.ravel(Theta2, order='F').tolist()]
    l = l[0] + l[1]
    return np.array(l)

def predict(params, X):
    Theta1 = params[0:hidden_layer_size * (input_layer_size + 1)].reshape(hidden_layer_size, input_layer_size + 1, order='F')
    Theta2 = params[hidden_layer_size * (input_layer_size + 1) : ].reshape(num_labels, hidden_layer_size + 1, order='F')
    m = X.shape[0]  # 样本的个数
    # 执行前向传播
    # 给 X 增加 1 列，全为 1
    a1 = np.ones((m, 1))
    a1 = np.hstack((a1, X))
    # 前向传播
    z2 = np.matmul(a1, Theta1.T)
    a2 = sigmoid(z2)
    # 增加第二层（隐藏层）的偏置单元
    ones = np.ones((m, 1), dtype=np.int)
    a2 = np.hstack((ones, a2))
    z3 = np.matmul(a2, Theta2.T)
    a3 = sigmoid(z3)  # m * num_labels
    return a3
    
def predict_accuracy(params, X, y):
    a3 = predict(params, X)
    # 预测、计算准确率
    pre = np.argmax(a3, axis=1) + 1
    return np.mean(pre == y) * 100

def predict_graph(params, X, y):
    a3 = predict(params, X)
    pre = np.argmax(a3, axis=1) + 1
    labels = list(range(1, 16))
    accuracies = []
    for label in labels:
        accuracy = np.sum((pre == label) & (y == label)) / np.sum(y == label)
        accuracies.append(accuracy)
    plt.bar(labels, accuracies)

def mgd(params, learning_rate=0.1, batch_size=100, maxepoch=10):
    epoch_list = []
    cost_list = []
    test_accuracy_list = []
    train_accuracy_list = []
    m = X.shape[0]
    # cost
    fig1, axes1 = plt.subplots()
    # accuracy
    fig2, axes2 = plt.subplots()
    for epoch in range(maxepoch):
        # 小批量进行更新全部训练集一次
        for i in range(0, m, batch_size):
            XX = X[i:i + batch_size, :]
            yy = y[i:i + batch_size]
            J, grad = nnCostFun(params, XX, yy)
            # 更新参数
            params -= learning_rate * grad
        # 记录
        epoch_list.append(epoch)
        cost_list.append(J)
        test_accuracy = predict_accuracy(params, Xtest, ytest)
        test_accuracy_list.append(test_accuracy)
        train_accuracy = predict_accuracy(params, X, y)
        train_accuracy_list.append(train_accuracy)
        print(epoch, J, train_accuracy, test_accuracy)
    axes1.plot(epoch_list, cost_list)
    axes1.set_xlabel('epoch')
    axes1.set_ylabel('cost')
    axes1.text(0, J, 'learning rate: %f' %learning_rate)
    axes2.plot(epoch_list, test_accuracy_list)
    axes2.plot(epoch_list, train_accuracy_list)
    axes2.set_xlabel('epoch')
    axes2.set_ylabel('accuracy')
    axes2.text(0, test_accuracy, 'learning rate: %f' %learning_rate)
    axes2.legend(['test accuracy', 'train accuracy'], loc=0)  # 图例
    fig1.savefig('cost_' + str(int(time.time())))
    fig2.savefig('accuracy_' + str(int(time.time())))

Lambda = 100

Theta1 = randInitialize(input_layer_size, hidden_layer_size)
Theta2 = randInitialize(hidden_layer_size, num_labels)
params = roll(Theta1, Theta2)

mat = io.loadmat('ex4weights2.mat')
Theta1 = mat.get('Theta1')
Theta2 = mat.get('Theta2')
params = roll(Theta1, Theta2)


#mgd(params, 0.1, 1000, 100)

在执行完小批量的梯度下降后，可以看到训练集的准确率为 \(96\%\) ，每个数字的准确率如条形图所示

predict_graph(params, X, y)
predict_accuracy(params, X, y)

96.86666666666667

output_26_1

测试集的准确率为 \(70\%\) ，对于非二分类的问题，这里是十五个类别，\(70\%\) 的准确率不算太低

predict_graph(params, Xtest, ytest)
predict_accuracy(params, Xtest, ytest)

70.44444444444444

output_28_1

下面从测试集中随机抽取 \(10\) 张图片进行识别，查看效果

label_map = {1:'零', 2:'一', 3:'二', 4:'三', 5:'四', 6:'五', 7:'六', 8:'七', 9:'八', 10:'九', 11:'十', 12:'百', 13:'千', 14:'万', 15:'亿'}

fig, axes = plt.subplots(nrows=2, ncols=5)
i = 0
for axe in axes:
    for ax in axe:
        i = np.random.randint(0, 4500)
        ax.imshow(Xtest[i, :].reshape((64, 64), order='F'), cmap='gray')
        print(label_map.get((predict(params, Xtest[i, :].reshape((1, 4096), order='F')).argmax() + 1)), end='  ')

九  三  零  亿  百  百  万  千  七  七

output_31_1

可以看到仍然有些误差，这是非常正常的，因为此模型选用的 \(BP\) 神经网络，对于这种图像识别的，使用卷积神经网络才是非常有力的模型。

标签：layer,手把手,labels,神经网络,num,BP,np,theta,size
来源： https://www.cnblogs.com/geekfx/p/13984511.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

手把手搭建 BP 神经网络

Neural networks

Visualizing the data

Model representation

Feedforward and cost function

Backpropagation

Sigmoid gradient

Random initialization

Backpropagation

Gradient checking

Regularized Neural Networks

Learning parameters

Minibatch Stochasitc Gradient Descent

Vectorized