2.6 BatchNormalization

2.6.1 Batch Normalizationのアルゴリズム

ミニバッチを単位として、データの分布が平均ゼロ、分散1になるように正規化します。

\mu_B \leftarrow \frac1m\sum_{i=1}^mx_i
\sigma_B^2 \leftarrow \frac1m\sum_{i=1}^m(x_i-\mu_B)^2
\hat{x}_i \leftarrow \frac{x_i-\mu_B}{\sqrt{\sigma_B^2+\epsilon}}
y_i \leftarrow \gamma\hat{x}_i+\beta

逆伝搬で求めたいのは、

\frac{\partial L}{\partial x_i}, \frac{\partial L}{\partial \gamma}, \frac{\partial L}{\partial \beta}
\frac{\partial L}{\partial \beta} = \sum_i\frac{\partial L}{\partial y_i}\frac{\partial y_i}{\partial \beta} = \sum_i\frac{\partial L}{\partial y_i}
\frac{\partial L}{\partial \gamma} = \sum_i\frac{\partial L}{\partial y_i}\frac{\partial y_i}{\partial \gamma} = \sum_i\frac{\partial L}{\partial y_i}\hat{x}_i
\frac{\partial L}{\partial \hat{x}_i} = \sum_j\frac{\partial L}{\partial y_j}\frac{\partial y_j}{\partial \hat{x}_i} = \frac{\partial L}{\partial y_i}\gamma
x_i = \sigma_B\hat{x}_i + \mu_B

です。逆伝搬の導出は、Frederik Kratzertのブログ「Understanding the backward pass through Batch Normalization Layer」に詳しい解説があります。

Batch Normalizationクラスを実装します。「ゼロから作るDeep Learning」のcommon/layers.pyを参考にしています。以下では実装の中心となるforward_2dメソッドとbackward_2dメソッドを記載します。

Code 2.9 BatchNormalization.forward_2dおよびbackward_2d

    def forward_2d(self, x_2d):
        if self.train.d:
            mu = x_2d.mean(axis=0)
            self.xc = x_2d - mu
            var = np.mean(self.xc ** 2, axis=0)
            self.std = np.sqrt(var + 1e-7)
            self.xn = self.xc / self.std

            momentum = 0.9
            self.running_mean.d = momentum * self.running_mean.d + (1 - momentum) * mu
            self.running_var.d = momentum * self.running_var.d + (1 - momentum) * var
        else:
            self.xc = x_2d - self.running_mean.d
            self.xn = self.xc / np.sqrt(self.running_var.d + 1e-7)

        return self.gamma.d * self.xn + self.beta.d


    def backward_2d(self, dy_2d):
        self.beta.g = dy_2d.sum(axis=0)
        self.gamma.g = np.sum(self.xn * dy_2d, axis=0)
        dxn = self.gamma.d * dy_2d
        dxc = dxn / self.std
        dstd = -np.sum((dxn * self.xc) / (self.std * self.std), axis=0)
        dvar = 0.5 * dstd / self.std

        batch_size = dy_2d.shape[0]
        dxc += (2.0 / batch_size) * self.xc * dvar
        dmu = np.sum(dxc, axis=0)
        return dxc - dmu / batch_size

さて、上記のクラスで正しいBatch Normalizationを実現できているか、確認します。

import numpy as np
from ivory.core.model import sequential

net = [
    ("input", 20),
    (2, "affine", 4, "batch_normalization", "relu"),
    ("affine", 5, "softmax_cross_entropy"),
]

model = sequential(net)
model.layers

[6] 2019-06-12 20:01:13 (17.9ms) python3 (272ms)

[<Affine('Affine.1', (20, 4)) at 0x2562a342860>,
 <BatchNormalization('BatchNormalization.1', (4,)) at 0x2562a342940>,
 <Relu('Relu.1', (4,)) at 0x2562a342b00>,
 <Affine('Affine.2', (4, 4)) at 0x2562a342be0>,
 <BatchNormalization('BatchNormalization.2', (4,)) at 0x2562a342da0>,
 <Relu('Relu.2', (4,)) at 0x2562a342f98>,
 <Affine('Affine.3', (4, 5)) at 0x2562a34b0b8>,
 <SoftmaxCrossEntropy('SoftmaxCrossEntropy.1', (5,)) at 0x2562a34b278>]

BatchNormalizationレイヤのパラメータを見てみます。

bn = model.layers[1]
bn.parameters

[7] 2019-06-12 20:01:13 (7.04ms) python3 (279ms)

[<Input('BatchNormalization.1.x', (4,)) at 0x2562a342978>,
 <Output('BatchNormalization.1.y', (4,)) at 0x2562a3429e8>,
 <Weight('BatchNormalization.1.gamma', (4,)) at 0x2562a3429b0>,
 <Weight('BatchNormalization.1.beta', (4,)) at 0x2562a342a20>,
 <State('BatchNormalization.1.running_mean', (4,)) at 0x2562a342a58>,
 <State('BatchNormalization.1.running_var', (4,)) at 0x2562a342a90>,
 <State('BatchNormalization.1.train', ()) at 0x2562a342ac8>]

正規化後の変換を表す\gamma\betaがあります。また、状態変数を持ちます。これらは、テスト時に用いる移動平均running_meanと移動分散running_var、および、訓練状態か否かのフラッグtrainです。

print("running mean:", bn.running_mean.d)
print("running var: ", bn.running_var.d)
print("train:       ", bn.train.d)

[8] 2019-06-12 20:01:13 (12.0ms) python3 (291ms)

running mean: [0. 0. 0. 0.]
running var:  [0. 0. 0. 0.]
train:        True

次に、BatchNormalizationレイヤの前後でのデータの変化をみます。

xv, tv = model.data_input_variables
layers = model.layers
affine1, norm1 = layers[0].y, layers[1].y
affine2, norm2 = layers[3].y, layers[4].y

[9] 2019-06-12 20:01:13 (5.00ms) python3 (296ms)

batch_size = 100
x = np.random.randn(batch_size, *xv.shape)
high = layers[-1].x.shape[0]
t = np.random.randint(0, high, (batch_size, *tv.shape))
model.set_data(x, t)

model.forward()
model.backward()

[10] 2019-06-12 20:01:13 (9.02ms) python3 (305ms)

print(affine1.d.mean(axis=0))
print(affine1.d.std(axis=0))
print(affine2.d.mean(axis=0))
print(affine2.d.std(axis=0))
print(norm1.d.mean(axis=0))
print(norm1.d.std(axis=0))
print(norm2.d.mean(axis=0))
print(norm2.d.std(axis=0))

[11] 2019-06-12 20:01:13 (17.8ms) python3 (323ms)

[-0.17217981  0.04251634  0.07352591 -0.01499862]
[1.22741099 1.11893047 1.51183028 1.55530686]
[ 0.36655023 -0.31966219  0.75286173 -0.73086581]
[0.45863232 0.74070044 1.58615885 1.06805542]
[0.00000000e+00 2.83106871e-17 4.10782519e-17 9.43689571e-18]
[0.99999997 0.99999996 0.99999998 0.99999998]
[ 2.44249065e-16 -3.71924713e-16  4.88498131e-17 -3.88578059e-18]
[0.99999976 0.99999991 0.99999998 0.99999996]

このように、Batch Normalizationの出力は、ユニットごとのバッチ内分布が平均0、標準偏差1に正規化されています。次に勾配確認を行います。

for v in model.grad_variables:
    print(v.parameters[0].name, model.gradient_error(v))

[12] 2019-06-12 20:01:13 (816ms) python3 (1.14s)

x 9.548821898909256e-09
W 7.558614830101108e-07
b 6.837249950236712e-18
gamma 1.7915361603249697e-06
beta 2.9362385780152406e-07
W 1.2863489168593275e-06
b 1.8526304622346057e-17
gamma 1.3054237404418363e-05
beta 6.173637762865031e-07
W 3.8470090224768275e-06
b 2.617637083306146e-07

差分が小さい値になっていることが分かります。以上で、Batch Normalizationレイヤが実装できました。