Layer Normalization

Neural networks perform poorly when parameters approach 0 or infinity. Backprop relies on derivatives, and common activation functions like sigmoid, tanh, or relu have derivatives that approach 0 as inputs approach 0 or infinity. A parameter that hits 0 or grows too large is thus neutralized: it stops learning. As parameters are neutralized, the model distributes its predictive power among fewer parameters. The model loses capacity and trains more slowly.

This phenomenon is known as the vanishing gradient problem. Layer normalization, aka LayerNorm in PyTorch, mitigates vanishing gradients by normalizing model parameters in between other network layers.

How might we normalize a neural network layer? There two components to data normalization:

centering data on the origin, shifting the mean to 0
scaling the data spread to a standard deviation of 1

Mathematically, this looks like:

x := the input tensor
µ := mean(x)
σ := standard_deviation(x)
norm := (x - µ)/σ

The above equation is the heart of layer normalization. It basically works by itself, but it has an edge case when σ is 0. To avoid any possibilty of failure, we add a small ε to the denominator:

ε := 1e-6
norm := (x - µ)/(σ + ε)

We could stop here and have a working layer normalization equation. However, we can still make the normalized output more useful to followup layers:

scale normalized output up/down with a coefficient (default is 1)
increase or decrease normalized output with a systematic bias (default is 0)

These two properties make up the affine transform γx + β, where γ is the scaling factor and β is the bias. Empirical tests show that a learned affine transform component improves the performance of layer normalization over an implementation with γ and β set to their static defaults. All together, the equation looks like:

γ := Parameter(ones(x.size()))
β := Parameter(zeros(x.size()))
norm := γ(x - µ)/(σ + ε) + β

PyTorch docs show a slightly different denominator: sqrt(σ² + ε). This is near identical to (σ + ε) and has no impact on model convergence speed. The LayerNorm implementation in Attention Is All You Need uses (σ + ε), likely because doing so avoids two unnecessary calculations.

Alex Shapiro