# Backpropagation

# Clarification

# Routes of gradient

For each variable w, combine all routes' propagation => $\Delta w$ . For example, considering

$X = \begin{bmatrix}x0 & x1 & x2 \end{bmatrix}$
$Y = softmax(X) = \begin{bmatrix}y0 & y1 & y2 \end{bmatrix}$

the gradient $\Delta x0$ has three routes: y0 => x0, y1 => x0, and y2 => x0, thus

$\Delta x0 = \frac{\partial y0}{\partial x0} + \frac{\partial y1}{\partial x0} + \frac{\partial y2}{\partial x0} = y0(1-y0) - y0y1 - y0y2 = y0(1 - y0 - y1 - y2) = 0$

import tensorflow as tf
with tf.GradientTape(persistent=True) as tape:
    tape.watch(wk2)
    tape.watch(wk)
    tape.watch(wk3)
    tape.watch(s3)
    tape.watch(s)
    q = tf.linalg.matmul(x1, wq)
    k = tf.linalg.matmul(x1, wk)
    k2 = tf.linalg.matmul(x2, wk2)
    k3 = tf.linalg.matmul(x3, wk3)
    s = tf.linalg.matmul(q, tf.transpose(k))
    s2 = tf.linalg.matmul(q, tf.transpose(k2))
    s3 = tf.linalg.matmul(q, tf.transpose(k3))
    S = tf.concat([s, s2, s3-10000], axis=1)
    y = tf.nn.softmax(S)[0]
    y0 = tf.gather(y, 0)
    y1 = tf.gather(y, 1)
    y2 = tf.gather(y, 2)

tape.gradient(y, s) => 0
tape.gradient(y0, s) => some value

# routes to s3 all 0
tape.gradient(y0, s3) => 0
tape.gradient(y1, s3) => 0
tape.gradient(y2, s3) => 0
tape.gradient(y0, wk3) => 0  # lower gradients in chain also 0

# Some observations

if x is 0, $\Delta w$ is 0
considering $\frac{\partial y1}{\partial y2}\frac{\partial y2}{\partial y3}\frac{\partial y3}{\partial w}$ , if $\frac{\partial y2}{\partial y3}$ is 0, $\Delta w$ is also 0

# Scalar level

References

Notes on backpropagation

# Multi-label classification

logistic activation, cross-entropy loss

Forward

the weighted input sum at hidden unit j: $s_{j}^{1}=\sum_{k=1} x_{k} w_{kj}$

logistic/sigmoid activation at unit j: $h_{j} = \frac{1}{1+e^{-s_{j}^{1}}}$

the weighed input sum at output unit i: $s_{i}=\sum_{j=1} h_{j} w_{j i}$

logistic/sigmoid activation at output i: $y_{i} = \frac{1}{1+e^{-s_{i}}}$

binary cross entropy error: $E=-\sum_{i=1}^{\text {nout }}\left(t_{i} \log \left(y_{i}\right)+\left(1-t_{i}\right) \log \left(1-y_{i}\right)\right)$

Backward

\begin{aligned} \frac{\partial E}{\partial w_{j i}}&=\frac{\partial E}{\partial y_{i}} \frac{\partial y_{i}}{\partial s_{i}} \frac{\partial s_{i}}{\partial w_{j i}}\\\\ \frac{\partial E}{\partial y_{i}}&=\frac{-t_{i}}{y_{i}}+\frac{1-t_{i}}{1-y_{i}}\\ &=\frac{y_{i}-t_{i}}{y_{i}\left(1-y_{i}\right)} && \text{(binary cross entropy)}\\\\ \frac{\partial y_{i}}{\partial s_{i}} &=y_{i}\left(1-y_{i}\right) && \text{(sigmoid/logistic activation)}\\\\ \frac{\partial s_{i}}{\partial w_{j i}}&=h_{j} && \text{(fc layer)} \end{aligned}

gives:

\frac{\partial E}{\partial s_{i}}=y_{i}-t_{i} \text{ and } \frac{\partial E}{\partial w_{j i}}=\left(y_{i}-t_{i}\right) h_{j}

then:

\begin{aligned} \frac{\partial E}{\partial s_{j}^{1}} &=\sum_{i=1}^{n o u t} \frac{\partial E}{\partial s_{i}} \frac{\partial s_{i}}{\partial h_{j}} \frac{\partial h_{j}}{\partial s_{j}^{1}} \\ &=\sum_{i=1}^{n o u t}\left(y_{i}-t_{i}\right)\left(w_{j i}\right)\left(h_{j}\left(1-h_{j}\right)\right) \\ \frac{\partial E}{\partial w_{k j}^{1}} &=\frac{\partial E}{\partial s_{j}^{1}} \frac{\partial s_{j}^{1}}{\partial w_{k j}^{1}} \\ &=\sum_{i=1}^{n o u t}\left(y_{i}-t_{i}\right)\left(w_{j i}\right)\left(h_{j}\left(1-h_{j}\right)\right)\left(x_{k}\right) \end{aligned}

# Multi-class classification

logistic activation at hidden layer, but softmax activation at output layer , cross-entropy loss

Forward

the weighted input sum at hidden unit j: $s_{j}^{1}=\sum_{k=1} x_{k} w_{kj}$

logistic/sigmoid activation at unit j: $h_{j} = \frac{1}{1+e^{-s_{j}^{1}}}$

the weighed input sum at output unit i: $s_{i}=\sum_{j=1} h_{j} w_{j i}$

softmax activation at output i: $y_{i}=\frac{e^{s_{i}}}{\sum_{c}^{n c l a s s} e^{s_{c}}}$

cross entropy error: $E=-\sum_{i}^{n c l a s s} t_{i} \log \left(y_{i}\right)$

Backward

\begin{aligned} \frac{\partial E}{\partial y_{i}}=-\frac{t_{i}}{y_{i}} && \text{(cross entropy)} \end{aligned}

\begin{aligned} \frac{\partial y_{i}}{\partial s_{k}} &=\left\{\begin{array}{cc} \frac{e^{s_{i}}}{\sum_{c}^{\text {nclass }} e^{s_{c}}}-\left(\frac{e^{s_{i}}}{\sum_{c}^{\text {nclass}} e^{s_{c}}}\right)^{2} & i=k \\ \\ -\frac{e^{s_{i}} e^{s_{k}}}{\left(\sum_{c}^{n c l a s s} e^{\left.s_{c}\right)^{2}}\right.} & i \neq k && \text{(softmax)} \end{array}\right.\\ &=\left\{\begin{array}{cc} y_{i}\left(1-y_{i}\right) \quad i=k \\ -y_{i} y_{k} & i \neq k \end{array}\right. \end{aligned}

\begin{aligned} \frac{\partial E}{\partial s_{i}} &=\sum_{k}^{n c l a s s} \frac{\partial E}{\partial y_{k}} \frac{\partial y_{k}}{\partial s_{i}} \\ &=\frac{\partial E}{\partial y_{i}} \frac{\partial y_{i}}{\partial s_{i}}+\sum_{k \neq i} \frac{\partial E}{\partial y_{k}} \frac{\partial y_{k}}{\partial s_{i}} \\ &=-t_{i}\left(1-y_{i}\right)+\sum_{k \neq i} t_{k} y_{i} \\ &=-t_{i}+y_{i} \sum_{k} t_{k} \\ &=y_{i}-t_{i} \end{aligned}

\begin{aligned} \frac{\partial E}{\partial w_{j i}} &=\sum_{i} \frac{\partial E}{\partial s_{i}} \frac{\partial s_{i}}{\partial w_{j i}} \\ &=\left(y_{i}-t_{i}\right) h_{j} \end{aligned}

# Control

# Update layer's `trainable`: official doc

Modern Keras contains the following facilities to view and manipulate trainable state

# Print current trainable map:
print(model._get_trainable_state())

# Set every layer to be non-trainable:
# take care for excluding model wrappers(model_wrapper.trainable = False makes all the sub layers False)
for k,v in model._get_trainable_state().items():
    k.trainable = False

# Don't forget to re-compile the model
model.compile(...)

Important notes about BatchNormalization layer

Many image models contain BatchNormalization layers. That layer is a special case on every imaginable count. Here are a few things to keep in mind.

BatchNormalization contains 2 non-trainable weights that get updated during training. These are the variables tracking the mean and variance of the inputs. When you set bn_layer.trainable = False, the BatchNormalization layer will run in inference mode, and will not update its mean & variance statistics. This is not the case for other layers in general, as weight trainability & inference/training modes are two orthogonal concepts. But the two are tied in the case of the BatchNormalization layer. When you unfreeze a model that contains BatchNormalization layers in order to do fine-tuning, you should keep the BatchNormalization layers in inference mode by passing training=False when calling the base model. Otherwise the updates applied to the non-trainable weights will suddenly destroy what the model has learned.

# Create base model
base_model = keras.applications.Xception(
    weights='imagenet',
    input_shape=(150, 150, 3),
    include_top=False)
# Freeze base model
base_model.trainable = False

# Create new model on top.
inputs = keras.Input(shape=(150, 150, 3))
x = base_model(inputs, training=False)
x = keras.layers.GlobalAveragePooling2D()(x)
outputs = keras.layers.Dense(1)(x)
model = keras.Model(inputs, outputs)

# Exclude variable by name

Sometimes, variables are included implicitly in a model

training_vars = compiled_model.trainable_variables
training_vars = [v for v in training_vars if v.name == 'some_name']

...

gradients = tape.gradient(loss, training_vars)
model.optimizer.apply_gradients(zip(gradients, training_vars))

# Exclude part of variable by `tf.stop_gradient`

Reference:

https://stackoverflow.com/a/43368518/6845273
https://www.tensorflow.org/api_docs/python/tf/stop_gradient

← Loss Pytorch →