# Backpropagation

# Clarification

# Routes of gradient

For each variable w, combine all routes' propagation => Δw\Delta w. For example, considering

X=[x0x1x2]X = \begin{bmatrix}x0 & x1 & x2 \end{bmatrix}
Y=softmax(X)=[y0y1y2]Y = softmax(X) = \begin{bmatrix}y0 & y1 & y2 \end{bmatrix}

the gradient Δx0\Delta x0 has three routes: y0 => x0, y1 => x0, and y2 => x0, thus

Δx0=y0x0+y1x0+y2x0=y0(1y0)y0y1y0y2=y0(1y0y1y2)=0\Delta x0 = \frac{\partial y0}{\partial x0} + \frac{\partial y1}{\partial x0} + \frac{\partial y2}{\partial x0} = y0(1-y0) - y0y1 - y0y2 = y0(1 - y0 - y1 - y2) = 0

import tensorflow as tf
with tf.GradientTape(persistent=True) as tape:
    tape.watch(wk2)
    tape.watch(wk)
    tape.watch(wk3)
    tape.watch(s3)
    tape.watch(s)
    q = tf.linalg.matmul(x1, wq)
    k = tf.linalg.matmul(x1, wk)
    k2 = tf.linalg.matmul(x2, wk2)
    k3 = tf.linalg.matmul(x3, wk3)
    s = tf.linalg.matmul(q, tf.transpose(k))
    s2 = tf.linalg.matmul(q, tf.transpose(k2))
    s3 = tf.linalg.matmul(q, tf.transpose(k3))
    S = tf.concat([s, s2, s3-10000], axis=1)
    y = tf.nn.softmax(S)[0]
    y0 = tf.gather(y, 0)
    y1 = tf.gather(y, 1)
    y2 = tf.gather(y, 2)

tape.gradient(y, s) => 0
tape.gradient(y0, s) => some value

# routes to s3 all 0
tape.gradient(y0, s3) => 0
tape.gradient(y1, s3) => 0
tape.gradient(y2, s3) => 0
tape.gradient(y0, wk3) => 0  # lower gradients in chain also 0

# Some observations

  • if x is 0, Δw\Delta w is 0
  • considering y1y2y2y3y3w\frac{\partial y1}{\partial y2}\frac{\partial y2}{\partial y3}\frac{\partial y3}{\partial w}, if y2y3\frac{\partial y2}{\partial y3} is 0, Δw\Delta w is also 0

# Scalar level

# Multi-label classification

logistic activation, cross-entropy loss

Forward

the weighted input sum at hidden unit j: sj1=k=1xkwkjs_{j}^{1}=\sum_{k=1} x_{k} w_{kj}

logistic/sigmoid activation at unit j: hj=11+esj1h_{j} = \frac{1}{1+e^{-s_{j}^{1}}}

the weighed input sum at output unit i: si=j=1hjwjis_{i}=\sum_{j=1} h_{j} w_{j i}

logistic/sigmoid activation at output i: yi=11+esiy_{i} = \frac{1}{1+e^{-s_{i}}}

binary cross entropy error: E=i=1nout (tilog(yi)+(1ti)log(1yi))E=-\sum_{i=1}^{\text {nout }}\left(t_{i} \log \left(y_{i}\right)+\left(1-t_{i}\right) \log \left(1-y_{i}\right)\right)

Backward

Ewji=EyiyisisiwjiEyi=tiyi+1ti1yi=yitiyi(1yi)(binary cross entropy)yisi=yi(1yi)(sigmoid/logistic activation)siwji=hj(fc layer) \begin{aligned} \frac{\partial E}{\partial w_{j i}}&=\frac{\partial E}{\partial y_{i}} \frac{\partial y_{i}}{\partial s_{i}} \frac{\partial s_{i}}{\partial w_{j i}}\\\\ \frac{\partial E}{\partial y_{i}}&=\frac{-t_{i}}{y_{i}}+\frac{1-t_{i}}{1-y_{i}}\\ &=\frac{y_{i}-t_{i}}{y_{i}\left(1-y_{i}\right)} && \text{(binary cross entropy)}\\\\ \frac{\partial y_{i}}{\partial s_{i}} &=y_{i}\left(1-y_{i}\right) && \text{(sigmoid/logistic activation)}\\\\ \frac{\partial s_{i}}{\partial w_{j i}}&=h_{j} && \text{(fc layer)} \end{aligned}

gives:

Esi=yiti and Ewji=(yiti)hj \frac{\partial E}{\partial s_{i}}=y_{i}-t_{i} \text{ and } \frac{\partial E}{\partial w_{j i}}=\left(y_{i}-t_{i}\right) h_{j}

then:

Esj1=i=1noutEsisihjhjsj1=i=1nout(yiti)(wji)(hj(1hj))Ewkj1=Esj1sj1wkj1=i=1nout(yiti)(wji)(hj(1hj))(xk) \begin{aligned} \frac{\partial E}{\partial s_{j}^{1}} &=\sum_{i=1}^{n o u t} \frac{\partial E}{\partial s_{i}} \frac{\partial s_{i}}{\partial h_{j}} \frac{\partial h_{j}}{\partial s_{j}^{1}} \\ &=\sum_{i=1}^{n o u t}\left(y_{i}-t_{i}\right)\left(w_{j i}\right)\left(h_{j}\left(1-h_{j}\right)\right) \\ \frac{\partial E}{\partial w_{k j}^{1}} &=\frac{\partial E}{\partial s_{j}^{1}} \frac{\partial s_{j}^{1}}{\partial w_{k j}^{1}} \\ &=\sum_{i=1}^{n o u t}\left(y_{i}-t_{i}\right)\left(w_{j i}\right)\left(h_{j}\left(1-h_{j}\right)\right)\left(x_{k}\right) \end{aligned}

# Multi-class classification

logistic activation at hidden layer, but softmax activation at output layer , cross-entropy loss

Forward

the weighted input sum at hidden unit j: sj1=k=1xkwkjs_{j}^{1}=\sum_{k=1} x_{k} w_{kj}

logistic/sigmoid activation at unit j: hj=11+esj1h_{j} = \frac{1}{1+e^{-s_{j}^{1}}}

the weighed input sum at output unit i: si=j=1hjwjis_{i}=\sum_{j=1} h_{j} w_{j i}

softmax activation at output i: yi=esicnclassescy_{i}=\frac{e^{s_{i}}}{\sum_{c}^{n c l a s s} e^{s_{c}}}

cross entropy error: E=inclasstilog(yi)E=-\sum_{i}^{n c l a s s} t_{i} \log \left(y_{i}\right)

Backward

Eyi=tiyi(cross entropy) \begin{aligned} \frac{\partial E}{\partial y_{i}}=-\frac{t_{i}}{y_{i}} && \text{(cross entropy)} \end{aligned}
yisk={esicnclass esc(esicnclassesc)2i=kesiesk(cnclassesc)2ik(softmax)={yi(1yi)i=kyiykik \begin{aligned} \frac{\partial y_{i}}{\partial s_{k}} &=\left\{\begin{array}{cc} \frac{e^{s_{i}}}{\sum_{c}^{\text {nclass }} e^{s_{c}}}-\left(\frac{e^{s_{i}}}{\sum_{c}^{\text {nclass}} e^{s_{c}}}\right)^{2} & i=k \\ \\ -\frac{e^{s_{i}} e^{s_{k}}}{\left(\sum_{c}^{n c l a s s} e^{\left.s_{c}\right)^{2}}\right.} & i \neq k && \text{(softmax)} \end{array}\right.\\ &=\left\{\begin{array}{cc} y_{i}\left(1-y_{i}\right) \quad i=k \\ -y_{i} y_{k} & i \neq k \end{array}\right. \end{aligned}
Esi=knclassEykyksi=Eyiyisi+kiEykyksi=ti(1yi)+kitkyi=ti+yiktk=yiti \begin{aligned} \frac{\partial E}{\partial s_{i}} &=\sum_{k}^{n c l a s s} \frac{\partial E}{\partial y_{k}} \frac{\partial y_{k}}{\partial s_{i}} \\ &=\frac{\partial E}{\partial y_{i}} \frac{\partial y_{i}}{\partial s_{i}}+\sum_{k \neq i} \frac{\partial E}{\partial y_{k}} \frac{\partial y_{k}}{\partial s_{i}} \\ &=-t_{i}\left(1-y_{i}\right)+\sum_{k \neq i} t_{k} y_{i} \\ &=-t_{i}+y_{i} \sum_{k} t_{k} \\ &=y_{i}-t_{i} \end{aligned}
Ewji=iEsisiwji=(yiti)hj \begin{aligned} \frac{\partial E}{\partial w_{j i}} &=\sum_{i} \frac{\partial E}{\partial s_{i}} \frac{\partial s_{i}}{\partial w_{j i}} \\ &=\left(y_{i}-t_{i}\right) h_{j} \end{aligned}

# Control

# Update layer's trainable: official doc

Modern Keras contains the following facilities to view and manipulate trainable state

# Print current trainable map:
print(model._get_trainable_state())

# Set every layer to be non-trainable:
# take care for excluding model wrappers(model_wrapper.trainable = False makes all the sub layers False)
for k,v in model._get_trainable_state().items():
    k.trainable = False

# Don't forget to re-compile the model
model.compile(...)

Important notes about BatchNormalization layer

Many image models contain BatchNormalization layers. That layer is a special case on every imaginable count. Here are a few things to keep in mind.

BatchNormalization contains 2 non-trainable weights that get updated during training. These are the variables tracking the mean and variance of the inputs. When you set bn_layer.trainable = False, the BatchNormalization layer will run in inference mode, and will not update its mean & variance statistics. This is not the case for other layers in general, as weight trainability & inference/training modes are two orthogonal concepts. But the two are tied in the case of the BatchNormalization layer. When you unfreeze a model that contains BatchNormalization layers in order to do fine-tuning, you should keep the BatchNormalization layers in inference mode by passing training=False when calling the base model. Otherwise the updates applied to the non-trainable weights will suddenly destroy what the model has learned.

# Create base model
base_model = keras.applications.Xception(
    weights='imagenet',
    input_shape=(150, 150, 3),
    include_top=False)
# Freeze base model
base_model.trainable = False

# Create new model on top.
inputs = keras.Input(shape=(150, 150, 3))
x = base_model(inputs, training=False)
x = keras.layers.GlobalAveragePooling2D()(x)
outputs = keras.layers.Dense(1)(x)
model = keras.Model(inputs, outputs)

# Exclude variable by name

Sometimes, variables are included implicitly in a model

training_vars = compiled_model.trainable_variables
training_vars = [v for v in training_vars if v.name == 'some_name']

...

gradients = tape.gradient(loss, training_vars)
model.optimizer.apply_gradients(zip(gradients, training_vars))

# Exclude part of variable by tf.stop_gradient

Reference:

  • https://stackoverflow.com/a/43368518/6845273
  • https://www.tensorflow.org/api_docs/python/tf/stop_gradient
Last Updated: 7/10/2023, 12:23:51 PM