# Loss

Requires polishing

# Compare and contrast of BinaryCrossEntropy and CrossEntropy

# Definition

Definition of cross entropy

H(p,q)=H(p)+DKL(pq)H(p, q)=H(p)+D_{\mathrm{KL}}(p \| q), where H(p,q)H(p, q) is cross entropy, H(p)H(p) is entropy and DKL(pq)D_{\mathrm{KL}}(p \| q) is KL distance.
For discrete probability distribution p and q:
H(p,q)=xXp(x)logq(x)H(p, q)=-\sum_{x \in \mathcal{X}} p(x) \log q(x)
H(p)=xXp(x)logp(x)H(p)=-\sum_{x \in \mathcal{X}} p(x) \log p(x)
DKL(pq)=xXp(x)log(q(x)p(x))D_{\mathrm{KL}}(p \| q)=-\sum_{x \in \mathcal{X}} p(x) \log \left(\frac{q(x)}{p(x)}\right)

Considering binary classification problem with an positive instance:

  • BinaryCrossEntropy: FC layer output 1 logits, activated by sigmoid, gives the probability a, and the calculation is log(a)-log(a)
  • CrossEntropy: FC layer output 2 logits, activated by softmax, gives the probability (1-b, b), and the calculation is log(b)-log(b)

Considering X(X>=3) classification problem with an instance(class 1):

  • CrossEntropy: FC layer output X logits, activated by softmax, gives the probability (n1, n2, ..., nx), and the calculation is log(n2)-log(n2)

TIP

sum of probabilities of softmax is 1, as one instance can only have one label

Considering X(X>=2) multi-label classification problem, if the instance only has a single label with class 1:

  • BinaryCrossEntropy: FC layer output X logits, activated by sigmoid, gives the probability (m1, m2, ...., mx), and the calculation is log(1m1)log(m2)log(1m3)...log(1mx)-log(1-m1)-log(m2)-log(1-m3)-...-log(1-mx)

TIP

probabilities are independent with each other, as one instance can have multiple labels

"""
Considering binary classification and for a single instance with target 1
"""

target = [1.0]  # loss computation requires float
predictions = [0.4]

# bce way 1
bce_fn = tf.keras.losses.BinaryCrossentropy(from_logits=False)
bce = bce_fn(target, predictions)

# bce way 2
bce = tf.keras.losses.binary_crossentropy(target, predictions, from_logits=False)

num_classes = 2
one_hot_target = tf.one_hot(target, 2)  # [0., 1.]
predictions = [0.4, 0.6]  # sum to 1; if not, tf will normalize them

# ce way 1
ce_fn = tf.keras.losses.CategoricalCrossentropy(from_logits=False)
ce = ce_fn(one_hot_target, predictions)

# ce way 2
ce = tf.keras.losses.categorical_crossentropy(one_hot_target, predictions, from_logits=False)

# ce way 3
ce_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
ce = ce_fn(target, predictions)

# ce way 4
ce = tf.keras.losses.sparse_categorical_crossentropy(target, predictions)

# Problems with gradient stability

WARNING

if softmax/sigmoid first then compute log:

  • log of 0 might happen
  • exp of large positive might happen

To solve gradient stability problem for softmax, use tf.nn.log_softmax as per link

To solve gradient stability problem for sigmoid, use tf.nn.sigmoid_cross_entropy_with_logits as per link

# Imbalanced Classification

At sample level, given fixed model structure: class A weight = total sample / class A number is good to use. However
At model level, given n fully connected outputs, event though class of each unit of n is imbalanced, it not good to add weights.

Last Updated: 8/27/2022, 9:15:32 PM