# Loss

Requires polishing

# Compare and contrast of BinaryCrossEntropy and CrossEntropy

# Definition

Definition of cross entropy

$H(p, q)=H(p)+D_{\mathrm{KL}}(p \| q)$ , where $H(p, q)$ is cross entropy, $H(p)$ is entropy and $D_{\mathrm{KL}}(p \| q)$ is KL distance.
For discrete probability distribution p and q:
$H(p, q)=-\sum_{x \in \mathcal{X}} p(x) \log q(x)$
$H(p)=-\sum_{x \in \mathcal{X}} p(x) \log p(x)$
$D_{\mathrm{KL}}(p \| q)=-\sum_{x \in \mathcal{X}} p(x) \log \left(\frac{q(x)}{p(x)}\right)$

Considering binary classification problem with an positive instance:

BinaryCrossEntropy: FC layer output 1 logits, activated by sigmoid, gives the probability a, and the calculation is $-log(a)$
CrossEntropy: FC layer output 2 logits, activated by softmax, gives the probability (1-b, b), and the calculation is $-log(b)$

Considering X(X>=3) classification problem with an instance(class 1):

CrossEntropy: FC layer output X logits, activated by softmax, gives the probability (n1, n2, ..., nx), and the calculation is $-log(n2)$

TIP

sum of probabilities of softmax is 1, as one instance can only have one label

Considering X(X>=2) multi-label classification problem, if the instance only has a single label with class 1:

BinaryCrossEntropy: FC layer output X logits, activated by sigmoid, gives the probability (m1, m2, ...., mx), and the calculation is $-log(1-m1)-log(m2)-log(1-m3)-...-log(1-mx)$

TIP

probabilities are independent with each other, as one instance can have multiple labels

"""
Considering binary classification and for a single instance with target 1
"""

target = [1.0]  # loss computation requires float
predictions = [0.4]

# bce way 1
bce_fn = tf.keras.losses.BinaryCrossentropy(from_logits=False)
bce = bce_fn(target, predictions)

# bce way 2
bce = tf.keras.losses.binary_crossentropy(target, predictions, from_logits=False)

num_classes = 2
one_hot_target = tf.one_hot(target, 2)  # [0., 1.]
predictions = [0.4, 0.6]  # sum to 1; if not, tf will normalize them

# ce way 1
ce_fn = tf.keras.losses.CategoricalCrossentropy(from_logits=False)
ce = ce_fn(one_hot_target, predictions)

# ce way 2
ce = tf.keras.losses.categorical_crossentropy(one_hot_target, predictions, from_logits=False)

# ce way 3
ce_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
ce = ce_fn(target, predictions)

# ce way 4
ce = tf.keras.losses.sparse_categorical_crossentropy(target, predictions)

# Problems with gradient stability

WARNING

if softmax/sigmoid first then compute log:

log of 0 might happen
exp of large positive might happen

To solve gradient stability problem for softmax, use tf.nn.log_softmax as per link

To solve gradient stability problem for sigmoid, use tf.nn.sigmoid_cross_entropy_with_logits as per link

# Imbalanced Classification

At sample level, given fixed model structure: class A weight = total sample / class A number is good to use. However
At model level, given n fully connected outputs, event though class of each unit of n is imbalanced, it not good to add weights.

← Optimizer Backpropagation →