In the machine learning field, we heard the term cross-entropy a lot, but what the heck is this, and what this function does. So today we're going to discuss what is entropy, cross-entropy, KL-divergence?

Entropy from Physics Perspective

The idea of entropy Its introduced by the German physicist Rudolf Clausius in 1850 in Thermodynamics. The concept of entropy is a measurable physical property that is most commonly associated with a state of disorder, randomness, or uncertainty.

Entropy

Entropy from Information Theory

The concept of information entropy was introduced by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication". Information theory is a field of study concerned with quantifying information for communication.

The intuition behind quantifying information is the idea of measuring how much surprise there is in an event. Those events that are rare (low probability) are more surprising and therefore have more information than those events that are common (high probability).

Suppose you live in Bangladesh. Suddenly the weather station told that tomorrow is going to be snowfall in Bangladesh. This is a rare event that contains low probability, on the other hand, In Antarctica, snowfall is a common event that has high probability and low entropy.

So, Low Probability Event: Have High Information (surprising).
High Probability Event: Have Low Information (unsurprising).

A discrete probability distribution of Entropy

Now let’s understand the entropy with calculations.
Suppose there are two people. Person A and Person B. And they talk, the probability of they talking is contained the information 1 & 2.

Information	Person A	Person B
information_01	0.3	0.9
information_02	0.7	0.1

Let’s see the entropy formula \begin{aligned} H(p) &= - \sum\limits_{x \in X} p(x) \log p(x) \end{aligned}

where information = p

Calculate entropy for person A, \begin{aligned} H(p) &= - (0.3 \log_2 (0.3) + 0.7 \log_2(0.7) ) =0.881 \end{aligned}

Calculate entropy for person B, \begin{aligned} H(p) &= - (0.9 \log_2 (0.9) + 0.1 \log_2(0.1) ) =0.468\end{aligned}

Cross-Entropy

The cross-entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event.

Information	Person A	Person B
information_01	0.3	0.9
information_02	0.7	0.1

The cross-entropy formula \begin{aligned} H(p,q) &= - \sum\limits_{x \in X} p(x) \log q(x) \end{aligned}

Calculate for Person A with person B \begin{aligned} H(personA,personB) &= - (0.3 \log_2 (0.9)) + 0.7 \log_2 (0.1)) =2.370 \end{aligned}

Calculate for Person B with person A \begin{aligned} H(personB,personA) &= - (0.9 \log_2(0.3)) + 0.1 \log_2 (0.7)) =1.614\end{aligned}

KL-Divergence

The Kullback-Leibler Divergence score, or KL divergence score, quantifies how much one probability distribution differs from another probability distribution.

Information	Person A	Person B
information_01	0.3	0.9
information_02	0.7	0.1

KL-Divergence = ( Cross-Entropy ) - ( Entropy )

\begin{aligned} = - \sum\limits_{x \in X} p(x) \log \frac{q(x)}{p(x)} \end{aligned}

Calculate for A given B \begin{aligned} = - (0.3 \log_2 \frac{0.9}{0.3} + 0.7 \log_2 \frac{0.1}{0.7} ) = 1.489\end{aligned} \begin{aligned} or\end{aligned} \begin{aligned}KL = 2.370 - 0.881 = 1.489\end{aligned}

Calculate for B given A \begin{aligned} = - (0.9 \log_2 \frac{0.3}{0.9} + 0.1 \log_2 \frac{0.7}{0.1} ) = 1.146\end{aligned} \begin{aligned} or\end{aligned} \begin{aligned}KL = 1.614 - 0.468 = 1.146\end{aligned}

So this is the Basic calculation of Entropy, Cross-Entropy, and KL-Divergence. Next, we're going to see how those functions help to minimize the loss.

Resources

Deep Dive into Entropy, Cross-Entropy, KL-divergence

Entropy

Cross-Entropy

KL-Divergence