Sure, I'll try to give you the intuition behind it.
Let's say I have a belief ("theory") that tries to describe/predict how the world really is ("reality"). My theory is sometimes going to be right, and sometimes going to be wrong. Sometimes it will predict an event to happen more often than it really does; sometimes less.
The KL divergence is a measure of how far my theory differs from reality.
The specific way that it quantifies it is basically by asking, "If I go around expecting the reality to be just like in my theory, then how often, and how badly will I be surprised?"
"Surprise" is itself quantified in such a way that if you assign a *low* probability to an event that *does* happen (or vice versa), that counts as "high" surprise. Likewise, if you predict a high probability for something that does happen, you get a low surprise-value. (The math involves taking the log of the inverse expected probability of the event.)
So the KL divergence goes through the world's "probability distribution", and at each point (i.e. experience all events at least once, but with differing frequencies depending on how probable the event is) it compares to my theory's probability distribution, finds how surprised I am, and sums over the whole thing.
The result is a measure of how bad my theory is (i.e., theories that are often wrong get a high KL divergence from the world).
Make any more sense now?
As for the joke: It's saying that information theory is the theory, and any deviation of the world from it is the KL divergence.