This is the story of an interesting flight of fancy with mathematics. I found it intriguing, and hope you do, as well.
Here’s a fact that comes up in high school mathematics: you can demote multiplication into addition by using logarithms. That is:
That is, you can compute the log of a product, given only the logs of the factors.
To students today, this might seem like just another algebraic identity. But in the age before calculators, it was actually the main reason for a typical high school student to be interested in logarithms at all! Multiplication is more difficult than addition, so if you have a way to represent numbers that makes multiplication into addition, that helps. This is whole principle behind doing multiplication with a slide rule, for example: one just converts to logarithms, adds the resulting distances, and then converts back.
Similarly, one can use logarithms to demote powers into multiplication:
But if we’re imagining a world where we work entirely with logarithms, it’s not entirely fair to just multiply by y, so I’m going to rewrite this (let’s agree that all logarithms are natural) as:
There’s an additional exponential function there, but if we take that as given, we can now compute the log of a power using only multiplication, the exponential function, and the logs of the inputs.
An interesting question to ask is: what about addition? The following does not work, although math teachers will recognize it as a very common mistake!
So, can we complete this equation?
At first glance, thinking of the logarithm as translating operations down one order (multiplication into addition, and exponents into multiplication), this seems to call for an operation an order lower than addition. What could fit in such a place?
We can start to answer this question using simple algebra and our existing identities. Let’s assume x is not zero (since then it would have no logarithm anyway!), and then we can factor:
So by applying the log rule for multiplication, we get this nifty little formula:
Notice that although the presentation here doesn’t look symmetric, it actually is. Swapping the x and y values doesn’t change the result.
Again, imagining that we have only the logarithms and not the actual values, that fraction at the end is sort of cheating. Just as I did with the multiplication formula, I’ll introduce an explicit exponential, and it simplifies nicely.
In order to write this more clearly, I’ll name a new function, h, and define in terms of that:
It’s true that we haven’t succeeded in getting rid of addition, but this is leading somewhere interesting. But what is this mysterious function h?
h: The soft rectified linear function
We can start to explore h by looking at a graph.
At first glance, it looks like h(x) is approximately zero for any input less than -2, and approximately x for any input greater than 2. This sounds like the so-called “rectified linear” function:
Indeed, we can graph the two functions on the same axes, and see that they agree except near zero. (You can also verify this by reasoning about the formula. For inputs far less than zero, the exponential term becomes insignificant, while for inputs far greater than zero, the constant term becomes insignificant. This is the basis of a not-too-hard proof that these are asymptotes.)
We can, therefore, think of h as a soft rectified linear function; what you get by just rounding out the rectified linear function around its sharp corner.
(This rectified linear function, incidentally, has been popularized in machine learning, where for reasons that depend on who you ask, it has turned out to be wildly successful as an activation function for artificial neural networks. Part of the reason for that success is that it is so simple it can be computed quickly. But that’s not enough to explain all of its success! I suspect another part of the reason is that it’s closely related to sums exactly in the sense of the very investigation we’re doing now.)
Back to the sum
So if h is so similar to the rectified linear function, what happens when you (inaccurately) use the rectified linear function itself in the sum formula above. Remarkably, you get this:
In other words, in terms of logarithms, adding numbers is approximately the same as just taking the maximum! At least, it is when the difference between the numbers is large. That sort of makes sense, actually. If you add a very large number to a very small number, the result will indeed be approximately the same as the large number. (Remember that since we’re only thinking about numbers with logarithms, both inputs must be positive. We need not worry about the case where both numbers are large but with opposite signs.)
We can pull out of this pattern a sort of “soft” maximum function, which is almost like just giving the greater of its two arguments, but if the arguments are close then it rounds off the curve. Unfortunately, the phrase softmax already means something different and somewhat more complex to the machine learning community mentioned above, so perhaps we ought to call this something like smoothmax instead.
Then we have our answer:
It’s not easily computable, really, in the sense that products and powers were, but this still gives some intuition for the function that does compute the log of a sum, given the logs of the summands. Anyway, I’m satisfied enough with that answer.
What about the algebra?
This tells us that this smoothmax function can play the role of addition in mathematical expressions. That implies that all of the algebraic properties of addition ought to hold for smoothmax, as well. That’s interesting!
For example, smoothmax ought to be commutative. That is:
Indeed, this is true. I made that observation above when first introducing the formula. One can also expect that smoothmax is associative. That is:
And, indeed, although the algebra is a little more complex, this turns out to be true, as well. In fact, we need not really show each of these with complicated algebra. We’ve already shown that smoothmax is addition, just using the logarithms to represent the numbers.
I think things get even more interesting when we consider the distributive property. Remember that when we work with logs, multiplication gets replaced with addition, so we have this:
Thinking of this as a softened maximum, this works out to be some kind of translation invariance property of the maximum: if you take the maximum of two numbers and then add x, that’s the same as adding x to each one and then taking the maximum! That intuitively checks out.
There are some things that don’t work, though.
You might also hope for something like an identity property, since for addition we have x + 0 = x. This one doesn’t turn out so well, because we cannot take the logarithm of zero! We end up wanting to write something like:
This would make sense given the asymptotic behavior of the smoothmax function, but we’re playing sort of fast and loose with infinities there, so I wouldn’t call it a true identity. To say that correctly, you need limits.
You also need to be careful with expecting smoothmax to act like a maximum! For example:
That’s weird… but not if you remember that smoothmax is at its least accurate when its two inputs are close together, so both inputs being the same is a worst case scenario. Indeed, that’s where the true max function has a non-differentiable sharp corner that needed to be smoothed out. And, indeed, the exact behavior is given by addition, rather than maximums, and addition is not idempotent (i.e., adding a number to itself doesn’t give the same number back).
In fact, speaking of smooth maxing a number with itself:
which resembles a sort of definition of addition of log-naturals as “repeated smoothmax of a number with itself”, in very much the same sense that multiplication by naturals can be defined as repeated addition of a number with itself, strengthening the notion that this operation is sort-of one order lower than addition.
So there you have it. That’s as far as my flight of fancy goes. I found it interesting enough to share.