>> The key takeaway? Depth is important for certain functions.
I thought the key takeaway is that nonlinearity is important. Multilayer perceptrons can be collapsed into an equivalent single layer if there isn't nonlinearity thrown in. multi-input XOR can be solved using a form of sin(x) function after summing in a single layer (sin squared if you don't like negative numbers).
Upon reading this I also kept thinking "why not use the same transformer weights on each later?" and then the author goes there with the Universal Transformer concept. How many iterations should one use? Why should that be a question at all? As humans we might need to "think about it for a while" and get back to you when we figure it out. One thing all this AI research keeps ignoring is that real biological neurons are running in real time, not some batch process. We might also be doing training and inference at the same time, but I'm not certain how much - see dreams and other forms of background or delayed processing.
> I thought the key takeaway is that nonlinearity is important. Multilayer perceptrons can be collapsed into an equivalent single layer if there isn't nonlinearity thrown in. multi-input XOR can be solved using a form of sin(x) function after summing in a single layer (sin squared if you don't like negative numbers).
More broadly, as I've alluded to in other replies, XOR is not the problem in itself, but illustrates a bigger problem.
Yes, non-linearities can solve problems, but one hidden layer and the wrong non-linearity (sigmoid instead of sin in this case) may not. Which leads us to the question: What do we include in the broad set we label "non-linearities"? All Turing-complete functions? (This issue is alluded to here: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00493...)
Then we're back to analysing the computability of a given non-linearity, which in itself has multiple steps of iteration. It's turtles all the way down.
I thought the key takeaway is that nonlinearity is important. Multilayer perceptrons can be collapsed into an equivalent single layer if there isn't nonlinearity thrown in. multi-input XOR can be solved using a form of sin(x) function after summing in a single layer (sin squared if you don't like negative numbers).
Upon reading this I also kept thinking "why not use the same transformer weights on each later?" and then the author goes there with the Universal Transformer concept. How many iterations should one use? Why should that be a question at all? As humans we might need to "think about it for a while" and get back to you when we figure it out. One thing all this AI research keeps ignoring is that real biological neurons are running in real time, not some batch process. We might also be doing training and inference at the same time, but I'm not certain how much - see dreams and other forms of background or delayed processing.