EDIT: Before you read my comment below, please see https://news.ycombinator.com/...

YeGoblynQueenne · on April 5, 2021

Note that the paper was rejected for publication in ICLR 2021:

https://openreview.net/forum?id=PdauS7wZBfC

AndrewKemendo · on April 6, 2021

That is an awesome site, thanks for posting I had no idea there was a place with that much transparent review (shows how much I've been publishing).

cs702 · on April 6, 2021

Yes, I linked to that same page at the top of my comment :-)

YeGoblynQueenne · on April 6, 2021

Oops, sorry- I noticed the link but I thought it was a HN url, like the one before... and I wondered why it was greyed-out (visited). But still I didn't check it out. My very bad.

cs702 · on April 8, 2021

No worries :-)

babel_ · on April 5, 2021

Interesting follow up reading:

"Relaxing the Constraints on Predictive Coding Models" (https://arxiv.org/abs/2010.01047), from the same authors. Looks at ways to remove neurological implausibility from PCM and achieve comparable results. Sadly they only do MNIST in this one, and are not as ambitious in testing on multiple architectures and problems/datasets, but the results are still very interesting and it covers some of the important theoretical and biological concerns.

"Predictive Coding Can Do Exact Backpropagation on Convolutional and Recurrent Neural Networks" (https://arxiv.org/abs/2103.03725), from different authors. Uses an alternative formulation that means it always converges to the backprop result within a fixed number of iterations, rather than approximately converges "in practice" within 100-200 iterations. Not only is this a stronger guarantee, it means they achieve inference speeds within spitting distance of backprop, levelling the playing field. (Edit: also noted by eutropia)

It'd be interesting to see what a combination of these two could do, and at this point I feel like a logical next step would be to provide some setting in popular ML libraries such that backprop can be switched for PCM. Being able to verify this research just be adding a single extra line for the PCM version, and perhaps replicating state-of-the-art architectures, would be quite valuable.

eutropia · on April 5, 2021

Here's a more recent paper (March, 2021) which cites the above paper: https://arxiv.org/abs/2103.04689 "Predictive Coding Can Do Exact Backpropagation on Any Neural Network"

cs702 · on April 5, 2021

Yup. I'd expect to see many more citations going forward. In particular, I'd be excited to see how this ends up getting used in practice, e.g., training and running very large models running on distributed, masively parallel "neuromorphic" hardware.

abraxas · on April 5, 2021

I’m going to personally flog any researcher who titles their next paper “Predictive Coding Is All You Need”. You’ve been warned.

cs702 · on April 5, 2021

There are already 60+ of those, and counting, all but one of them since Vaswani et al's transformer paper:

https://arxiv.org/search/?query=is+all+you+need&searchtype=a...

andyxor · on April 5, 2021

the thing is about every week there is a paper published with groundbreaking claims, with this question in particular being very popular, trying to unify neuroscience and deep learning in some way, in search for computational foundations of AI. Mostly this is driven by success of DL in certain industrial applications.

Unfortunately most of these papers are heavy on theory but light on empirical evidence. If we follow the path of natural sciences, theory has to agree with evidence. Otherwise it's just another theory unconstrained by reality, or worse, pseudo-science.

autopoiesis · on April 5, 2021

The paper (arxiv:2103.04689) linked by eutropia above has some empirical evidence on the ML side, showing that performance of predictive coding is not so far off backprop. And there is no shortage of suggestions for how neural circuits might work around the strict requirements of backprop-like algorithms.

cs702's original comment above is excessively hyperbolic: the compositional structure of Bayesian inversion is well known and is known to coincide structurally with the backward/forward structure of automatic differentiation. And there have been many papers before this one showing how predictive coding approximates backprop in other cases, so it is no surprise that it can do so on graphs, too. I agree with the ICLR reviewers that this paper is borderline and not in itself a major contribution. But that does not mean that this whole endeavour, of trying to find explicit mathematical connections between biological and artificial learning, is ill motivated.

eli_gottlieb · on April 5, 2021

>the compositional structure of Bayesian inversion is well known

/u/tsmithe's results on that are well known, now? I can scarcely find anyone to collaborate with who understands them!

TaupeRanger · on April 5, 2021

Not only light on evidence, but essential practicality-free. There's no "there" there. Literally nothing useful will come from this.

nl · on April 6, 2021

I don’t think anyone familiar with the field is in anyway surprised by this results.

The breakthrough seems really limited to showing it holds for graphs. We already knew this was practically true though anyway.

cs702 · on April 6, 2021

Agree, no one is surprised.

But the authors successfully show how to train CNNs, RNNs, and LSTM RNNs without backpropagation, i.e., every layer learning only via local rules, without having to wait for gradients to be backpropagated to all layers before the entire model can move on to the next sample.

As I understand it, this work has paved a path for training very large networks in massively parallel, fully distributed hardware -- in the not too distant future.

nl · on April 6, 2021

> But the authors successfully show how to train CNNs, RNNs, and LSTM RNNs without backpropagation, i.e., every layer learning only via local rules

The basic version of this was shown in [1], as mentioned by the ICLR review:

"Specifically, the original paper by Whittington & Bogacz (2017) demonstrated that for MLPs, predictive coding converges to backpropagation using local learning rules."

That Whittington & Bogacz didn't extend to complex ANN architectures, but it would have been very surprising if what they showed didn't extend to other ANNs.

OTOH, while local-only updates are great it doesn't help much if the overall algorithm needs vastly more iterations. Again, from the ICLR review: "The increase in computational cost (of 100x) is mentioned quite late and seems to be glossed over a bit."

[1] https://pubmed.ncbi.nlm.nih.gov/28333583/

cs702 · on April 8, 2021

In my view, there's a big difference between successfully training, say, LSTM RNNs, versus successfully training "vanilla" MLPs.

This work opens the door for using new kinds of massively parallel "neuromorphic" hardware to implement orders of magnitude more layers and units, without requiring greater communications bandwidth between layers, because the model no longer needs to wait until gradients have back-propagated from the last to the first layer before moving on to the next sample.

Scaling backpropagation to GPT-3 levels and beyond (think trillions of dense connections) is very hard -- it requires a lot of complicated plumbing and bookkeeping.

Wouldn't you want to be able to throw 100x, 1000x, or even 1Mx more fully distributed computing power at problems? This work has paved a path pointing in that direction :-)

JackFr · on April 5, 2021

My background is as an interested amateur, but

> also for making it trivial to train models in a fully distributed manner, with all learning done locally

seems like a really huge development.

At the same time I remain pretty skeptical of claims of unifying the fields of biological and artificial intelligence. I think the recent tremendous successes in AI & ML lead to an unjustified over confidence that we are close to understanding the way biological systems must work.

himinlomax · on April 5, 2021

Indeed, it's worth mentioning we still have absolutely no idea how memory works.

andyxor · on April 5, 2021

we know a lot about memory, but most AI researchers are simply ignorant in neuroscience or cognitive psychology and stick with their comfort zone.

Saying "we have no idea" is just being lazy.

TaupeRanger · on April 5, 2021

No. We really have no idea what is going on. We only know some basic psychology about it (holding 7 things in short term, etc.) If we knew something about implementation, we could implement human-like memory.

andyxor · on April 6, 2021

I suggest starting with the works by Howard Eichenbaum on memory and Edvard & May-Britt Moser (and John O'Keefe and Lynn Nadel) on place & grid cells.

For the latest and greatest see

https://twitter.com/doellerlab

https://twitter.com/KordingLab

https://twitter.com/preston_lab

https://twitter.com/memorylab

https://twitter.com/ptoncompmemlab

https://twitter.com/MillerLabMIT

https://twitter.com/hugospiers

Once you start pulling that thread you'd be surprised how much we do know.

TaupeRanger · on April 6, 2021

Literally nothing you posted surprises me in the least, and literally none of this work shows that we know anything at all about how memory is implemented. Perhaps read some of the many takedowns of so called "grid cells" which show that it is completely unsurprising and not at all interesting or noteworthy that activity in some parts of the brain correlates with location information. The important questions always remain unanswered.

eli_gottlieb · on April 6, 2021

We know a fair bit about how cognitive maps work in 2D and 3D Euclidean environments. We know damn little about how nontrivial manifold structure can be learned, particularly in spaces with more than three dimensions.

andyxor · on April 6, 2021

spatial cognitive maps used for navigation are extendable to arbitrarily high dimensional spaces for abstract concept representation, using pretty much the same machinery.

There is a ton of work on this, both theory and empirical evidence, here are just a few:

"Navigating cognition: Spatial codes for human thinking" https://science.sciencemag.org/content/362/6415/eaat6766.abs...

"Organizing conceptual knowledge in humans with a gridlike code" https://science.sciencemag.org/content/352/6292/1464

"The Hippocampus Encodes Distances in Multidimensional Feature Space" https://www.sciencedirect.com/science/article/pii/S096098221...

"A non-spatial account of place and grid cells based on clustering models of concept learning" https://www.nature.com/articles/s41467-019-13760-8

"A learned map for places and concepts in the human MTL" https://www.biorxiv.org/content/10.1101/2020.06.15.152504v1....

"What Is a Cognitive Map? Organizing Knowledge for Flexible Behavior" https://www.sciencedirect.com/science/article/pii/S089662731...

"A map of abstract relational knowledge in the human hippocampal–entorhinal cortex" https://elifesciences.org/articles/17086

"Map-Like Representations of an Abstract Conceptual Space in the Human Brain" https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7884611/

"Knowledge Across Reference Frames: Cognitive Maps and Image Spaces" https://www.sciencedirect.com/science/article/pii/S136466132...

"Concept formation as a computational cognitive process" https://www.sciencedirect.com/science/article/pii/S235215462...

"Efficient and flexible representation of higher-dimensional cognitive variables with grid cells" https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...

"The cognitive map in humans: spatial navigation and beyond" https://www.nature.com/articles/nn.4656

"A general model of hippocampal and dorsal striatal learning and decision making" https://www.pnas.org/content/117/49/31427.short

"On the Integration of Space, Time, and Memory" https://www.sciencedirect.com/science/article/pii/S089662731...

p1esk · on April 6, 2021

Unless you know of working implementations of memory algorithms I tend to agree that we have no clue how memory works.

andyxor · on April 6, 2021

that's called being lazy

klmadfejno · on April 5, 2021

I'm trying to imagine how that works. Imagine you've got a nueral net. One node identifies the number of feet. One node identifies that number of wings. One node identifies color. This feeds into a layer that tries to predict what animal it is.

With backprop, you can sort of assume that given enough scale your algo will identify these important features. With local learning, wouldn't you get a tendency to identify the easily identifiable features many times? Is there a need for a sort of middleman like a one arm bandit kind of thing that makes a decision to spawn and despawn child nodes to explore the space more?

TheRealPomax · on April 5, 2021

The fallacy there is the idea that "one node" does anything useful, rather than optimizing itself in a way that you have _no idea_ what it actually codes for, but at the emergent level, you see it contribute to coding for wing detection, or color detection, or more likely actually seventeen different things that are supposedly unrelated, it just happens to be generating values that somehow contribute to a result for the features the various constellations detect.

(meaning it might also actually cause one or more constellations to perform worse than if it wasn't contributing, and realistically, you'll never know)

klmadfejno · on April 6, 2021

That's, at best, pedantically true. You can determine the function of individual components of a network, and they will correspond to concrete thing. It's just that the utility of doing this is low in the scheme of things and the function of individual components is going to be fuzzier than nice constructs that humans like to think in. If you wanted to take painstaking steps to align the functionality of nodes to identify specific features per the example, you could do this and the network would work just fine. It's an appropriate way to simplify an explanation of how the model works.

TheRealPomax · on April 6, 2021

That's literally what we cannot do in a neural net.

It's why they're so problematic: you can determine the propagation functions of individual nodes perfectly, and that knowledge tells you exactly nothing about all of the many things it's values contribute to. There is no "concrete thing" at the node level: a single node fundamentally can't see a wing, or a color, or anything else, that's only emergent behaviour of node constellations, and one node can contribute to many constellations simultaneously.

Heck, there often isn't even a "concrete thing" at the many of the constellation levels, concrete things don't start to emerge until you're looking at the full state of all end nodes.

klmadfejno · on April 7, 2021

That's the machine learning 101 worldview. Initial level nodes are likely just minor inscrutable transformations. Later layers will be coding for features that, at least to some degree if it's a tractable problem, humans can understand and agree with as useful features. They'll be fuzzy and not as clearly defined as a human would frame the problem, but their purpose can be explored and generally identified.

In any case, speaking of them as representing singular features for simplification is appropriate. Maybe it's not one node that codes for legs, but two nodes that codes for legs like this and legs like that, but that's not relevant to the point.

SamBam · on April 5, 2021

> Is there a need for a sort of middleman like a one arm bandit kind of thing that makes a decision to spawn and despawn child nodes to explore the space more?

What's the one-armed bandit? (Besides a slot machine.)

My knowledge of this field is rusty, but I actually wrote my MSc thesis on novel ways to get Genetic Algorithms to more efficiently explore the space without getting stuck, so it sounds up my alley.

fancy_pantser · on April 5, 2021

I wonder if you thought of it as a type of optimal stopping problem locally on each node and explore-exploit (multi-armed bandit) globally? For example, if each node knows when to halt when it hits a [probably local] minima, the results can be shared at that point and the best-performing models can be cross-pollinated or whatever the mechanism is at that point. Since both copying the models and continuing without gaining ground are both wastes of time, you want to dial in that local halting point precisely. An overseeing scheduler would record epoch-level results and make the decisions, of course.

klmadfejno · on April 6, 2021

Haha sorry I meant multi arm bandit, which I'd presume you're familiar with.

Although I guess a single arm bandit would be something akin the secretary problem.