The article quotes a paper by, "Hao et. al." which concludes: "Finally, given th...

The article quotes a paper by, "Hao et. al." which concludes:

"Finally, given the success that Transformers have had as models of natural language, it is perhaps surprising that these models’ expressive power seems to be best characterized (or at least bounded) in terms of circuit complexity. Mathematical explorations of natural language have most commonly employed the approach to language complexity afforded by the Chomsky hierarchy and its refinements, which is based on automata and formal grammars. The apparent incomparability of these approaches suggests that the exploration of different types of Transformer models might offer a new approach to the study of the formal properties of natural language."

A few years ago I found a paper that made the argument that Chomsky's hierarchy of formal languages is not necessarily the only mathematical hierarchy that can exist over languages. The author advocated that other ways of systematizing the classification of languages could lead to both a different hierarchy and new insights. He lamented that the strength of Chomsky's work and reputation might have inadvertently discouraged us from questioning it and looking for alternatives - especially ones which do more to capture what we intuitively mean by natural language parsing ability.

Edit: The OP article author interprets the first quote as, "maybe human language is simpler than we thought, and doesn't require higher levels of the Chomsky hierarchy... We might be Transformers!" However in light of that other paper I mentioned, it's more like there might be another hierarchy of languages different from and parallel to Chomsky's hierarchy, with different boundaries for the levels that cover the space differently than Chomsky's.

As for how such a thing is possible in detail, remember that a lot of the Chomsky language hierarchy require looking at behavior "in the limit" or "unbounded" in the sense that no finite parser of a lower level can parse all (unbounded) sentences of the higher complexity language. (But an arbitrarily sized lesser-complexity parser can parse sentences of bounded subset of the higher-complexity language.)

And it may be that for parsing actual human language pertaining to a non-infinite world, such arguments as the Chomsky hierarchy deals with are no more relevant than arguing that our physical computers aren't truly universal Turing machines because they have finite memory. Technically correct, yet also irrelevant to most of the practical uses we have found for them!