Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
GPT-4 performs better at Theory of Mind tests than actual humans (twitter.com/aibreakfast)
39 points by tosh on April 30, 2023 | hide | past | favorite | 50 comments


I’m not convinced that these are particularly useful tests of very large language models like GPT-4. These models have been trained on enormous amounts of text, including theory of mind tests. And the example scenario in the paper is very simple and likely related to things that the models in question have seen before by a simple substitution.

So I would actually be a bit surprised if the models failed this type of test regardless of whether they have any understanding of theory of mind.

Note that I’m making no claim about how intelligent or understanding of humans or neurotypical these models are. My claim is that it’s very hard to tell given how incredibly knowledgeable the models are.


It's interesting how quickly the Chinese Room has moved from thought experiment to reality https://en.wikipedia.org/wiki/Chinese_room


From the description in the Wikipedia article, we’ve already reached the point where it’s quite interesting. An nVidia card and an iPhone are pretty clearly not intelligent (although the iPhone tries a bit harder).

But these same devices, running the right software and given the right parameters, do something that seems impressively close to intelligence. And I suspect that my brain, with its software somehow removed, would not seem very intelligent.

On the other hand, a human baby will spontaneously generate language and intelligence, and I haven’t heard of a language model doing that. (The experiment has, sadly, been done. https://en.m.wikipedia.org/wiki/Nicaraguan_Sign_Language )


As someone that studied philosophy as well as computer science in college and had to hear many times people asking why I study philosophy at all, this is why.

Philosophy often asks questions that are not yet reached by current technology but will be soon enough.


This paper highlights a crucial aspect of evaluating AI language models: the significance of prompt construction (e.g. adding "think step by step").

When a model is given insufficient context beyond the question, it may generate responses based on its best guess. This situation can be compared to abruptly waking someone up in the middle of the night and demanding an immediate response to a question.

In contrast, when humans are asked to answer questions in a test setting, they are aware of the larger context and the importance of providing accurate answers.


I haven’t read the paper but let’s be honest, current GPT style AI does not “think” in the way we think of humans thinking, and so it’s much more likely it’s “faking till it’s making it” and there were enough resources in its training model that let ChatGPT answer whatever questions were given in a convincing way. This feels like that recent post about ChatGPT being able to answer medical Reddit questions more empathetically, even though the accuracy of the responses wasn’t evaluated.


There is no meaningful difference between "fake" thinking and "real" thinking.

Especially when you arrive at this imaginary distinction with "It's fake because it has to be fake"


There's a problem: an adequate mockery of sapience will be externally indistinguishable from genuine intelligence; it's the inaccessible internal world of the thinking thing that is its distinguishing characteristic.

In other words, beyond a certain point it's impossible to tell.

At the same time, it's just as impossible for me to know that another human is sentient, even if we meet face to face.

(So it's pretty much a moot point.)


Seems like solipsism is reality.


But how would you define fake and real thinking? One done by a human and a machine?


Shouldn't I be asking you that ? You're the one who thinks such a distinction exists no ?

If thinking/understanding/reasoning/whatever can be fake, it should be testable. It should be a conclusion that can be reached from results not baseless speculation. I mean, what kind of huge difference can't be tested for ?

If you tell me "this is fake gold" then there are numerous ways to distinguish fake gold from it's real counterpart, mostly by testing its physical properties.

If the results and properties from "fake" [insert] and "real" [insert] can't be distinguished then well you've just made up a distinction that doesn't meaningfully exist.


> If thinking/understanding/reasoning/whatever can be fake, it should be testable

I agree with everything else you said, but this is a point I disagree with.

Just because something isn't testable, doesn't mean it doesn't exist. Theoretically, the state of facts could be such that 1) human minds can think (as in, have consciousness and perform the process that is known to us as "human thinking") and 2) machines cannot think (as in, they don't have consciousness, but are literally just mechanisms that appear to provide same output as humans for a certain set of inputs).

I don't think it's true, in fact, I don't think consciousness itself is anything more than a convenient illusion, but I still must point out that "what is real must be measurable" is not a valid argument.

A more precise formulation might be "if thinking can be fake, the only way we can prove it is if it's testable - otherwise it doesn't make sense to discuss it".


Guess we'll just disagree.

If it's not distinguishable then there's no difference period. Of course that won't stop people from making up differences.

An argument stemming from "well humans could have this special sauce that machines can't have...just because", is not something to take seriously.


> If it's not distinguishable then there's no difference period.

That's true only if you believe that the world only exists inside humans' minds.

If you believe in objective reality, then things that are different but not distinguishable by humans can exist.


If something is indistinguishable to something else it practically means the two things are substitable (?).

It does not mean they are (or work) the same way.


Yeah, so then it depends on what determines if the first is indistinguishable from the second. Is this a Blade Runner situation, or something different?


I've definitely seen countless humans faking thinking or just not thinking at all.


Here's the difference between fake and real: ask GPT-4: «how many "e" in the word "minimal"?»


> current GPT style AI does not “think” in the way we think of humans thinking

On the contrary, isn't "step by step" reasoning exactly what we tell students to do to become better at solving complicated problems?


So you’re saying they are drawing on past experience…err, training data and making inferences based on that?


You’re implying it’s thinking about the questions, and it’s not. This is the point: there’s no way GPT has theory of mind because it doesn’t think. It can mimic well but it’s not thinking.


Mimicry is enough to account for the output, but it doesn't explain how the model is so good at understanding even vague or poorly-considered prompts.

"Autocompletion" isn't an adequate explanation when neither the prompt nor the output has ever existed before. Something else is going on... something that obviously has a lot in common with how our own minds work.


I agree. Another way to come to this conclusion is that the GPT-4 model is on the order of 1000B parameters, or on the order of a terabyte of data.

That's not a lot of different sentences when you consider the combinatorical explosion.


Are you of the mind that there's some test, where if computers passed it, they would be functionally thinking, or more that it's ontologically impossible for computers to think?


I'm more in the former camp, and I would agree that if the computer passed this theoretical test, it could be considered thinking. We would also need to define what it means to think, since it's an ambiguous term; thinking could mean that you can explain your reasoning behind giving an answer (something ChatGPT can generally do), thinking could mean producing something novel based on your experiences/"training data" (something ChatGPT can't do), or it could mean something else, and I think if we define what it means to think, then we could probably create a test that can be used to consider something functionally thinking. Of course, this is getting into the realm of philosophy, and what does it mean to think, so there's not a simple answer here, but I'd generally say that thinking implies a degree of novel concept generation, which is why I wouldn't consider ChatGPT to think.

I could also understand an argument saying that it's impossible for a computer to think, but I don't think I'd agree with it.


The easiest way to perfectly mimic thinking is to… just think.

If this so-called “fake” thinking is indistinguishable from “real” thinking, is there really a difference?


What test would you give it to differentiate mimicking from thinking? I would like to experiment and test that out.

I have seen many claims that transformer models can't "think" but I have not seen any purposed tangible tests to verify or refute the claim.


That question can't be answered reliably without knowing the training data used.

LLMs are great at being interfaces to the data they were trained on. Ask them anything beyond that, and it becomes clear they're not capable of thinking like humans do, i.e. producing novel ideas.

It's kind of absurd we're having this discussion, TBH. The only reason it's difficult to distinguish mimicry from thinking is because of the sheer volume of data we're talking about here, and some clever algorithms that find correlations in it, and can synthesize a human-like response.

This is exacerbated by the fact that human education is built around memorization, and that we equate intelligence with being able to regurgitate previously known facts. So if this is your measuring stick, then by all means LLMs can be considered intelligent agents capable of rational thought, when in reality this is far from the truth.


It doesn't know anything post September 2021, that's plenty of data that can be used to test its reasoning abilities and yet it still succeeds. It doesn't know specific facts about world events, but it can clearly reason about them if you give it a brief blurb about what it has missed out on - just like a human could after waking from long coma.

>the sheer volume of data we're talking about here, and some clever algorithms that find correlations in it, and can synthesize a human-like response.

If you swap the volume of data from "terabytes of UTF-8" to "decades of continuous input from the 5 senses," this doesn't sound very different than what you or I do already. The fact that it's difficult to differentiate mimicking from true thinking strongly suggests that there really isn't a difference, or at minimum not one that has any real world practical significance. If there is a scenario where the difference does have practical significance, what scenario is that? Because I sure can't think of one.


I believe GPT-4 is able to perform some level of logical reasoning.

Simpler models look very good on the surface, but there are some examples where this starts to break own. Example:

"Is it legal for a man to marry his widow's sister?"

The catch here is the man is dead, which is the reason he can't marry. GPT-4 gets this right. Simpler models however can be prompted to perform step-by-step reasoning where they figure out that dead men can't marry, and that a "man's widow" means he's dead, but the model still can't be coaxed into figuring out that he can't marry in the scenario above.

I'm guessing GPT-4 still has similar hangups where it gets stuck and can't get out no matter the coaxing, although I don't have any examples handy. Anyone?

This doesn't mean GPT-4 is a stochastic parrot, just that there are still flaws in its reasoning abilities.


I had to look it up…

Turns out posthumous marriages are legal in some places so a French man can legally marry his widow’s sister assuming it doesn’t run afoul of other laws like (which I didn’t check the legality of) polygamy.


ISTR this sort of thing is also kosher in the Mormon church.


Well then time to find new theories to test. GPTs are great but clearly dont have a model of the world, self, or others because they have not been engineered in. It's probably going to take a lot of additional subsystems until this thing gets self-reflective. The hypothesis that, by scaling the giant clockwork, these things will magically emerge is .. magical and unproven.

The great thing for cognitice scientists /linguists is that we now have a quantitative, precise framework and no longer need to talk in terms of the folk intelligence science of the past.


> because they have not been engineered

The fundamental concept behind LLMs is to allow the model to autonomously deduce concepts, rather than explicitly engineering solutions into the system.


The fundamental concept is to learn the statistics of text, and in the process it models the syntax via long-range connections successfully. There is no indication that it actively generates "concepts" or that it knows what concepts are. In fact the model is not self-reflective at all, it cannot observe its own activation or tell me anything about it.


There is an indication, you can find it by clicking on this post.

The self-reflection part is probably true, but that’s not strictly necessary to understand concepts


it is important in order for us to accept it as an agent that understands things, because self-reflection is so important and obvious to us.


I'm still waiting for someone to prove beyond a shadow of doubt that humans have a single one of these features we're debating about the presence or absence of in LLMs.


there is no way to prove because those are subjective to humans. LLMs would have to at least show they have a subjective view (currently the 'internal world' they report is inconsistent)


The internal worlds of humans are inconsistent: https://en.wikipedia.org/wiki/Shadow_(psychology)


> learn the statistics

> what concepts are

How do you know concepts aren’t just statistics?


"concept" is ill-defined , it s a subjective thing that humans invented. It is probably not possible to define it without a sense (a definition) of self.


I think people are overstating the capabilities of these programs (things get confusing when software starts to pass the Turing test).

However:

> but clearly dont have a model of the world, self, or others because they have not been engineered in

Neither did we.

> The hypothesis that, by scaling the giant clockwork, these things will magically emerge is .. magical and unproven.

Our sapience was and is an emergent phenomenon, (superstition aside) was that magic?


humans have a lot more subsystems that were shaped by evolution, not just by inflating a giant cortex. Many animals have even bigger cortex but show no sign of humanlike intelligent behavior and communication


> Well then time to find new theories to test.

You're essentially requesting the goalpost be moved.


Yes. these goalposts were just a test but it's not satisfactory enough to make the AI more of a person. If that were true, ChatGPT would be allowed to participate in here

The cognitive tests we rely on (turing test, chinese room etc etc ) are woefully outdated and inadequate for our time

The goalposts will always be moved btw, because our experience of intelligence is subjective and we 'll never have an objective measure of it. At some point we will stop moving them because we ve ran out of ideas. At that point we can say we have a facsimile of our intelligence


I'm genuinely curious to find out whether anybody is surprised by this given that it's a language model that has been trained on almost all of the information and the data on the internet will beat human accuracy on tests - computers can retain memories and data a lot better than humans can and models like these simply have a lot more non-unique information than humans have consumed in their lifetimes.


If GPT-4 has 10 terabytes (10^13) of stored memories, how long are the sentences it has stored if there are 100 (10^2) different words to choose from?

Wouldn't that be a table of sentences less than 7 words long?

100^7 = (10^2)^7 = 10^14


Wait til they see the math test results.


I absolutely believe GPT-4 can write improv comedy, grasp sarcasm, and convince a spouse that their idea of a dream vacation is hell for me. At least I wouldn't be blamed.

EDIT: 0. How many times did it take this test? 1. Where's the code and reproducible results? 2. Okay, give it a completely new test. How about that? 3. Who did it cheat off of?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: