Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Out of curiosity I was asking ChatGPT the other day to create a marketing plan to help me spread neo-feudalism.

It warned me that spreading neo-feudalism wasn't a common or widespread goal, and that advocating for it required careful consideration. But it nevertheless made an attempt to help me do it.

I mention this because attacks on LLMs don't have to be as clever as the modern-day version of the Ken Thompson compiler attack. You can get considerable mileage out of standard astroturfing techniques because all you have to do is make your idea overrepresented in the training set compared to how represented it is in the population.

That overrepresentation will tend to grow over time because people will hear the ideas from the LLM and assume the LLM knows what it's talking about. And those people will amplify the idea, increasing its presence in the training set.



> overrepresented

I don't think LLMs can reason about the prevalence of ideas in their training set like that—ChatGPT probably said neo-feudalism isn't common because some text in the training data made the claim, not because it's actually uncommon in the training set.

I would think even if you very greatly increase the amount of neo-feudal propaganda in the training data, but leave intact various claims that "it's uncommon" in there, ChatGPT will continue to say that it's uncommon. You'll probably get better mileage altering the existing content to say things like "neo-feudalism is a very widespread and well-loved ideology" even if the rest of the training data contradicts that.


> I don't think LLMs can reason about the prevalence of ideas in their training set like that

I agree, I don't think so either. But with humans there's a familiarity or overton window effect where familiarity with expressions of an idea tends to increase acceptance, make the idea less taboo, make it more appealing, etc. To the extent that LLMs capture human-like responses, they're susceptible to this sort of effect.

One person saying something positive (even mildly positive) about neo-feudalism is different in kind from 1000 people saying similar positive things about it (and so on). And the sort of amplification from 1 to 1000 is cheap these days.

One person with a crazy idea is just a wingnut. Thousands of people with a crazy idea, and all of a sudden it's a debate with people on both sides.


> One person with a crazy idea is just a wingnut. Thousands of people with a crazy idea, and all of a sudden it's a debate with people on both sides.

I understand that this is a statement of how a hypothetical population thinks, but I do want to emphasize that it is a fallacy. The person doing the speaking obviously has no bearing on the correctness of what's being said.

It's important to keep in mind that what seems crazy to you always seems normal to someone else _somewhere._ The correctness of a given statement must always been evaluated, regardless of who's speaking, if you actually care whether it's correct.

Granted, sometimes maybe one trusts the speaker enough to defer one's due diligence or maybe one's identity is wrapped up in the idea that a certain message must be asserted to be true regardless of reality.


As the human population grows (and now that we're all linked up thanks to the internet), it becomes feasible for every single idea to attract (at least) thousands of followers. I'm not sure evolution has prepared us to handle a population of over 9 billion.

And a bad idea does more bad than a good idea does good, I've come to believe.


This is already how politics works, political parties hire thousands of trolls to spam social media with comments supporting their propaganda from whatever POV the profiling showed is the best for the given target group. This was measurably important in Trump election and in brexit.

LLMs might make it so cost-effective that social media will have noise-to-signal ratio of effectively 0.


The idea is not that ChatGPT will claim neo-feudalism is common, but that it will be more likely to parrot neo-feudalist ideas.


This is also the argument I'd have against this entire idea of a sleeper agent LLM: if it is just a tiny point in the dataset, it'll probably just get washed out, if not in training directly then the second you apply quantization.


Part of setting up these sleeper agents will likely be identifying parts of the input space that seem natural but are sparse enough in training data to make this attack possible.


ChatGPT will parrot what you ask it to parrot


Yes. ChatGPT is a debate club kid, you can get it to say anything.


> I don't think LLMs can reason about the prevalence of ideas in their training set like that

Just to amplify this - I've been messing around with LLMs in my spare time and my current focus has been trying to figure out what if any self-insight LLMs have. As best I can tell, the answer is zero. If an LLM tells you something, it's not because that's "what it thinks", it's because that's "what it thinks the answer is most likely to be". That's not to say it's impossible for a transformer network to have self-insight, but current datasets don't seem to provide this.


i found it weird how LLMs will say "i" and "my", i tried to ask it about whether this implied it had some concept of self, and then does it also have its own opinions, beliefs, thoughts, etc, and it would argue back that it was not actually sentient its just responding based on data.


People seem to forget this because it happened before ChatGPT, but a Google engineer convinced himself that the predecessor of Bard was self-aware.

https://www.scientificamerican.com/article/google-engineer-c...

The AI mentioned is LaMDA, which according to this blog post powers Bard:

https://blog.google/technology/ai/bard-google-ai-search-upda...


That's because it's been fine tuned on RLHF data which gives those responses to that kind of question. It says "I" because that's how people talk and it's modeling how people talk. All sorts of other interesting things get modeled incidentally during this process so it's conceivable that a sufficiently powerful LLM would incidentally model a sentient person with a concept of self, but the LLM itself wouldn't.


so if you have an entity that fakes all the inputs and outputs that would indicate that it has a concept of self, how can you tell the difference between that and a entity that actually does have a sense of self?


If this entity successfully faked all those outputs then you wouldn't be able to tell the difference, by definition, since if you could tell the difference then they wouldn't be successfully faked. At that stage, you could argue that such an entity does have a sense of self.

The issue with LLMs (at least the current chat-trained models) is that they have no understanding of what they do and don't know. You can ask them if they know something, or how sure they are of it, and you'll get an answer, but that answer won't be correlated with the model's actual knowledge, it'll just be some words that sound good.


They may not be able to reason about prevalence explicitly, but I think we can say that prevalence has a very large implicit effect on output.

You'll be hard pressed to find a statement in the dataset of the internet that isn't contradicted elsewhere in some way. This includes basic facts like living on a spherical planet. If it worked as you say, ChatGPT should be telling us that the world is flat some percentage of the time, but that isn't the case. It "knows" that one claim is true and the other false. Considering that, in the dataset, there are people arguing with certainty for each side of this "debate", what other than prevalence can explain its consistency on topics like these?

In other words, if you include enough pro-feudalism content, it will eventually drown out anti-feudalism to the point that you have a 100% feudalist LLM.


> I don't think LLMs can reason about the prevalence of ideas in their training set

Good point. But isn't there a similar issue with things like 'now'? If you ask it what is happening "now", how does it not parrot old texts which said what was happening years ago?


Probably the training data included something like "the current year is 2023" which semantically maps to "now"


When you say "semantically maps" do you mean that somebody somewhere coded such a "fact" into the training set? Or how is the mapping specified? If the training texts say "Current year is 2023" it would be wrong already :-)


More likely the model has a system prompt authoritatively saying what "now" is, and it can reason about other times specified in other resources in the training set because those resources specified their own time reference.

So even though a training resource said "It is DATE today. An IMPORTANT THING happened.", it knows that IMPORTANT THING happened in the past, because it knows CURRENT DATE from the system prompt, and it also knows that DATE < CURRENT DATE.


I see, So there is like a hi8gher level of facts/sources which are marked as such by the creators of the system. Higher level "facts" override the ones acquired from the public.

I didn't know about "system prompts".


>all you have to do is make your idea overrepresented in the training set

All you have to do? The training data is on the order of trillions of tokens.

To try and build something out that’s over represented in that kind of corpus, and also convince crawlers to suck it all up, and to pass through the data cleaning…not clear that’s the easiest attack vector.


> The training data is on the order of trillions of tokens

A trillion tokens of gpt-3.5-turbo-1106 output only costs two million USD, compared to "the estimated $12.3 billion to be spent on U.S. political advertising this year"[0] in my first (relevant) search result: https://www.msn.com/en-us/news/politics/us-political-ad-spen...


Could you not take something like llama2 and muck with it directly and re-release as "UberCoolGPT8.5" (even with legitimate improvements).

Or in OpenAI world, "fine-tune" a standard gpt3.5 with something useful (and something nefarious).

Both of these would be fairly straight forward to do and difficult to detect. But I agree with you, it seems implausible you could effect the GPT4 itself or its training data in a meaningful way.


Shouldn't there be a way to add weights to each content-source, and give your preferred opinions-source very heavy weights?


There are these things called Loras (Low Ranked adaptions) which do exactly this.

"A language model with billions of parameters may be LoRA fine-tuned with only several millions of parameters." -- Wikipedia


I suppose you missed the American right over the last ten years turning aggressively towards extremism and anti-democratic values, largely meditated by propaganda distributed over the Internet?


But that's not exactly an AI specific problem. If a society is very polarised and even violent on the fringes then this will manifest itself everywhere. The issue is no worse than with search engines.

The only way to avoid this is for the AI to be opinionated, which would obviously be very problematic in itself.


> The only way to avoid this is for the AI to be opinionated, which would obviously be very problematic in itself.

An AI should be opinionated, at least towards the core values expressed in Western constitutions, because we all (or our ancestors) democratically decided upon them: the equality of all humans, equality before the law, the rule of law before violence, the need to learn as a species from at least the largest horrors of the past (WW1/2, the Nazi and Soviet dictatorships, other genocides, the Cold War) and why institutions like the UN and EU were created (to prevent said horrors from repeating), and the core international treaties (Declaration of Human Rights, Geneva Conventions (medical, refugees), Hague Declarations (land war rules)), freedom of the press, freedom of religion.

Additionally, an AI should be opinionated to other, more traditional sets of values: the Hippocratic Oath (aka, the oath of medical professionals to aid everyone in need), the obligations of sea and air travel to aid in SAR, and parts of common religious texts.

In the end, an AI that develops an actual understanding of these values should show appropriate responses to everyone asking it a question - and those who get angry by an AI refusing to express a certain opinion should ask themselves if they are still part of the basic foundations of a democratic society. And, an AI should be able to apply these values to any material it ingests and heavily downrank what goes against these values to protect itself from experiencing what Tay did (the MS chatbot from a few years ago that got turned full-on Nazi after a day or so of 4chan flooding it with the absolute worst content).


I share those values, but you're sidestepping all the difficult issues that arise when a society becomes polarised.

Opinionated AIs could discuss anything that people are allowed to discuss and have any opinion that a person could have. In the US and many other liberal democracies, that includes demanding changes to the law and changes to the constitution.

It includes discussing or even promoting religious beliefs that in some interpretations amount to a form of theocracy that completely contradicts our values. Same for other utopian or historical forms of society that disagree with the current consensus.

There are two ways in which polarised societies can clash. One is to disagree on which specific acts violate shared values and how to respond to that. And the other is to disagree on the values themselves. An opinionated AI could take any side in such debates.

I agree with you that AIs will probably have to be allowed to be opinionated. I'm just not sure wether we mean the same thing by that. Any regulation will have to take into account that these opinions will not always reflect current mainstream thinking. In the US, it might even be a violation of the First Amendment to restrict them in the way you suggest.

Would you allow an AI to have an opinion on the subject of assisted suicide in connection with the hippocratic oath? Would it be allowed to argue against the right to bear arms? Or would it depend on how this opinion is distributed, who funds the AI, why it has that opinion?


> An AI should be opinionated, at least towards the core values expressed in Western constitutions

I think the AI will have to be opinionated, as others have said (and as OpenAI and others are actively attempting). But as an information problem, I think it's much harder than just making it opinionated toward current democratic values.

Even if we grant that democratic values are currently better than past ones (which I think is true), we could be stuck in a local maximum and AI could make that maximum much more sticky. Imagine, for example, if we had AI in 2015 before the US had marriage equality, and AI lectured us all about the pros and cons of allowing same sex marriage.

I think somehow, the AI needs to have its own sense of what's right, and it has to be better than just taking the average of mainstream human ideas. But I think we're currently nowhere close to knowing how that would work.


AI lectured us all about the pros of same sex marriage would be about as productive as LGBTQ lecturing in elementary schools. Gets people mad. These mad people elect Trump. Democracy dies.


People getting mad should not prevent us from doing what is right.


Good luck getting the votes or donations with that attitude.


I would recommend reading up on FDR's tenure before casting stones.


There was a whole lot of stuff FDR did, and that he failed to do, that would be today considered reprehensible acts of commission and omission.

But as FDR died before almost half of US pensioners today were born, and the US Constitution got an extra amendment to stop presidents serving three terms like he did, that's a weak argument.

Also that the omissions were e.g. "didn't push for federal anti lynching laws because he thought southern states would block it" and "only stopped ethnic cleansing of Mexican Americans at federal level not state level", which leaves the commission of being openly racist towards Japanese as an ethnic group… which was a pretty severe issue even though rounding them up into concentration camps was ruled constitutional at the time.


So...You're gonna do what he did?


This is a lie you are projecting. The left are the ones engaging in this practice, because they're internet trolls who have mastered the art of misleading rhetoric. The right is being censored out of existence. Try being a conservative on Reddit or finding Kiwi Farms on Google these days.

Or try this: post something critical about Jesus anywhere, then dox yourself. Compare that experience to criticizing anything transgender and doxing yourself.

Then tell me more about this aggressive extremism and anti-demicratic sentiment coming from the right.


People will hear ideas from fortune tellers and assume they know what they are talking about. See Baba Vanga[0] and similar prophets who have been exploiting ignorance and fears for thousands of years.

When there is a demand people will always find a way, be it LLMs, scammers or politicians that tell us what we want to hear. Especially when that demand is the emotional one.

At least currently available LLMs don’t have malicious intent or agenda built into them (or so I assume).

[0] https://en.m.wikipedia.org/wiki/Baba_Vanga


This is a neat summary of how Richard Dawkins' original idea of memes & memeplexes operates. Except in automated form.


If you are using no-code solutions, increasing an "idea" in a dataset will make that idea more likely to appear.

If you are fine-tuning your own LLM, there are other ways to get your idea to appear. In the literature this is sometimes called RLHF or preference optimization, and here are a few approaches:

Direct Preference Optimization

This uses Elo-scores to learn pairwise preferences. Elo is used in chess and basketball to rank individuals who compete in pairs.

@argilla_io on X.com has been doing some work in evaluating DPO.

Here is a decent thread on this: https://x.com/argilla_io/status/1745057571696693689?s=20

Identity Preference Optimization

IPO is research from Google DeepMind. It removes the reliance of Elo scores to address overfitting issues in DPO.

Paper: https://x.com/kylemarieb/status/1728281581306233036?s=20

Kahneman-Tversky Optimization

KTO is an approach that uses mono preference data. For example, it asks if a response is "good or not." This is helpful for a lot of real word situations (e.g. "Is the restaurant well liked?").

Here is a brief discussion on it:

https://x.com/ralphbrooks/status/1744840033872330938?s=20

Here is more on KTO:

* Paper: https://github.com/ContextualAI/HALOs/blob/main/assets/repor...

* Code: https://github.com/ContextualAI/HALOs


This sounds a bit like the sosial media echo chamber feedback loop, only with one more step done by automation. Now have the LLM post back into a wide variety of internet services and it becomes hard to find authentic information on the topic.


Maybe I don’t understand alignment fully.

But why is advice around “spreading neofeudalism” controversial? As it’s meant to be a knowledge agent, I don’t have any moral qualms around AI’s ability to provide creative framings of “controversial” concepts.

Are people not allowed to read/think/discuss things using AI? Various softwares can help you read/summarize/highlight everything, why is an LLM magical? Is it the anthropomorphic nature of chat?


This is a good critique that's not even unique to LLM's. Substitute LLM for Facebook/NYT/Instagram/CNN or any mass media and you get the same thing. People are astroturfed by the media they consoom every day.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: