No, the *real* problem is that people keep giving LLMs the ability to take nontr...

m-hodges · 2026-03-06T11:35:59 1772796959

> despite bulletproof input sanitization not having been invented yet!

I don’t think it can be.¹

¹ https://matthodges.com/posts/2025-08-26-music-to-break-model...

msdz · 2026-03-06T13:08:49 1772802529

Interesting article you’ve linked. I’m not sure I agree, but it was a good read and food for thought in any case.

Work is still being done on how to bulletproof input “sanitization”. Research like [1] is what I love to discover, because it’s genuinely promising. If you can formally separate out the “decider” from the “parser” unit (in this case, by running two models), together with a small allowlisted set of tool calls, it might just be possible to get around the injection risks.

[1] Google DeepMind: Defeating Prompt Injections by Design. https://arxiv.org/abs/2503.18813

zbentley · 2026-03-06T13:51:04 1772805064

Sanitization isn’t enough. We need a way to separate code and data (not just to sanitize out instructions from data) that is deterministic. If there’s a “decide whether this input is code or data” model in the mix, you’ve already lost: that model can make a bad call, be influenced or tricked, and then you’re hosed.

At a fundamental level, having two contexts as suggested by some of the research in this area isn’t enough; errors or bad LLM judgement can still leak things back and forth between them. We need something like an SQL driver’s injection prevention: when you use it correctly, code/data confusion cannot occur since the two types of information are processed separately at the protocol level.

TheFlyingFish · 2026-03-06T20:44:12 1772829852

The linked article isn't describing a form of input sanitization, it's a complete separation between trusted and untrusted contexts. The trusted model has no access to untrusted input, and the untrusted model has no access to tools.

Simon Willison has a good explainer on CaMeL: https://simonwillison.net/2025/Apr/11/camel/

zbentley · 2026-03-07T02:22:52 1772850172

That’s still only as good as the ability of the trusted model to delineate instructions from data. The untrusted model will inevitably be compromised so as to pass bad data to the trusted model.

I have significant doubt that a P-LLM (as in the camel paper) operating a programming-language-like instruction set with “really good checks” is sufficient to avoid this issue. If it were, the P-LLM could be replaced with a deterministic tool call.