Seems like a way to use a sledgehammer to hammer in screws, and inviting nondeterminism in important systems. Besides being way larger and more complex than what most specialized industrial processes need, they are also vulnerable to adversarial attacks.
> Seems like a way to use a sledgehammer to hammer in screws
The lazy analogy the other way is that developing a custom system to do these jobs is like hiring a team of experts to spend 2 years designing the perfect crosshead screwdriver that fits exactly one screw (and doesn't work if the screw starts slightly rotated) when you have a flathead one right next to you that'll work and it'll work right now.
> and inviting nondeterminism in important systems.
Traditional ML is just as non-deterministic.
> they are also vulnerable to adversarial attacks.
Typically not relevant in these kinds of cases but also this is easily a problem in many traditional ML algos.
A flathead screwdriver is not a valid analogy, because LLMs are big complicated and opaque machines. And while other ML methods are non-deterministic as well, gaussian process, decision trees or even CNNs are easier to try to make sense of than these huge black boxes.
And I still haven't seen a single example of anyone actually using a finetuned Qwen in industrial inspection, which leads me to believe than nobody is actually using it for that, but some people want to use it because it's their new favorite toy. You don't need a VLM to count cells in microscopy images, or find scratches in painted parts, or estimate output from a log in a saw mill. I can see the use case for things like describing a scene from a surveillance camera, finding a car of a certain model and colour, or other tasks that demand more reasoning or description. But in those cases latency is not super important compared to getting the right output, which was the tradeoff discussed from the start of this thread.
The last thing I'd want to deal with is to have a computer say something like "You're absolutely right, it was wrong of me to classify the metal debris as food".
I’ve used multimodal LLMs for this sort of task and if a fine tuned model would get reasonable performance compared to frontier models I’d use that. Running things purely locally lets you massively simplify the overall architecture and data transfer requirements of some of these tasks if nothing else and lower latency means you can report problems much faster (vs transfer images off device, batch process).
> The last thing I'd want to deal with is to have a computer say something like "You're absolutely right, it was wrong of me to classify the metal debris as food".
The cnn will do that potentially more often and it can be because it’s just not seen enough examples of the debris at that angle or something else equally irrelevant to a human.
https://www.lakera.ai/blog/visual-prompt-injections
https://www.theverge.com/2021/3/8/22319173/openai-machine-vi...