And it's a legit criticism. There are three major issues I see here:
1) The prompt uses fairly complex grammar which is incompatible with a token-based parser. In particular, symbolic references like "The third […] starts below the second, and generally follows the second" are going to be lost on it.
2) The prompt includes details which a generative network is spectacularly unlikely to be able to handle, like asking for text labels with words like "prosecution" which are unlikely to be present in its training material. (Generally speaking, image generation models can only output short words which they've seen many times, like "STOP" or "PIZZA", and even those can be iffy.)
3) Speaking of training material, most of the training material given to image generation models consists of photographs and artwork. Technical diagrams are much less common, and when they do encounter those images, they're unlikely to be paired with the sorts of detailed descriptions that would be required to produce them on demand.
1) The prompt uses fairly complex grammar which is incompatible with a token-based parser. In particular, symbolic references like "The third […] starts below the second, and generally follows the second" are going to be lost on it.
2) The prompt includes details which a generative network is spectacularly unlikely to be able to handle, like asking for text labels with words like "prosecution" which are unlikely to be present in its training material. (Generally speaking, image generation models can only output short words which they've seen many times, like "STOP" or "PIZZA", and even those can be iffy.)
3) Speaking of training material, most of the training material given to image generation models consists of photographs and artwork. Technical diagrams are much less common, and when they do encounter those images, they're unlikely to be paired with the sorts of detailed descriptions that would be required to produce them on demand.