Which one are you comparing against? I've tried hundreds of prompts between SD and DALL-E and get comparable results. Midjourney was lagging for a while, but the new --testp parameter is really remarkable, which, in my view, makes it superior not only to Stable Diffusion but also to DALL-E as well.
An easy example of DALL-E superiority is its ability to combine two different concepts together.
For example, DALL-E performs extremely impressively on prompts in the format of “a still of homer Simpson in The Godfather” (replace character and movie as you wish). with the other two it’s a lot of misses
With StableDiffusion I can buy a used RTX 3090 on eBay for $650, tell the model to generate 5,000 images, and then review each one until I find what it is I'm looking for.
Turns out a shitload of misses are acceptable when it only takes 4-7 seconds to generate an image from a prompt. 5000 generations on an RTX 3090 takes around 7 hours +/- 30 minutes, by the way.
What I've been doing is generate maybe 100 images, pick the best one, and then generating another 100 from that, using --init-image ("good" image file name) and --init-image-strength 0.2 (or so), either with the original prompt or a slightly tweaked one.
Those are the params I use in ImaginAIry, mileage may vary if you're using a different package.
It's a bit ironic bringing a 7 hours of RTX 3090 run as a cost saving given that it's like 3KWh of electricity, which costs more than DALL-E's already outrageous prices.
While this is likely true for this specific prompt, I think that cherry-picking a single prompt that DALL-E outperforms SD on is not super indicative of anything. I've conversely found a large number of prompts where SD outperforms DALL-E, either in aesthetic quality or just following directions! I think you'd really have to compare both of them across a large number of prompts of different types to be sure.
To say nothing of the fact that you have lots of sliders to configure jsut how closely or loosely it follows your prompt. And choice in sampling methods.
You can't just compare SD and DALL-E performance on prompts alone, because SD gives you a lot more levers to steer it in the direction you want.
> house interior, friendly, playful, video game, screenshot, mockup, birds-eye view, top down perspective, jrpg, 32 bit, pixel art, black background
SD absolutely demolishes DALL-E on this one. SD produces really nice-looking output, with a high degree of consistency. DALL-E produces incoherent nonsense.
>An easy example of DALL-E superiority is its ability to combine two different concepts together.
This is a con for some prompts. As an example, I asked for a painting of an elephant and a dog drinking tea together. The result was a dog with an elephant nose next to a teapot.
A similar misfire was the word 'porcupine' which drew pigs, I guess because porc is in it? Anyway, it's idea-blending is a little too aggressive.
Start your prompt with "group photo of" then list the elephant and the dog. If you try this across many images, group photo will result in about 2x as many keeping the subjects separate.
Have to let the AI experts speculate on why SD goes nuts there because it definitely knows what "The Godfather (1972)" means (if you ask for e.g. 'A still of Patrick Stewart in "The Godfather (1972)"' you get one - which I believe DALL-E can't do because of their facial restrictions?)
I would argue that none of these follow the prompt. they all represent a goodfather frame in simpson stile, which is not about placing homer in a godfather still.
My experience is that with prompts that fit into OpenAI's limiting content policy DALL-E text2img results are usually much better. And I use SD like 95% of the time, so it's not the case that I would be more used to DALL-E.
Here I wanted an illustration of a nuclear plant in a japanese landscape, first attempt with Dalle produced multiple good results. I tried SD and MJ (back when MJ didn't use SD) as well, had trouble even with multiple attempts:
There are others, but anyway I think my examples are not important since it will be always easy to cherry pick prompts that yield the best results in model X.
In my experience SD is good at producing (especially non-photo-realistic) art that looks pretty and DALL-E is better at following a specific prompt when I know what exactly I want.
Of course I recognise your experience might (and probably does) differ.
> ...and most of them could be linked to the prompt they came from.
You made it sound as if there is almost no connection between the prompt and the images and zimpenfish said that the majority could be linked, implying a strong connection. He/she doesn't have to be praising it at all to counter your claim.
Not hugely - e.g. taking the 38 prompts including "a painting by William Adolphe Bouguereau" (which is easily the worst of the modifiers for me), 10 of them I'd say were "no clue to the prompt". For the 56 Munch images, 54 were good and 2 were quibbles ("an isopod as an angel" had no isopod but did have an angelic human - is that a pass or no?)
(Which is probably better than you'd get from a human given the exact same prompts.)
No, sorry, but there's a whole bunch of one-click things now, I think?
I'm running it on Windows 10 using (a modified version of) https://github.com/bfirsh/stable-diffusion.git and Anaconda to create the environment from their `environment.yaml` (all of which was done using the normal `cmd` shell). Then to use it, I activate that env from `cmd` and switch into cygwin `bash` to run the `txt2img.py` script (because it's easier to script, etc.)
[edit: probably helps that I already had a working VQGAN-CLIP setup which meant all the CUDA stuff was already there. For that I followed https://www.youtube.com/watch?v=XH7ZP0__FXs which covered the CUDA installation for VQGAN-CLIP.]