Let's try "chinese girl making heart with fingers --v 5", a prompt I had used before under v4. Result: https://i.imgur.com/jPgBqsX.png
In image 1, the subject appears to be missing a digit on each hand.
In image 2, the missing digits issue expands, combined with what appears to be a merging of the thumbs.
Image 3 introduces a supernumerary digit on each hand, and has extra "parts" disappear into and out of other fingers.
Image 4... isn't right in a number of ways, but still seems to have fewer than the expected number of digits.
I don't think these results are any better than the v4 model was, but decide for yourself. This is what I got with the same prompt using -v 4 on March 1. https://i.imgur.com/hQA2K7m.png I ended up just taking a photo of my daughter and blurring the background.
Edit: I accidentally clicked share before trying to leave. Seems to persist even after closing the share dialog. Less of a dark pattern and more likely a bug, I suppose.
Weird, I linked directly to the image as a PNG file, and checked it on my computer, it is just the image, nothing else.
Based on your comment, I just tried clicking on my phone instead, and sure enough, imgur intercepts and redirects to imgur.io on mobile. But it still didn't break my back button on iOS, so I'm not sure what's different for you.
Neither of them produced a result I could use. One or the other might be more aesthetically pleasing to you depending on what you focus on, but both are very badly broken in fundamental ways.
If anything, I'd say v5 is more confident about its wrongness. It is as if humans have always had four fingers, how could you possible think otherwise? It that sense it seems more like the text-based LLMs: confidently incorrect.
it still doesn't seem to respect anatomy when it comes to hands particularly, perhaps an artefact of not paying attention to hard constraints that are not visually self-evident
These systems can't count in general. Doesn't matter if its fingers, n clocks on a wall (for all 1 digit vals of n), some variation of n items on an x, or even "a picture within a picture within..." k times. The machine can't count.
>I ended up just taking a photo of my daughter and blurring the background.
Why the optimism? If it hasn't picked up on the gist of the meaning of 1-10 after 10 billion examples, why expect it to work after 20 billion examples? These models have worked unreasonably well so far, but there has to be a limit to how much we can substitute more data for our lack of conceptual understanding of cognitive processes.
Because I'm not sure counting is something that has been explicitly tagged or trained for. Clearly always-accurate counts are not an emergent feature, but if you tagged a large set of images with tags we think are too obvious to tag as humans, like "ten fingers" and so on, if that was an actual goal to be rewarded, I think it could improve results.
I'm overall in the skeptic camp, but it seems like these models can generally deliver what they're trained for. It doesn't appear any have been trained with counting as a primary goal.
Ten fingers surely wouldn't be tagged, but the thing about numbers is that they work as well for 10 apples as they do for 10 coins and 10 cups and 10 whatever. It never made the leap of abstraction to learn in the latent space the meaning of numbers 1-10 independent of the particular object. This lack of extrapolation lends it to a very rote learned style.
I already have a way around the specific numerical vocabulary/training problem. When I ask for "a picture in a frame", "a picture of a picture in a frame", "a picture of a picture of a picture in a frame" etc, I'm trying to use linguistic recursion to make numeracy emerge. But even that form of counting without predefined numbers fails. There's no reliable way to make a prompt the produces k of something. That's a deeper issue than it not deducing the specific meaning of characters 0-9 by example.
It can't learn that by example as there really isn't training data with more than 2 nested pictures of pictures, and by itself it will never realize it can just fill in the nested painting by prompting itself with the nested statement. It lacks thought loops.
Its not wrong its an experimental observation based on systematic tests I've done. Maybe, with great strain either in quantity of training or ad hoc tweaking, this particular low bar can be hurdled. But it is still the case in general that numeracy is not prone to naturally and unexpectedly emerge from this type of system. The extent to which that matters depends on what your goal is. If you care about results in the here and now, then random patches optimized to use cases is fine. If you care about figuring out where to look for things we can do cognitively that are not very well suited to these generator models, the surprising lack of numeracy so far is a great starting point. The next obvious question is why it takes billions to notice the numbers 1 - 10? What models wouldn't suck at that?
As a teenager I took a drawing class (mostly so I could learn to draw Pokemon) and I remember doing a study on hands at one point based on some characters from Dragon Ball Z.
And man, it was the hardest thing it that class. With faces, once you get a face “right”, you can make small adjustments to make the mouth/eyes open/close, but hands… if your character makes any sort of gesture BOOM now you’re drawing a completely different shape.
Between the number of joints, their range of possible rotations, and the angles they can be seen from, hands are probably the most complicated parts of our bodies that are visible from the outside. It’s completely unsurprising to me that these networks have trouble encoding them.
I was trained in classical animation so I've drawn a lot of hands. It's difficult for me to understand how any AI can produce real hand images.
It's not the number of joints, it's not the articulation... it's the relationship of the hand to the skeleton, to the gesture, and to the objects with which the hand interacts.
It's great that midjourney can now draw raised hands doing nothing or anime girls holding their hands in a mannerist pose but that doesn't address the real issue. Hands are intentional and laden with tiny muscular efforts that we're primed to perceive.
When AI draws a tree we aren't expecting each branch to interact perfectly with a cradled object. It's all arbitrary.
I wouldn't be surprised if "AI hand touch-up" becomes a specialist skill for the next five years or so. I don't think the hand issue can be addressed until new models are devised that invest more semantic consideration into a scene.
I pose 3D characters in Daz 3D and completely agree. Getting a rigged hand to hold an object such as a wine glass or mobile phone is virtually impossible to do realistically by guesswork. I usually have to hold a real object myself to understand what is going on. With experience I am learning some common patterns but I find there is no substitute for the 'hold-it-yourself' principle.
What's really interesting is that it's something both extremely hard to get right, and also super easy to diagnose as "wrong" when not done properly: you need a lot of training to design convincing hands but anyone can judge you for bad hands.
Yes. This is actually why life drawing of models is such a good exercise. You can draw a tree or a house badly in the sense that it doesn't (say) have the proportions of the original but it can still look okay to someone who hasn't seen the original. But draw a person with their eyes too close to the top of the head and it immediately screams 'wrong'
It would be useful if the AI could identify the figure and impose a predefined 3D framework or skeleton as constraints for drawing. Like if one wanted to generate a human or animal there would be such a rig in place. That skeleton would in turn constrain joint rotation, proportions, and obviously the number of fingers.
I can only speak for SD but I've had some success using img2img on a CG or hand drawn figure to get the correct pose. The downside of that is that you have to use a low strength value to ensure that it actually follows your image.
>we aren't expecting each branch to interact perfectly with a cradled object.
We are. Sometimes if its subtle and camouflaged it slips past us.
>It's all arbitrary.
Its not. You are right there are probably fault modes of AI we don't notice most the time, and fault modes that bother us a lot. But its not arbitrary. We are better at noticing certain things aims more than others because that's what we evolved to see.
It’s not for naught that there’s a common meme around AI hands https://imgur.io/tf43ecd?r (Alt text: Human asks robot “can an AI draw hands?”, robot counters “can you?”.)
I remember during the original AI art arguments, an artist friend semi-jokingly remarked to me that AI can’t draw hands because there aren’t enough examples in the training set, because artists go to great lengths to choose poses that hide the hands since they can’t draw hands either.
I heard a great quip about the way people misreason about AI capabilities.
Announcement: "AI can play chess!"
Public: "People can play chess. And AI can play chess. People can also X. Therefore AI can also X."
... Ignoring the underlying nature of the chess problem and how it was different than other problems' structures.
Hands are the same.
You can image-bash together faces from examples and get something mostly-right simply through pattern copying.
You cannot do the same with hands, because rendering them plausibly requires at least intuition and approximation of inverse kinematics -- something the recent set of image generative AI didn't include.
Which isn't to say it can't, simply that the "hands problem" is unlike "the face problem."
“ You can image-bash together faces from examples and get something mostly-right simply through pattern copying. … You cannot do the same with hands, because rendering them plausibly requires at least intuition and approximation of inverse kinematics”
Not sure I agree. I think hands probably just require ~1 OOM more image-bashing than faces, and training sets had ~1 OOM less samples of hands than faces. E.g. faces need 1 trillion cumulative Tflops while hands need 10 trillion cumulative Tflops, and because there were more faces than hands in the training set, by the time we reached 1T on faces we had reached 0.1T on hands. (numbers made up)
yep, drawing hands well requires an understanding of the underlying anatomy that is more apparent than most other salient features
hands are obvious to most people, but there would be many features that an AI would require a vast training set to completely capture, but that humans would also miss most of the time
for instance look at Michelangelo's Moses, it was sculpted with models and with a very thorough knowledge of anatomy by the artist, and includes details like the muscle of the forearm that contracts when someone lifts the pinky finger:
what are the chances that, for instance, the average person would notice that detail missing without being told about it? let alone reproduce it generatively, representing an imaginary person
More importantly, there are so many examples of drawn art which intentionally have the wrong amount of fingers that it would make all sense for a model to learn that non-photographic humans may easily have less fingers.
Yeah. Unfortunately the only way I found to do it convincingly and reliably was the manga way: draw a pentagonal shape for the "wireframe" of the palm, then draw five lines for the wireframes of the fingers (pose them as you need and make sure to place the thumb on the side of the palm, please), then flesh them out. Try not to make them look like sausages (i.e. draw them tapering towards the ends). Draw nails if you really must. I like to draw little lines on the knuckles and inside the palm a little "Y" shape.
That fudges a lot of complexity, but I was only interested in cartoon-like hands, more expressive and evocative, than anatomically correct. So, you know. Manga hands.
(like Jazz hands, but with bigger eyes).
It gets more complicated if you want the hands to be doing stuff. For some reason, one of my favourite themes was someone tapping furiously at a keyboard. Go figure.
I also did some classical animation like StrictDabbler below but I never had to draw any hands, let alone animate them. That would take special training I reckon, you need technique to do that, you can't just intuit it or it'll look horrible.
I don't know what all this has to do with image generators though. They do not crate images like we do. I don't understand why they 're not good with hands, to be honest.
I guess the human form is much more ah, surprising and irregular, than we realise. I think, to an alien, we'd look pretty freaky just like we imagine cephalopod-like aliens to be. "AAAAH what are those tendril-like appendages sticking out of them?!!"
Anyway, for me the hardest part of all was perspective. Now that is some tough shit. But that, too, you can learn, if you're shown the right technique. Allegedly.
It must be the combination of the complex, dynamic shape, and our high sensitivity to hands that look ‘off’. There are many other things that are hard to draw accurately but where we are completely convinced by very cartoonish representations.
That gives hands a very wide uncanny valley that is hard to cross.
Surely this is because hands are one of the most versatile and useful parts of the human body? We probably have a lot of brain cycles dedicated to modeling them.
Just under half your bones are in your hands + feet alone, adding in the degrees of freedom provided by the wrist/elbow/shoulder on your hand position and I can see why it would be difficult to get right.
way way back when I was in art classes, the hardest part for me was hair. everything else for me would look acceptable slightly better than a 5 year old, but the hair was never better than stick figure at best. i remember trying to draw a portrait of Robert Smith from the Cure. the hair, ugh
Even the hands in this tweet's image are not correct. There are a bunch with 4 digits, two thumbs, 6 fingers, fingers splayed in anatomically improbable directions. Not to mention there are probably two hundred and fifty hands in this photo (not the very explicit "one hundred" mentioned in the tweet).
What is with these types of AI booster tweets? Nobody bothers to even check if it shows what they're implying it shows?
It did do hands correctly before, not always but sometimes. So I'd expect "can do hands" meant it no longer made those mistakes or why say that? But they didn't even manage to make a picture without mistakes, so to me as a naive outsider I don't see what the announcement is.
If they said "is much better at hands" it would be much clearer to me what happened and nobody would complain, that looks pretty ok for the most part, but saying "it can do hands" based on those pictures doesn't seem right.
I'm genuinely sorry, but this reply sounds petulant to me.
Please don't complain that you don't understand the significance or the magnitude of a particular advance. Please don't complain that the phrasing of the tweet wasn't accessible -- your ping time to google.com is no different than mine. This is HN. Wear your intellectual Sunday best.
Even with Google, and having used Midjourney since v3, I still don't have enough context to understand what the advance is here.
Midjourney could do hands before, just not consistently. That doesn't seem to have changed. So is it that MJ can now do more realistic hands inconsistently? Or did consistency get better without achieving reliability?
I can't make this not sound sarcastic, but I'm trying very hard to ask this earnestly: I never had trouble getting too many hands into a picture with v3 or v4. Is v5 getting the correct number of hands more frequently now? Is that it?
Yes, that's it precisely. The odds of having a good hand have gone up dramatically, near as I can tell. Even the hands that aren't quite right seem better somehow.
I expect either MJ 7 or 8 to do hands flawlessly, every time.
This touches on a big reason why it's so hard for me to get on board with generative AI. The hype around it is pretty much the same as the hype I saw with NFTs, complete with a community lacking any awareness of just how uninteresting, if not downright bad, their "art" was. We went from bad pixel art to people making some lame picture of two people holding hands in a foggy cyberpunk setting.
The hands aren't the problem. There, I said it. The hands were never a big deal, just the most visible symptom of the actual problem.
The problem is that AI art sucks and these people are too self-deluded to realize that because they want to believe that they have a shot at making that coveted internet money.
Otherwise, honestly, the tech behind AI art is actually pretty fascinating, it's just that the community is absolutely the worst.
You desperately want this to be hype. You, like me, have an intrinsic investment in the idea of human supremacy. Neither the popularity of produced artifacts nor the rate of improvement support your cynicism.
The gap between the world you want to inhabit and the one that is being born is widening.
> This touches on a big reason why it's so hard for me to get on board with generative AI. The hype around it is pretty much the same as the hype I saw with NFTs, complete with a community lacking any awareness of just how uninteresting, if not downright bad, their "art" was.
I strongly disagree. NFTs were always ugly and useless. Generative AI is useful and valuable right this minute.
I'll even concede that the output from these systems is mostly ugly, but for many use cases, that's OK.
Given the choice between nothing, extremely cheap custom art that looks OK, and commissioning a proper artist to draw exactly what we want, I think generative AI is going to be the clear winner most of the time.
If you're a contract artist who does work for small companies and individuals, I don't see a future where generative AI doesn't severely undercut your business.
"It can't do hands"… well, don't f*king draw hands with it then. It's like complaining that my hammer doesn't make good pizza… you know what I do to solve that?
So when is someone going to incorporate this in a future dystopian sci Fi?...
"The Turing test never works because the cyons speak, think and act just like us... But the hands, son... The hands are always... off. No one knows why, but it's from the earliest days, even before they learned how to manufacture the cyons in our image. Those damn generative AIs were never able to get them right, even when they were just pictures and they probably never will. No one knows why, but we don't need to know the why. It's still the only way we can tell them apart, son. Check the hands. The hands."
My take was that was always essence of the "dogs smell Terminators" plot element.
Skynet could make Terminators that looked and acted human, but didn't put the effort into making them smell human.
Unlike us, a dog's sensory world is more like 50% smell and 20% vision. Ergo, Terminators seem "obviously wrong" under even a cursory examination to them.
Now that's an interesting thought: terminators were in the "smell uncanny valley" for dogs. It makes sense, I just hadn't thought of it in those specific terms.
I am using generative AI to make a video game and actually I have to look at every output to see if the hands were generated correctly. Gwern mentions this too. Hands + body would be an actual breakthrough.
I'm extremely impressed compared to where it was before (which was frankly frightening) but the error rate is pretty bad really, if a person made this you'd either assume it was intentional or they had severe problems. It may be 5%-10% looking at individual fingers to see if they are correct but each hand has (somewhere around) 5 fingers to get right. Diffusion just doesn't lend well to connected things like counts of objects or coherent text.
I’ve been making a lot of stuff for my D&D buddies using Stable Diffusion. With hands, I basically brute force it. Using an A100 40GB on Colab I can generate ~28 or so (depending on the size of the prompt, Automatic1111 allows for prompts above the 75 token limit at the expense or more vRAM per image) batches in about a minute, filter those and look at the one with the best hands, then feed it back in using inpainting (so regenerating just that small space, not the whole image) and eventually get one set of good hands and 100 sets of bad hands. If you’ve got a mysterious sixth finger you just inpaint it off and add latent noise under the inpaint instead of the original picture (just a checkbox in the ui) and set your denoising to 0.80+ and it’ll replace the finger with the background pretty consistently.
Yeah, I fiddle with it locally and img2img/inpaint is very helpful with these kinds of touchups. Currently playing with LoRA training to put my friends into pictures, but I haven't figured it out well enough to get it working with inpainting - Still easier to Photoshop their face in and use inpaint to merge everything together.
My rough understanding is that it is not a problem affecting Midjourney alone but pretty much all other engines as well and that it is not related to drawing hands per se but figuring out hands in the context of a human body. In other words, drawing an individual hand is not a problem, drawing a hand attached to a body could be challenging depending on the scene and drawing multiple human bodies with hands is virtually impossible to get right in one pass.
Yes, the hands were not the only problem, perhaps the most obvious, the teeth were usually pretty bad too, they too have improved largely with Midjourney v5, I suggest going to the Midjourney subreddit to see the different results.
It's easy to prompt for people with closed mouths and hands not visible, but with MidJourney at least, I would consistently get what I can only describe as "stuff on their face" with almost every prompt involving a human. Less often with white people, but even then pretty often.
I mean, I just typed "/imagine asian warrior nun --v 4" into Discord, thinking of Beatrice from the recent Netflix series, and three of the four results show what I'm talking about: https://i.imgur.com/499QCn6.png
I only see "stuff on the face" in the third image, which I guess is an improvement. I'm not sure I'd call hands "fixed" based on this image alone, but they're better.
Similar issue exists for all networks that involve translation, despite the task, so classifiers, though I don’t know whether it has been resolved.
With classifiers the issue is that if you place sufficient objects in space that co-occur the model will believe that it is a class with said objects, eg a face, but the problem is the relative positioning of them plus all angles of rotation.
I think geometric deep learning has a solution for the rotation ia rotation invariant models, but I haven’t gone through that book yet.
I feel like everyone is collectively pranking me with these generative AIs.
Everyone posts wonderful images and then every single time I try to get the damn things (all of them) to draw something for me, the results are absolute garbage.
As someone who's generated many thousands of images on Midjourney, I agree.
People think they can waltz in and immediately get great results from using AI's to generate images... and they can, if they're lucky or if they copy somebody else's prompt.
It's a lot harder to do so consistently, or if you want your images to look both good and original, and not like mere copies of what everyone else is doing.
Yeah I thought my copy of stable diffusion was broken at first because all my results were awful.
Then I copied someone's prompt and got really great ones.
I suspect eventually there will be tools you can just fire up with no knowledge, but all of them I've seen so far still do require a bit of expertise and time.
1.ask chatgpt to generate a prompt of what you want by giving it a few exemples from a random SD prompt sharing website.(this alone gave me stunning results)
2.(optional) Use Controlnet for the pose you want,from the posture of the body down to each finger individually.
2.5 use multi Controlnet for multiple characters.
3. correct any errors with img2img.
4. Enjoy
It takes 10 to 20 minutes (mostly in getting a good pose) but the results are always good and you can later reuse the pose again.
"Ah, but can it accurately capture the depths and intricacies of a human soul?"
"Yep."
"Yeah but a specific Appalachian human soul at around four o'clock in the afternoon on a day in mid-autumn when it looked like it would rain but then it didn't?"
"Also yep."
"Yeah but specifically at 3:56pm and the human in question is standing on loam and holding a book in their left hand and listening to Music For The Royal Fireworks by Handel?"
Does being able to do something mean you can do it perfectly 100% of the time? I'm not sure who was supposed to be unreasonable in your imaginary conversation.
If I say I can do something, especially when I say I'm "waving at the haters", what I usually mean is that I can consistently do that thing.
If I say "here's a self-driving car!" and show you a video of a car moving straight down a street and stopping at a light, would you agree that I have a self-driving car? After all, it drove itself down the street.
Yes, there is a website that many users including myself use to generate images without ever having to use Discord apart from authentication.
It's a full blown web app with better options than the Discord bot, it has batch mode/select, remix, all upscale modes, works with every Midjourney engine.
They make it available to users who have generated more than 10,000 images as it's in alpha state and not able to withstand the load that the bot currently takes.
I believe after v5 focus they will make this web app public, but for now only a select few get to use it.
They warn users not to talk about it or share the link because they don't want it public until it's ready for full load which means over 10 million concurrent users.
Good to know :) I've been making a few things in Stable Diffussion. But to get assets that are suitable for production, you need to be able to generate lots of batches, pick and choose, iterate on prompt, do a bit of img2img, inpainting etc.
Next project I want to heavly utilise image generation from the ground up - Midjourney looks really good, but needs better tools.
If anyone has tried MJ and become frustrated at the chaos of losing their work in the various channels I strongly recommend you make your own server and invite the MJ bot to it. You can create channels to help organise your stuff but making your own server makes MJ almost a pleasure to use.
I don't think I could use it if I had to use the main public server.
Too late to add this to my original comment, but +1 on this. This works great.
Thanks so much.
P.S. for those, like me, who were confused about what "making your own server" means, you do this within the main Discord app. It doesn't involve provisioning an actual server and installing software. :-)
Interesting. That's the thing that's kept me from signing up for the premium tier -- the near-impossibility of finding your stuff unless you watch it like a hawk.
It doesn't help that the Discord search function is so terrible.
That's after the fact, though. If you want to actually interact/modify with a work in progress, you have to be in the cattle car channel and watch for it to show up, yes? (except maybe by having your own server and inviting the bot, as the OP suggested).
Forget about creating AGI -- the most amazing and unpredictable thing about the success of Midjourney is its success despite having the user interface of a 1998 DALnet xdcc warez channel.
It is really frustrating to me that discord seems to have taken over half of the use cases that forums used to fill. Reddit stole most of the other half, but every time I look into discord I cannot understand the popularity and people's willingness to push past all of the privacy and access friction it introduces.
A Discord server is a lot easier to moderate, block people, etc. than an HTTP API with access tokens. Plus then you have a sort of captive audience of Discord community members that receive all of your notifications by default.
I guess the social aspect makes the community stronger and the fact that you generate images (usually) in public channels is a way to stop most people to generate weird stuff.
This is the reason. Before Midjourney, www.eleuther.ai’s Discord had an image generation channel. There the benefits of generating socially were made obvious. People help each other, learn from each other, riff off each other. It accelerated technique evolution tremendously.
Midjourney is a small team. They are working on a web interface. But, won’t release it until it is significantly better than all the benefits they get from Discord. Meanwhile, they’ve been too busy making quality improvements and scaling the service to keep up with demand.
I have no doubt that someone could generate a similar result with SD, but it would require a lot of effort, to control hands with SD using controlnet one usually uses the depth model, if one wants to create an image similar to the one generated by Midjourney v5 one would have to place hundreds of hands.
David Holz has said they don’t want to be in the API business. Their goal is to bring creative power to individuals. Squeezing margin out of API calls is more about negotiating with corporations.
There is the api that the website uses to talk to their server.
Of course automating image generation via any means (including the private api) goes against tos for good reason, I have never misused the api to generate images and have no plans to.
However I do use the api to download all my images and their metadata including prompts. Using the API I sync every image grid + 'upscaled' I have ever generated, generate a json file with all metadata including the full prompt and then use that to build my local archive.
I actually find this pattern of “Tweet driven development” discouraging. Seems like the teams are spot fixing issues as they’re identified without understanding or addressing the root cause. It means that the same problem still exists somewhere else in the model’s latent space, we just don’t know about it yet. This is fine for AI art generation, but it will break at scale as more and more folks try to rely on generative models as critical components of larger systems.
I don't get how everyone was saying the hands looked weird before. I think it just had something to do with the camera technology back then, and it must be training on that. It looks that way in all of my old childhood photos.
People in the stable diffusion community have solved this problem using another neural network (ControlNet) to guide stable diffusion output using OpenPose information.
These models don't understand relationships between objects in a scene, especially between distant objects. So they can't do hands for the same reason they can't get legs on a table right. They know roughly what a table and a table leg look like, but they don't understand that there needs to be 3-4 of them at least, and they need to be spaced so that the table sits level, and the perspective they should have as a result. So, I've seen tables where it kind of gets it right that the legs are in the corners but then as the table legs go down, the front ones are mysteriously behind something that ought to be under the table. And sometimes it kind of loses track of a table leg or two - they melt into the background.
Very similar problem with hands. They need a very specific orientation and shape and the fingers all need to consistently point in the right direction, and typically the same direction (except for when they don't like with a pointed finger, etc).
Curious as to how these models handle it so much better than prior generations. Is it something novel, or a specific hand-based fix they put it, or is it just "we made the model bigger"?
It still feels unintuitive to me that models aren't able to infer these concepts from the training data given how consistently the training data follows them. It's not like there will be a lot of examples of bad hands in there.
Or maybe the right models have not been built yet or plugged in? Another commenter told me about openpose information which is an AI that detects human poses. If that neuron is plugged in, it might lead to more accurate numbers. Stable diffusion is trying to do this.
A hand isn't so much a 'thing' as it is a complex asymmetric relationship of multiple elements that have to be within certain ratios of each other to fairly tight tolerances. Humans are very sensitive to those ratios. It's a hard problem.
The number 4 (palm fingers) is very precise. You can't have 3 or 5. But you can have a variable number of stripes in tiger coat for example. It's difficult for AI to pickup that they need exactly 4.
The fingers themselves are also almost identical, but not really. If you learn a "platonic finger" it's not good enough, you should learn each finger individually. There is only so much you can spend on them, you got a million other things to learn. And the raters of the model are much more likely to penalize a bad face than some off details in a hand.
Another reason I saw was that models were trained on 512x512 "portrait" images including very few hands. Added to the inherent complexity of hands, this throw off their generation.
these have had progressively better and better hands over time. You can clearly see that they've been using various model merges (this Lora merging techniques like https://github.com/cloneofsimo/lora) to get two different models to combine and get the best of both. Many have done better hands and contributed it out. NSFW, but this is one i found that has very realistic hands now: https://civitai.com/models/2661
It is faster than i can keep up. This is open source collaboration at heart. I am very glad that Stable Diffusion was released publicly. Now if only openAI would do the same with their GPT models.
Oh don't think they said this explicitly, but plenty of people said they weren't worried about AI art because it couldn't even draw hands which kind of implies that they won't be fixed imminently.
In image 1, the subject appears to be missing a digit on each hand.
In image 2, the missing digits issue expands, combined with what appears to be a merging of the thumbs.
Image 3 introduces a supernumerary digit on each hand, and has extra "parts" disappear into and out of other fingers.
Image 4... isn't right in a number of ways, but still seems to have fewer than the expected number of digits.
I don't think these results are any better than the v4 model was, but decide for yourself. This is what I got with the same prompt using -v 4 on March 1. https://i.imgur.com/hQA2K7m.png I ended up just taking a photo of my daughter and blurring the background.