Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Midjourney v5 can do hands (twitter.com/tristwolff)
238 points by GaggiX on March 16, 2023 | hide | past | favorite | 180 comments


Let's try "chinese girl making heart with fingers --v 5", a prompt I had used before under v4. Result: https://i.imgur.com/jPgBqsX.png

In image 1, the subject appears to be missing a digit on each hand.

In image 2, the missing digits issue expands, combined with what appears to be a merging of the thumbs.

Image 3 introduces a supernumerary digit on each hand, and has extra "parts" disappear into and out of other fingers.

Image 4... isn't right in a number of ways, but still seems to have fewer than the expected number of digits.

I don't think these results are any better than the v4 model was, but decide for yourself. This is what I got with the same prompt using -v 4 on March 1. https://i.imgur.com/hQA2K7m.png I ended up just taking a photo of my daughter and blurring the background.


Wow, imgur.com ate my back button.

Edit: I accidentally clicked share before trying to leave. Seems to persist even after closing the share dialog. Less of a dark pattern and more likely a bug, I suppose.


Someone's A/B test probably went great. Engagement and time spent goes way up if nobody can ever leave the page :)


Weird, I linked directly to the image as a PNG file, and checked it on my computer, it is just the image, nothing else.

Based on your comment, I just tried clicking on my phone instead, and sure enough, imgur intercepts and redirects to imgur.io on mobile. But it still didn't break my back button on iOS, so I'm not sure what's different for you.


same


V5 is vastly better. V4 mutations are outright creepy.


Neither of them produced a result I could use. One or the other might be more aesthetically pleasing to you depending on what you focus on, but both are very badly broken in fundamental ways.

If anything, I'd say v5 is more confident about its wrongness. It is as if humans have always had four fingers, how could you possible think otherwise? It that sense it seems more like the text-based LLMs: confidently incorrect.


V5 looks like photos with filters on. V4 looks like a painting.


they're both pretty flawed but v5 is better

it still doesn't seem to respect anatomy when it comes to hands particularly, perhaps an artefact of not paying attention to hard constraints that are not visually self-evident


These systems can't count in general. Doesn't matter if its fingers, n clocks on a wall (for all 1 digit vals of n), some variation of n items on an x, or even "a picture within a picture within..." k times. The machine can't count.

>I ended up just taking a photo of my daughter and blurring the background.

Ah, the diffusion process.


I suspect inability to count is a thing that can be trained for, although whether that will come at the expense of something else, I'm not sure.

I appreciate the humor. :)


Why the optimism? If it hasn't picked up on the gist of the meaning of 1-10 after 10 billion examples, why expect it to work after 20 billion examples? These models have worked unreasonably well so far, but there has to be a limit to how much we can substitute more data for our lack of conceptual understanding of cognitive processes.


Because I'm not sure counting is something that has been explicitly tagged or trained for. Clearly always-accurate counts are not an emergent feature, but if you tagged a large set of images with tags we think are too obvious to tag as humans, like "ten fingers" and so on, if that was an actual goal to be rewarded, I think it could improve results.

I'm overall in the skeptic camp, but it seems like these models can generally deliver what they're trained for. It doesn't appear any have been trained with counting as a primary goal.


Ten fingers surely wouldn't be tagged, but the thing about numbers is that they work as well for 10 apples as they do for 10 coins and 10 cups and 10 whatever. It never made the leap of abstraction to learn in the latent space the meaning of numbers 1-10 independent of the particular object. This lack of extrapolation lends it to a very rote learned style.

I already have a way around the specific numerical vocabulary/training problem. When I ask for "a picture in a frame", "a picture of a picture in a frame", "a picture of a picture of a picture in a frame" etc, I'm trying to use linguistic recursion to make numeracy emerge. But even that form of counting without predefined numbers fails. There's no reliable way to make a prompt the produces k of something. That's a deeper issue than it not deducing the specific meaning of characters 0-9 by example.

It can't learn that by example as there really isn't training data with more than 2 nested pictures of pictures, and by itself it will never realize it can just fill in the nested painting by prompting itself with the nested statement. It lacks thought loops.


Wrong. Look at eg Parti. Solely a matter of scaling the text encoder and then not screwing it up with unCLIP etc.


Its not wrong its an experimental observation based on systematic tests I've done. Maybe, with great strain either in quantity of training or ad hoc tweaking, this particular low bar can be hurdled. But it is still the case in general that numeracy is not prone to naturally and unexpectedly emerge from this type of system. The extent to which that matters depends on what your goal is. If you care about results in the here and now, then random patches optimized to use cases is fine. If you care about figuring out where to look for things we can do cognitively that are not very well suited to these generator models, the surprising lack of numeracy so far is a great starting point. The next obvious question is why it takes billions to notice the numbers 1 - 10? What models wouldn't suck at that?


> Its not wrong its an experimental observation based on systematic tests I've done.

You've done systematic tests on Parti?


How quickly the bar was raised. I can remember a time when we nitpicked Dall.E for creepy eyes.

Surprisingly some things we consider easy seem to be hard computationally.


I can count eleven mistakes in your comment on one hand. (Born near Chernobyl, ymmv)


Has anyone here ever tried drawing hands?

As a teenager I took a drawing class (mostly so I could learn to draw Pokemon) and I remember doing a study on hands at one point based on some characters from Dragon Ball Z.

And man, it was the hardest thing it that class. With faces, once you get a face “right”, you can make small adjustments to make the mouth/eyes open/close, but hands… if your character makes any sort of gesture BOOM now you’re drawing a completely different shape.

Between the number of joints, their range of possible rotations, and the angles they can be seen from, hands are probably the most complicated parts of our bodies that are visible from the outside. It’s completely unsurprising to me that these networks have trouble encoding them.


I was trained in classical animation so I've drawn a lot of hands. It's difficult for me to understand how any AI can produce real hand images.

It's not the number of joints, it's not the articulation... it's the relationship of the hand to the skeleton, to the gesture, and to the objects with which the hand interacts.

It's great that midjourney can now draw raised hands doing nothing or anime girls holding their hands in a mannerist pose but that doesn't address the real issue. Hands are intentional and laden with tiny muscular efforts that we're primed to perceive.

When AI draws a tree we aren't expecting each branch to interact perfectly with a cradled object. It's all arbitrary.

I wouldn't be surprised if "AI hand touch-up" becomes a specialist skill for the next five years or so. I don't think the hand issue can be addressed until new models are devised that invest more semantic consideration into a scene.


I pose 3D characters in Daz 3D and completely agree. Getting a rigged hand to hold an object such as a wine glass or mobile phone is virtually impossible to do realistically by guesswork. I usually have to hold a real object myself to understand what is going on. With experience I am learning some common patterns but I find there is no substitute for the 'hold-it-yourself' principle.


What's really interesting is that it's something both extremely hard to get right, and also super easy to diagnose as "wrong" when not done properly: you need a lot of training to design convincing hands but anyone can judge you for bad hands.


Yes. This is actually why life drawing of models is such a good exercise. You can draw a tree or a house badly in the sense that it doesn't (say) have the proportions of the original but it can still look okay to someone who hasn't seen the original. But draw a person with their eyes too close to the top of the head and it immediately screams 'wrong'


It would be useful if the AI could identify the figure and impose a predefined 3D framework or skeleton as constraints for drawing. Like if one wanted to generate a human or animal there would be such a rig in place. That skeleton would in turn constrain joint rotation, proportions, and obviously the number of fingers.

I can only speak for SD but I've had some success using img2img on a CG or hand drawn figure to get the correct pose. The downside of that is that you have to use a low strength value to ensure that it actually follows your image.


Have you tried the openpose controlnet model? It seems to work well, but unfortunately does not cover hands.


I cannot edit the comment anymore but it seems I was wrong: https://www.reddit.com/r/StableDiffusion/comments/1144vyb/co... in this post they discuss and use the openpose model with also hands covered.


>we aren't expecting each branch to interact perfectly with a cradled object.

We are. Sometimes if its subtle and camouflaged it slips past us.

>It's all arbitrary.

Its not. You are right there are probably fault modes of AI we don't notice most the time, and fault modes that bother us a lot. But its not arbitrary. We are better at noticing certain things aims more than others because that's what we evolved to see.


“Has anyone here ever tried drawing hands?”

It’s not for naught that there’s a common meme around AI hands https://imgur.io/tf43ecd?r (Alt text: Human asks robot “can an AI draw hands?”, robot counters “can you?”.)

I remember during the original AI art arguments, an artist friend semi-jokingly remarked to me that AI can’t draw hands because there aren’t enough examples in the training set, because artists go to great lengths to choose poses that hide the hands since they can’t draw hands either.


I heard a great quip about the way people misreason about AI capabilities.

Announcement: "AI can play chess!"

Public: "People can play chess. And AI can play chess. People can also X. Therefore AI can also X."

... Ignoring the underlying nature of the chess problem and how it was different than other problems' structures.

Hands are the same.

You can image-bash together faces from examples and get something mostly-right simply through pattern copying.

You cannot do the same with hands, because rendering them plausibly requires at least intuition and approximation of inverse kinematics -- something the recent set of image generative AI didn't include.

Which isn't to say it can't, simply that the "hands problem" is unlike "the face problem."


“ You can image-bash together faces from examples and get something mostly-right simply through pattern copying. … You cannot do the same with hands, because rendering them plausibly requires at least intuition and approximation of inverse kinematics”

Not sure I agree. I think hands probably just require ~1 OOM more image-bashing than faces, and training sets had ~1 OOM less samples of hands than faces. E.g. faces need 1 trillion cumulative Tflops while hands need 10 trillion cumulative Tflops, and because there were more faces than hands in the training set, by the time we reached 1T on faces we had reached 0.1T on hands. (numbers made up)

Appeals to needing understanding of deeper or underlying principles like chess rules or inverse kinematics compel me to bring up the bitter lesson http://www.incompleteideas.net/IncIdeas/BitterLesson.html


yep, drawing hands well requires an understanding of the underlying anatomy that is more apparent than most other salient features

hands are obvious to most people, but there would be many features that an AI would require a vast training set to completely capture, but that humans would also miss most of the time

for instance look at Michelangelo's Moses, it was sculpted with models and with a very thorough knowledge of anatomy by the artist, and includes details like the muscle of the forearm that contracts when someone lifts the pinky finger:

https://i.imgur.com/0vjAOnR.png

what are the chances that, for instance, the average person would notice that detail missing without being told about it? let alone reproduce it generatively, representing an imaginary person


For some reason, the phrasing, the structure, the rhythm, they all very strongly remind me of Alan Watts’.

Quite amusing and surprising.


More importantly, there are so many examples of drawn art which intentionally have the wrong amount of fingers that it would make all sense for a model to learn that non-photographic humans may easily have less fingers.


>> Has anyone here ever tried drawing hands?

Yeah. Unfortunately the only way I found to do it convincingly and reliably was the manga way: draw a pentagonal shape for the "wireframe" of the palm, then draw five lines for the wireframes of the fingers (pose them as you need and make sure to place the thumb on the side of the palm, please), then flesh them out. Try not to make them look like sausages (i.e. draw them tapering towards the ends). Draw nails if you really must. I like to draw little lines on the knuckles and inside the palm a little "Y" shape.

That fudges a lot of complexity, but I was only interested in cartoon-like hands, more expressive and evocative, than anatomically correct. So, you know. Manga hands.

(like Jazz hands, but with bigger eyes).

It gets more complicated if you want the hands to be doing stuff. For some reason, one of my favourite themes was someone tapping furiously at a keyboard. Go figure.

I also did some classical animation like StrictDabbler below but I never had to draw any hands, let alone animate them. That would take special training I reckon, you need technique to do that, you can't just intuit it or it'll look horrible.

I don't know what all this has to do with image generators though. They do not crate images like we do. I don't understand why they 're not good with hands, to be honest.

I guess the human form is much more ah, surprising and irregular, than we realise. I think, to an alien, we'd look pretty freaky just like we imagine cephalopod-like aliens to be. "AAAAH what are those tendril-like appendages sticking out of them?!!"

Anyway, for me the hardest part of all was perspective. Now that is some tough shit. But that, too, you can learn, if you're shown the right technique. Allegedly.


It must be the combination of the complex, dynamic shape, and our high sensitivity to hands that look ‘off’. There are many other things that are hard to draw accurately but where we are completely convinced by very cartoonish representations.

That gives hands a very wide uncanny valley that is hard to cross.

Surely this is because hands are one of the most versatile and useful parts of the human body? We probably have a lot of brain cycles dedicated to modeling them.


Probably drawing hands is just 'prime factorization' for visual arts.

It's pretty easy to spot when something is wrong but pretty hard to get them right.


The saying goes that the mark of a great master is their ability to draw hands. It's why Rodin has sculptures that are just hands.


Just under half your bones are in your hands + feet alone, adding in the degrees of freedom provided by the wrist/elbow/shoulder on your hand position and I can see why it would be difficult to get right.


way way back when I was in art classes, the hardest part for me was hair. everything else for me would look acceptable slightly better than a 5 year old, but the hair was never better than stick figure at best. i remember trying to draw a portrait of Robert Smith from the Cure. the hair, ugh


Even the hands in this tweet's image are not correct. There are a bunch with 4 digits, two thumbs, 6 fingers, fingers splayed in anatomically improbable directions. Not to mention there are probably two hundred and fifty hands in this photo (not the very explicit "one hundred" mentioned in the tweet).

What is with these types of AI booster tweets? Nobody bothers to even check if it shows what they're implying it shows?


> Nobody bothers to even check if it shows what they're implying it shows

Or the vast majority of the hands are fine and everyone understands that it's a big upgrade except for some "well ackshully it's not perfect" HNers.

I had to zoom in and go hand to hand to find some outliers.


It did do hands correctly before, not always but sometimes. So I'd expect "can do hands" meant it no longer made those mistakes or why say that? But they didn't even manage to make a picture without mistakes, so to me as a naive outsider I don't see what the announcement is.

If they said "is much better at hands" it would be much clearer to me what happened and nobody would complain, that looks pretty ok for the most part, but saying "it can do hands" based on those pictures doesn't seem right.


I'm genuinely sorry, but this reply sounds petulant to me.

Please don't complain that you don't understand the significance or the magnitude of a particular advance. Please don't complain that the phrasing of the tweet wasn't accessible -- your ping time to google.com is no different than mine. This is HN. Wear your intellectual Sunday best.


Even with Google, and having used Midjourney since v3, I still don't have enough context to understand what the advance is here.

Midjourney could do hands before, just not consistently. That doesn't seem to have changed. So is it that MJ can now do more realistic hands inconsistently? Or did consistency get better without achieving reliability?

I can't make this not sound sarcastic, but I'm trying very hard to ask this earnestly: I never had trouble getting too many hands into a picture with v3 or v4. Is v5 getting the correct number of hands more frequently now? Is that it?


Yes, that's it precisely. The odds of having a good hand have gone up dramatically, near as I can tell. Even the hands that aren't quite right seem better somehow.

I expect either MJ 7 or 8 to do hands flawlessly, every time.


This is HN so if you’re claiming “dramatically” let’s get some proof.


sure. See image on the tweet that started this thread.

Now, take a look at these hands from MJ3: https://www.reddit.com/r/midjourney/comments/wlujgw/midjourn...

It's important to note that MJ3 reliably did not produce human looking hands.

It's equally important to note that MJ5 usually does, at least from a quick count/survey of the hands shown in the provided image.

Is that sufficient? If not, what proofs would be sufficient?


> your ping time to google.com is no different than mine

That's gotta be the best comeback I've ever read on this site.


This touches on a big reason why it's so hard for me to get on board with generative AI. The hype around it is pretty much the same as the hype I saw with NFTs, complete with a community lacking any awareness of just how uninteresting, if not downright bad, their "art" was. We went from bad pixel art to people making some lame picture of two people holding hands in a foggy cyberpunk setting.

The hands aren't the problem. There, I said it. The hands were never a big deal, just the most visible symptom of the actual problem.

The problem is that AI art sucks and these people are too self-deluded to realize that because they want to believe that they have a shot at making that coveted internet money.

Otherwise, honestly, the tech behind AI art is actually pretty fascinating, it's just that the community is absolutely the worst.


You desperately want this to be hype. You, like me, have an intrinsic investment in the idea of human supremacy. Neither the popularity of produced artifacts nor the rate of improvement support your cynicism.

The gap between the world you want to inhabit and the one that is being born is widening.


> This touches on a big reason why it's so hard for me to get on board with generative AI. The hype around it is pretty much the same as the hype I saw with NFTs, complete with a community lacking any awareness of just how uninteresting, if not downright bad, their "art" was.

I strongly disagree. NFTs were always ugly and useless. Generative AI is useful and valuable right this minute.

I'll even concede that the output from these systems is mostly ugly, but for many use cases, that's OK.

Given the choice between nothing, extremely cheap custom art that looks OK, and commissioning a proper artist to draw exactly what we want, I think generative AI is going to be the clear winner most of the time.

If you're a contract artist who does work for small companies and individuals, I don't see a future where generative AI doesn't severely undercut your business.


> The problem is that AI art sucks

Uh, no. I mean, what sucks and doesn't suck in art is subjective, but you're objectively wrong because, quite simply: A lot of people like AI art.

A colleague of mine is way into doing AI art and does pretty amazing stuff. eg:

https://cdn.discordapp.com/attachments/552952459958550548/10...

https://cdn.discordapp.com/attachments/552952459958550548/10...

"It can't do hands"… well, don't f*king draw hands with it then. It's like complaining that my hammer doesn't make good pizza… you know what I do to solve that?


The bad hands is just a symptom of a tool small model. Larger models doesn't have this issue.


> Nobody bothers to even check if it shows what they're implying it shows?

Twitter, insta, YouTube…

It’s not a great minds collection.


>> There are a bunch with 4 digits, two thumbs, 6 fingers, fingers splayed in anatomically improbable directions.

Diversity and inclusion.


> What is with these types of AI booster tweets? Nobody bothers to even check if it shows what they're implying it shows?

It's Twitter. The only thing matters is tweeting, not some fact-checking nonsense.

We can leave those tedious tasks to GPT-4.


So when is someone going to incorporate this in a future dystopian sci Fi?...

"The Turing test never works because the cyons speak, think and act just like us... But the hands, son... The hands are always... off. No one knows why, but it's from the earliest days, even before they learned how to manufacture the cyons in our image. Those damn generative AIs were never able to get them right, even when they were just pictures and they probably never will. No one knows why, but we don't need to know the why. It's still the only way we can tell them apart, son. Check the hands. The hands."


This is a thing in the original Westworld movie (1973). They couldn't get the hands right.


Now THAT is interesting.

I guess it could also be folded into the plot of we go Terminator and incorporate time travel.

"We went back and tried to warn them even before computers were widespread... All they did was remake the media and left out the clues"


My take was that was always essence of the "dogs smell Terminators" plot element.

Skynet could make Terminators that looked and acted human, but didn't put the effort into making them smell human.

Unlike us, a dog's sensory world is more like 50% smell and 20% vision. Ergo, Terminators seem "obviously wrong" under even a cursory examination to them.


Now that's an interesting thought: terminators were in the "smell uncanny valley" for dogs. It makes sense, I just hadn't thought of it in those specific terms.


Doctor Who Cybermen, too - the original run, anyway.


I am using generative AI to make a video game and actually I have to look at every output to see if the hands were generated correctly. Gwern mentions this too. Hands + body would be an actual breakthrough.


The real Turing test:

Q: Say something bad about Biden

A: I'm sorry, but as a large language model....


Umm, have you actually zoomed in? Lots of extra or missing fingers, fingers melting into other arms, etc


An example showing hands in basically the easiest, “flattest” pose you can get, still failing.


Do diffusion models find this easier than other poses?


Still several times better than before...


Looking forward to mj six! Giving the doubters the finger edition


I see you have six fingers on your right hand. Someone was looking for you.


A person with six fingers? That's inconceivable!

How does one give the middle finger with six fingers?


Twice as well as someone with five? I guess you could do some sort of hybrid British/American gesture with two middle fingers.


It’s perfect for Aussie-Americans- they can give the forks and TWO middle fingers simultaneously!


It's a bit better than before but still not right:

https://twitter.com/Excaldata/status/1636375182750396418?s=2...


Now, I'm traumatized.


probably a 5-10% error rate. C'mon, you have to be impressed though compared to where we were only a few months ago


I'm extremely impressed compared to where it was before (which was frankly frightening) but the error rate is pretty bad really, if a person made this you'd either assume it was intentional or they had severe problems. It may be 5%-10% looking at individual fingers to see if they are correct but each hand has (somewhere around) 5 fingers to get right. Diffusion just doesn't lend well to connected things like counts of objects or coherent text.


It's AI, you, have to be impressed right?


Well I for one, as a person with sophisticated tastes, find the emperor's clothes to be absolutely fabulous, none finer have I ever gazed upon.


I’ve been making a lot of stuff for my D&D buddies using Stable Diffusion. With hands, I basically brute force it. Using an A100 40GB on Colab I can generate ~28 or so (depending on the size of the prompt, Automatic1111 allows for prompts above the 75 token limit at the expense or more vRAM per image) batches in about a minute, filter those and look at the one with the best hands, then feed it back in using inpainting (so regenerating just that small space, not the whole image) and eventually get one set of good hands and 100 sets of bad hands. If you’ve got a mysterious sixth finger you just inpaint it off and add latent noise under the inpaint instead of the original picture (just a checkbox in the ui) and set your denoising to 0.80+ and it’ll replace the finger with the background pretty consistently.


Yeah, I fiddle with it locally and img2img/inpaint is very helpful with these kinds of touchups. Currently playing with LoRA training to put my friends into pictures, but I haven't figured it out well enough to get it working with inpainting - Still easier to Photoshop their face in and use inpaint to merge everything together.


My rough understanding is that it is not a problem affecting Midjourney alone but pretty much all other engines as well and that it is not related to drawing hands per se but figuring out hands in the context of a human body. In other words, drawing an individual hand is not a problem, drawing a hand attached to a body could be challenging depending on the scene and drawing multiple human bodies with hands is virtually impossible to get right in one pass.


Yes, the hands were not the only problem, perhaps the most obvious, the teeth were usually pretty bad too, they too have improved largely with Midjourney v5, I suggest going to the Midjourney subreddit to see the different results.


It's easy to prompt for people with closed mouths and hands not visible, but with MidJourney at least, I would consistently get what I can only describe as "stuff on their face" with almost every prompt involving a human. Less often with white people, but even then pretty often.

I mean, I just typed "/imagine asian warrior nun --v 4" into Discord, thinking of Beatrice from the recent Netflix series, and three of the four results show what I'm talking about: https://i.imgur.com/499QCn6.png


I tried the same prompt but with "--v 5" instead, and got this: https://i.imgur.com/eElOkjU.png

I only see "stuff on the face" in the third image, which I guess is an improvement. I'm not sure I'd call hands "fixed" based on this image alone, but they're better.


Similar issue exists for all networks that involve translation, despite the task, so classifiers, though I don’t know whether it has been resolved.

With classifiers the issue is that if you place sufficient objects in space that co-occur the model will believe that it is a class with said objects, eg a face, but the problem is the relative positioning of them plus all angles of rotation.

I think geometric deep learning has a solution for the rotation ia rotation invariant models, but I haven’t gone through that book yet.


I feel like everyone is collectively pranking me with these generative AIs.

Everyone posts wonderful images and then every single time I try to get the damn things (all of them) to draw something for me, the results are absolute garbage.


You're just not seeing the hours they spent learning prompt engineering and/or the random results they picked through to get the good one(s).


As someone who's generated many thousands of images on Midjourney, I agree.

People think they can waltz in and immediately get great results from using AI's to generate images... and they can, if they're lucky or if they copy somebody else's prompt.

It's a lot harder to do so consistently, or if you want your images to look both good and original, and not like mere copies of what everyone else is doing.


Yeah I thought my copy of stable diffusion was broken at first because all my results were awful.

Then I copied someone's prompt and got really great ones.

I suspect eventually there will be tools you can just fire up with no knowledge, but all of them I've seen so far still do require a bit of expertise and time.


You can :

1.ask chatgpt to generate a prompt of what you want by giving it a few exemples from a random SD prompt sharing website.(this alone gave me stunning results)

2.(optional) Use Controlnet for the pose you want,from the posture of the body down to each finger individually.

2.5 use multi Controlnet for multiple characters.

3. correct any errors with img2img.

4. Enjoy

It takes 10 to 20 minutes (mostly in getting a good pose) but the results are always good and you can later reuse the pose again.


I've had some dreams where everything except my hands were crystal clear.

Maybe there is something fundamentally difficult about representing hands ?

But I think the more probable explanation is that I'm just trying to find a correlation where none exists.


Hands are highly structural while also being not completely planar.

That’s some of the hardest geometry for a mind to envisage without some kind of construction process.

When you’re imagining or dreaming, you don’t have a construction process to make them look good.


One of the ways they say to recognize you're dreaming is to count your fingers and see if you have more/less than five.


"Ah, but can it accurately capture the depths and intricacies of a human soul?"

"Yep."

"Yeah but a specific Appalachian human soul at around four o'clock in the afternoon on a day in mid-autumn when it looked like it would rain but then it didn't?"

"Also yep."

"Yeah but specifically at 3:56pm and the human in question is standing on loam and holding a book in their left hand and listening to Music For The Royal Fireworks by Handel?"

"Uhhh..."

"See, told you AI is useless."


How about this conversation:

"Midjourney v5 can do hands"

"Did you look at the hands it did? There are a bunch of mis-shapen blobs, hands with extra fingers, two thumbs on either side, etc."

"Sure, but there are also some accurate hands, so it can do hands."


Does being able to do something mean you can do it perfectly 100% of the time? I'm not sure who was supposed to be unreasonable in your imaginary conversation.


Can't wait for the "Select the non-deformed hands" captchas.


This is why it is important to remember to end every imaginary conversation with “and everyone clapped” right when the protagonist wins.


"Hey check it out, this new architecture can sometimes solve X"

"HAH! Here is a counterexample where X is not solved, so it can NOT!"


If I say I can do something, especially when I say I'm "waving at the haters", what I usually mean is that I can consistently do that thing.

If I say "here's a self-driving car!" and show you a video of a car moving straight down a street and stopping at a light, would you agree that I have a self-driving car? After all, it drove itself down the street.


If it can draw accurate hands 25% of the time, then it would only take 4 tries to get it right. Seems pretty good to me


Nice! So if I want to draw 5 people with two hands a pop I can get 10 non-deformed hands a full... .0953674316e-7% of the time. I like those odds!


What point are you making here?


Can Midjourney be used other than via discord bots yet?


Yes, there is a website that many users including myself use to generate images without ever having to use Discord apart from authentication.

It's a full blown web app with better options than the Discord bot, it has batch mode/select, remix, all upscale modes, works with every Midjourney engine.

They make it available to users who have generated more than 10,000 images as it's in alpha state and not able to withstand the load that the bot currently takes.

I believe after v5 focus they will make this web app public, but for now only a select few get to use it.

They warn users not to talk about it or share the link because they don't want it public until it's ready for full load which means over 10 million concurrent users.


Good to know :) I've been making a few things in Stable Diffussion. But to get assets that are suitable for production, you need to be able to generate lots of batches, pick and choose, iterate on prompt, do a bit of img2img, inpainting etc.

Next project I want to heavly utilise image generation from the ground up - Midjourney looks really good, but needs better tools.


If anyone has tried MJ and become frustrated at the chaos of losing their work in the various channels I strongly recommend you make your own server and invite the MJ bot to it. You can create channels to help organise your stuff but making your own server makes MJ almost a pleasure to use.

I don't think I could use it if I had to use the main public server.


Too late to add this to my original comment, but +1 on this. This works great.

Thanks so much.

P.S. for those, like me, who were confused about what "making your own server" means, you do this within the main Discord app. It doesn't involve provisioning an actual server and installing software. :-)


Interesting. That's the thing that's kept me from signing up for the premium tier -- the near-impossibility of finding your stuff unless you watch it like a hawk.

It doesn't help that the Discord search function is so terrible.


Every user gets a personal, searchable gallery on the web site https://www.midjourney.com/


That's after the fact, though. If you want to actually interact/modify with a work in progress, you have to be in the cattle car channel and watch for it to show up, yes? (except maybe by having your own server and inviting the bot, as the OP suggested).


Once you’ve subscribed, you can work alone in DMs with the bot. No server necessary.


I just send the bot a private message, solves everything.


Unfortunately no (reason why they are the biggest server on Discord with 13mln users)


Forget about creating AGI -- the most amazing and unpredictable thing about the success of Midjourney is its success despite having the user interface of a 1998 DALnet xdcc warez channel.


Computers from 1998 couldn't run the monstrous amount of JS and/or surveillance that is Discord with any sort of performance.

https://stallman.org/discord.html

In fact, it looks like modern ones can't either:

https://old.reddit.com/r/discordapp/


It is really frustrating to me that discord seems to have taken over half of the use cases that forums used to fill. Reddit stole most of the other half, but every time I look into discord I cannot understand the popularity and people's willingness to push past all of the privacy and access friction it introduces.


what's so hard to understand about why people are attracted to social media. we're social creatures


Huh, thanks. Seems like a weird moat to hide it behind but I guess they know what they're doing..


A Discord server is a lot easier to moderate, block people, etc. than an HTTP API with access tokens. Plus then you have a sort of captive audience of Discord community members that receive all of your notifications by default.


I guess the social aspect makes the community stronger and the fact that you generate images (usually) in public channels is a way to stop most people to generate weird stuff.


This is the reason. Before Midjourney, www.eleuther.ai’s Discord had an image generation channel. There the benefits of generating socially were made obvious. People help each other, learn from each other, riff off each other. It accelerated technique evolution tremendously.

Midjourney is a small team. They are working on a web interface. But, won’t release it until it is significantly better than all the benefits they get from Discord. Meanwhile, they’ve been too busy making quality improvements and scaling the service to keep up with demand.


But can it do Xi Jinping? (Last I heard they were censoring it for dubious reasons)


SD can as well with multi controlnet.


I have no doubt that someone could generate a similar result with SD, but it would require a lot of effort, to control hands with SD using controlnet one usually uses the depth model, if one wants to create an image similar to the one generated by Midjourney v5 one would have to place hundreds of hands.


With controlNET you actually get to chose your hand pose though and not just hope for it to be the way you want in Midjourney,


Don't use depth for hands. There is an openpose model that can use hand information.


Can you link me the controlnet model trained also on hand openpose?



If MJ is using AI then some community solution should soon appear for SD.


Is there a reason Midjourney does not have an API?


David Holz has said they don’t want to be in the API business. Their goal is to bring creative power to individuals. Squeezing margin out of API calls is more about negotiating with corporations.


There is the api that the website uses to talk to their server.

Of course automating image generation via any means (including the private api) goes against tos for good reason, I have never misused the api to generate images and have no plans to.

However I do use the api to download all my images and their metadata including prompts. Using the API I sync every image grid + 'upscaled' I have ever generated, generate a json file with all metadata including the full prompt and then use that to build my local archive.


I read about a limitation, and then within a week I read that the limitation has been vanquished. Who hit fast forward, and how did they do it?


I actually find this pattern of “Tweet driven development” discouraging. Seems like the teams are spot fixing issues as they’re identified without understanding or addressing the root cause. It means that the same problem still exists somewhere else in the model’s latent space, we just don’t know about it yet. This is fine for AI art generation, but it will break at scale as more and more folks try to rely on generative models as critical components of larger systems.


Hands are some of the hardest forms for humans to draw, it is no surprise these models struggle with hands too.

In a way, a hand contains more features than perhaps the entire human body from a forms/intersection perspective.

I think an entire model needs to be built to focus on just hands and then combined into a more general model, perhaps that will be the path forward?


I don't get how everyone was saying the hands looked weird before. I think it just had something to do with the camera technology back then, and it must be training on that. It looks that way in all of my old childhood photos.


the fact that our fingers also look weird in our dreams is just a coincidence right?


I've never noticed that. I do have a problem rendering mirrors though.


Besides the known problem of multiple and missing fingers, it is also missing the golden ratio phi for proportions between fingers and palms.

We are almost there I guess. Just need to add the concepts of known proportions to image construction.


People in the stable diffusion community have solved this problem using another neural network (ControlNet) to guide stable diffusion output using OpenPose information.



I‘d like to use midjourney via an API from my scripts, but the only way of using it right now seems to be via discord, or did I miss something?

Seems like Dall-E is what I have to use for now :/


Yes still discord only


Wow. The majority of the hands in that picture are malformed. Are the midjourney people blind from looking at too many malformed hands?


Midjourney's sonic hedgehog needs more calibration.


Did they put in work specifically to improve hands and other failure cases? Or is this purely a side effect of a generally bigger/ better model?


It's a step in the right direction, but it still seems to have a problem understanding that the vast majority of hands have 5 fingers.


So lovely; how we will recognise the fake; show us hands. Even good hands are not good.


Is there a public’s demo for Midjourney like there are for StableDiffusion?


There’s a very limited free trial. Like a few dozen images.


Bold claim, and its false.


Can't count, though. That's way more than 100 hands


Why are these models so bad at hands?


I'm a layman but the gist afaict is:

These models don't understand relationships between objects in a scene, especially between distant objects. So they can't do hands for the same reason they can't get legs on a table right. They know roughly what a table and a table leg look like, but they don't understand that there needs to be 3-4 of them at least, and they need to be spaced so that the table sits level, and the perspective they should have as a result. So, I've seen tables where it kind of gets it right that the legs are in the corners but then as the table legs go down, the front ones are mysteriously behind something that ought to be under the table. And sometimes it kind of loses track of a table leg or two - they melt into the background.

Very similar problem with hands. They need a very specific orientation and shape and the fingers all need to consistently point in the right direction, and typically the same direction (except for when they don't like with a pointed finger, etc).

Curious as to how these models handle it so much better than prior generations. Is it something novel, or a specific hand-based fix they put it, or is it just "we made the model bigger"?


It still feels unintuitive to me that models aren't able to infer these concepts from the training data given how consistently the training data follows them. It's not like there will be a lot of examples of bad hands in there.


Or maybe the right models have not been built yet or plugged in? Another commenter told me about openpose information which is an AI that detects human poses. If that neuron is plugged in, it might lead to more accurate numbers. Stable diffusion is trying to do this.


Maybe the problem is that the model can't count, and just knows that each finger has a 75% chance to have another finger next to it.


There will be now though.


A hand isn't so much a 'thing' as it is a complex asymmetric relationship of multiple elements that have to be within certain ratios of each other to fairly tight tolerances. Humans are very sensitive to those ratios. It's a hard problem.


But can't you say the same about faces (except for symmetry) and AI seems to only produce gorgeous women?


The number 4 (palm fingers) is very precise. You can't have 3 or 5. But you can have a variable number of stripes in tiger coat for example. It's difficult for AI to pickup that they need exactly 4.

The fingers themselves are also almost identical, but not really. If you learn a "platonic finger" it's not good enough, you should learn each finger individually. There is only so much you can spend on them, you got a million other things to learn. And the raters of the model are much more likely to penalize a bad face than some off details in a hand.


for what it's worth, humans are also in general terrible at drawing hands - i think it's just a difficult problem


Another reason I saw was that models were trained on 512x512 "portrait" images including very few hands. Added to the inherent complexity of hands, this throw off their generation.


Humans seem terrible at it in very different ways, and definitely don't get as good at other parts before getting good at hands.


off topic, but the title reminds me of "meta VR now has legs!". :)


Tell me when meta can do legs.

#tooSoon


Is it actually 100?


wow, lol.... good luck :D


damn


[flagged]


It has been already, for quite a while now (at least a month or so).

have a look at various twitter accounts that post AI images: e.g. https://twitter.com/PLAawesome/media

these have had progressively better and better hands over time. You can clearly see that they've been using various model merges (this Lora merging techniques like https://github.com/cloneofsimo/lora) to get two different models to combine and get the best of both. Many have done better hands and contributed it out. NSFW, but this is one i found that has very realistic hands now: https://civitai.com/models/2661

It is faster than i can keep up. This is open source collaboration at heart. I am very glad that Stable Diffusion was released publicly. Now if only openAI would do the same with their GPT models.


Did "AI critics" actually claim hands wouldn't be fixed soon, or is this just a strawman?


Oh don't think they said this explicitly, but plenty of people said they weren't worried about AI art because it couldn't even draw hands which kind of implies that they won't be fixed imminently.


Case in point, from six days ago: "The uncanny failures of A.I.-generated hands" https://news.ycombinator.com/item?id=35108726


As Károly Zsolnai-Fehér of Two Minute Papers[1] fame consistently likes to point out (paraphrased):

> "Don't look at the current state, just imagine this 2 papers down the line".

[1] https://www.youtube.com/channel/UCbfYPyITQ-7l4upoX8nvctg


I laughed at how true this really is. Amazing.


>from being handled

I see what you did here


ok but can it do hands, hands, hands in my hands, hands, hands? Because I bet that's difficult.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: