> CUDA Moat Still Alive Wrong conclusion. AMD is slower than NVidia, but not _th...

SideQuark · on Dec 23, 2024

The amount of effort this team took, literally co-opting AMD engineers, and working for 5 months, to get closer but not yet usable, means they are not even close to usable. What team wanting to do ML training/inference can afford so much down time for zero benefit? How many except a few big ones can get AMD to devote so many resources simply for that team?

And, if you’re training a model costing you millions, the last thing you need is a buggy, untested stack, breaking training or perhaps worse giving you noise that makes your models perform worse or increases training time.

By the time AMD gets usable out of the box at this point, NVidia will have moved further ahead.

cyberax · on Dec 23, 2024

Sure. But this work is done, and can be reused by others.

Meanwhile, Nvidia hardware is expensive and still is in short supply. AMD might look quite tempting.

SideQuark · on Dec 23, 2024

It sure doesn’t sound done. It’s a one off hacked set of scripts tied to incompatible chunks of a ton of libraries. What happens when you want or need other parts of the PyTorch/billion libs ecosystem? You’re gonna get more AMD engineers and waste 5 months getting those to work?

Meanwhile those libs release running CUDA on NVidia’s old and newest releases out of the box.

So no, it cannot be reused by others in production any more than my custom hacked car engine mod can be added by Ford to every car in existence.

Have you done any deep professional production work on any of these stacks? I have, and would never, ever put stuff like the stuff in the article in production. It’s no where near ready for production use.

ryao · on Dec 23, 2024

There is a difference between doing something just for yourself and making it usable by others.

m3kw9 · on Dec 23, 2024

Like the article say, if the model change a little this work need to be almost thrown out

YetAnotherNick · on Dec 23, 2024

All the cloud providers list MI300x as more expensive than H100. So if you compare performance/cost it is even worse.

m3kw9 · on Dec 23, 2024

Just like that? So a little work and now they are competitive. You know how much work “just a little bit of work” is doing? They us a cultural issue and it will take months to fix if they are lucky and then you start tackling the tech debt they’ve built up. By that time it will be another generation