Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
CUDA Moat Still Alive (semianalysis.com)
221 points by pella on Dec 22, 2024 | hide | past | favorite | 172 comments


> Give AMD Engineers more compute and engineering resources to fix and improve the AMD ecosystem, they have very few internal gpu boxes relative to what Nvidia provides to their engineers.

This is real. We’ve found ourselves having to give hardware to engineers at AMD because they’re unable to get allocation of it internally.


This is baffling. I’m sure there are many technical reasons I don’t grok that AMD’s job is challenging, but it’s wild that they are dropping the ball on such obvious stuff as this.

The prize is trillions of dollars, and they can print hundreds of millions if they can convince the market that they are closing the gap.

It’s embarrassing that whoever actually tries to use their product hits these crass bugs (same with geohot who was really invested in making AMD’s cards work; I think he just ran their demo script in a loop and produced crashes).

It seems they really don’t understand/value the developer flywheel.


There's a recent interview with Lisa Su where she basically says she's never been interested in software because hardware is harder, she doesn't believe AMD has any problems in the software department anyway and AMD is doing great in AI. So make of that what you will. Suffice it to say, clearly the AMD board doesn't care either because otherwise they'd replace her.



In the late 90's, US manufacturers, including high-tech electronics, had 2 mantras:

1) Cash is king

2) Inventory is evil

I think this mindset may still be here, in 2024


Sadly common at hardware companies. The most extreme case I've heard of is ASML, who supposedly doesn't keep any machines of their own. They test against "almost-ready" machines right before they go out the door to customers.


ASML might be an extreme outlier though, don't those things cost like $50 million+ each?


Many for last gen process nodes, and from a second or third hand supplier if you could even find one. ASML makes very few fully working machines each year, and the cost and throughput those machines have is astronomical.

They have spare parts you'd bet, and I'd bet they have some SLA agreement with each customer where an engineer is basically on call nearby in case a single thing dosnt work or a random part breaks or needs servicing.

Asianometry did a great video on the cost of downtime when it comes to ASML device in any fab. While I am not directly in this field and can't speak to the accuracy of the numbers john gives, he does not seem one to just make stuff up as his quality of video production for niche topics is quite good.


Almost a decade ago KFAB had a fire, power was cut, everything in process was dumped, they planned to restart but ended up being was cheaper to close the whole facility .

Probably for the best though, KFAB had been discharging several tons of solvents, cleaning agents, and reagents per year into the surrounding area [for as long as it ran](https://enviro.epa.gov/facts/tri/ef-facilities/#/Release/640...)

https://enviro.epa.gov/facts/tri/ef-facilities/#/Release/640...


try 400 mil


Put a 1 or 2 in front of that


You are comparing a machine the size of a container with GPUs?

its nice to be aware of this but this is so fastly different from a critisism point of view that i don't think that matters.


It's not a criticism, it's an extreme example from a company people know that I have no particular NDA restrictions with.


I found it interesting ¯\_(ツ)_/¯


Actually, they shipped not-ready machines to customers. In hope they could find solutions that can fix the machine later.


That’s why they cooperate closely with imec and their FAB in Leuven


Coming up next: "We bought AMD stock on the open market and used it to compensate AMD engineers".


You joke, but it is almost a genuine investment opportunity here for a large player.

Spend a billion on AMD shares, Spend another Billion on a out-of-house software team to solve the software solution to more than double the share price.

Taking into account that there are players that already own billions in AMD shares, they could probably do that as well. On the other hand perhaps it would be better for them, as major shareholders, to have a word with AMD management.


I don't have the inside baseball but I have seen those weird as hell interviews with Lisa Su where she gets asked point blank about the software problems and instead of "working on it, stay tuned" -- an answer that costs nothing to give -- she deflects into "performance is what matters," which is the kind of denial that rhymes exactly with the problems they are having. No, the horsepower of your F1 racecar doesn't matter if the engine doesn't start and there's a wheel missing! You need to fix those problems before the horsepower can matter! Please tell me you are fixing the starter and the wheel!

Hopefully I am reading too much into this. Hopefully she doesn't have any weird hangups over investing in software and it all just takes time to Do It Right after GPGPU got starved in the AMD winter. But if it is a weird hangup then yeah, 100%, ownership needs to get management in line because whiffing a matmul benchmark years into a world where matmul is worth trillions just ain't it.


> she deflects into "performance is what matters," which is the kind of denial that rhymes exactly with the problems they are having.

It's not a deflection, but a straightforward description of AMDs current top-down market strategy of partnering with big players instead of doubling down to have a great OOBE for consumers & others who don't order GPUs by the pallet. It's an honest reflection if their current core competencies, and the opportunity presented by Nvidia's margins.

They are going for a bang-for-buck right now aiming at data center workloads, and the hyperscalers care a lot about perf/$ than raw performance at. Hyperscalers are also more self-sufficient at software: they have entire teams working on PyTorch, Jax, and writing kernels.


Engineers at hyperscalers are struggling through all the bugs too. It's coming at notable opportunity cost for them, at a time when they also want an end to the monopoly. Do they buy AMD and wade through bug after bug, regression after regression, or do they shell out slightly more money for Nvidia GPUs and have it "just work".

AMD has to get on top of their software quality issues if they're ever going to succeed in this segment, or they need to be producing chips so much faster than Nvidia that it's worth the extra time investment and pain.


> Engineers at hyperscalers are struggling through all the bugs too

[citation needed]


It's in the article. Meta don't use AMD for training and write their own kernels for inference. You can't train with AMD, full stop, because their software stack is so buggy.


> It's in the article

The same article also states that AMD provided custom bug-fixes written by Principle Engineers to address bugs in a benchmark - this is software that will only become part of the public release in 2 quarters. I ask again, do you think AMD will not expedite non-public bug-fixes for hyperscalers?

> You can't train with AMD, full stop, because their software stack is so buggy.

Point 7 from the article:

>> The MI300X has a lower total cost of ownership (TCO) compared to the H100/H200, but training performance per TCO is worse on the MI300X on public stable releases of AMD software. This changes if one uses custom development builds of AMD software.

I have an inkling that Meta does not obtain MI300 drivers from https://download.amd.com


AI labs don't want to train models using a stack build some guy hacked up on his desktop last night that's been through no proper QA process. The cost of a training run that fails or results in a garbage model due to numerical errors are huge.

Which is why, as they say clearly, nobody is training models on AMD. Only inference, at most. I'm not sure why you keep claiming they are training using private drivers. They clearly aren't.


Now I i see how where talking past each other. I 100% agree that none of if the hyperscalers are currently (publicly) training on AMD silicon. I disagree with forward-looking statements like it "can't" happen, because I can guarantee you several of them are actively working on making it possible to train on AMD chips - that's just too juicy a target for a bonus packets all the way up to directors: "Our team saved the org $x0 million in TCO vy enabling training on MI300/MI400X in our new clusters"


Oh, I see. "Can't" means in the present tense in my previous sentence, it wasn't meant to be a definitive statement about the entire future.


Lolol 100% accurate. Go trawl through PRs to Triton by FB people to the AMD portion of the codebase.


Sorry, but NDAs mean I can't say any more.


That's the excuse used by every big company shitting out software so broken that it needs intensive professional babysitting.

I've been on both sides of this shitshow, I've even said those lines before! But I've also been in the trenches making the broken shit work and I know that it's fundamentally an excuse. There's a reason why people pay 80% margin to Nvidia and there's a reason why AMD is worth less than the rounding error when people call NVDA a 3 trillion dollar company.

It's not because people can't read a spec sheet, it's because people want their expensive engineers training models not changing diapers on incontinent equipment.

I hope AMD pulls through but denial is _not_ the move.


What exactly are they in denial about? They are aware that software is not a strength of theirs, so they partner with those who are great at it.

Would you say AMD is "shitting the bed" by not building it's own consoles too? You know AMD could build a kick-ass console since they are doing the heavy-lifting for the Playstation, and the XBox[1] , but AMD knows as much as anybody that they don't have the skills to wrangle studio relationships or figure out which games to finance. Instead, they lean hard in their HW skills and get Sony Entertainment/the Xbox division do what they do best.

1.and the Steam Deck, plus half a dozen Deck clones.


There is probably one employee - either a direct report of Su's or maybe one of her grandchildren in the org chart - who needs to "get it". If they replaced that one manager with someone who sees graphics cards as a tool to accelerate linear algebra then AMD would be participating more effectively in a multi-trillion dollar market. They are so breathtakingly close to the minimum standards of competence on this one. We know from the specs that the cards they produce should be able to perform.

This is a case-specific example of failure, it doesn't generalise very well to other markets. AMD is really well positioned for this very specific opportunity of historic proportions and the only thing holding them back is a somewhat continuous stream of unforced failures when writing a high quality compute driver. It seems to be pretty close to one single team of people holding the company back although organisational issues tend to stem from a level or two higher than the team. This could be the most visible case of value destruction by a public company we'll see in our lifetimes.

Optimistically speaking maybe they've already found and sacked the individual responsible and we're just waiting for improvement. I'm buying Nvidia until that proves to be so.


It would go nowhere, games history is full of great hardware that died because they failed to have a profitable ecosystem.

Even Steam Deck is only a success, because it depends on Windows ecosystem, and the moment Microsoft decides it is enough, lets see how long it holds.


Steam Decks run on Arch Linux


As means to avoid paying for Windows licenses.

All the games that matter are Windows games running via Proton, as Valve has failed to actually build a GNU/Linux native games ecosystem, in spite of UNIX/POSIX underpinnings of Android NDK, PlayStation, the studios hardly bother.

The day Microsoft actually decides to challenge Proton, or do a netbooks move on handhelds with XBox OS/Windows, the SteamDeck will lose, just like the netboooks did.

Additionally, it is anyone's guess what will happen to Valve when Gabe steps down.


As a means of control


If they cared about control, they wouldn't depend on Windows ecosystem, rather foster GNU/Linux native games.

Everyone is quite curious what Microsoft will drop at CES 2025, and which OEMs will be on their side, it is going to be netbooks all over again.


They are literally fostering Linux games by selling and endorsing a platform where those games would be native as well as having native releases of their own games. They aren't gonna force any third-party devs to do the same, but they're showing that there is a market while also growing it.


The ones buying gpus in pallets _really_ need the software to not be a worry. Software issues would ruin the entire operational life


>Hyperscalers are also more self-sufficient at software: they have entire teams working on PyTorch, Jax, and writing kernels.

None of this matters because AMD drivers are broken. No one is asking AMD to write a PyTorch backend. The idea that AMD will have twice the silicon performance than nvidia to make up the performance loss for bad software is a pipedream.


> None of this matters because AMD drivers are broken

Do you honestly think the MI300 has show-stopper driver bugs, or that Meta/Amazon doesn't have a direct line to AMD engineers?


>Do you honestly think the MI300 has show-stopper driver bugs

Yes

>Meta/Amazon doesn't have a direct line to AMD engineers?

I don't even think AMD engineers have a direct line to AMD.


The fine article says they had the ears of VPs and Principal Engineers, just to help with a benchmark. Meta/Amazon will get white-glove service


> None of this matters because AMD drivers are broken

How do you know that the problems arise from broken drivers rather than broken hardware? Real world GPU drivers are full of workarounds for hardware bugs.


It does seem like a good idea. If the obvious major Nvidia customers see it as too risky (or we could speculate about other possible reasons why it has not happened yet), maybe some hedge fund who is struggling with where to put their cash could initiate and fund the project.


I enjoy this train of thought a lot. Capturing shareholder value by just creating it yourself. The destruction of a moat I fear is worth less than the existence of a moat a competitor has successfully built. So don’t forget to buying puts Nvidia.


This seems so insane, is anyone actually doing the work to provide an alternative to CUDA? Maybe Google?



The problem is that everyone that tries, keeps missing the CUDA forest and focus only in a specific kind of tree.


Honestly, probably NVIDIA itself, since they contribute significantly to many open-source projects (MLIR), and also make their SoTA GEMM/Conv implementations open-source and available for study (Cutlass).


*> also make their SoTA GEMM/Conv implementations open-source and available for study (Cutlass)"

Cutlass is a fine piece of engineering, but it is not quite as good as their closed source libraries in real world workloads. There is secret sauce that is not open sourced.


I was surprised to hear recently that the same happens at NVIDIA! Hopefully less frequently, but I can understand why it's hard to keep many units on hand given the level of external demand.


> AMD is attempting to vertically integrate next year with their upcoming Pollara 400G NIC, which supports Ultra Ethernet, hopefully making AMD competitive with Nvidia.

Infiniband is an industry standard. It is weird to see the industry invent yet another standard to do effectively the same thing just because Nvidia is using it. This “Nvidia does things this way so let’s do it differently” mentality is hurting AMD:

  * Nvidia has a unified architecture so let’s split ours into RDNA and CDNA.
  * Nvidia has a unified driver, so let’s make a different driver for every platform.
  * Nvidia made a virtual ISA (PTX) for backward compatibility. Let’s avoid that.
  * Nvidia is implementing tensor cores. Let’s avoid those on RDNA. Then implement them on CDNA and call them matrix cores.
  * Nvidia is using Infiniband like the rest of the HPC community. Let’s use Ethernet.
I am sure people can find more examples. Also, they seem to have realized their mistake in splitting their architecture into RDNA and CDNA, since they are introducing UDNA in the future to unify them like Nvidia does.


You're painting this like AMD is off to play in their own sandbox when it's more like the entire industry is trying to develop an alternative to Infiniband.

Ultra Ethernet is a joint project between dozens of companies organized under the Linux Foundation.

https://www.phoronix.com/news/Ultra-Ethernet-Consortium

>> The Linux Foundation has established the Ultra Ethernet Consortium "UED" as an industry-wide effort founded by AMD, Arista, Broadcom, Cisco, Eviden, HPE, Intel, Meta, and Microsoft for designing a new Ethernet-based communication stack architecture for high performance networking.

You probably can't call it "industry standard" yet but the goal is obviously for it to become one.


I wrote:

> It is weird to see the industry invent yet another standard to do effectively the same thing just because Nvidia is using it.

This is a misstep for all involved, AMD included. Even if AMD is following everyone else by jumping off a bridge, AMD is still jumping too.


Infiniband is a monopoly from NVidia (Mellanox). Everyone else would much rather use Ethernet which is the actual industry standard.


Depends on deployment: Ethernet has more drawbacks encapsulating PCIe packet traffic than Infiniband does, or lack thereof with RDMA.


Others can build infiniband hardware according to the standard. There used to be at least two companies building infiniband hardware until Intel killed QLogic’s infiniband division in a misguided attempt to make its own monopoly. :/


Or they can develop a RoCEv2 that works (which is basically what UltraEthernet is).

Infiniband is not fun - it's a special snowflake of an interconnect that sits parallel to the rest of your datacenter network, and can not really run a standard TCP/IP codebase (yeah IPoIB is a thing but still). Do the Nvidia boxes really need a scale-out IP network as well as an Infiniband network?

Plus the spec is old. Packet spraying and trimming, better ordering guarantees, queue pair scalability... a whole bunch of enhancements have been incorporated into UE all the while being compatible with regular Ethernet.

Qlogic was never really an Infiniband vendor --- their qib driver is still in the Linux codebase and essentially emulates verbs on top of a messaging-based design.


We must agree to disagree then. Infiniband is an awesome technology. It was originally meant to to serve as a fabric to connect all components in a computer. That plan died with the dotcom bubble as it was too ambitious and therefore risky, so it was relegated to networking where it was an incredibly good job in delivering messages reliably and fast. It still am hopeful someone will do that someone will do it.


> Infiniband is an industry standard

Infiniband is not an industry standard lol.

Maybe it used to be, but it definitely is not anymore. Most Infiniband vendors are dead. The only product from those days that endures is Cornelis' Omnipath, and even that only emulated the Infiniband API back with its first gen, and then evolved to be its own thing.

At this point, Infiniband is as good as a proprietary interconnect only sold by Nvidia/Mellanox.


Infiniband is very much an industry standard:

https://www.infinibandta.org/member-listing/

As far as I know, everyone is free to sign up with the infiniband trade association and implement the specification.

Furthermore, if you use RDMA over Ethernet, you are using infiniband at a low level. RoCE which enables it was originally called Ethernet over infiniband. It is maintained by the infiniband trade association.

Omnipath was Intel’s failed effort to try to kill an open standard. It purchased QLogic’s infiniband business, killed it in favor of omnipath and sold it when it failed.


> Furthermore, if you use RDMA over Ethernet, you are using infiniband at a low level. RoCE which enables it was originally called Ethernet over infiniband.

This is wrong.

RDMA over Ethernet is... RDMA over Ethernet. There is no Infiniband involved.

RoCE was motivated by supporting RDMA, which was then an IB-only feature, over regular Ethernet. The user-level APIs are the same (verbs), but the underlying architecture is all different --- it is traditional Ethernet with link-level flow control to make it lossless (pause frames).


This is not what I have been told by others.


Tip: copy-paste our entire exchange as a Gemini prompt and ask "who's correct?".

(I did that yesterday and was very satisfied :) )


You are not the first person to say “the AI assistant liked what I said over what you said” to me on the Internet, to which I say they will tell you whatever you want to hear.

I am not interested in continuing this discussion, but if you want to do your own research, I suggest starting with the fact that the Infiniband Trade Association controls the RoCE specification. I suggest you avoid using LLMs, for obvious reasons.


Nvidia has a unique problem, wants to move fast and has a shit load of money.

No need for Nvidia to go first to an industry standard and neither for AMD.

Personally would be great its getting backported but its so far away from an normal use case.


Nvidia is the one who went to an industry standard before their competitors in this space. It was created in 1999 and is called infiniband:

https://en.wikipedia.org/wiki/InfiniBand

Infiniband is extremely popular in the HPC space, which is why Nvidia adopted it. Everyone else saw Nvidia adopt it and said "Let us make a new network standard to be incompatible". This is mind boggling.

Even more mind boggling is that many of the companies in the Ultra Ethernet Consortium are members of the Infiniband Trade Association, AMD included:

https://www.infinibandta.org/member-listing/

This would be like the automotive industry forming a consortium to invent new incompatible wheels to exclude a successful upstart that adopted their existing standard wheel designs. With trillions of dollars in revenue on the line, you would think that companies would use existing networking standards to focus on building competitive hardware with reduced time to market, yet they are instead reinventing networking standards just because they can. This is a huge gift to Nvidia, since it means that everyone else is wasting time and money instead of being competitive.


Sorry to hound you for the third time, but this is wrong:

> Infiniband is extremely popular in the HPC space

Not anymore. There used to be Cray Aries/GNI, psm/psm2, and now there's Slingshot, the new Cornelis stuff etc. There's almost no Infiniband now.


The top500 says otherwise:

https://www.infinibandta.org/infiniband-and-roce-advances-fu...

Where are you getting your information?


I concede that the specifics of what I said were wrong, but the larger point was not.

If you buy a single DGX H100 rack and run LINPACK, you automatically get TOP500-grade numbers. Infiniband is a solid product, if not the best commercial offering for AI/ML, but no one buys it for an HPC cluster separately from the DGX boxes.


#26 on the list uses AMD GPUs with infiniband:

https://www.top500.org/system/180171/

You can likely find more. Infiniband has been excellent for HPC since the 2000s. That includes all HPC workloads, not just AI/ML.

Excuse me if I do not believe your claims concerning infiniband. They contradict not only actual data, but also what I have heard from people I consider experts.

Also, you did not answer my question concerning the origin of your information. I notice from another comment if yours that you have been talking to a LLM about this conversation. Have you been posting things that a LLM tells you?


You hardly beat someone by copying him. They have way more experience in the field you try to catch up.


You don’t beat someone by doing everything worse either.


AMD doesn't need to beat Nvidia, they just need to match them at a lower price point.


In business, that combination is nearly impossible to distinguish from beating them.


That‘s beating them at the price point.


I made the mistake of clicking on one of the links to commits they mentioned only to end up at a MR changing multiple autogenerated yaml files with 10k line diffs and incomprehensible names. I guess this is where the whole "bad talent" thing comes in - a year later and you are thousands of YAML files deep but still no one can run a simple PyTorch compile ops and get the performance you sold, absolutely unhinged.


That MatMul performance is fairly shocking. To be that much below theoretical maximum on what should be a fairly low overhead operation.

I would at least hope that they know where the speed is going, but the issue of torch.matmul and F.Linear using different libraries with different performance suggests that they don't even know which code they are running, let alone where the slow bits in that code are.


Low overhead in what sense? matmul is kinda complicated and there are varying, complex state-of-the-art algorithms for it, no? And then if you know things about the matrices in advance you can start optimizing for that, which adds another layer of complexity.


Yes and no. Conceptually it's just three nested loops. The fiddly part is unrolling the inner loop and swizzling the data layouts in such a way that the cores can be kept "fed" efficiently. This usually means breaking things up into cache-sized chunks along some axis.

It's easy enough that there's blog articles showing single developers getting within spitting distance of NVIDIA's highly optimised code. As in, 80-something-percent of the best available algorithms!

All NVIDIA did was "put the effort in", where the effort isn't some super clever algorithm implemented by a unique genius, but they simply made hundreds of variants of the matmul algorithm optimised for various scenarios. It's a kind of algorithmic brute force for eking out every last percentage point for every shape and size of input matrices on every GPU model and even for various SLI configurations.

From what I've seen, AMD has done... none of this.


> From what I've seen, AMD has done... none of this.

There are a number of pull-requests to ROCMblas for tuning various sizes of GEMV and GEMM operations. For example: https://github.com/ROCm/rocBLAS/pull/1532


Merged two days ago!?

That’s about half a decade after they should have done this foundational work!

I guess it’s better late than never, but in this case a timely implementation was worth about a trillion dollars… maybe two.


There are likely plenty of unrealized opportunities to improve mature BLAS libraries. For example, this guy who was able to outperform OpenBLAS' GEMM on Zen 4:

https://salykova.github.io/matmul-cpu

Concidentally, the Intel MKL also outperforms OpenBLAS, so there being room for improvement is well known. That said, I have a GEMV implementation that outperforms both the Intel MKL and OpenBLAS in my tests on Zen 3:

https://github.com/ryao/llama3.c/blob/master/run.c#L429

That is unless you shoehorn GEMV into the Intel MKL's batched GEMM function, which then outperforms it when there is locality. Of course, when there is no locality, my code runs faster.

I suspect if/when this reaches the established amd64 BLAS implementations' authors, they will adopt my trick to get their non-batched GEMV implementations to run fast too. In particular, I am calculating the dot products for 8 rows in parallel followed by 8 parallel horizontal additions. I have not seen the 8 parallel horizontal addition technique mentioned anywhere, so I might be the first to have done it.


How do the cache sizes compare between AMD GPU’s and nvidia? I remember reading a while ago they were quite different (enough to make flash attention painful to implement)


There are, but everyone uses variations of the same O(n^3) algorithm taught in college introduction to linear algebra classes because it is numerically stable and can be made extremely fast through tweaks that give spatial locality and good cache characteristics. Meanwhile the asymptomatically faster algorithms have such large constants in their big O notation that they are not worth using. FFT based matrix multiplication, which is O((n^2)log(n)), also has numerical instability on top of running slower.


> matrix multiplication, which is O((n^2)log(n))

Isn't the fastest theoretical algorithm something like O(n^2.37) ?


Yes, but it's impractical unless you have galactic-scale matrices to multiply (at least).


> FFT based matrix multiplication, which is O((n^2)log(n))

What?


https://en.wikipedia.org/wiki/Schönhage–Strassen_algorithm

I forgot the log(log(n)) factor.

In any case, for matrix multiplications that people actually do, this algorithm runs slower than a well optimized O(n^3) matrix multiplication implementation because the constant factor in the Big O notation is orders of magnitude larger.


Schönhage-Strassen is not about matrix multiplication.


FFT is fast Fourier transform, and our best theoretical bounds on multiplication come from methods involving FFT.


For matrix multiplication? How?


By overhead I'm talking about the things that have to be done supplementary to the algorithm.

While there are complex state-of-the-art algorithms, those algorithms exist for everyone. The overhead is the bit that had to be done to make the algorithm work.

For instance for sorting a list of strings the algorithm might be quick sort. The overhead would be in the efficiency of your string compare.

For matmul I'm not sure what your overhead is beyond moving memory, multiplying, and adding. A platform touting a memory bandwidth and raw compute advantage should have that covered. Where is the performance being lost?

I guess the only real options are stalls, unnecessary copies, or unnecessary computations.


> For matmul I'm not sure what your overhead is beyond moving memory, multiplying, and adding. A platform touting a memory bandwidth and raw compute advantage should have that covered. Where is the performance being lost?

The use of the word 'algorithm' is incorrect.

Look... I do this sort of work for a living. There has been no useful significant change to matmul algorithms.

What has changed is the matmul process.

Modern perf optimization on GPUs has little to do with algorithms and everything to do with process optimization. This is akin to factory floor planning and such. You have to make sure the data is there when the processing units need it, and the data is coming in at the fastest rate possible, while keeping everything synchronized to avoid wrong results or deadlocks.

Really compute power has nothing to do with it. It's a waste of time to even consider it. We can compute matmuls much faster than you can naively bring memory to the processing units. Whoever solves that problem will become very rich.

To that end, NVIDIA ships libraries that will choose from a wide variety of implementations the appropriate trade-offs necessary for SoTA perf on matmuls of all shapes and data types.


To be fair, GEMV is memory bandwidth bound and that is what token generation in transformers uses. GEMM is the compute bound one, provided you do not shoehorn GEMV into it. That special case is memory bandwidth bound.


GEMM isn't compute bound in ML in practice. If you do naive GEMM based attention, then you will have to write the output matrix into HBM and in the worst case you might even have to reload the output from HBM!

So what is done in practice is an algorithm that doesn't calculate the same result, but is imperceptibly close to doing classic attention, namely flash attention. Flash attention lets you fuse the kernel so that you can multiply against the V matrix and therefore write the condensed output to HBM. As an additional benefit you also go from quadratic memory usage to linear memory usage. But here is the problem: Your SRAM is limited and even flash attention is still O(n^2) in compute. This means if you tile your K and V cache into j and k tiles. You will have to load j*k times from memory. Meanwhile compute tends to consume very little silicon area. So you end up in a situation where you have excessive compute vs your SRAM. In the compute > SRAM regime, doubling SRAM size also doubles performance. You're memory bound again for super long contexts.

Now let's assume the opposite. Your compute resources are improperly sized with regards to your SRAM, you have too much SRAM but not enough compute resources e.g. a CPU. You will be compute bound with regard to a linear factor vs your SRAM, but always memory bound vs main memory. You could add the matrix cores to the CPU and the problem would disappear in thin air.


I have been working on inference software for my own use, with both CPU and GPU versions:

https://github.com/ryao/llama3.c

The only thing you wrote that makes any sense to me is “Flash attention lets you fuse the kernel”. Everything else you wrote makes no sense to me. For what it is worth, flash attention does not apply to llama 3 inference as far as I can tell.


Which algorithm you pick for what shape of matrices is different and not straightforward to figure out. AMD currently wants you to “tune” ops and likely search for the right algorithm for your shapes while Nvidia has accurate heuristics for picking the right algorithm.


Nvidia's heuristics are not accurate, and it's not possible to achieve peak performance without search.


Low overhead in the sense that matrix multiplication is almost the only algorithm that is able to reach computational throughput values very close to the theoretical maximum for a given hardware.

Good CPUs and GPUs have a throughput in Flop/s for matrix multiplication that is between 60% and 90% of the maximum possible throughput, with many (especially the CPUs) reaching values towards the high end of that range.

As shown in the article, the AMD GPUs attain only slightly less than 50% (for BF16; for FP8 the AMD efficiency is even less than 40%).

Such a low efficiency for the most important operation is not acceptable.


Matmul is trivial to get right, especially since you won't be calculating dot products manually to begin with. You're going to use the tensor cores or equivalent, which already perform almost the entire matrix multiplication for you. Your primary goal in developing a custom matmul kernel is in adjusting the algorithm to the specific hardware by knowing how many tiles you can store in your local registers and SRAM and how to simultaneously intertwine loading new data from HBM and performing the calculations.


> It’s not just that it’s immature software, they need to change how they do development.

I remember geohot saying something similar about a year ago


I expect everyone has been saying it for a while, the calls are just getting more strident and public as it becomes clear that AMD's failures are strategic rather than tactical. And as people try to build business on their half-hearted attempts.

I still think it is a mistake to say that CUDA is a moat. IMO the problem here is that AMD still doesn't seem to think that GPGPU compute is a thing. They don't seem to understand the idea that someone might want to use their graphics cards to multiply matricies independently of a graphics pipeline. All the features CUDA supports are irrelevant compared to the fact that AMD can't handle GEMM performantly out of the box. In my experience it just can't do it, back in the day my attempts to multiply matrices would crash drivers. That isn't a moat, but it certainly is something spectacular.

If they could manage an engineering process that delivered good GEMM performance then the other stuff can probably get handled. But without it there really is a question of what these cards are for.


I wonder to what extent vulkan compute could be used for this. Of course, it is only an option on their RDNA GPUs since CDNA is not for graphics, even though that is the G in GPU.


There has been some testing within llama.cpp, which supports both Vulkan and ROCM-Blas. When it works, the latter is about 2x faster than the Vulkan version.


Unless it provides the polyglot capabilities of CUDA, and related IDE and graphical debugging capabilities, not really.


Yeah, 80% margins on matrix multiplication should be a puddle not a moat but AMD is more scared of water than the witch that melts in Wizard of Oz so I guess the puddle is a moat after all.


Anyone who looks at the mess that is ROCm and the design choices they made could easily see that.

GPU support lagged behind for years, no support for APUs and no guaranteed forward compatibility were clear signs that as a whole they have no idea what they are doing when it comes to building and shipping a software ecosystem.

To that you can add the long history of both AMD and ATI before they merged releasing dog shit software and then dropping support for it.

On the other hand you can take any CUDA binary even one that dates back to the original Tesla and run it on any modern NVIDIA GPU.


> GPU support lagged behind for years, no support for APUs and no guaranteed forward compatibility were clear signs that as a whole they have no idea what they are doing when it comes to building and shipping a software ecosystem.

This is likely self inflicted. They decided to make two different architectures. One is CDNA for HPC and the other is RDNA for graphics. They are reportedly going to rectify this with UDNA in the future. However, that is what they really should have done from the start. Nvidia builds 1 architecture with different chips based on it to accommodate everything and code written for one easily works on another as it is the same architecture. This is before even considering that they have PTX to be an intermediate language that serves a similar purpose to Java byte code in allowing write once, run anywhere.


This was happening before CDNA was even a thing.

They didn’t release support even for all GPUs from the same generation and dropped support for GPUs sometime within 6 months of releasing a version that actually “worked”.

The entire core architecture behind ROCM is rotten.

P.S. NVIDIA usually has multiple CUDA feature levels even within a generation. The difference is that a) they always provide a fallback option, and usually this doesn’t require any manual intervention and b) is that as long as you define the minimum target framework when you build the binary you are guaranteed to run on all past hardware that is supported by the feature level you targeted and on all future hardware.


The differences between CUDA feature levels appear minor according to the PTX documentation:

https://docs.nvidia.com/cuda/parallel-thread-execution/index...

They also appear to be cululative.


It doesn’t matter the point is that they don’t break stuff. You can still compile CUDA today to work on old hardware and your binaries are guaranteed to have forward compatibility.

You don’t get that with ROCm, and this is why it’s garbage unless someone else abstracts all of that from you.

So if Microsoft is happy to maintain an ML as a service solution that just takes prompts and maybe data it’s not your problem.

But if you need to run your own workloads and these can include workloads that are well outside of “AI” and might not be even possible or remotely profitable to have a SAAS wrapper around them it’s all on you.


> On the other hand you can take any CUDA binary even one that dates back to the original Tesla and run it on any modern NVIDIA GPU

This particular difference stems the fact that NVIDIA has PTX and AMD does not have any such thing. Ie this kind of backwards compatibility will never be possible on AMD.


Backward compatibility is one thing but not having a forward compatibility is a killer.

Having to create a binary that targets a very specific set of hardware and having no guarantees and in fact having a guarantee that it won’t on future hardware is what make ROCM unusable for anything you intend to ship.

What’s worse is that they also drop support for their GPUs faster than Leo drops support for his girlfriends once they reach 25…

So not only that you have to recompile there is no guarantee that your code would work with future versions of ROCM or that future versions of ROCM could still produce binaries which are compatible with your older hardware.

Like how is this not the first design goal to address when you are building a CUDA competitor I don’t fucking know.


> Like how is this not the first design goal to address when you are building a CUDA competitor I don’t fucking know.

The words "tech debt" do not have any meaning at AMD. No one understands why this is a problem.


Backwards compatibility, and polyglot ecosystem, thanks to the amount of compiler toolchains that support PTX.


"The software needs to be better" is (and was) an easy call to make for anyone paying attention. The problem is that "AMD just needs to do better" is not and will never be an implementable strategy. Engineering isn't just about money. It's also about the process of exploring all the edge cases.

"We recommend that AMD to fix their GEMM libraries’ heuristic model such that it picks the correct algorithm out of the box instead of wasting the end user’s time doing tuning on their end." Is such a profoundly unhelpful thing to say unless you imagine AMDs engineers just sitting around wondering what to do all day.

AMD needs to make their drivers better, and they have. Shit just takes time.


Sounds more like they were (and still are) being sloppy. “be better” is one thing. “runs without fatal crash” is what semi is talking about.


In buggy numerical code many bugs go trough the software stack without any problems. No crash, no errors. For example,you might switch two double parameters to a function and if their value range is similar, everything works fine except it's all bullshit.

If there are bugs in AMD code that prevent running tests, I bet there are even more bugs that don't manifest until you look at results.


I once tried installing AMD ROCM to run a small llm on a consumer-grade AMD GPU. It was the most horrible software install experience I ever had. Never did manage to get it working.


What I couldn't find is inference benchmarks for consumer hardware. Just pick a reasonable workload with llama.cpp or ollama and show us some numbers.

I'm particularly interested in building a Home Assistant machine that can run the voice assistant locally (STT/TTS/LLM) while using the least amount of power / generating the least amount of heat and noise.


AMD's software for consumer GPUs demonstrates a lack of seriousness. ROCm only officially supports RDNA2 and RDNA3 GPUs (their last two generations of hardware), and for some reason most of them are supported on only Windows (https://rocm.docs.amd.com/projects/install-on-windows/en/lat...) and not Linux (https://rocm.docs.amd.com/projects/install-on-linux/en/lates...), where most AI training and inference occurs. In particular, Linux users can only start playing with ROCm with a top-of-the-line, power-guzzling unit whereas they can get started with CUDA using basically any Nvidia GPU on desktops or laptops.


In practice, consumer Navi 21 based cards (RX 6900XT etc) and Navi 31 cards (RX 7900 XTX etc) are compatible with Pytorch on Linux.

What they write about ROCm and Windows is equivocation. They target only one app: Blender. Pytorch+ROCm+Windows does not work.

I had bought a 6900XT myself around launch time (the RTX3080 I ordered was not coming, it was the chip shortage times...) and it took around 2 years for Pytorch to become actually usable on it.


in practice everyone who wants to do ML at home buys nvidia and pays the premium


Sad but true. Years ago, pre2018 nvidia was the goto hardware supplier if you were doing anything with neural networks.

I remember CUDA being much more buggy back then but it still worked pretty good.

Back then AMD wasn't considered a real competition for ML/AI hardware.

Glad as always to see more competition in the market to drive innovations. AMD seems to be letting larger VRAM onto consumer cards, which is nice to see, just hope the AI/ML experience can get better for their software ecosystem.


Maybe, but it's pretty lame when you buy a new $500 CPU (7800 XT) and the docs say it's unsupported, and makes you feel like even if you reported a bug they would just say "Sorry, not supported".

Did make me wish I bought a Nvidia.


The obligatory link:

https://xkcd.com/644/

That said, I would not expect it to stay working for long as long as ROCm is a dependency since AMD drops support for its older GPUs quickly while Nvidia continues to support older GPUs with less frequent legacy driver updates.


Based on a machine we had bought at my university with 4 AMD W6800s (which are just RX 6800s with double the VRAM), it's bad _even if it works at all_.


It would be cool to see these benchmarks on the newly released Jetson Orin Nano Super, like faster-whisper.


You might just check out the Home Assistant Voice:

https://ameridroid.com/products/home-assistant-voice-preview...


Yes, that's exactly what I was checking out. You need fast enough hardware to run the speech to text, text to speech and (most importantly) LLM locally: https://www.youtube.com/watch?v=XvbVePuP7NY (he has dual 3090 GPUs but that's not a practical setup for most people - budget / power / noise).


My anecdata on AMD hiring: they just aren't moving fast enough. They still wanted to fly people out scheduling 3 weeks in advance for AI compiler work. That's just not going to work. Startups and companies like NVIDIA, OpenAI are hiring much faster with much less onerous interview processes, with higher compensation. This is not a mystery. People work for money and aren't going to hop through more hoops to be paid less.


Latest: Dylan Patel ( SemiAnalysis )

"Met with @LisaSu today for 1.5 hours as we went through everything

She acknowledged the gaps in AMD software stack

She took our specific recommendations seriously

She asked her team and us a lot of questions

Many changes are in flight already!

Excited to see improvements coming"

https://x.com/dylan522p/status/1871287937268383867


Lisa Su : https://x.com/LisaSu/status/1871362304194859511

"Thanks @dylan522p for the constructive conversation today. Feedback is a gift even when it’s critical. We have put a ton of work into customer and workload optimizations but there is lots more we can do to enable the broad ecosystem. I appreciate all the feedback and desire to engage with @AMD. We are committed to building a world-class open software stack. Lots planned for 2025. Happy holidays to all!"


AMD could spend their market cap in one year to get this done in three and it would be a coup for the shareholders. They could hire all of the best NVIDIA engineers at double their current comp, crush the next TSMC node on Apple levels, and just do it and if it got them a quarter of NVDA’s cap it would be a bargain.

They don’t fucking want to! Believing this is anything like a market is fucking religion.


You make it sound like that's a sure thing, but I doubt it. A lot of this is about processes, team structures and incentives, all those fuzzy things between the people.

Remember, most acquisitions fail. For the same reason, the likelihood of failure with your scenario seems high.

Do you really think nobody at AMD is aware of all the points made in this thread? That seems too bizarre to be true. There are probably some issues in upper management which could perhaps be fixed with some targeted hiring decisions, but do you really believe some random person on here would have a chance making that call?


We were arguing about this two years ago, maybe five. I was sharing NVIDIA dev boxes with other hackers doing CUDA in 2016.

There’s this meme that it can’t change on a dime and I believe that.

You could build this from scratch in a decade. JFK sent NASA to the moon in less time for comparable money.

If NVIDIA shareholders can’t come close? What fucking good are they? Why do our carrier battle groups guard their supply chain?


Everyone is so concerned about “losing” the “AI” “race” to the PRC.

I say ship them Altman and an exaflop and watch their society corrupt itself at a fractal nature at machine speed.

Good fucking riddance. I see your fentanyl crisis: raise you Sam and a failure to ship GPT-5. Have fun with that.


Try to make sense...

They can spend their market cap by either:

1: issuing new shares worth their market cap, diluting existing shareholders to 50%.

2: Or borrow their market cap and pay interest by decreasing profits. "AMD operating margin for the quarter ending September 30, 2024 was 5.64%" so profits would be extremely impacted by interest repayments.

Either way your suggestion would be unlikely to be supported by shareholders.

> crush the next TSMC node on Apple levels

I would guess Apple is indirectly paying for the hardware (to avoid repatriating profits) or guaranteeing usage to get to the front of the line at TSMC. Good luck AMD competing with Apple: there's a reason AMD sold GlobalFoundries and there's a reason Intel is now struggling with their foundry costs.

And it comes across as condescending to assume you know better than a successful company.


When 4 trillion dollars are at stake, the financing is available or it fucking better be.

What in God’s name do we pay these structured finance, bond-issue assholes 15% of GDP for if not to finance a sure thing like that?

It sure as hell ain’t for their taste in Charvet and Hermes ties, because the ones they pick look like shit.


I think it’s hardware not software.

“Cuda moat” is a misnomer. The PTX spec is relatively short (600 page pdf). Triton directly writes PTX, skipping cuda. Flash attention was created by a non nvidia employee without access to any of the secret sauce within Cuda or its libraries.

The hardware is just not as good, and no software can paper over its flaws.


NVidia also refactored their hardware design to follow C++ memory model, I think this still isn't the case of others.


So basically Nvidia is like Windows and AMD is like Wine. I think trying to emulate CUDA and using forked Nvidia libraries is not the best strategy for AMD. They should have made a clean break and come out with a fresh API, like Apple's Metal.


AMD doesn’t just have to fix these issues it has to build up a record of fixing issues like those discussed here.

Otherwise who will bet their firm / cash / career on new hardware without a successful track record.


B100, B200 ramp-up is only 4-6 months away.


Wow they really botched the title. Why wasn’t it “Still Filled With Water”?


> CUDA Moat Still Alive

Wrong conclusion. AMD is slower than NVidia, but not _that_ much slower. They are actually pretty cost-competitive.

The just need to do some improvements, and they'll be a very viable competitor.


The amount of effort this team took, literally co-opting AMD engineers, and working for 5 months, to get closer but not yet usable, means they are not even close to usable. What team wanting to do ML training/inference can afford so much down time for zero benefit? How many except a few big ones can get AMD to devote so many resources simply for that team?

And, if you’re training a model costing you millions, the last thing you need is a buggy, untested stack, breaking training or perhaps worse giving you noise that makes your models perform worse or increases training time.

By the time AMD gets usable out of the box at this point, NVidia will have moved further ahead.


Sure. But this work is done, and can be reused by others.

Meanwhile, Nvidia hardware is expensive and still is in short supply. AMD might look quite tempting.


It sure doesn’t sound done. It’s a one off hacked set of scripts tied to incompatible chunks of a ton of libraries. What happens when you want or need other parts of the PyTorch/billion libs ecosystem? You’re gonna get more AMD engineers and waste 5 months getting those to work?

Meanwhile those libs release running CUDA on NVidia’s old and newest releases out of the box.

So no, it cannot be reused by others in production any more than my custom hacked car engine mod can be added by Ford to every car in existence.

Have you done any deep professional production work on any of these stacks? I have, and would never, ever put stuff like the stuff in the article in production. It’s no where near ready for production use.


There is a difference between doing something just for yourself and making it usable by others.


Like the article say, if the model change a little this work need to be almost thrown out


All the cloud providers list MI300x as more expensive than H100. So if you compare performance/cost it is even worse.


Just like that? So a little work and now they are competitive. You know how much work “just a little bit of work” is doing? They us a cultural issue and it will take months to fix if they are lucky and then you start tackling the tech debt they’ve built up. By that time it will be another generation


Back in the late 1990s I met the ATI guys, and they were slipshod then as well. That the ATI legacy of special-casing things lives on is sadly, not too surprising for me.


Sounds like a buy signal for AMD. If you run the right branch and set the right env cars, the thing flies.


Would be a a buy signal if their actions (better drivers) show that they are seriously working on improving software. "This could be great _if_ you go through the trouble of doing it right!" is not persuasive, and any sane person would go with green if they know they have the choice between troubleshooting shitty software and things just working. Look at the george hotz archive youtube channel and watch the videos where he's debugging the amd drivers, it damn near ruins the man. And george is not the type to give up at the first roadblock, there are multiple 5-8 hour videos where he just tries to get the thing to work. The mad man ended up just writing his own driver lol.


It does seem like an improvement. Six or twelve months ago, I recall a lot of crashes and even more basic problems. “If you tune it right, it’s awesome” is a big step forward compared to that.


Anthony has been doing training on 4 of our MI300x systems and has been getting great results... but of course, he is a genius, and writing his own code...

https://x.com/HotAisle/status/1870984996171006035

https://x.com/zealandic1/status/1869857713280430349

https://x.com/zealandic1/status/1868810042168033623


Unfortunately,

> Getting reasonable training performance out of AMD MI300X is an NP-Hard problem.


I expect Nvidia shares to increase tomorrow because of the article while AMD shares are not likely to do well. It is odd how we read the same thing and came to opposite conclusions.


1. AMD always had a lot of hype already priced in, it is no different with AI.

2. AMD has always shipped a bad software stack, it is no different with AI.


Disappointed that there wasn’t anything on inference performance in the article at all. That’s what the major customers have announced they use it for.


TL;DR: AMD still doesn't take software seriously?


[flagged]


People call out Gelsinger all the time.

And yes, she is definitely responsible for this. Probably more than Gelsinger.

At Intel it is not so obvious what they should have done to improve their fabs to better compete with TSMC, which is groundbreaking tech where you often have to make risky bets.

At AMD it was pretty obvious what had to be done to better compete in AI, and it was basic software engineering, not lithography wizardry. Totally achievable by just spending money, increasing head count, hiring top talent, and firing underperformers.

They have so much low hanging fruit that could have been solved just by hiring 5 or 10 software engineers.


In this case it’s because Dylan Patel of Semianalysis interviews Lisa Su regularly and presumably has a direct line to her, and because Lisa and the rest of AMD leadership are absolutely reading the article. It’s unclear if Pat would have (e.g. I don’t think Pat ever sat down for a chat with Semianalysis like Lisa has).


> Is it a kind of misogynism?

-.-

At what point did any of the criticism have anything to do with her gender? Honest question, I'm scratching my head trying to see where misogyny comes into play. Surely it's not that _because_ she's a woman any criticism from men must be misogynistic? Would it be different if Intel's CEO was female? Or do the people criticising need to be of the same gender as those they're criticising in order for there to be no misogyny?

Truly just trying to get an idea of what sort of perspective it takes to get to

> Is it a kind of misogynism?


It might be because Lisa has been so outstandingly effective at making AMD competitive across multiple product lines against bigger competitors and this feels to some people like an oversight that AMD could easily solve. I suspect the current situation with ML software at AMD is a consequence of a very focused company and not an easy fix without sacrificing something else.

I don't think many people can keep track of who's running Intel let alone have hope that with a little work they can deliver reasonable substitutes for NVIDIA's products.


This is Lisa Su's fault.


This is the exact type of victim mentality that we don't need. There are absolute insane amounts of people calling out Gelsinger by name and blaming solely him for failures at Intel.


Lisa Su is an exemplary CEO, and widely recognized as such. She is exemplary for doing what she did with AMD, and did it without appealing at all to her sex... just on sheer competence. I think it's a bit presumptuous to suddenly call out her sex as if it matters. In reality, she's being talked about exactly like any male CEO. I have great faith in her though. She is clearly extraordinarily capable, and honestly a real inspiration to women in tech




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: