That MatMul performance is fairly shocking. To be that much below theoretical ma...

kg · on Dec 23, 2024

Low overhead in what sense? matmul is kinda complicated and there are varying, complex state-of-the-art algorithms for it, no? And then if you know things about the matrices in advance you can start optimizing for that, which adds another layer of complexity.

jiggawatts · on Dec 23, 2024

Yes and no. Conceptually it's just three nested loops. The fiddly part is unrolling the inner loop and swizzling the data layouts in such a way that the cores can be kept "fed" efficiently. This usually means breaking things up into cache-sized chunks along some axis.

It's easy enough that there's blog articles showing single developers getting within spitting distance of NVIDIA's highly optimised code. As in, 80-something-percent of the best available algorithms!

All NVIDIA did was "put the effort in", where the effort isn't some super clever algorithm implemented by a unique genius, but they simply made hundreds of variants of the matmul algorithm optimised for various scenarios. It's a kind of algorithmic brute force for eking out every last percentage point for every shape and size of input matrices on every GPU model and even for various SLI configurations.

From what I've seen, AMD has done... none of this.

Fronzie · on Dec 23, 2024

> From what I've seen, AMD has done... none of this.

There are a number of pull-requests to ROCMblas for tuning various sizes of GEMV and GEMM operations. For example: https://github.com/ROCm/rocBLAS/pull/1532

jiggawatts · on Dec 23, 2024

Merged two days ago!?

That’s about half a decade after they should have done this foundational work!

I guess it’s better late than never, but in this case a timely implementation was worth about a trillion dollars… maybe two.

ryao · on Dec 23, 2024

There are likely plenty of unrealized opportunities to improve mature BLAS libraries. For example, this guy who was able to outperform OpenBLAS' GEMM on Zen 4:

https://salykova.github.io/matmul-cpu

Concidentally, the Intel MKL also outperforms OpenBLAS, so there being room for improvement is well known. That said, I have a GEMV implementation that outperforms both the Intel MKL and OpenBLAS in my tests on Zen 3:

https://github.com/ryao/llama3.c/blob/master/run.c#L429

That is unless you shoehorn GEMV into the Intel MKL's batched GEMM function, which then outperforms it when there is locality. Of course, when there is no locality, my code runs faster.

I suspect if/when this reaches the established amd64 BLAS implementations' authors, they will adopt my trick to get their non-batched GEMV implementations to run fast too. In particular, I am calculating the dot products for 8 rows in parallel followed by 8 parallel horizontal additions. I have not seen the 8 parallel horizontal addition technique mentioned anywhere, so I might be the first to have done it.

cavisne · on Dec 23, 2024

How do the cache sizes compare between AMD GPU’s and nvidia? I remember reading a while ago they were quite different (enough to make flash attention painful to implement)

ryao · on Dec 23, 2024

There are, but everyone uses variations of the same O(n^3) algorithm taught in college introduction to linear algebra classes because it is numerically stable and can be made extremely fast through tweaks that give spatial locality and good cache characteristics. Meanwhile the asymptomatically faster algorithms have such large constants in their big O notation that they are not worth using. FFT based matrix multiplication, which is O((n^2)log(n)), also has numerical instability on top of running slower.

petters · on Dec 23, 2024

> matrix multiplication, which is O((n^2)log(n))

Isn't the fastest theoretical algorithm something like O(n^2.37) ?

jey · on Dec 23, 2024

Yes, but it's impractical unless you have galactic-scale matrices to multiply (at least).

eapriv · on Dec 23, 2024

> FFT based matrix multiplication, which is O((n^2)log(n))

What?

ryao · on Dec 23, 2024

https://en.wikipedia.org/wiki/Schönhage–Strassen_algorithm

I forgot the log(log(n)) factor.

In any case, for matrix multiplications that people actually do, this algorithm runs slower than a well optimized O(n^3) matrix multiplication implementation because the constant factor in the Big O notation is orders of magnitude larger.

eapriv · on Jan 4, 2025

Schönhage-Strassen is not about matrix multiplication.

jhanschoo · on Dec 23, 2024

FFT is fast Fourier transform, and our best theoretical bounds on multiplication come from methods involving FFT.

eapriv · on Jan 4, 2025

For matrix multiplication? How?

Lerc · on Dec 23, 2024

By overhead I'm talking about the things that have to be done supplementary to the algorithm.

While there are complex state-of-the-art algorithms, those algorithms exist for everyone. The overhead is the bit that had to be done to make the algorithm work.

For instance for sorting a list of strings the algorithm might be quick sort. The overhead would be in the efficiency of your string compare.

For matmul I'm not sure what your overhead is beyond moving memory, multiplying, and adding. A platform touting a memory bandwidth and raw compute advantage should have that covered. Where is the performance being lost?

I guess the only real options are stalls, unnecessary copies, or unnecessary computations.

anon291 · on Dec 23, 2024

> For matmul I'm not sure what your overhead is beyond moving memory, multiplying, and adding. A platform touting a memory bandwidth and raw compute advantage should have that covered. Where is the performance being lost?

The use of the word 'algorithm' is incorrect.

Look... I do this sort of work for a living. There has been no useful significant change to matmul algorithms.

What has changed is the matmul process.

Modern perf optimization on GPUs has little to do with algorithms and everything to do with process optimization. This is akin to factory floor planning and such. You have to make sure the data is there when the processing units need it, and the data is coming in at the fastest rate possible, while keeping everything synchronized to avoid wrong results or deadlocks.

Really compute power has nothing to do with it. It's a waste of time to even consider it. We can compute matmuls much faster than you can naively bring memory to the processing units. Whoever solves that problem will become very rich.

To that end, NVIDIA ships libraries that will choose from a wide variety of implementations the appropriate trade-offs necessary for SoTA perf on matmuls of all shapes and data types.

ryao · on Dec 23, 2024

To be fair, GEMV is memory bandwidth bound and that is what token generation in transformers uses. GEMM is the compute bound one, provided you do not shoehorn GEMV into it. That special case is memory bandwidth bound.

imtringued · on Dec 23, 2024

GEMM isn't compute bound in ML in practice. If you do naive GEMM based attention, then you will have to write the output matrix into HBM and in the worst case you might even have to reload the output from HBM!

So what is done in practice is an algorithm that doesn't calculate the same result, but is imperceptibly close to doing classic attention, namely flash attention. Flash attention lets you fuse the kernel so that you can multiply against the V matrix and therefore write the condensed output to HBM. As an additional benefit you also go from quadratic memory usage to linear memory usage. But here is the problem: Your SRAM is limited and even flash attention is still O(n^2) in compute. This means if you tile your K and V cache into j and k tiles. You will have to load j*k times from memory. Meanwhile compute tends to consume very little silicon area. So you end up in a situation where you have excessive compute vs your SRAM. In the compute > SRAM regime, doubling SRAM size also doubles performance. You're memory bound again for super long contexts.

Now let's assume the opposite. Your compute resources are improperly sized with regards to your SRAM, you have too much SRAM but not enough compute resources e.g. a CPU. You will be compute bound with regard to a linear factor vs your SRAM, but always memory bound vs main memory. You could add the matrix cores to the CPU and the problem would disappear in thin air.

ryao · on Dec 24, 2024

I have been working on inference software for my own use, with both CPU and GPU versions:

https://github.com/ryao/llama3.c

The only thing you wrote that makes any sense to me is “Flash attention lets you fuse the kernel”. Everything else you wrote makes no sense to me. For what it is worth, flash attention does not apply to llama 3 inference as far as I can tell.

dhruvdh · on Dec 23, 2024

Which algorithm you pick for what shape of matrices is different and not straightforward to figure out. AMD currently wants you to “tune” ops and likely search for the right algorithm for your shapes while Nvidia has accurate heuristics for picking the right algorithm.

0-_-0 · on Dec 23, 2024

Nvidia's heuristics are not accurate, and it's not possible to achieve peak performance without search.

adrian_b · on Dec 23, 2024

Low overhead in the sense that matrix multiplication is almost the only algorithm that is able to reach computational throughput values very close to the theoretical maximum for a given hardware.

Good CPUs and GPUs have a throughput in Flop/s for matrix multiplication that is between 60% and 90% of the maximum possible throughput, with many (especially the CPUs) reaching values towards the high end of that range.

As shown in the article, the AMD GPUs attain only slightly less than 50% (for BF16; for FP8 the AMD efficiency is even less than 40%).

Such a low efficiency for the most important operation is not acceptable.

imtringued · on Dec 23, 2024

Matmul is trivial to get right, especially since you won't be calculating dot products manually to begin with. You're going to use the tensor cores or equivalent, which already perform almost the entire matrix multiplication for you. Your primary goal in developing a custom matmul kernel is in adjusting the algorithm to the specific hardware by knowing how many tiles you can store in your local registers and SRAM and how to simultaneously intertwine loading new data from HBM and performing the calculations.