> From what I've seen, AMD has done... none of this. There are a number of pull-...

jiggawatts · on Dec 23, 2024

Merged two days ago!?

That’s about half a decade after they should have done this foundational work!

I guess it’s better late than never, but in this case a timely implementation was worth about a trillion dollars… maybe two.

ryao · on Dec 23, 2024

There are likely plenty of unrealized opportunities to improve mature BLAS libraries. For example, this guy who was able to outperform OpenBLAS' GEMM on Zen 4:

https://salykova.github.io/matmul-cpu

Concidentally, the Intel MKL also outperforms OpenBLAS, so there being room for improvement is well known. That said, I have a GEMV implementation that outperforms both the Intel MKL and OpenBLAS in my tests on Zen 3:

https://github.com/ryao/llama3.c/blob/master/run.c#L429

That is unless you shoehorn GEMV into the Intel MKL's batched GEMM function, which then outperforms it when there is locality. Of course, when there is no locality, my code runs faster.

I suspect if/when this reaches the established amd64 BLAS implementations' authors, they will adopt my trick to get their non-batched GEMV implementations to run fast too. In particular, I am calculating the dot products for 8 rows in parallel followed by 8 parallel horizontal additions. I have not seen the 8 parallel horizontal addition technique mentioned anywhere, so I might be the first to have done it.