There are likely plenty of unrealized opportunities to improve mature BLAS libraries. For example, this guy who was able to outperform OpenBLAS' GEMM on Zen 4:
Concidentally, the Intel MKL also outperforms OpenBLAS, so there being room for improvement is well known. That said, I have a GEMV implementation that outperforms both the Intel MKL and OpenBLAS in my tests on Zen 3:
That is unless you shoehorn GEMV into the Intel MKL's batched GEMM function, which then outperforms it when there is locality. Of course, when there is no locality, my code runs faster.
I suspect if/when this reaches the established amd64 BLAS implementations' authors, they will adopt my trick to get their non-batched GEMV implementations to run fast too. In particular, I am calculating the dot products for 8 rows in parallel followed by 8 parallel horizontal additions. I have not seen the 8 parallel horizontal addition technique mentioned anywhere, so I might be the first to have done it.
There are a number of pull-requests to ROCMblas for tuning various sizes of GEMV and GEMM operations. For example: https://github.com/ROCm/rocBLAS/pull/1532