Almost certainly Zen 5 won't have single-cycle FP latency (I haven't heard of an...

adrian_b · on June 3, 2024

Even if Lisa Su said this at the Zen 4 launch, it is not likely that 512-bit operations are split into a pair of 256-bit operantions that are executed sequentially in the same 256-bit execution unit.

Both Zen 3 and Zen 4 have four 256-bit execution units.

Two 512-bit instructions can be initiated per clock cycle. It is likely that the four corresponding 256-bit micro-operations are executed simultaneously in all the 4 execution units, because otherwise there would be an increased likelihood that the dispatcher would not be able to find enough micro-operations ready for execution so that no execution unit remains idle, resulting in reduced performance.

The main limitation of the Zen 4 execution units is that only 2 of them include FP multipliers, so the maximum 512-bit throughput is one fused multiply-add plus one FP addition per clock cycle, while the Intel CPUs have an extra 512-bit FMA unit, which stays idle and useless when AVX-512 instructions are not used, but which allows two 512-bit FMA per cycle.

Without also doubling the transfer path between the L1 cache and the registers, a double FMA throughput would not have been beneficial for Zen 4, because many algorithms would have become limited by the memory transfer throughput.

Zen 5 doubles the width of the transfer path to the L1 and L2 cache memories and it presumably now includes FP multipliers in all the 4 execution units, thus matching the performance of Intel for 512-bit FMA operations, while also doubling the throughput of the 256-bit FMA operations, where in Intel CPUs the second FMA unit stays unused, halving the throughput.

No well-designed CPU has a FP addition or multiplication latency of 1. All modern CPUs are designed for the maximum clock frequency which ensures that the latency of operations similar in complexity with 64-bit integer additions between registers is 1. (CPUs with a higher clock frequency than this are called "superpipelined", but they have went out of fashion a few decades ago.)

For such a clock frequency, the latency of floating-point execution units of acceptable complexity is between 3 and 5, while the latency of loads from the L1 cache memory is about the same.

The next class of operations with a longer latency includes division, square root and loads from the L2 cache memory, which usually have latencies between 10 and 20. The longest latencies are for loads from the L3 cache memory or from the main memory.

dzaima · on June 3, 2024

Yeah, it's certainly possible that it's not double-pumping. Should be roughly possible to test via comparing latency upon inserting a vandpd between two vpermd's (though then there are questions about bypass networks; and of course if we can't measure which method is used it doesn't matter for us anyway); don't have a Zen 4 to test on though.

But of note is that, at least in uops.info's data[0], there's one perf counter increment per instruction, and all four pipes get non-zero equally-distributed totals, which seems to me much simpler to achieve with double-pumping (though not impossible with splitting across ports; something like incrementing a random one. I'd expect biased results though).

Then again, Agner says "512-bit vector instructions are executed with a single μop using two 256-bit pipes simultaneously".

[0]: https://uops.info/html-tp/ZEN4/VPADDB_ZMM_ZMM_ZMM-Measuremen...

adgjlsfhk1 · on June 3, 2024

it seems plausible that they could be using power of 2 random choices to keep the counts even.