That doesn't sound right. The marginal cost of +768GB of DDR5 ECC memory in an E...

InTheArena · on March 5, 2025

GPU accessible RAM.

adrian_b · on March 5, 2025

In a dual-socket EPYC system, the memory bandwidth is higher than in this Apple system by 40% (i.e. 1152 GB/s), and the memory capacity can be many times higher.

Like another poster said, 768 GB of ECC RDIMM DDR5-6000 costs around $5000.

Any program whose performance is limited by memory bandwidth, as it can be frequently the case for inference, will run significantly faster in such an EPYC server than in the Apple system, even when running on the CPU.

Even for computationally-limited programs, the difference between server CPUs and consumer GPUs is not great. One Epyc CPU may have about the same number of FP32 execution units as an RTX 4070, while running at a higher clock frequency (but it lacks the tensor units of an NVIDIA GPU, which can greatly accelerate the execution, where applicable).

aurareturn · on March 5, 2025

  Any program whose performance is limited by memory bandwidth, as it can be frequently the case for inference, will run significantly faster in such an EPYC server than in the Apple system, even when running on the CPU.

Source on this? CPUs would be very compute constrained.

adrian_b · on March 5, 2025

According to Apple, the GPU of M3 Ultra has 80 graphics cores, which should mean 10240 FP32 execution units, the same like an NVIDIA RTX 4080 Super.

However Apple does not say anything about the GPU clock frequency, which I assume that it is significantly less than that of NVIDIA.

In comparison, a dual-socket AMD Turin can have up to 12288 FP32 execution units, i.e. 20% more than an Apple GPU.

Moreover, the clock frequency of the AMD CPU must be much higher than that of the Apple GPU, so it is likely that the AMD system may be at least twice faster for computing some graphic application than the Apple M3 Ultra GPU.

I do not know what facilities exist in the Apple GPU for accelerating the computations with low-precision data types, like the tensor cores of NVIDIA GPUs.

While for graphic applications big server CPUs are actually less compute constrained than almost all consumer GPUs (except RTX 4090/5090), the GPUs can be faster for ML/AI applications that use low-precision data types, but this is not at all certain for the Apple GPU.

Even if the Apple GPU happens to be faster for some low-precision data type, the difference cannot be great.

However a server that would beat the Apple M3 Ultra GPU computationally would cost much more than $10k, because it would need CPUs with many cores.

If the goal is only to have a system with 50% more memory and 40% more memory bandwidth than the Apple system, that can be done at a $10k price.

While such a system would become compute constrained more often than an Apple GPU, it would still beat it every time when the memory would be the bottleneck.

aurareturn · on March 6, 2025

No one is using FP64 for AI inference.

adrian_b · on March 6, 2025

I have not said any word about FP64.

I have just compared the FP32 computational capabilities, i.e. what is used for graphics, between the Apple M3 Ultra GPU and AMD server CPUs, because these numbers are easily available and they demonstrate the size relationships between them.

Both GPUs and server CPUs have greater throughputs for lower precision data (CPUs have instructions for BF16 and INT8 inference), but the exact acceleration factors are hard to find and it is more difficult to estimate the speeds without access to such systems for running benchmarks.

sgt · on March 6, 2025

Anecdotal but it seems like the big EPYC rigs are getting very low tokens per second, and not even consistent. They are strained, as opposed to e.g. M3 Ultra that can likely sustain 40-50 tokens/s based on previous stats.

I'd like to see some proper benchmarking on this though, but it looks like the Apple systems might just be extremely good value if you want to run the large DeepSeek model.

numpad0 · on March 5, 2025

moot point if tok/s benchmark results are the same or worse.

kjreact · on March 5, 2025

Are the benchmarks worse? Running LLMs in system memory is rather painful. I am having a hard time finding benchmarks for running large models using system memory. Can you point me to some benchmarks you’re referring to?

DrBenCarson · on March 5, 2025

Not moot if you care about producing those tokens with the largest available models