I’d be careful with H100 on clouds right now — we ran Onnx & PyTorch payloads an ran into issues that haven’t been addressed yet.
For instance, onnx & PyTorch just died randomly and a few thousand dollars later we figured it was directly caused by incorrect assumptions made in the underlying implementations where the developers almost definitely did not have access to an H100 yet.
Personally I think we're doing AI wrong. Every time we simplify the architecture we get huge improvement in speed. CPUs have to everything (load/store, branching, execution), GPUs do a lot less branching and aren't optimised for it. This is why they can be made massively parallel. What if we cut out arbitrary load/store too? This would look like A DSP. All memory would be prepared in advance and the massively parallel accelerator would "stream walk" through It without having to load store arbitrary memory locations. The speed improvement could be on par with CPU->Gpu. Of course this would be for inference mostly. There is a startup that is trying to do just that called tinycorp. I'm certainly watching it with interest
> All memory would be prepared in advance and the massively parallel accelerator would "stream walk" through It without having to load store arbitrary memory locations.
Dedicated accelerators do that already, e.g. google's TPUs, tesla's D1 or apple's neural engine. You must load the data into compute-unit local memory first before executing matmuls. Keeping the weights there and only piping the dynamic data through it saves memory bandwidth.
This sort of architecture only really works when the network is small enough, which has been a perennial problem for neural network accelerators as networks grow but accelerators don't. LLMs and the like will often prefer the opposite form of "stream walking" (streaming the weights through the data) or a hybrid.
Streams weights instead of data sounds really interesting - I had never considered it.
Something else that might be theoretically possible is -
Large array of FPGAs are apparently used to simulate and verify chips [1], can the same be done to run LLMs? Can we have 0.25 to 1 token per cycle, would the engineering effort be worth it, and would it be financially feasible from a TCO standpoint?
There have been some research projects in this direction (computing fabrics, in-memory computation) but nothing that I'm aware of that made it to larger scale manufacturing.
Itanium tried removing a bunch of the out-of-order hardware and hoped that compilers could schedule everything in advance. Generally, that did not work very well.
I don’t get the sense that this is what the parent comment is talking about at all.
Not to mention that GPUs already execute in-order (at least any that I’m familiar with). They do have multiple execution pipelines, but instruction fetch/decode is in-order unlike something like a typical modern high performance CPU.
well, the data is way smaller than the model (at least the current trend), and you probably still need random access for the weights of the model. I am not sure if it a gain worth to pursue.
It's interesting how much power we end up spending purely on DRAM transfer. HBM3 uses about 5 pJ/bit, so maxing out an H100's 3.35TB/s of bandwidth requires around 135 watts just for the RAM. Given how many AI workloads are entirely bandwidth-bound, you could probably build a much cheaper chip that just went all-in bandwidth and got rid of all the fancy compute elements.
> And sadly, Nvidia does not support OpenCL’s FP16 extension, so FP16 throughput couldn’t be tested
What a bummer. But is this really true? I know Nvidia does not report the cl_khr_fp_16 extension, but I saw somewhere that you can still use fp16 types in your code. Has anyone tested this?
... which you created when none of the other stacks existed, and stood behind for 20 years while all your competitors changed their mind about everything every couple of years.
> This giant die implements 144 Streaming Multiprocessors (SMs), 60 MB of L2 cache, and 12 512-bit HBM memory controllers. We’re testing H100’s PCIe version on Lambda Cloud, which enables 114 of those SMs [...]
Seems like an odd detail. Is it safe to assume that this language is meant for investors? I’m the companies I’ve worked for, we never advertised the amount of efused-to-death silicon.
Probably a case of "no one complained hard enough yet", also probably a case of "beggars can't be choosers"... when literally billions of VC money is poured into both AI startups and estabilished cloud computing companies, all of that flows directly into Nvidia's pockets and it's not like this area is free from vendor lock in. When you have a stack that requires Nvidia's hardware you are going to pay for Nvidia's hardware. We live in a time when any hardware with a "matrix calculation accelerator" label sells like hot cakes. It's a massive bubble (HN doesn't like when you use that word in combination with AI, but that's what it is), but as with any bubble, people don't care, they want to ride that wave while it's there. But to get back to the issue, anything Nvidia will sell right now, it's just a matter of who is going to be able to buy it first. So no one really complains about some of Nvidia's marketing being a little dishonest. Also, even if people cared, being a trillion dollar company on your way to being one of the most valued companies on the planet, you have a lot of options and money for litigation.
It's always been like this in GPU space - all reviews have always mentioned number of compute units (be it SMS or "cuda cores"), and the total available for the given architecture is also known. A lot can be told about relative performance of two cards based on that, so this information is useful not only to the investors.
AFAIK it's been like that in CPU space too - e.g. that 6-core CPUs are actually 8-core CPUs with 2 cores deactivated, either because of defects or because they needed more 6-core CPUs?
It's always like that in consumer semiconductors. Intel has something like 3 to 5 actual silicon variants per generation that covers all dozen or two SKUs.
> We’re testing H100’s PCIe version on Lambda Cloud, which enables 114 of those SMs, 50 MB of L2 cache, and 10 HBM2 memory controllers. The card can draw up to 350 W.
> Nvidia also offers a SXM form factor H100, which can draw up to 700W and has 132 SMs enabled.
So i wonder if the number of enabled elements is due to a power supply or cooling constraint.
My confusion is to why the "/144" is needed there, rather than just the lone numbers, 114 and 132, especially since the missing pieces are, more than likely, defective. How can knowing this number help anyone? Perhaps it's transparency, "132 is the best we have now. Best possible is 144, so don't save your money, buy this one!"
It's typically mentioned as a yield or product placement reference, because the same silicon is often used for a range of models and if the full line-up isn't released yet the number of disabled units gives hints as to which other products are likely to exist, how they perform and priced.
The fusing is supposedly due to the power envelope on the PCIe cards. Practically, it could be a market segmentation / yield enhancement trick, but that would be the most nefarious thing, but I assume it's also mostly due to the power envelope.
While it was a mistake, should it be an amazing price?
Sure, if you want 10x the memory bandwidth of a smaller card, that should be expensive.
But 80GB of GDDR6 would currently cost something like... $300. Or if you looked 1-3 years ago it would have been $1000.
GDDR6 is already designed so that you can attach 8 data lines per chip. A high end GPU with a 384-bit memory bus could attach 48 chips that way and have 96GB. Or exactly 80GB on a 320-bit bus.
$1000-1500 retail baseline for the GPU, $250-800 for extra RAM, $1000+ for the extra design hassle... I think you'd be able to buy that for $3500 if we had better competition.
For a lot of use cases, you want a balance between compute and memory. For big AI models outside of a datacenter, memory is far more important. It's worth putting half the budget into RAM chips if that means your model can fit, even if you "only" get 100 teraflops at FP16.
There are many other articles on Tom's Hardware, mostly old. The newest one (two weeks ago) is just a small piece that reveals that
> can barely render graphics [as in, real time 3D rendering] as they do not have enough special-purpose hardware [...] GH100 only has 24 raster operating (ROPs) units and does not have display engines or display outputs [...] One H100 board scores 2681 points in 3DMark Time Spy, which is even slower than performance of AMD's integrated Radeon 680M, which scores 2710
If Tom's Hardware reviewed cars, I wouldn't be surprised to learn they'd written the following: "Despite being called a "Ford", the Taurus scored exceptionally poorly in our river crossing tests, doing only roughly as well as other mechanical bulls."
For instance, onnx & PyTorch just died randomly and a few thousand dollars later we figured it was directly caused by incorrect assumptions made in the underlying implementations where the developers almost definitely did not have access to an H100 yet.