Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Nvidia’s H100: Funny L2, and Tons of Bandwidth (chipsandcheese.com)
132 points by picture on July 3, 2023 | hide | past | favorite | 49 comments


I’d be careful with H100 on clouds right now — we ran Onnx & PyTorch payloads an ran into issues that haven’t been addressed yet.

For instance, onnx & PyTorch just died randomly and a few thousand dollars later we figured it was directly caused by incorrect assumptions made in the underlying implementations where the developers almost definitely did not have access to an H100 yet.


That sounds serious, are you able to share more details?


Personally I think we're doing AI wrong. Every time we simplify the architecture we get huge improvement in speed. CPUs have to everything (load/store, branching, execution), GPUs do a lot less branching and aren't optimised for it. This is why they can be made massively parallel. What if we cut out arbitrary load/store too? This would look like A DSP. All memory would be prepared in advance and the massively parallel accelerator would "stream walk" through It without having to load store arbitrary memory locations. The speed improvement could be on par with CPU->Gpu. Of course this would be for inference mostly. There is a startup that is trying to do just that called tinycorp. I'm certainly watching it with interest


> All memory would be prepared in advance and the massively parallel accelerator would "stream walk" through It without having to load store arbitrary memory locations.

Dedicated accelerators do that already, e.g. google's TPUs, tesla's D1 or apple's neural engine. You must load the data into compute-unit local memory first before executing matmuls. Keeping the weights there and only piping the dynamic data through it saves memory bandwidth.


This sort of architecture only really works when the network is small enough, which has been a perennial problem for neural network accelerators as networks grow but accelerators don't. LLMs and the like will often prefer the opposite form of "stream walking" (streaming the weights through the data) or a hybrid.


Streams weights instead of data sounds really interesting - I had never considered it.

Something else that might be theoretically possible is -

Large array of FPGAs are apparently used to simulate and verify chips [1], can the same be done to run LLMs? Can we have 0.25 to 1 token per cycle, would the engineering effort be worth it, and would it be financially feasible from a TCO standpoint?

[1] https://www.servethehome.com/amd-vp1902-is-leviathan-fpga-do...


The only problem is that an array of FPGAs would make an Nvidia GPU look cheap. And easy to get.



FGPA’s have been tried and dismissed. GPU’s are simply better with speed and efficiency.


This sounds a lot like the thesis behind George Hotz’s tinygrad, and the accelerators he eventually hopes to design for it.

edit: typo


There have been some research projects in this direction (computing fabrics, in-memory computation) but nothing that I'm aware of that made it to larger scale manufacturing.


Itanium tried removing a bunch of the out-of-order hardware and hoped that compilers could schedule everything in advance. Generally, that did not work very well.


I don’t get the sense that this is what the parent comment is talking about at all.

Not to mention that GPUs already execute in-order (at least any that I’m familiar with). They do have multiple execution pipelines, but instruction fetch/decode is in-order unlike something like a typical modern high performance CPU.


As far as I know you can preload the data in the GPU before processing it. What is the difference/advantage of what you are proposing?


You wouldn't need the hardware to support arbitrary load/stores. In particular, you could get rid of (some of) the address lines...

I'm unsure if this would be much of a win.


well, the data is way smaller than the model (at least the current trend), and you probably still need random access for the weights of the model. I am not sure if it a gain worth to pursue.


It's interesting how much power we end up spending purely on DRAM transfer. HBM3 uses about 5 pJ/bit, so maxing out an H100's 3.35TB/s of bandwidth requires around 135 watts just for the RAM. Given how many AI workloads are entirely bandwidth-bound, you could probably build a much cheaper chip that just went all-in bandwidth and got rid of all the fancy compute elements.


Or compute right next to the memory with dumb cores, and have a bigger core for mixing the results.


Do you have a source for that? I thought it was closer to 5pJ/byte.


> And sadly, Nvidia does not support OpenCL’s FP16 extension, so FP16 throughput couldn’t be tested

What a bummer. But is this really true? I know Nvidia does not report the cl_khr_fp_16 extension, but I saw somewhere that you can still use fp16 types in your code. Has anyone tested this?


Yes FP16 is fully optimised on NVidia hardware; but if you don't support standards...


...You gain a near monopoly by getting people to use your proprietary stack?


... which you created when none of the other stacks existed, and stood behind for 20 years while all your competitors changed their mind about everything every couple of years.


I notice they didn't test against TPUv5 either ...


Wouldn't surprise me if it's impossible to test against TPUs with Google not sharing low-level technical information and documentation from users.


> This giant die implements 144 Streaming Multiprocessors (SMs), 60 MB of L2 cache, and 12 512-bit HBM memory controllers. We’re testing H100’s PCIe version on Lambda Cloud, which enables 114 of those SMs [...]

What does "enabled" mean in this context?


The physical die has 144 SMs but is fused, for binning and SKU differentiation, so that only 114 are usable.


Seems like an odd detail. Is it safe to assume that this language is meant for investors? I’m the companies I’ve worked for, we never advertised the amount of efused-to-death silicon.


Probably a case of "no one complained hard enough yet", also probably a case of "beggars can't be choosers"... when literally billions of VC money is poured into both AI startups and estabilished cloud computing companies, all of that flows directly into Nvidia's pockets and it's not like this area is free from vendor lock in. When you have a stack that requires Nvidia's hardware you are going to pay for Nvidia's hardware. We live in a time when any hardware with a "matrix calculation accelerator" label sells like hot cakes. It's a massive bubble (HN doesn't like when you use that word in combination with AI, but that's what it is), but as with any bubble, people don't care, they want to ride that wave while it's there. But to get back to the issue, anything Nvidia will sell right now, it's just a matter of who is going to be able to buy it first. So no one really complains about some of Nvidia's marketing being a little dishonest. Also, even if people cared, being a trillion dollar company on your way to being one of the most valued companies on the planet, you have a lot of options and money for litigation.


It's always been like this in GPU space - all reviews have always mentioned number of compute units (be it SMS or "cuda cores"), and the total available for the given architecture is also known. A lot can be told about relative performance of two cards based on that, so this information is useful not only to the investors.


AFAIK it's been like that in CPU space too - e.g. that 6-core CPUs are actually 8-core CPUs with 2 cores deactivated, either because of defects or because they needed more 6-core CPUs?


It's always like that in consumer semiconductors. Intel has something like 3 to 5 actual silicon variants per generation that covers all dozen or two SKUs.


This sort of yield-enhancement-by-binning extends to almost every form of semiconductor, from amplifiers to server CPUs.


Sure, but Intel doesn't advertise the number of dead cores.


The article says:

> We’re testing H100’s PCIe version on Lambda Cloud, which enables 114 of those SMs, 50 MB of L2 cache, and 10 HBM2 memory controllers. The card can draw up to 350 W.

> Nvidia also offers a SXM form factor H100, which can draw up to 700W and has 132 SMs enabled.

So i wonder if the number of enabled elements is due to a power supply or cooling constraint.


Very possible, including some PCI limits (even though you probably have an auxiliary port)

It's also possible the yields are not so great and then you have a limited number of good SMs per chip


H100 PCIe has 114/144 SMs enabled

H100 SXM5 has 132/144 SMs enabled. Also higher clocks, much higher TDP.


My confusion is to why the "/144" is needed there, rather than just the lone numbers, 114 and 132, especially since the missing pieces are, more than likely, defective. How can knowing this number help anyone? Perhaps it's transparency, "132 is the best we have now. Best possible is 144, so don't save your money, buy this one!"


It's typically mentioned as a yield or product placement reference, because the same silicon is often used for a range of models and if the full line-up isn't released yet the number of disabled units gives hints as to which other products are likely to exist, how they perform and priced.


The fusing is supposedly due to the power envelope on the PCIe cards. Practically, it could be a market segmentation / yield enhancement trick, but that would be the most nefarious thing, but I assume it's also mostly due to the power envelope.


No mention of the RAM (or was I skimming too hard)?! Used to be pretty crucial.

According to Tom's Hardware, one could find 80GB boards for around 3500$.


Are you missing a zero there?


[Curses], yes, of course, I am missing a zero in 35,000$ and I am missing a nurse to check what I am doing, quite apparently.

BTW: that number was obtained from converting Yen to Dollars, and other figures mention instead "over 30,000$".


Doesn't have to be crucial, it can also be corsair, kingston, hyperx, etc.

:P


Can you link to the Tom's hardware article? That would be amazing price.


While it was a mistake, should it be an amazing price?

Sure, if you want 10x the memory bandwidth of a smaller card, that should be expensive.

But 80GB of GDDR6 would currently cost something like... $300. Or if you looked 1-3 years ago it would have been $1000.

GDDR6 is already designed so that you can attach 8 data lines per chip. A high end GPU with a 384-bit memory bus could attach 48 chips that way and have 96GB. Or exactly 80GB on a 320-bit bus.

$1000-1500 retail baseline for the GPU, $250-800 for extra RAM, $1000+ for the extra design hassle... I think you'd be able to buy that for $3500 if we had better competition.

For a lot of use cases, you want a balance between compute and memory. For big AI models outside of a datacenter, memory is far more important. It's worth putting half the budget into RAM chips if that means your model can fit, even if you "only" get 100 teraflops at FP16.


Hi, sorry, I am mentally a bit overwhelmed these times and losing pieces - I thought I posted the link:

https://www.tomshardware.com/news/nvidia-hopper-h100-80gb-pr...

just five weeks old.

There are many other articles on Tom's Hardware, mostly old. The newest one (two weeks ago) is just a small piece that reveals that

> can barely render graphics [as in, real time 3D rendering] as they do not have enough special-purpose hardware [...] GH100 only has 24 raster operating (ROPs) units and does not have display engines or display outputs [...] One H100 board scores 2681 points in 3DMark Time Spy, which is even slower than performance of AMD's integrated Radeon 680M, which scores 2710


If Tom's Hardware reviewed cars, I wouldn't be surprised to learn they'd written the following: "Despite being called a "Ford", the Taurus scored exceptionally poorly in our river crossing tests, doing only roughly as well as other mechanical bulls."


Toyota, despite not having such nomenclature in it’s name, proved more reliable in keeping water on the outside of the vehicle.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: