Faulty memory H100 = difficult AI training!

July 30, 2024

0 Comments

1 minute read

217 Views

2 months ago

As for AI, we learn that Meta's training for Llama 3 has not been easy. In fact, the H100 was the cause of numerous crashes, partly due to memory issues. It should also be noted that the training of this AI lasted 54 days.

NVIDIA's H100s gave Meta a tough time!

For the record, Llama 3 was trained on an astronomical number of graphics cards. We're talking about a set of 16,384, all NVIDIA H100s, the most powerful card currently available in this sector.

Let's remember that we are talking about a graphics card equipped with a GH100 GPU and 80 GB of HBM3 memory. As for the GPU, depending on the variant used, we are talking about 114 or 132 SMs, i.e. a CUDA core count of 14,592 or 16,896, depending on whether we are talking about a PCIe or SXM5 card.

In short, during these 54 days of training, there were many problems. In fact, we are talking about almost a thousand problems. Our colleagues tell us:

419 unexpected failures
47 planned maintenance interruptions
466 breakdowns

This leaves us with a total of 885 hardware-related bugs, broken down as follows: 30.1% NVLink-related and 17.2% HBM3 memory-related. Lastly, this leaves room for just two CPU-related bugs… Two bugs in 54 days of training, that’s insane!

(tags to translate)NVIDIA

NVIDIA's H100s gave Meta a tough time!

Related Post