Nvidia H100 caused problems during Meta's Llama AI training

Nvidia H100 GPUs have apparently caused some delays in Meta’s Llama 3 training due to memory-related issues. The Meta team had to fight through upwards of 400 failures to train its AI.

According to a recent Meta study, the brand’s Llama 3 training wasn’t smooth sailing. It seems that the massive cluster of Nvidia H100 GPUs had some memory issues. And we are not talking about a couple of errors here and there. In total, the team had to deal with 466 interruptions, 419 of which were due to unexpected failures, with the remaining 47 caused by planned maintenance.

This volume of failures is even more concerning when taking into account the training duration, which lasted for just 54 days. This means that Llama 3’s team had to fight through dozens of these each day. I thought AI was supposed to simplify our lives. Despite all of this, the Llama 3 team maintained over a 90% effective training time.

These issues were apparently due to faulty HBM3 memory, causing the GPUs to account for 58.7% of the interruptions. Thankfully, only three incidents required significant manual intervention, as automation fixed most. To be fair, Nvidia’s GPUs didn’t cause all the bugs; many originated from software or networking bugs. There were even a couple of CPU failures, but these didn’t cost $30,000.

Nonetheless, there is a stain on Nvidia’s 3D leather jacket, especially as these chips are the most sought-after worldwide, with queues to receive a shipment counting in weeks. Even China wants a piece of the AI cake, prompting Nvidia to build a dedicated model to avoid US sanctions.

Meta trained Llama 3’s 175 billion parameters using a massive cluster of 16,384 H100 GPUs based on the Hopper architecture. The H100 is one of the fastest AI training solutions on the market, with 14,592 or 16,896 CUDA cores, depending on whether it’s a PCIe 80GB HDBM2e model or a PCIe 96GB HBM3. Blackwell B200 is the only GPU that supersedes it.

Browse

The Club

Nvidia H100 caused problems during Meta’s Llama AI training

Deal of the Day

It’s time to make the jump to AM5 with this half-price AMD Ryzen 5 7600X

Hot Reviews

Asus ROG Xbox Ally X review: hail to the new handheld king

Corsair Sabre v2 Pro Ultralight Wireless review: so light you forget its presence

Geekom GT1 Mega review: a powerful mini PC with AI features

Preferred Partners

Related Reading

Nvidia DGX Spark review roundup – a compact, specialised tool for AI development

GPU power supply woes continue even with specialist cables

AMD Radeon RX 9070 XT comes very close to the Nvidia GeForce RTX 5090 in Call of Duty Black Ops 7 beta

Asus shares the making of a $500,000 full-gold RTX 5090D graphics card