Nvidia H100 caused problems during Meta’s Llama AI training

Getting the fastest AI chip is just half the battle.

Nvidia H100 DGX Superpod system render.

Nvidia H100 GPUs have apparently caused some delays in Meta’s Llama 3 training due to memory-related issues. The Meta team had to fight through upwards of 400 failures to train its AI.

According to a recent Meta study, the brand’s Llama 3 training wasn’t smooth sailing. It seems that the massive cluster of Nvidia H100 GPUs had some memory issues. And we are not talking about a couple of errors here and there. In total, the team had to deal with 466 interruptions, 419 of which were due to unexpected failures, with the remaining 47 caused by planned maintenance.

This volume of failures is even more concerning when taking into account the training duration, which lasted for just 54 days. This means that Llama 3’s team had to fight through dozens of these each day. I thought AI was supposed to simplify our lives. Despite all of this, the Llama 3 team maintained over a 90% effective training time.

These issues were apparently due to faulty HBM3 memory, causing the GPUs to account for 58.7% of the interruptions. Thankfully, only three incidents required significant manual intervention, as automation fixed most. To be fair, Nvidia’s GPUs didn’t cause all the bugs; many originated from software or networking bugs. There were even a couple of CPU failures, but these didn’t cost $30,000.

Nonetheless, there is a stain on Nvidia’s 3D leather jacket, especially as these chips are the most sought-after worldwide, with queues to receive a shipment counting in weeks. Even China wants a piece of the AI cake, prompting Nvidia to build a dedicated model to avoid US sanctions.

Meta trained Llama 3’s 175 billion parameters using a massive cluster of 16,384 H100 GPUs based on the Hopper architecture. The H100 is one of the fastest AI training solutions on the market, with 14,592 or 16,896 CUDA cores, depending on whether it’s a PCIe 80GB HDBM2e model or a PCIe 96GB HBM3. Blackwell B200 is the only GPU that supersedes it.