Oops, my calculation was wrong. Let me add an edit to the blog, thanks for point...

gdiamos · on Sept 24, 2024

That’s approximately 0.8% MFU - h100 would get more like 30% or 40% MFU if well tuned

405e9 parameters

2 flops per matrix multiply per parameter

3 matrix multiplies for (forward, backward param, and backward activation) passes

batch size 16

seq length 64

1.3 petaflops per second per GPU in bfloat16

8 GPUs

30 seconds

So that’s 0.8% = (405e9 * 2 * 3 * 16 * 64 / 30) / (1.3e15 * 8)

Note that I’m ignoring the attention flops in this simplified calculation, but they would be a second order effect at this sequence length

Also note that I’m assuming full weight training, not LoRA . The result would be lower MFU if using LoRA

These MI300X results are promising functionally (it's tough to get any model this big running) but they have a long way to go on perf. It's also single node. The biggest issues I've seen on MI300X are related to scaling to multiple nodes.

EDIT: The blog seems to indicate it is using LoRA. So we should remove the backward param pass from the equation above. Backward param only applies to adaptor weights, which are much more than 10x smaller, so we set it to 0 in the approximation. So we get

0.53% = (405e9 * 2 * 2 * 16 * 64 / 30) / (1.3e15 * 8)

darrick_horton · on Sept 25, 2024

We've significantly optimized multinode on AMD MI300X for both stability and performance at TensorWave. There were certainly a lot of challenges, but we've become experts at multinode on AMD. We'd be happy to show you the latest results! They are quite compelling

latchkey · on Sept 26, 2024

Promises... promises... https://x.com/DarrickHorton/status/1828902979875017003

Maybe challenging for Darrick, but in reality, you just need to follow the directions and configure things correctly...

https://techdocs.broadcom.com/us/en/storage-and-ethernet-con...

heyoni · on Sept 24, 2024

If I feed your reply to an llm will it be able to tell me what you’re talking about?

/edit I think it did alright…

gdiamos · on Sept 24, 2024

I tried with Llama 3.1 and it hallucinated MFU. It also completely misinterpreted multi-node scaling.

==========================================================

Let's break down the results described in the post.

Context: The post is discussing the performance of a large language model on a MI300X GPU, which is a high-performance computing (HPC) system. The model has approximately 405 billion parameters and is trained using a batch size of 16 and sequence length of 64.

Key metrics:

MFU (Million Floating-Point Operations per Second): This is a measure of the model's performance, specifically the number of floating-point operations (FLOPs) it can perform per second.

FLOPs: The number of floating-point operations required to perform a matrix multiplication, which is a fundamental operation in deep learning.

GPU performance: The MI300X GPU is capable of 1.3 petaflops (1.3 x 10^15 FLOPs) per second in bfloat16 (a 16-bit floating-point format).

Calculations:

The author provides two calculations to estimate the MFU of the model:

Initial calculation: Assuming full weight training (not LoRA), the author estimates the MFU as:

405 billion parameters

2 FLOPs per matrix multiply per parameter

3 matrix multiplies (forward, backward parameter, and backward activation)

Batch size 16

Sequence length 64

30 seconds to complete the calculation

1.3 petaflops per second per GPU

8 GPUs

The calculation yields an MFU of approximately 0.8%.

Revised calculation: After correcting the assumption to use LoRA (a technique that reduces the number of FLOPs), the author revises the calculation by removing the backward parameter pass, which is only applied to adaptor weights (much smaller than the main weights). This yields an MFU of approximately 0.53%.

Interpretation:

The results indicate that the MI300X GPU is not yet optimized for this large language model, with an MFU of only 0.53% (or 0.8% in the initial calculation). This is a relatively low performance compared to the theoretical maximum of 1.3 petaflops per second per GPU. The author notes that the biggest issues are related to scaling to multiple nodes, suggesting that the performance may improve when running on a larger cluster.

The revised calculation using LoRA reduces the MFU by about 33%, indicating that using this technique can lead to a more efficient use of the GPU resources.

==========================================================

MFU means model flops utilitization. it is a measure of efficiency from 0% to 100% . 100% means that the model is running at maximum possible efficiency, i.e. 1.3 petaflops per gpu.

==========================================================

In that case, the results indicate that the MI300X GPU is running the large language model at a relatively low efficiency, with an MFU of 0.53% (or 0.8% in the initial calculation).

This means that the model is only utilizing a tiny fraction of the GPU's maximum theoretical performance of 1.3 petaflops per second. In other words, the model is not fully utilizing the GPU's capabilities, and there is a significant amount of headroom for optimization.

To put this into perspective, an MFU of 100% would mean that the model is running at the maximum possible efficiency, using 1.3 petaflops per second per GPU. An MFU of 0.53% or 0.8% is extremely low, indicating that the model is running at a tiny fraction of its potential performance.

The author's comment that the MI300X results are "promising functionally" suggests that the model is able to run, but the low MFU indicates that there are significant opportunities for optimization and performance improvement.

heyoni · on Sept 24, 2024

MFU being maximal floating point utilization right? I used chatgpt 4o and it otherwise looks very similar to your llama result

YetAnotherNick · on Sept 24, 2024

> The blog seems to indicate it is using LoRA. So we should remove the backward param pass from the equation above. Backward param only applies to adaptor weights

Backward pass still runs on the non adapter weights. But yeah 10 TFlops/GPU specially on tiny sequence size is very bad compared to what you can get on Nvidia. And I believe the difference would be even higher with large sequence length.

gdiamos · on Sept 24, 2024

backward activations does but typically not backwards weight gradients.

Why compute gradients with regards to weights that aren't going to be updated?