A M4 Pro has 273 GB/s of MBW and roughly 7 FP16 TFLOPS. A 5090 has 1.8TB/s of MBW and likely somewhere around 200 FP16 Tensor TFLOPS (for llama.cpp inference this is even more stark as it is doing roughly 90% INT8 for its CUDA backend and the 5090 likely has >800 INT8 dense TOPS). So we're looking at a device with about 5.6X more memory bandwidth for token generation, and >100X more processing power for prefill/prompt processing or diffusion workloads. The performance and cost/perf are really not on the same planet.

https://www.reddit.com/r/LocalLLaMA/comments/191srof/amd_radeon_7900_xtxtx_inference_performance/ https://www.reddit.com/r/LocalLLaMA/comments/1ghvwsj/llamacpp_compute_and_memory_bandwidth_efficiency/ https://www.reddit.com/r/LocalLLaMA/comments/1gheslj/testing_llamacpp_with_intels_xe2_igpu_core_ultra/

Actually there is exact info on the MBW. The linked writeup says the HP device using (the max suppoprted) LPDDR5x-8000 and per the official AMD specsheet, it is "256-bit LPDDR5x". A little multiplication, you get 256GB/s of MBW.

Per Apple, 16/20CU M4 Pros have 273 GB/s MBW, 32CU M4 Max has 410GB/s, and 40CU M4 Max has 546GB/s (they use LPDDRx-8533). You need to use an M4 Max to get 128GB of memory, and the starting price for that is $4700, but you get >2X the MBW. The main weakness of the Apple Silicon remains compute. The M4 Max has almost half the raw TFLOPS vs Strix Halo (important for prompt processing, batching, any diffusion or image generation).

See:

Just a clarification, per the Ampere GA 102 whitepaper (Appendix A, Table 9) for the 3090 (328 3rd gen Tensor Cores):

Peak FP16 Tensor TFLOPS with FP32 (dense/sparse): 71/142
- So for most training (unless you're DeepSeek) you're looking at 71 TFLOPS
Peak INT8 Tensor TOPS (dense/sparse): 284/568
- llama.cpp's CUDA backed is doing 90% dense INT8

And from the Ada GPU architecture whitepaper (Appendix A, Table 2) for the 4090 (512 4th gen Tensor Cores):

Peak FP16 Tensor TFLOPS with FP32 (dense/sparse): 165.2/330.4
- So for most training you're looking at 165 TFLOPS
Peak INT8 Tensor TOPS (dense/sparse): 660.6/1321.2
- llama.cpp's CUDA backed is doing 90% dense INT8

AFAIK, Nvidia has yet to publicly publish a Blackwell/GB102 technical architecture doc.

Here's chart I made. The GB10 announcement seems very light on details atm. Based on Nvidia's recent technical marketing, I'll assuming the 1 PFLOPS FP4 mentioned is sparse, so dense would be 500 TFLOPS. From there I use the Blackwell datasheet to back-calculate the dense FP16 and INT8 ratios based on the Blackwell fovd: https://resources.nvidia.com/en-us-blackwell-architecture

Specification	Apple M4 Max	AMD Ryzen AI Max Plus 395	NVIDIA GB10 Digits
Release Date	November 8, 2024	Spring 2025	May 2025
Price	$4,699 (MBP 14)	$1200+	$3,000
Memory	128GB LPDDR5X-8533	128GB LPDDR5X-8000	128GB LPDDR5X
Memory Bandwidth	546 GB/s	256 GB/s	Unknown, 256GB/s or 512GB/s
FP16 TFLOPS	34.08	59.39	125
INT8 TOPS (GPU)	34.08	59.39	250
INT8 TOPS (NPU)	38	50
Storage	1TB (non-upgradable) + 3 x TB5 (120Gbps)	2 x NVMe PCIe 4.0 x 4 + 2 x TB4 (40Gbps)	NVMe?

There's no M4 Max Mac Mini or Mac Studio, so the MBP 14 is the cheapest M4 Max config you can get
HP is the only one I've seen showing off a mini-PC workstation w/ the HP Z2 Mini G1a. I'm using Liliputing's reporting on pricing, I'd guess the lower RAM models would be the starting price.
Interested people might want to refer to my recent llama.cpp efficiency tests to see how that might translate to real world performance.

If the GB10 has a 512-bit bus (and hence 512GB/s of MBW) it's big FLOPS/TOPS advantage definitely puts it in a class of its own. If it merely matches Strix Halo on MBW, then it becomes a lot less interesting for the price...

Mac testing (Llama 7B Q4_0): ggerganov/llama.cpp#4167

M2 Ultra (60 CU) does pp512/tg128 1013.81/88.64
M2 Ultra (76 CU) does pp512/tg128 1238.48/94.27

So there apepars to be a tg advantage w/ more cores.

As for a 7900 XT, here's some testing I did a while back on a similar Llama2 7B Q4_0: https://www.reddit.com/r/LocalLLaMA/comments/191srof/amd_radeon_7900_xtxtx_inference_performance/

tg is 96.6 t/s, so about on par. Note the RTX 3090 tg speed though. It has 936.2GB/s of MBW (+17%) but is +41% faster, so theoretical MBW doesn't tell the whole story (Nvidia cards have gotten even faster on the llama.cpp CUDA backend since then).

You can more recent analysis I did of t/TFLOP and theoretical MBW% efficiency numbers here: https://www.reddit.com/r/LocalLLaMA/comments/1ghvwsj/llamacpp_compute_and_memory_bandwidth_efficiency/

Power Efficiency https://www.reddit.com/r/LocalLLaMA/comments/1hqlug2/comment/m4r84nl/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Comparison.md

GPU Comparison.md

Files

GPU Comparison.md

Latest commit

History

GPU Comparison.md

File metadata and controls