Tesla Dojo和Nvidia Blackwell比較
超級電腦嘅性能真係日新月異
2021/2022 Tesla Dojo
Source: https://www.youtube.com/watch?v=j0z4FweCy4M, https://www.youtube.com/watch?v=ODSJsviD_SU
1粒D1 Chip [362 TFlops, 422.5MB SRAM]
25粒D1 Chip = 1塊Training Tiles [9 PFlops, 11GB SRAM]
12塊Tiles + 1280GB High Bandwidth DRAM + 8TB Host DRAM + 電源 = 1條Rack/Cabinet [108 PFlops, 200kW]
10條Rack = 1組Cluster/ExaPod [1.08 EFlops, 2MW, 1.3TB SRAM, 12.5TB High Speed DRAM, 2MW]
浮點精度係BF16/CFP8
每粒D1 Chip每邊嘅訊頻寬係4TBps
2024 Nvidia Blackwell
1 Blackwell GPU [10PFlops, 192GB DRAM]
2粒Blackwell GPU + 1粒Grace CPU = 1塊GB200 Blackwell Superchip [20PFlops, 384GB DRAM + 384GB “Fast Memory”]
2塊GB200 Blackwell Superchip = 1部Node [40PFlops, 1.7TB DRAM]
18部Node + 若干NVLink設備 = 1條Rack GB200 NVL72 [720PFlops, 30.6TB DRAM, 120kW]
8條GB200 NVL72 + 10條唔知乜嘢 = 1組Cluster [5.760ExaFlops]
浮點精度係FP8 (slide中嘅AI Performance係指FP4,後面有slide指FP8係一半速度)
每粒Blackwell GPU係由兩個die組成,之間通訊頻寬係10TBps
比較
當住BF16係兩倍FP8咁勁先,噉以FP8計嘅話…
1條Rack:
- Tesla Dojo: 216PFlops, 200kW, 1.25TB DRAM, 0.13TB SRAM
- Nvidia Blackwell: 720PFlops, 120kW, 30.6TB DRAM
Moore’s Law話18個月double乜乜乜嘛…睇落好似差唔多啦。(To be exact係積體電路上可容納的電晶體數目,約每隔18個月便增加一倍,而唔係指性能)
註:啲suffix由細到大係kilo, Mega, Giga, Tera, Peta, Exa
其他廢話
Nvidia個Omniverse ecosystem睇落係幾勁(勁咗好多年㗎啦唔係今年先出),用來做digital twins或者simulation嘅話睇落好勁。
不過generative AI training或inferencing又唔需要omniverse嘅,AMD即使缺少呢套嘢但市場份額應該都仲有得追。
Intel就算數把啦。🤪