Insights From The Blog
The Architecture of Colossus: Supermicro’s Groundbreaking AI Supercomputer
With big data sets becoming the norm in many areas of computing, the need for ever more powerful computer systems has never been greater. But because the complexity of the data as well as its sheer mass has also grown, detailed analysis and the growing need for dataset encryption has meant that AI has been a major factor in new systems.
When Elon Musk needed a fast and powerful system to train and power Grok – the new generative AI chatbot developed by xAI – his team identified Supermicro as being the company to do it. But what does it take to develop a computing system that meets the needs of such a huge amount of data being handled in a super-fast manner?
Outsized Specifications
Fundamentally, Colossus is a massive AI supercomputer that has more than 100,000 NVIDIA HGX H100 GPUs, exabytes of storage, and the extremely fast networking capabilities that are necessary to run one of the most ambitious systems in the world. Almost even more impressive is the speed with which the Colossus system was bought online; Based in Memphis, TN, the multi-billion-dollar data facility started from an empty building devoid of any of the required power generators, transformers, or multiple hall construction to a production-ready AI supercomputer in just over 120 days.
Having identified the area for the building, Supermicro started by developing an architectural model that would maximise the thermal envelope and the power requirements of the massive system. The internal layout began with a raised floor data hall, with electricity above and liquid cooling pipes flowing to the facility chiller in the basement area.
Each of the four “computing” halls has around 25,000 NVIDIA Graphics Processing Units (GPUs) clustered together, as well as all storage, fibre optic high-speed networking, and power, allowing the computing system to operate autonomously.
From there, things become much more specialist and bespoke. The Supermicro liquid-cooled rack is the foundation of Colossus, and it is found in every cluster. Each rack houses eight Supermicro 4U Universal GPU systems, which have two liquid-cooled x86 CPUs and four liquid-cooled NVIDIA HGX H100 8-GPUs.
The individual racks feature 64 NVIDIA Hopper GPUs. Each one of the racks is made up of eight GPU servers, a Supermicro coolant distribution unit (CDU), and coolant distribution manifolds (CDM). The racks are then grouped in series of eight, each with 512 GPUs, and there is also a networking rack to enable mini-clusters inside the much larger system. In this way, the entire system builds up in a modular way so that a failure can be swapped out quickly if needed.
Cooling the System
Of course, this kind of computing power and data transfer requires a huge amount of cooling, and that became a major feature in the cluster systems. The rack coolant distribution manifold positioned above each server introduces cold liquid and expels heated liquid; easy disconnects provide the rapid and effortless removal or reinstallation of the liquid cooling apparatus with one hand, allowing access to the two below trays.
The rack comprises eight Supermicro Universal GPU Systems designed for liquid-cooled NVIDIA HGX H100 and HGX H200 Tensor Core GPUs. Each top tray of the system accommodates the NVIDIA HGX H100 8-GPU configuration and the cold plates on the NVIDIA HGX board for GPU cooling. The lower tray accommodates the motherboard, CPU, RAM, PCIe switches, and cold plates for twin socket CPUs. This system effectively draws the hot air from the GPU areas and takes it off to be cooled, while cooled air is introduced in its place. It’s complex but highly effective at drawing out the huge amount of heat generated by the computing cores.
As a safety system, every custom coolant distribution unit has its own monitoring system that tracks flow rate, temperature, and other important functions. These systems are connected to the central management interface. Each one also contains backup pumps and power supplies, which means that if one fails, it can be repaired or replaced in a matter of minutes without disturbing the entire operation of the system, hence limiting downtime. Originally proven as just a few linked units, the design was upscaled in Colossus and found to work just as effectively in the final system.
Network Requirements
Having all of that hardware in place meant that Colossus needed thousands of kilometres of connective cables, and the task of networking every CDU together is just as impressive as the building of the structure and the cooling system.
NVIDIA’s Spectrum-X Ethernet networking platform is utilised to manage the enormous networks of the data centre. This technology is being used in the data centre to grow its large AI clusters to a level that no other technology can reach. Spectrum-X is a state-of-the-art networking platform that offers quick and dependable data transport, and is built to meet the high demands of AI workloads.
Utilising NVIDIA Bluefield-3 SuperNICs, each of the clusters has networking capabilities of an enormous 400 gigabits per second. The data centre employs 400GbE, or 400 times quicker per optical link, which is the same underlying technology as any desktop ethernet cable. With nine cables per machine, each GPU computing server can supply 3.6Tbps of bandwidth. Most of this bandwidth is used by the RDMA (Remote Direct Memory Access) network that the GPUs use. Networking technologies from Spectrum-X and NVIDIA BlueField-3 SuperNIC are connected with each graphics processing unit.
To minimise network congestion and complete all CPU supercomputer operations in the cluster, the system employs smart flow management and may offload different security protocols as it requires them.
The Colossus system is a massive undertaking and is likely to be the first in many such developments as AI systems become ever more complex. If you really want to appreciate this, Supermicro released a video accompanying the build on X recently. You can see that here.
Unity Developers are a team of professional designers who can help you with your XR content. Why not contact us and see how we can help with your projects.