Nvidia, Microsoft collaborates to build the “most powerful AI supercomputers”

Nvidia, Microsoft collaborates to build the “most powerful AI supercomputers”Source: Nvidia

Nvidia, Microsoft collaborates to build the “most powerful AI supercomputers”

  • Nvidia is bringing tens of thousands of its GPUs, Quantum-2 InfiniBand and full stack of its AI Software to Microsoft Azure’s advanced supercomputing infrastructure.

Nvidia Corp, the most valuable US semiconductor maker, recently announced a “multi-year collaboration” with Microsoft to build “one of the most powerful AI supercomputers in the world,” designed to handle the huge computing workloads necessary to train and scale AI. The partnership makes Microsoft Azure the first public cloud to incorporate Nvidia’s full AI stack — its GPUs, networking, and AI software.

The supercomputer will be powered by Microsoft Azure’s advanced supercomputing infrastructure combined with NVIDIA GPUs, networking and full stack of AI software — to help enterprises train, deploy and scale AI, including large, state-of-the-art models, Nvidia said in a statement last week.

Nvidia’s vice president of enterprise computing Manuvir Das noted that “AI technology advances as well as industry adoption is accelerating.” The breakthrough of foundation models has triggered a tidal wave of research, fostered new startups and enabled new enterprise applications.

He added that the collaboration between Nvidia and Microsoft “will provide researchers and companies with state-of-the-art AI infrastructure and software to capitalize on the transformative power of AI.” Echoing his thoughts, Microsoft’s executive VP of the Cloud + AI Group Scott Guthrie said, “Our collaboration with NVIDIA unlocks the world’s most scalable supercomputer platform, which delivers state-of-the-art AI capabilities for every enterprise on Microsoft Azure.”

Microsoft and Nvidia bringing cloud and supercomputer together

Azure’s cloud-based AI supercomputer includes powerful and scalable ND- and NC-series virtual machines optimized for AI distributed training and inference, according to Nvidia. Microsoft Azure is also the first public cloud to pair with NVIDIA’s advanced AI stack, adding tens of thousands of NVIDIA A100 and H100 GPUs, NVIDIA Quantum-2 400Gb/s InfiniBand networking and the NVIDIA AI Enterprise software suite to its platform.

The collaboration will also see NVIDIA utilizing Azure’s scalable virtual machine instances to research and further accelerate advances in generative AI, “a rapidly emerging area of AI in which foundational models like Megatron Turing NLG 530B are the basis for unsupervised, self-learning algorithms to create new text, code, digital images, video or audio,” Nvidia said.

To top it off, Nvidia and Microsoft will also collaborate to optimize the latter’s DeepSpeed deep learning optimization software. In turn, NVIDIA’s full stack of AI workflows and software development kits, optimized for Azure, will be made available to Azure enterprise customers. “Combined with Azure’s advanced compute cloud infrastructure, networking and storage, these AI-optimized offerings will provide scalable peak performance for AI training and deep learning inference workloads of any size,” the statement by Nvidia reads.

Once Nvidia and Microsoft’s cloud computer comes online, customers can deploy thousands of GPUs in a single cluster to “train even the biggest language models, build the most complex recommender systems at scale, and enable generative AI at scale,” according to Nvidia. To recall, earlier this year, social media conglomerate Meta have claimed that its new AI Research SuperCluster, or RSC, is already among the fastest machines of its type and, when complete in mid-2022, will be the world’s fastest.

AI Research SuperCluster — Meta’s cutting-edge AI supercomputer for AI research

AI Research SuperCluster — Meta’s cutting-edge AI supercomputer for AI research.
Source: Meta

RSC will be used to train a range of systems across Meta’s businesses: from content moderation algorithms used to detect hate speech on Facebook and Instagram to augmented reality features that will one day be available in the company’s future AR hardware. To top it off, Meta says RSC will be used to design experiences for the metaverse.

Work on RSC began around 2019, with Meta’s engineers designing the machine’s various systems — cooling, power, networking, and cabling — entirely from scratch. By January this year, phase one of RSC was already up and running and consisted of 760 Nvidia GGX A100 systems containing 6,080 connected GPUs. Even then, Meta says it’s already providing up to 20 times improved performance on its standard machine vision research tasks.