Who’s behind the powerful AI ChatGPT? NVIDIA, revealed Microsoft
- Microsoft improved their infrastructure to cater to the growing need for larger and more complex models, with the latest addition being virtual machines with NVIDIA H100 Tensor Core GPUs and Quantum-2 InfiniBand networking.
- The upcoming NVIDIA GTC event will feature updates and announcements about generative AI, cloud computing, the industrial metaverse, and other topics.
Although ChatGPT’s popularity and effectivenes is widely acknowledged, not everyone is familiar with the technology responsible for its vast-scale computing. It’s important to acknowledge that NVIDIA has played a significant part in empowering ChatGPT to support massive-scale computing for the well-known and dominant AI model.
OpenAI proposed a groundbreaking idea to Microsoft – developing AI systems that would revolutionize human interaction with computers. With its expertise in high-performance computing, Microsoft has been working on AI models to improve language efficiency.
As OpenAI researchers started using more powerful GPUs for complex AI tasks, they realized the need for massive supercomputing infrastructure to scale up the training of increasingly powerful AI models. This urgency led to a partnership between Microsoft and OpenAI to build a dedicated Azure AI supercomputing technology infrastructure that could support large language models.
Microsoft has continued to advance this infrastructure to meet the increasing demand for larger and more complex models. The latest offering is virtual machines integrating NVIDIA H100 Tensor Core GPUs and Quantum-2 InfiniBand networking.
Advancements in large language model training
The breakthroughs in training large language models were made possible by mastering the construction, operation, and maintenance of tens of thousands of GPUs co-located and connected through a high-speed, low-latency InfiniBand network. This scale exceeded what suppliers of GPUs and networking equipment had ever tested, making it uncharted territory with no guarantee that the hardware could withstand the strain.
The computation workload is distributed across thousands of GPUs in a cluster to train large language models, which requires a reliable infrastructure and optimized system-level software for optimal performance.
Over the years, Microsoft has developed software techniques that enable efficient utilization of GPUs and networking equipment, allowing the training of models with tens of trillions of parameters while reducing the required resources and time. Microsoft and its partners have gradually added capacity to the GPU clusters and the InfiniBand network while testing the data center infrastructure to keep the GPU clusters running for weeks, including cooling systems, uninterruptible power supplies, and backup generators.
The Azure infrastructure optimized for large language model training is now available through Azure AI supercomputing capabilities in the cloud, providing the required combination of GPUs, networking hardware, and virtualization software to power the next wave of AI innovation. Microsoft built special-purpose clusters focusing on enabling large training workloads, and OpenAI was one of the early proof points for that.
Microsoft and OpenAI worked closely together to understand the requirements for building training environments and develop the necessary solutions.
The power of NVIDIA AI
Microsoft’s ND H100 v5 VM, which comes in sizes ranging from eight to thousands of NVIDIA H100 GPUs interconnected by NVIDIA Quantum-2 InfiniBand networking, allows customers to experience faster AI model performance than with the last generation ND A100 v4 VMs. The ND H100 v5 VM includes 8 NVIDIA H100 Tensor Core GPUs interconnected via next-gen NVSwitch and NVLink 4.0, 400 Gb/s NVIDIA Quantum-2 CX7 InfiniBand per GPU with 3.2Tb/s per VM in a non-blocking fat-tree network, NVSwitch and NVLink 4.0 with 3.6TB/s bisectional bandwidth between 8 local GPUs within each VM, 4th Gen Intel Xeon Scalable processors, PCIE Gen5 host to GPU interconnect with 64GB/s bandwidth per GPU, and 16 Channels of 4800MHz DDR5 DIMMs.
The ND H100 v5 VM introduced by Microsoft enables customers to achieve supercomputer-level performance and deploy a new class of large-scale AI models. This capability is especially relevant for organizations like Inflection, NVIDIA, and OpenAI that have committed to large-scale deployments. Microsoft’s AI Infrastructure is designed for this purpose, focusing on scale and optimization for AI.
Azure has integrated AI at scale into its core, with investments in large language model research and the creation of the first AI supercomputer in the cloud. This preparation allowed Azure to leverage the power of generative artificial intelligence when possible. With services like Azure Machine Learning and Azure OpenAI Service, customers can easily access and utilize Azure’s AI infrastructure for model training and large-scale generative AI models. Additionally, Azure has democratized supercomputing capabilities by eliminating the need for massive hardware or software investments.
NVIDIA and Microsoft Azure have collaborated through multiple product generations to deliver cutting-edge AI innovations to global enterprises. The ND H100 v5 VMs represent the latest collaboration, which will usher in a new era of generative AI applications and services.
The upcoming NVIDIA GTC event will feature updates and announcements about generative AI, cloud computing, the industrial metaverse, and other topics.
- Samsung’s leap: Securing 2nm AI chip deal, nipping at TSMC’s Heels
- FBI and UK crime agency finally disrupt Lockbit cyber-gang
- A change in the Nintendo Switch 2 release date? Delayed even further?
- Choosing a tablet for seamless site work: Boost productivity everywhere in the business with Zebra’s ET60 and ET65 solutions
- Keeping stock of your stock: How to build a resilient supply chain in 2024