NVIDIA to Unveil Innovations That Improve Data Center Performance and Energy Efficiency at Hot Chips

A deep-tech conference for processor and system architects from industry and academia has become a key forum for the trillion-dollar data center computing market.

At Hot Chips 2024 next week, NVIDIA senior engineers will showcase the latest advancements driving the NVIDIA Blackwell platform, as well as research into liquid cooling for data centers and AI agents for chip design.

They will share how to:

NVIDIA Blackwell brings together multiple chips, systems and NVIDIA CUDA software to power the next generation of AI across all use cases, industries, and countries.
NVIDIA GB200 NVL72 graphics card — a multi-node, liquid-cooled, rack-scale solution that connects 72 Blackwell GPU and 36 CPU Grace — raises the bar for AI systems design.
NV Link Interconnect technology provides all-to-all GPU communication, enabling record-breaking high throughput and low-latency inference for generative AI.
The NVIDIA Quasar quantization system pushes the boundaries of physics to accelerate AI computing.
NVIDIA researchers are creating AI models that help build AI processors.

An NVIDIA Blackwell talk, taking place on Monday, August 26, will also highlight new architectural details and examples of generative AI models running on Blackwell silicon.

It is preceded by three tutorials on Sunday, August 25, which will cover how hybrid liquid cooling solutions can help data centers transition to more energy-efficient infrastructure and how AI models, including large language model (LLM)-driven agents can help engineers design the next generation of processors.

Together, these presentations showcase the ways NVIDIA engineers are innovating across all areas of data center design and computing to deliver unprecedented performance, efficiency and optimization.

Get ready for Blackwell

NVIDIA Blackwell is the ultimate end-to-end computing challenge. It features multiple NVIDIA chips, including the Blackwell GPU, Grace CPU, Blue field data processing unit, ConnectX network interface card, NVLink Switch, Spectrum Ethernet switch and Quantum InfiniBand switch.

Ajay Tirumala and Raymond Wong, NVIDIA’s chief architecture officers, will provide a first look at the platform and explain how these technologies work together to deliver a new standard for AI and accelerated computing performance while advancing energy efficiency.

The multinode NVIDIA GB200 NVL72 graphics card The solution is a perfect example. LLM inference requires high-throughput, low-latency token generation. GB200 NVL72 acts as a unified system to deliver up to 30x faster inference for LLM workloads, enabling real-time execution of trillion-parameter models.

Tirumala and Wong will also discuss how the NVIDIA Quasar quantization system—which brings together NVIDIA algorithmic innovations, software libraries and tools, and Blackwell’s second-generation Transformer Engine—enables high accuracy from low-precision models, highlighting examples using LLM and visual generative AI.

How to keep data centers cool

The traditional hum of air-cooled data centers may become a relic of the past as researchers develop more efficient and sustainable solutions that use hybrid cooling, a combination of air and liquid cooling.

Liquid cooling techniques move heat away from systems more efficiently than air, making it easier for computer systems to stay cool even while processing heavy workloads. Liquid cooling equipment also takes up less space and consumes less energy than air cooling systems, allowing data centers to add more server racks (and therefore more computing power) to their facilities.

Ali Heydari, director of data center infrastructure and cooling at NVIDIA, will present several designs for hybrid-cooled data centers.

Some designs retrofit existing air-cooled data centers with liquid cooling units, offering a quick and easy solution for adding liquid cooling capabilities to existing racks. Other designs require plumbing for direct liquid cooling to the chip using cooling distribution units or by fully submerging the servers in immersion cooling tanks. While these options require a larger initial investment, they provide substantial savings in both energy consumption and operating costs.

Heydari will also share his team's work as part of COLD POTATOESa U.S. Department of Energy program to develop advanced cooling technologies for data centers. As part of the project, the team is using NVIDIA Omniverse platform to create physics-based digital twins that will help them model energy consumption and cooling efficiency to optimize their data center designs.

AI agents contribute to processor design

Semiconductor design is a massive challenge at a microscopic scale. Engineers developing cutting-edge processors work to cram as much processing power as possible into a piece of silicon just a few centimeters in diameter, testing the limits of what is physically possible.

AI models support your work by improving design quality and productivity, boosting the efficiency of manual processes, and automating some time-consuming tasks. Models include prediction and optimization tools to help engineers quickly analyze and improve designs, as well as LLMs that can help them answer questions, generate code, debug design issues, and more.

Mark Ren, director of design automation research at NVIDIA, will provide an overview of these models and their uses in a tutorial. In a second session, he will focus on agent-based AI systems for chip design.

AI agents powered by LLM can be directed to complete tasks autonomously, opening the door to broad applications across industries. In microprocessor design, NVIDIA researchers are developing agent-based systems that can reason and take action using custom circuit design tools, interact with experienced designers, and learn from a database of agent and human experiences.

NVIDIA experts are not only developing this technology, They are using itRen will share examples of how engineers can use AI agents for time reporting analysis. cell group optimization processes and Code generationThe cell cluster optimization work recently won the best paper award at the first IEEE International Workshop on LLM-Aided Design.

Get ready for Blackwell

How to keep data centers cool

AI agents contribute to processor design

Related Post