Hopper is a graphics processing unit (GPU) microarchitecture developed by Nvidia. It is designed for datacenters and is parallel to Ada Lovelace.
Named for computer scientist and United States Navy rear admiral Grace Hopper, the Hopper architecture was leaked in November 2019 and officially revealed in March 2022. It improves upon its predecessors, the Turing and Ampere microarchitectures, featuring a new streaming multiprocessor and a faster memory subsystem.
The Nvidia Hopper H100 GPU is implemented using the TSMC 4N process with 80 billion transistors. It consists of up to 144 streaming multiprocessors. In SXM5, the Nvidia Hopper H100 offers better performance than PCIe.
The streaming multiprocessors for Hopper improve upon the Turing and Ampere microarchitectures, although the maximum number of concurrent warps per SM remains the same between the Ampere and Hopper architectures, 64. The Hopper architecture provides a Tensor Memory Accelerator (TMA), which supports bidirectional asynchronous memory transfer between shared memory and global memory. Under TMA, applications may transfer up to 5D tensors. When writing from shared memory to global memory, element wise reduction and bitwise operators may be used, avoiding registers and SM instructions while enabling users to write warp specialized codes. TMA is exposed through cuda::memcpy_async
When parallelizing applications, developers can use thread block clusters. Thread blocks may perform atomics in the shared memory of other thread blocks within its cluster, otherwise known as distributed shared memory. Distributed shared memory may be used by an SM simultaneously with L2 cache; when used to communicate data between SMs, this can utilize the combined bandwidth of distributed shared memory and L2. The maximum portable cluster size is 8, although the Nvidia Hopper H100 can support a cluster size of 16 by using the cudaFuncAttributeNonPortableClusterSizeAllowed function, potentially at the cost of reduced number of active blocks.
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Multiprocessors are a core component in all types of computing infrastructure, from phones to datacenters. This course will build on the prerequisites of processor design and concurrency to introduce
Related lectures (17)
,
Modern data management systems aim to provide both cutting-edge functionality and hardware efficiency. With the advent of AI-driven data processing and the post-Moore Law era, traditional memory-bound scale-up data management operations face scalability ch ...
Recent GPU architectures make available numbers of parallel processing units that exceed by orders of magnitude the ones offered by CPU architectures. Whereas programs written using dataflow programming languages are well suited for programming heterogeneo ...
Explores GPUs' architecture, CUDA programming, image processing, and their significance in modern computing, emphasizing early start and correctness in GPU programming.
We present a massively parallel and scalable nodal discontinuous Galerkin finite element method (DGFEM) solver for the time-domain linearized acoustic wave equations. The solver is implemented using the libParanumal finite element framework with extensions ...