Over the past decade, high-performance computing has scaled from teraflop performance to petaflop performance, and is now heading toward the exaflop era. Technology development has had to keep up in order to enable such performance leaps, with such notable advancements as the move from SMP architecture to clustered multiprocessing with multi-core processors, as well as added acceleration from GPUs, FPGAs and other co-processing technologies.
Historically, increased performance has been achieved with new hardware devices, drivers, middleware, and software applications that further scalability and maximize throughput. But the limitations of this approach are beginning to reveal themselves, and this trend is becoming short-lived. The performance improvements to enable exascale-class computing will require technology collaboration in all areas. No one company or development effort can efficiently provide all the components necessary to scale performance to such a degree, so the discrete development and typical integration strategy will not be feasible. Instead, a system-level approach to exascale computing is already underway.
The era of codesign
Codesign is a collaborative effort among industry thought leaders, academia, and manufacturers to reach exascale performance with a holistic, system-level approach to fundamental performance improvements. Codesign architecture enables all active system devices to become acceleration devices by orchestrating a more effective mapping of communication between devices in the system. This produces a well-balanced architecture across the various compute elements, networking, and data storage infrastructures that exploits system efficiency and even reduces power consumption.
Exascale computing will undoubtedly include three primary concepts: heterogeneous systems, direct communication through a more sophisticated intelligent network, and backward/forward compatibility. Codesign includes these concepts in order to create an evolutionary architectural approach that will enable exascale-class systems.
Seamless heterogeneous system architecture
An example of recent efforts, and a more unified approach to better enable heterogeneous systems, is the OpenUCX project. OpenUCX is a collaborative effort of working together to create an open, production-grade communication framework for high-performance computing applications. OpenUCX is already well underway and addresses fundamental concerns of application portability across a variety of hardware, without the need to migrate applications and the system software stack for every type of infrastructure. The participants in this initiative include IBM, NVIDIA, Mellanox, the University of Houston, Oak Ridge National Laboratory, The University of Tennessee, and many others. The project also includes an advisory panel of thought leaders who guide efforts toward the most effective solutions for exascale.
UCX was initially created by merging three existing high-performance computing frameworks:
- Oak Ridge was working on an interface called UCCS, which was their framework supporting SHMEM over their systems.
- IBM was working on PAMI, which was their interface for the Blue Gene/Q supercomputer; and
- Mellanox was working on MXM, its messaging accelerator for MPI or PGAS, which already used a codesign approach to parallel programming libraries.
UCX will replace all of those by supporting the communication frameworks on one side and all hardware interfaces on the other side. The result of this approach is an optimized communication path with low software overhead, producing near-bare-metal performance and portability of software from one interconnect to another.
Direct communication by providing a direct peer-to-peer communication path between acceleration devices is another important concept in achieving exascale computing. This approach significantly decreases latency and completely removes the CPU from all network communications. Codesign will make direct peer-to-peer communication possible between remote GPUs by completely bypassing the need for CPU and host memory intervention to move data, which can reduce latency for internode GPU communication by upwards of 70%.
And this is only the beginning. The continued development of this technology will soon evolve into the next generation of peer-to-peer transactions, including more control of network operations to the accelerator and offloading of the control plane from the CPU and the data path. The result will further reduce latency, allow much lower-power CPUs to be coupled with GPU acceleration capabilities, and address power reduction across peer devices that will be typical in a heterogeneous system balanced with both vector and scalar components.
Backward and forward compatibility
Compatibility is another important concept in reaching exascale. Backward compatibility must always be a consideration when advancing technologies with performance improvements, but forward compatibility will be of paramount importance for exascale computing. Whereas it is not uncommon for 10-20 petaflop machines to be completely replaced within a 5-year period today, exascale machines will not be able to be supplanted so easily. As such, codesign uses open standards for portability and compatibility, ensuring that systems can grow without the fear that clusters will need to be entirely overhauled or upgraded.
A common concern when working with the traditional approach is with point-to-point processor technologies such as QPI or HyperTransport. Such technologies have their own defined set of physical, link, routing, transport, and protocol layers that have not remained consistent and compatible over time. This not only introduces backward-compatibility issues between chip technologies, but it also limits adoption of the next generation of integrated elements. Exascale systems must have guaranteed future proofing to maintain such a level of investment, performance, and capabilities, and to keep millions of lines of application code from being overhauled for every generation of hardware.
EDR InfiniBand capabilities already are based on the codesign approach and include offloading and acceleration capabilities that free CPU cores from communications overhead, allowing the CPU to perform more meaningful application computation. In order to reach exaflop levels of scalability and performance, the industry must pursue codesign to provide a holistic system perspective that addresses the next order of magnitude of performance.