Using High-Performance Computing for AI Workloads

Most organizations today are using Artificial Intelligence (AI) technology in their daily operations. Running AI workloads can affect software needs, hardware demands, and HPC needs. Integrating AI in HPC systems requires you to change the programming languages you use, the memory size of a system, integrating virtualization and containers, and more.

What Is HPC?

High performance computing (HPC) is the use of aggregated resources to perform large volumes of or highly complex computations. It is also sometimes referred to as supercomputing. HPC uses clusters of machines, connected together to enable workloads to be processed in parallel.

In an HPC system, each connected device is called a node. Nodes share resources with the system as a whole, including storage, memory, and processing power. Within each node are two or more central processing units (CPUs) with multiple processing cores. A typical HPC system has between 16 and 64 nodes in an HPC system although the largest systems can have thousands.

HPC systems may also contain graphical processing units (GPUs). GPUs are processors designed for a specific task, as opposed to the more general CPUs. These processors can help boost the processing speeds available from HPC systems.

Why Is HPC Needed?

HPC enables the processing of data on a scale that was previously impossible or near impossible. It enables developers to train highly complex machine learning algorithms, researchers to perform predictive analyses, and businesses to process streaming data in real-time.

Without the massive processing power created by HPC systems, processing big data would be a significant challenge. Many AI applications would not be possible and blockchain technologies might not exist. Without HPC, technology would not be as advanced as it currently is, nor would it be growing as fast.

HPC Moves to the Cloud

Traditionally, HPC systems were operated on-premises and required significant upfront and ongoing costs from the owner organizations. The hardware needed was purpose-built and required maintenance from teams with specific expertise.

Additionally, systems were only able to run specific software and required researchers to have significant programming skills. These requirements limited HPC use to only those organizations with significant resources and a high level of technological skill.

However, as cloud computing gains in popularity, HPC and HPC-like resources are becoming more readily available and accessible. Cloud computing enables even average users to access aggregate computing power that is greater than what can be achieved by many organizations on-premises. It also allows for the creation of more flexible HPC systems that can be connected to a wider variety of data sources and analytics tools.

When looking to adopt cloud-based HPC systems, organizations can either use private or public cloud resources. Private resources can provide greater security and dedicated resources but often cost more. Public clouds often provide cheaper costs and can enable organizations to use HPC power as needed, rather than requiring the creation of a dedicated system.

Currently, all major public cloud providers offer resources optimized for HPC deployments. Some providers, such as Azure, also offer integrations and additional services that enable you to create hybrid HPC implementations. This can allow organizations to supplement in-house HPC as needed, significantly reducing costs without sacrificing latency or control.

The Convergence of HPC and AI

HPC systems use the same sort of processing power that is needed to develop, train, and run machine learning models. These systems can significantly speed training since processes can be run in parallel. Datasets can be ingested and transformed and analyses completed faster. This means more training iterations and more advanced models in less time.

Outside of the benefits that HPC can grant to AI, the similarities between the infrastructure needed for both leads to an inevitable overlap. Both require significant computing power, high-speed interconnects, and performance accelerators.

How AI is Affecting HPC

Unfortunately, although the infrastructure is similar, the software used for HPC vs AI is often quite different. For these two areas to be merged certain changes to workload management and tooling need to occur. Below are a few examples of the shifts that are occurring.

Adjustments in programming languages

HPC programs are typically written in either Fortran, C, or C++ languages. Processes are supported and directed with C interfaces, libraries, and extensions. AI, however, relies heavily on Python and MATLAB applications.

For both to successfully use the same infrastructure, interfaces and software need to be adapted to accommodate both. Most likely, this means that AI frameworks and languages will be overlaid onto existing applications which will continue to run as before. This can enable both AI and HPC programmers to continue working with their preferred tools without requiring either transition languages.

Virtualization and containers

Containers can provide a huge benefit for both HPC and AI applications. These tools enable infrastructures to be more easily adapted to changing workload demands and to be deployed to a wider range of environments. For AI, containers also help improve the scalability of Python and Julia applications. This is because containerization allows for environmental configuration separate from host infrastructure, which both languages benefit from.

Containerization also goes well with cloud-based HPC, which is becoming increasingly popular as a more affordable way of accessing HPC. Containers allow teams to create HPC configurations that can be deployed easily and quickly. This means that resources can be provisioned and dropped as needed without the loss of lengthy configuration efforts.

Increased memory

Big data plays a huge role in AI and the size of datasets is only getting bigger. Ingesting and processing these sets requires a significant amount of memory to maintain the speed and efficiency that HPC can offer.

To accommodate this, HPC systems are including new technologies for both persistent and ephemeral memory. For example, non-volatile memory (NVRAM) which can be used to increase both individual node and distributed memory.

Conclusion

HPC has always evolved to accommodate new technologies, and it will also evolve for AI. However, to merge these two areas, you need to make some changes to workload management and tooling. In any case, AI will contribute to the future definition of HPC, and therefore adjust the course of HPC forever. In fact, many artificial intelligence companies, especially industry leaders specialized in cloud services, already offer services for both AI and HPC, enabling businesses to leverage scalable and affordable resources.