IBM has built a cost-effective AI supercomputer in its cloud

IBM’s answer to the cost-effective supercomputer has already been up and running for several months now, but only recently has it disclosed any tangible information about its so-called Vela project.

Turning to its blog to discuss details, IBM revealed that the research, authored by five employees at the company, tackles the problems with previous supercomputers, and their lack of readiness for AI tasks.

In order to tweak the supercomputer model for this future type of workload, the company sheds some light on the decisions it made in terms of the use of affordable but powerful hardware.

IBM’s Vela AI supercomputer

The work highlights that “building a [traditional] supercomputer has meant bare metal nodes, high-performance networking hardware… parallel file systems, and other items usually associated with high-performance computing (HPC).” 

While it’s clear that these supercomputers can handle heavy AI workloads, including the one designed for OpenAI, the startup behind the popular ChatGPT live chat software, a lack of optimization has meant that traditional supercomputers could lack valuable power, and have an excess in other areas leading to an unnecessary spend.

While it has long been accepted that bare metal nodes are the most ideal for AI, IBM wanted to explore offering these up inside of a virtual machine (VM). The result, according to Big Blue, is huge performance gains.

“Following a significant amount of research and discovery, we devised a way to expose all of the capabilities on the node (GPUs, CPUs, networking, and storage) into the VM so that the virtualization overhead is less than 5%, which is the lowest overhead in the industry that we’re aware of.”

In terms of node design, Vela is packed with 80GB or GPU memory, 1.5TB of DRAM, and four 3.2TB NVMe storage drives.

The Next Platform estimates that, if IBM wanted to feature its supercomputer in the Top500 rankings, it would deliver around 27.9 petaflops of performance, placing it in 15th place according to November 2022’s rankings. 

While today’s supercomputers are currently able to handle AI workloads, huge developments in artificial intelligence combined with the pressing need for cost efficiency highlight the need for such a machine.

Go to Source