AWS launches new instances to turbocharge AI training

Amazon Web Services (AWS) has launched EC2 instances it says are specifically optimized for deep learning training.

The new Amazon EC2 Trn1 instances are powered by AWS Trainium chips, a second-generation ML chip designed by AWS, following on from its AWS Inferentia chips.

The cloud giant claims these new instances are well-suited for large-scale distributed training of complex deep learning models, such as natural language processing and image recognition.

What do users get?

Trn1 instances are available in two configurations and are powered by up to 16 AWS Trainium chips with 128 vCPUs. 

The instances apparently offer up to 512 GB of high-bandwidth memory and deliver up to 3.4 petaFLOPS of TF32/FP16/BF16 compute power and feature a NeuronLink interconnect between chips. NeuronLink helps avoid communication bottlenecks when scaling workloads across multiple Trainium chips.

In addition, Amazon says Trn1 instances are the first EC2 instances to enable up to 800 Gbps of Elastic Fabric Adapter (EFA) network bandwidth for high-throughput network communication. And Trn1 instances come with up to 8 TB of local NVMe SSD storage for ultra-fast access to large datasets.

AWS also said its Trainium chips include specific scalar, vector, and tensor engines that are purpose-built for deep learning algorithms. 

Other new features of Trainium chips include support for a wide range of data types, including FP32, TF32, BF16, FP16, and UINT8, Stochastic rounding,  as well as custom operators written in C++ and dynamic tensor shapes.

AWS Trainium shares the same AWS Neuron SDK as AWS Inferentia, which could make the transition to AWS Trainium easier.

Where can I sign up?

You can launch Trn1 instances today in certain regions such as AWS US East (N. Virginia) and US West (Oregon).

These Trn1 instances can be deployed using AWS Deep Learning AMIs, and container images are available via managed services such as Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS ParallelCluster.

To learn more, you can head to Amazon EC2’s Trn1 instances page.

Go to Source