CMP304-R1: AWS infrastructure for large-scale training at Facebook AI - a podcast by AWS

from 2021-01-31T22:10:42.023393

:: ::

In this session, the Facebook AI team discusses its major machine learning models and workloads and the infrastructure challenges it faced with large-scale distributed training. They share details of how they tested these ML workloads on AWS infrastructure and the results of this benchmarking. Then we discuss how the deep breadth of AWS infrastructure for ML workloads in compute, networking, and storage helps address large-scale ML challenges. Specifically, we dive deep into the AWS machine learning stack to choose the right Amazon EC2 platform to fit your ML workload while leveraging 100 Gbps networking and high-performance file systems to efficiently scale from a single GPU to hundreds or thousands.

Further episodes of AWS re:Invent 2019

Further podcasts by AWS

Website of AWS