Researchers from SONY today announced a new speed record for training ImageNet/ResNet 50 in only 224 seconds (three minutes and 44 seconds) with 75 percent accuracy using 2,100 NVIDIA Tesla V100 Tensor Core GPUs. This achievement represents the fastest reported training time ever published on ResNet-50.
The team also achieved over 90% GPU scaling efficiency with 1,088 NVIDIA Tesla V100 Tensor Core GPUs.
GPU scaling efficiency with ImageNet/ResNet-50 training
|Processor||Interconnect||GPU scaling efficiency|
|Goyal et al. ||Tesla P100 x256||50Gbit Ethernet||∼90%|
|Akiba et al. ||Tesla P100 x1024||Infiniband FDR||80%|
|Jia et al. ||Tesla P40 x2048||100Gbit Ethernet||87.90%|
|This work||Tesla V100 x1088||Infiniband EDR x2||91.62%|
Training time and top-1 validation accuracy with ImageNet/ResNet-50“As the size of datasets and deep neural network (DNN) model for deep learning increase, the time required to train a model is also increasing,” the SONY team wrote in their paper.
|Batch Size||Processor||DL Library||Time||Accuracy|
|He et al.||256||Tesla P100 x8||Caffe||29 hours||75.30%|
|Goyal et al.||8K||Tesla P100 x256||Caffe2||1 hour||76.30%|
|Smith et al.||8K→16K||full TPU Pod||TensorFlow||30 mins||76.10%|
|Akiba et al.||32K||Tesla P100 x1024||Chainer||15 mins||74.90%|
|Jia et al.||64K||Tesla P40 x2048||TensorFlow||6.6 mins||75.80%|
|This work||34K→68K||Tesla V100 x2176||NNL||224 secs||75.03%|
/To achieve the record, the researchers addressed two primary issues with large-scale distributed training: instability of large mini-batch training and the synchronization communication overhead, the team said.
“We adopt a batch size control technique to address large mini-batch instability,” the researchers said. “We [also] develop a 2D-Torus all-reducing scheme to efficiently exchange gradients across GPUs.”
The 2D-Torus serves as an efficient communication topology that reduces the communications overhead of a collective operation.
Software: “We used Neural Network Libraries (NNL) and its CUDA extension as a DNN training framework,” the team said. “We also used development branches based on NNL version 1.0.0. CUDA version 9.0 with cuDNN version 7.3.1 is employed to train DNN in GPUs.”
“We used NCCL version 2.3.5 and OpenMPI version 2.1.3 as communication libraries. The 2D-Torus all-reduce is implemented with NCCL2. The above software is packaged in a Singularity container. We used Singularity version 2.5.2 to run distributed DNN training.,” the team wrote in their paper.