SONY Breaks ResNet-50 Training Record with NVIDIA V100 Tensor Core GPUs

Nvidia News

Researchers from SONY today announced a new speed record for training ImageNet/ResNet 50 in only 224 seconds (three minutes and 44 seconds) with 75 percent accuracy using 2,100 NVIDIA Tesla V100 Tensor Core GPUs. This achievement represents the fastest reported training time ever published on ResNet-50.

The team also achieved over 90% GPU scaling efficiency with 1,088 NVIDIA Tesla V100 Tensor Core GPUs.

GPU scaling efficiency with ImageNet/ResNet-50 training  

Processor Interconnect GPU scaling efficiency
Goyal et al. [1] Tesla P100 x256 50Gbit Ethernet ∼90%
Akiba et al. [5] Tesla P100 x1024 Infiniband FDR 80%
Jia et al. [6] Tesla P40 x2048 100Gbit Ethernet 87.90%
This work Tesla V100 x1088 Infiniband EDR x2 91.62%

Training time and top-1 validation accuracy with ImageNet/ResNet-50
“As the size of datasets and deep neural network (DNN) model for deep learning increase, the time required to train a model is also increasing,” the SONY team wrote in their paper.

Batch Size Processor DL Library Time Accuracy
He et al. 256 Tesla P100 x8 Caffe 29 hours 75.30%
Goyal et al. 8K Tesla P100 x256 Caffe2 1 hour 76.30%
Smith et al. 8K→16K full TPU Pod TensorFlow 30 mins 76.10%
Akiba et al. 32K Tesla P100 x1024 Chainer 15 mins 74.90%
Jia et al. 64K Tesla P40 x2048 TensorFlow 6.6 mins 75.80%
This work 34K→68K Tesla V100 x2176 NNL 224 secs 75.03%

/To achieve the record, the researchers addressed two primary issues with large-scale distributed training: instability of large mini-batch training and the synchronization communication overhead, the team said.

“We adopt a batch size control technique to address large mini-batch instability,” the researchers said. “We [also] develop a 2D-Torus all-reducing scheme to efficiently exchange gradients across GPUs.”

The 2D-Torus serves as an efficient communication topology that reduces the communications overhead of a collective operation.

Software: “We used Neural Network Libraries (NNL) and its CUDA extension as a DNN training framework,” the team said.  “We also used development branches based on NNL version 1.0.0. CUDA version 9.0 with cuDNN version 7.3.1 is employed to train DNN in GPUs.”

“We used NCCL version 2.3.5 and OpenMPI version 2.1.3 as communication libraries. The 2D-Torus all-reduce is implemented with NCCL2. The above software is packaged in a Singularity container. We used Singularity version 2.5.2 to run distributed DNN training.,” the team wrote in their paper.

Read more>

Please follow and like us: