Wed, Oct 9, 2019
Humans are generating and collecting more data than ever. We have devices in our pockets that facilitate the creation of huge amounts of data, such as photos, gps coordinates, audio, and all kinds of personal information we consciously and unconsciously reveal.
Moreover, not only are we individuals generating data for personal reasons, but we’re also collecting data unbeknownst to us from traffic and mobility control systems, video surveillance units, satellites, smart cars, and an infinite array of smart devices.
This trend is here to stay and will continue to rise exponentially. In terms of data points, the International Data Corporation (IDC) predicts that the collective sum of the world’s data will grow from 33 zettabytes (ZB) in 2019 to 175 ZB by 2025, an annual growth rate of 61%.
While we’ve been processing data, first in data centers and then in the cloud, these solutions are not suitable for highly demanding tasks with large data volumes. Network capacity and speed are pushed to the limit and new solutions are required. This is the beginning of the era of edge computing and edge devices.
In this report, we'll benchmark five novel edge devices, using different frameworks and models, to see which combinations perform best. In particular, we'll focus on performance outcomes for machine learning on the edge.
Edge computing consists of delegating data processing tasks to devices on the edge of the network, as close as possible to the data sources. This enables real-time data processing at a very high speed, which is a must for complex IoT solutions with machine learning capabilities. On top of that, it mitigates network limitations, reduces energy consumption, increases security, and improves data privacy.
Under this new paradigm, the combination of specialized hardware and software libraries optimized for machine learning on the edge results in cutting-edge applications and products ready for mass deployment.
The biggest challenges to building these amazing applications are posed by audio, video, and image processing tasks. Deep learning techniques have proven to be highly successful in overcoming these difficulties.
As an example, let’s take self-driving cars. Here, you need to quickly and consistently analyze incoming data, in order to decipher the world around you and take action within a few milliseconds. Addressing that time constraint is why we cannot rely on the cloud to process the stream of data but instead must do it locally.
The downside of doing it locally is that the hardware is not as powerful as a super computer in the cloud, and we cannot compromise on accuracy or speed.
The solution to this is either stronger, more efficient hardware, or less complex deep neural networks. To obtain the best results, a balance of the two is essential.
Therefore, the real question is:
Which edge hardware and what type of network should we bring together in order to maximize the accuracy and speed of deep learning algorithms?
In our quest to identify the optimal combination of the two, we compared several state-of-the-art edge devices in combination with different deep neural network models.
Based on what we think is the most innovative use case, we set out to measure inference throughput in real-time via a one-at-a-time image classification task, so as to get an approximate frames-per-second score.
To accomplish this, we evaluated top-1 inference accuracy across all categories of a specific subset of ImageNetV2 comparing them to some ConvNets models and, when possible, using different frameworks and optimized versions.
While there has been much effort invested over the last few years to improve existing edge hardware, we chose to experiment with these new kids on the block
We included the Raspberry Pi and the NVIDIA RTX 2080 Ti so as to be able to compare the tested hardware against well-known systems, one cloud-based and one edge-based.
The lower bound was a no-brainer. Here at Tryolabs, we design and train our own deep learning models. Because of this, we have a lot of computing power at our disposal. So, we used it. To set this lower bound on inference times, we ran the tests on a 2080 Ti NVIDIA GPU. However, because we were only going to use it as a reference point, we ran the tests using basic models, with no optimizations.
For the upper bound, we went with the defending champion, the most popular single-board computer: the Raspberry Pi 3B.
For almost all the benchmarks, we used publicly available pre-trained models, which we run with different frameworks. With respect to the NVIDIA Jetson, we tried the TensorRT optimization; for the Raspberry, we used Tensor Flow and PyTorch variants; while for Coral devices, we implemented the Edge TPU engine versions of the S, M, and L EfficientNets models. Regarding Intel devices, we used the ResNet-50 compiled with OpenVINO Toolkit. Finally, for Gyrfalcon Plai Plug chip we used their private PyTorch ResNet 50 pretrained model. For this device, all convolutional layers run in the chip but FC and other type of layers run in the CPU host. For this experiment we measured entire inference processing time (chip + CPU), using a MacBook Pro with 2 GHz Quad-Core Intel Core i5 and 16 GB of RAM.
We ran the inference on each image once, saved the inference time, and then found the average. We calculated the top-1 accuracy from all tests, as well as the top-5 accuracy for certain models.
Top-1 accuracy: this is conventional accuracy, meaning that the model’s answer (the one with the highest probability) must equal the exact expected answer.
Top-5 accuracy: means that any one of the model’s top five highest-probability answers must match the expected answer.
Something to keep in mind when comparing the results: for fast device-model combinations, we ran the tests incorporating the entire dataset, whereas we only used parts of datasets for the slower combinations.
The dashboards below display the metrics obtained from the experiments. Due to the large difference in inference times across models and devices, the parameters are shown in logarithmic scale.
In terms of inference time, the winner is the Jetson Nano in combination with ResNet-50, TensorRT, and PyTorch. It finished in 2.67 milliseconds, which is 375 frames per second.
This result was surprising since it outperformed the inference rate reported by NVIDIA by a factor of 10x. This difference in results is most likely related to the fact that NVIDIA used TensorFlow instead of PyTorch.
Coming in second was the combination of Coral Dev. Board with EfficientNet-S. It finished in 5.42 milliseconds, which is 185 frames per second.
These results correspond with the 5.5 milliseconds and 182 frames per second, promised by Google.
Even though speed was high for this combination, accuracy was not. We couldn't acquire the exact validation set used by Google for accuracy reporting, but one hypothesis is that they used the image preprocessing transformations differently than we did. Since quantized 8-bit models are very sensitive to image preprocessing, this could have had a major impact on the results.
The best results in terms of accuracy came from the Jetson Nano in combination with TF-TRT and EfficientNet-B3, which attained an accuracy of over 85%. However, these results are relative, since we trained some models using a bigger dataset than others.
We can see that the accuracy rate is higher when we feed the models smaller datasets, and lower when the entire dataset is used. This results from the fact that we didn't randomly sort the smaller data sets and hence the images were not adequately balanced.
Concerning the usability of these devices, developers will note some major differences.
The Jetson was the most flexible when it came to selecting and employing precompiled models and frameworks. Intel sticks come in second since they provide good libraries, many models and cool projects. Moreover, the sticks have massively improved between the first and second editions. The only drawback is that their vast library, OpenVINO is only supported on Ubuntu 16.04 and not by later Linux OS versions.
Compared to Jetson and Intel sticks, Coral devices present some limitations. If you want to run non-official models on it, you have to convert them to TensorFlow Lite, then quantize and compile them for Edge TPU. Depending on the model, this conversion might not be feasible. Nevertheless, we expect improvements for this Google device with future generations.
Regarding GTI Plai Plug chips, the company not only provides the hardware, but also an SDK to run your apps/demos inferences and MDK (Model Development Kit) to develop, train and convert your own models. This USB accelerator is similar to previously explained Coral device, implementing and converting your own model may present a pronounced learning curve. The biggest drawback of this technology is that you cannot load, quantize and convert an official framework model on the fly. You will have to implement that model using GTI PyTorch/TF/Caffe layers and train from scratch. The documentation is very clear and explains the entire development workflow.
The research presented here is based on our exploration of state-of-the-art edge computing devices designed for deep learning algorithms.
We found that the Jetson Nano and Coral Dev. Board performed very well in terms of inference time.
In terms of accuracy, the Jetson Nano once again achieved great results, though the results were relative.
Given the overall performance of the Jetson Nano, it was our clear winner. 🏆
However, we must mention that we couldn't test the Jetson Nano and Coral with the same model, due to their different design. We believe that each device will have its own best-case scenario, depending on the specific task to be completed.
We encourage you to perform a detailed benchmarking as it pertains to your specific tasks, and share your results and conclusions in the comments section below. Further research of interest could include the design and training of your own model, utilizing quantization-aware training.