Detecting safety helmets in real-time

15 years of Tryolabs: building AI with purpose Read our story

Blog

Tue, May 10, 2022

Authors

Facundo Lezama

Lead Machine Learning Engineer

Agustín Castro

Machine Learning Engineer

At Tryolabs, we are passionate about building computer vision solutions that have real-world impact. Very often, this means integrating them to run on edge devices. In our journey to master AI on the edge, we have developed several projects in the past years, ranging from hackaton robots to community-aware ones such as MaskCam, an open-source smart camera-based around Jetson Nano. These projects are the ones that push us to dive deep into the state of the art of technologies and better understand how to build complex solutions with high-performance requirements.

Collaboration projects are always in our priorities. We love to partner with great organizations to build awesome solutions. This time, we partnered with Seeed, a hardware innovation platform that works closely with technology providers of all scales providing quality and affordable hardware. For example, they offer a wide variety of NVIDIA products on their Jetson Platform.

The goal was to leverage Seeed’s hardware, particularly using their reComputer edge devices built with Jetson Xavier NX 8GB module, and develop a computer vision analytics solution that tackles a challenging task in today’s industry 4.0 field. More specifically, we picked the challenge of detecting safety helmets in real-time.

Introducing Industry 4.0

Industry 1, 2, 3... 4? One after another, industrial revolutions have redefined our recent history. First was the arrival of mechanization, followed by electrification, and automation came third. Each of these revolutions was built on the achievements of those that came before and brought some novel technological advances, therefore earning the label of revolution. Combining hyper-connectivity, automation, artificial intelligence and real-time data, this fourth revolution — Industry 4.0 — brings physical and digital technologies together, creating a holistic and better-connected ecosystem for organizations. As a result of the support of machines that keep getting smarter as they get access to more data, the industry becomes more efficient, productive and less wasteful.

Undoubtedly, places such as factories and construction sites have plenty of room for embracing Industry 4.0 and can benefit from its adoption. Abundant opportunities range from optimizing logistics within a factory to rethinking the supply chain to leveraging real-time information. But not all industries are keen to adopt big changes. Although Construction 4.0 is set to be the next big step in the construction industry, construction sites are still holding back on adoption. The great promise of the Construction 4.0 revolution lies in the almost complete automation of the entire project life cycle. This automation involves using digital twins at every step, from planning to operation, including design and construction.

When discussing construction, one has to recognize that it is one of the most dangerous industries in the US. Through the years, Personal Protective Equipment (PPE) has made its way into mandatory requirements of construction sites due to its importance to workers’ safety. PPE may include items such as safety glasses, earplugs, gloves, or helmets. Industry 4.0 is also all into integrating PPE via the Internet of Things (IoT) to better understand how the equipment is used and even take action when necessary, such as alerting when a worker enters a restricted zone to prevent potential dangers.

Not only does IoT apply to this field, but also AI is a great match. Given the advancements in deep learning algorithms and the immense amount of data available that is created every day, AI techniques have greatly expanded to diverse tasks and environments, and countless industries have started to adopt these technologies. The field of computer vision has had enormous progress in recent years, developing great solutions for scene understanding and construction sites are definitely a great place to embrace them. For instance, this technology can contribute by giving insights into PPE compliance. PPE is essential for workers, its use is mandatory in several situations so having control over its adoption is key for minimizing risks to both workers’ health and employers’ responsibility.

Computer vision on the edge

It seems like a long time ago when embedded devices could only perform tasks with low computational cost. Using classical computer vision techniques and writing software with low-level programming languages used to be the only alternative for developers. However, even then, the quality of the results was limited by what those techniques could reach. Through the years, improvements have been made regarding both the available software and hardware. Nowadays, these devices are much more powerful and affordable. State-of-the-art deep learning models have made their way to these devices, allowing for great detection quality with little to no fine-tuning. Yes, you are reading fine. Machine learning on edge devices has been a thing for a while now. The constant improvement in the processing capacity has allowed developers to write their software faster since now they can use high-level programming languages (such as Python).

The challenge

Safety helmets help prevent and minimize injuries on construction sites and in factories. This is why its use is mandatory in most of the world. However, shortcomings still exist in the continuous monitoring of its use. In most scenarios, this process is manual, making it very expensive and inefficient. Current technology plants the seed of curiosity to seek and explore better alternatives given our current capabilities and resources.

By partnering with technology providers from hardware to the cloud, Seeed offers a wide array of hardware platforms and sensor modules ready to be integrated with existing IoT platforms. The proposed plan consists of creating an end-to-end solution to monitor the use of safety helmets in real-time and deploying it to a Jetson Xavier NX module provided by Seeed.

Hardware components

The Jetson Xavier NX is a small but powerful module suited for AI applications in embedded and edge devices. It is equipped with a 384-core NVIDIA Volta GPU, a 6-core Carmel ARM CPU, and two NVIDIA Deep Learning Accelerators (NVDLA). It can attain an AI performance of 21 TOPS with a power consumption of 20W (or 14 TOPS in a low-power mode with power consumption of as little as 10W). These specifications combined with the 8GB LPDDR4x memory with over 59.7GB/s of bandwidth, make this module an ideal platform for running AI networks with accelerated libraries for deep learning and computer vision. Together with the Jetson Xavier NX, we used a Logitech C922 USB webcam to test the final solution in real-time.

Image of NVIDIA Jetson Xavier NX device, and image og a Logitech C922 USB webcam.

The model

YOLOv5 is one of the most used algorithms for object detection. Not only is it capable of computing extremely accurate detections, but it also runs lightning fast, allowing its users to create real-time object detection applications. Since its beginning in 2016 with Joseph Redmond’s publication, the YOLO algorithm has been famous for its performance. Its incredibly small size immediately paved the way for mobile devices. The weights of a trained YOLOv5 model are notably smaller than YOLOv4’s, making it easier to deploy YOLOv5 models to embedded devices. YOLOv5 is approximately 88% smaller in size than YOLOv4. When running YOLOv5 on a NVIDIA Tesla P100 GPU it can detect objects at 140 FPS, compared to its predecessor’s max capability of 50 FPS.

A YOLOv5 Medium architecture was trained to continuously monitor the use of safety helmets on construction sites and factories. The detector can locate the faces of the people on a frame and classify them into the categories of “helmet” and “no helmet”. Given that for a specific person on a video, this category should be highly correlated through consecutive video frames. Tryolabs’ open-source tracking library Norfair allows us to get a more robust and less noisy criteria for this classification. By leveraging video tracking, we implemented a system of votes using the label associated with several consecutive detections in order to more confidently decide if a person is wearing a helmet or not. Therefore, evidence for several frames is needed to classify each person. A single misclassified detection is not enough to change the category in which a person is placed.

The dataset

Quintillions of bytes of data are created every day, and AI models are taking advantage of this fact. The number of images uploaded to the internet daily has made possible the existence of public datasets for a wide variety of applications. Of course, having access to these images is not the only requirement for creating a dataset, when working with supervised learning it also takes time and human effort to label each image with the right annotations so that our computers can recognize the patterns that we need them to learn. To make a detector that works well for the different environments, the images of this dataset must be taken from many diverse locations and under different lighting conditions. In turn, our model can learn patterns that generalize well, contrary to characteristics unique to a particular scene.

Fortunately for us, public datasets are already available to distinguish faces with and without helmets, such as the dataset GDUT-Hardhat Wearing Detection, that we selected to use on this project. This dataset includes 3869 images, from which a subset of 2916 images is selected for the training set, another 635 images are chosen for validation, and the remaining 318 images are set apart for testing purposes.

The results: a look under the hood

We compared the YOLOv5 and the Faster R-CNN architectures on a training job using this dataset, with most default settings. The training consisted of 26 epochs using multiprocessing and two NVIDIA GeForce RTX 2080 Ti GPUs. YOLOv5 vastly outperformed Faster R-CNN, obtaining better metrics in a much shorter time. In terms of inference time, both models performed similarly, taking around 0.08 seconds for each image on the edge device (12.5 FPS). The difference between these models lies in the training time and the final metrics indicating the quality of the detections. YOLOv5 took about half a minute per training epoch, whereas Faster R-CNN required more than 3 minutes to finish an epoch. Therefore, a YOLOv5 training job could be completed in 14 minutes, while it took around an hour and 20 minutes for the Faster R-CNN training job to end. Furthermore, the mAP (Mean Average Precision) measured using the YOLOv5 model returned slightly better results than the ones achieved by Faster R-CNN (by an absolute difference of around 4%). These values were attained in significantly less time.

Comparison graphs between YOLOV5 MAP50 and FASTER R-CNN MAP 50.

Comparison graphs between YOLOV5 MAP50:95 and FASTER R-CNN MAP 50:95.

Performance comparison

	YOLOv5	Faster R-CNN
Training time	14m 12s (10m 2 GPUs)	1h 19m 27s
Epoch time	33s	3m 10s
Max mAP50	0.9402	0.9065
Max mAP50:95	0.6220	0.5865
Time to mAP50=0.88	2m 16s (epoch 4)	9m 30s (epoch 3)
Time to mAP50:95=0.58	7m 54s (epoch 14)	50m 49s (epoch 16)

Optimizing the inference pipeline with NVIDIA DeepStream

NVIDIA has been one of the key players in the industry for a while now, not only for being the leading manufacturer of GPUs but also for providing a wide range of tools that enable building fast, complex and scalable AI solutions. For instance, the NVIDIA Metropolis framework and more precisely the DeepStream SDK toolkit allow the developers to build pipelines for AI-based video analytics, while also boosting development time and achieving an outstanding throughput for several applications, such as object detection, segmentation, and image classification.

By leveraging DeepStream SDK, the inference time was boosted to a staggering 0.012 seconds for each image (82.8 FPS) on the same NVIDIA Jetson Xavier NX.

Finally, the demos!

Conclusions

Monitoring helmet usage in different scenarios leads to useful insights to take preventive actions, saving time and resources. It is possible to monitor the usage of safety helmets in different environments from an edge device using state-of-the-art detectors, creating a more efficient and affordable alternative than the more ordinary and rough manual process. We have been working on predictive maintenance projects for a while now and love to dive into challenging projects that take AI solutions to the edge. If you have a business need that you think could benefit from AI on the edge, please don't hesitate and contact us.