Real-time object detection with AI — YOLO
“You Only Look Once” to detect common (~9000) objects and their respective, mostly high, probabilities. YOLO is a cutting-edge AI algorithm used to detect objects in images and videos. Speed, learning variance in objects, and customizability are its fundamental merits. Its speed can be attributed to programming complex convolutional neural nets (CNN) optimally. It takes care of variation in objects by training on and learning from new variations. Having sufficient training data — large volumes of images — expedites performance. Customizability can be achieved by manually labeling, advisably, ≥ 200 objects in varying images, and transferring intelligence from prior learning on large data sets.
For more action, watch it live on a James Bond trailer.
Object detection is an important concept in the computer vision domain. It’s the detection and classification of objects in image data. It has many applications — from autonomous vehicles to surveilance. It’s helped us unlock our phones just by looking — because tapping a finger was too much effort.
YOLO applies the same concept you use to create inteligence. You learn by detecting and classiying objects in your environment. So does the algorithm. YOLO is remarkably accurate, and it gets more so the more images it sees.
It does a great job in learning representations of objects. When you learned about a bowl, you learned that other objects that look like bowls — various sizes and colours — are also bowls. You didn’t have to look at all the bowls to learn that a new object is a bowl.
Considering its full form, YOLO only “looks once” in the sense that it works in one pass by creating a neural network, which divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted class probabilities to achieve the final classifications and bounding boxes.
The algorithm gives an average precision (true positive / (true positive + false positive)) at a rate of 78.6% at 40 frames per second (FPS). YOLO is generalizable and learns also from the environment surrounding labeled objects. YOLO also supplies probabilities of object presence. It draws boxes around detected objects. The algorithm also detects objects that it hasn’t trained on.
As you increase the number of detected boxes, you get a higher IOU value (how similar the truth and predicted boxes are). See below: the y-axis is the IOU value and the x-axis is the number of boxes.
YOLO is trained on vast, well-labeled datasets. It’s modeled on complex neural nets doing 8.52 billion operations rapidly. It utilizes hierarchical classification, for example, “Norfolk terrier” and“Yorkshire terrier” are both hyponyms of “terrier” which is a type of “hunting dog”, which is a type of “dog”. Most other competitive models treat labels as distinct.
It’s customizable and can predict custom objects with high accuracy. YOLO was customized to detect Microsoft’s Hololens and it performed well as you can see below.
YOLO is trained on labeled images and augmented data from manipulating these images — crops, saturation, rotations, and such. It looks at entire images and hence derives the context surrounding objects. The algorithm generalizes well by capturing the essential features and fundamentals of objects.
I encourage you to read the paper here for more details. This article provides an in-depth explanation and programming tutorial of YOLO. Browse the code here.