This is a multi-part post on image recognition and object recognition.
In this part, we briefly explain image recognition using traditional computer vision techniques. I mean techniques that are not based on deep learning.traditional machine visionTechniques because they are rapidly being replaced by techniques based on deep learning. However, many applications continue to promote traditional machine vision approaches. Many of these algorithms are also available in machine vision libraries like OpenCV and work great out of the box.
This series follows the following scheme.
- Image Recognition Using Traditional Machine Vision Techniques: Part 1
- Histogram of oriented gradients: Part 2
- Code example for image recognition: Part 3
- Train a better eye detector: Parte 4a
- Object recognition using traditional machine vision techniques: Part 4b
- How to train and test your own OpenCV object detector: Part 5
- Image Recognition with Deep Learning: Part 6
- Introduction to neural networks
- Understanding feedforward neural networks
- Image recognition with convolutional neural networks
- Object Recognition with Deep Learning: Part 7
A brief history of image recognition and object recognition
Our story begins in 2001; Paul Viola and Michael Jones invented an efficient facial recognition algorithm. His demonstration, showing how to recognize faces in real time from a webcam, was the most impressive demonstration of computer vision and its potential at that time. It was soon implemented in OpenCV and facial recognition became synonymous with the Viola and Jones algorithm.
Every few years, a new idea comes along that forces you to take a break. In object recognition, this idea came up in a 2005 article by Navneet Dalal and Bill Triggs. Its function descriptor, Histograms of Oriented Gradients (HOG), significantly outperformed existing pedestrian detection algorithms.
Every ten years or so, a new idea comes along that is so effective and powerful that you give up everything that came before it and wholeheartedly embrace it. Deep Learning is the idea of this decade. Deep learning algorithms have been around for a long time, but their resounding success in the 2012 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) made them popular in machine vision. In this competition, a deep learning-based algorithm by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton rocked the world of computer vision with a whopping 85% accuracy—11% better than the second-place algorithm! At ILSVRC 2012, this was the only entry based on deep learning. In 2013, all winning proposals were based on deep learning, and in 2015, several algorithms based on convolutional neural networks (CNNs) exceeded the human detection rate of 95%.
With such success in image recognition, deep learning-based object recognition was inevitable. Techniques like Faster R-CNN give impressive results on various classes of objects. We'll learn more about this in later posts, but for now, keep in mind that if you haven't already delved into deep learning-based image recognition and object detection algorithms for your apps, you might be missing out on a great opportunity. to improve the results.
With this overview, we are ready to return to the main goal of this post: understanding image recognition using traditional computer vision techniques.
Master Generative AI for the Curriculum
Get tips, hints and tricks from experts. Create amazing images, learn how to adjust feather patterns, advanced image editing techniques like In-Painting, Instruct Pix2Pix and much more.
Image recognition (also known as image classification)
An image recognition algorithm (also known as an image classifier) takes an image (or part of an image) as input and displays what the image contains. In other words, the output is a class tag (eg "cat", "dog", "table", etc.). How does an image recognition algorithm know the content of an image? Well, you need to train the algorithm to learn the differences between the different classes. If you want to find cats in photos, you'll need to train an image recognition algorithm on thousands of cat photos and thousands of background photos that don't contain cats. Obviously, this algorithm can only understand objects/classes that it has learned.
For the sake of simplicity, in this article we focus only on two-class (binary) classifiers. You may think this is a very limiting assumption, but remember that many popular object detectors (eg face detector and pedestrian detector) have a binary classifier under the hood. For example, in a face detector there is an image classifier that indicates whether a part of the image is a face or a background.
Anatomy of an image classifier
The following diagram illustrates the steps of a traditional image classifier.
Interestingly, many traditional computer vision image classification algorithms follow this process, while deep learning-based algorithms skip the feature extraction step entirely. Let's take a closer look at these steps.
Step 1: Preprocessing
An input image is usually pre-processed to normalize the effects of contrast and brightness. A very common preprocessing step is to subtract the average of the image intensities and divide by the standard deviation. Sometimes gamma correction gives slightly better results. For color images, a color space transformation (such as RGB to LAB color space) can help achieve better results.
Please note that I am not dictating which preprocessing steps are good. Because no one knows in advance which of these preprocessing steps will lead to good results. Try a few different ones and some may give slightly better results. Here is a paragraph from Dalal and Triggs
"We evaluated various input pixel representations, including grayscale, RGB, and LAB color spaces, with optional power-law (gamma) equalization. These normalizations have only a modest impact on performance, perhaps because subsequent normalization of the descriptor produces similar results. We use color information where available. RGB and LAB color spaces give comparable results, but limiting grayscale reduces performance by 1.5% at 10−4 FPPW. Root gamma compression square of each channel color improves performance at low FPPW (by 1% at 10−4 FPPW), but record compression is too strong and degrades performance by 2% at 10−4 FPPW.”
As you can see, they didn't know in advance which preprocessing to use. They made reasonable assumptions and used trial and error.
As part of the pre-processing, an input image or image section is also cropped and scaled to a fixed size. This is important because the next step, feature extraction, is performed on a fixed-size image.
Step 2: resource extraction
The input image contains a lot of additional information that is not required for ordering. Therefore, the first step in image classification is to simplify the image by extracting the important information contained in the image and omitting the rest. For example, if you want to find shirt and coat buttons in images, you'll see significant variation in RGB pixel values. However, by running an edge detector on an image, we can simplify the image. You can still easily see the circular shape of the buttons on these edge images, and we can conclude that edge detection retains essential information and discards non-essential information. The step is calledfeature extraction. In traditional machine vision approaches, the design of these features is critical to the performance of the algorithm. It turns out that we can do much more than simple edge detection and find much more reliable features. In our example of shirt and coat buttons, a good feature detector captures not only the circular shape of the buttons, but also information about how the buttons differ from other round objects, such as car tires.
Some well-known functions that are used in machine vision arehair remediespresented by Viola and Jones,Histogram Oriented Gradient (HOG), Scale Invariant Feature Transform (SIFT), Feature Robust Accelerated (SURF)etc.
As a concrete example, consider feature extraction using the directed gradient ( HOG ) histogram.
Histogram of oriented gradient (HOG)
A feature extraction algorithm converts a fixed size image to a fixed size feature array. In the case of pedestrian detection, the HOG resource descriptor is computed for a 64 × 128 patch of an image and returns a vector of size 3780. Note that the original dimension of this image patch was 64 × 128 x 3 = 24,576, which is reduced to 3,780 by the HOG descriptor.
HOG is based on the idea that the appearances of local objects can be effectively described by the distribution (histogram) of the edge directions (oriented gradients). Following are the steps to calculate the HOG descriptor for a 64 × 128 image.
- slope calculation: Calculate the gradient images x and y,mi, from the original image. This can be done by filtering the original image using the following kernels.
Using the gradient imagesmi, we can calculate the magnitude and orientation of the gradient using the following equations.
The calculated gradients are "unsigned" and thereforevaries from 0 to 180 degrees.
- cells: Divide the image into 8×8 cells.
- Calculate the histogram of the gradients in these 8 × 8 cells: At each pixel in an 8 × 8 cell, we know the gradient (magnitude and direction), so we have 64 magnitudes and 64 directions, that is, 128 numbers. The histogram of these gradients provides a more useful and compact representation. We will then convert these 128 numbers into a 9-bin histogram (ie 9 numbers). The histogram boxes correspond to the gradient directions 0, 20, 40...160 degrees. Each pixel corresponds to one or two bins in the histogram. If the direction of the gradient in a pixel is exactly 0, 20, 40... or 160 degrees, a vote equal to the magnitude of the pixel's gradient is thrown away. A pixel where the gradient direction is not exactly 0, 20, 40...160 degrees splits your vote into the two closest frames based on frame distance. For example, a pixel where the gradient magnitude is 2 and the angle is 20 degrees will be correct for the second container with value 2. On the other hand, a pixel with gradient 2 and angle 30 becomes 1 for both the second container ( corresponding to angle 20) and for the third drawer (corresponding to angle 40).
- block normalization: The histogram calculated in the previous step is not very robust to lighting changes. Multiplying the image intensities by a constant factor also scales the histogram bin values. To counteract these effects, we can normalize the histogram, that is, H. Think of the histogram as a 9-element vector, and divide each element by the magnitude of that vector. In the original HOG work, this normalization is not done on the 8×8 cell that produced the histogram, but on 16×16 blocks. The idea is the same, but now you have a 36 element array instead of a 9 element array.
- resource vector: In the previous steps, we discovered how to calculate a histogram on an 8 × 8 cell and then normalize it on a 16 × 16 block. To calculate the final feature vector for the entire image, the 16 × 16 block is shifted by steps of 8 (ie 50% overlap with the previous block) and 36 numbers (corresponding to 4 histograms in a 16 × 16 block) are computed each step is concatenated to produce the final feature vector.How big is the final vector?
The input image is 64 × 128 pixels in size and we move 8 pixels at a time. Therefore, we can take 7 steps in the horizontal direction and 15 steps in the vertical direction, making 7 x 15 = 105 steps. In each step we calculate 36 numbers, making the length of the final vector 105 x 36 = 3780.
Step 3: Learning Algorithm for Classification
In the previous section, we learned how to convert an image to a feature vector. In this section, we'll learn how a classification algorithm takes this set of features as input and generates a class label (for example, cat or background).
Before any classification algorithm can be effective, we must train it by showing thousands of examples of cats and origins. Different learning algorithms learn differently, but the general principle is that learning algorithms treat feature vectors as points in higher dimensional space and try to find planes/surfaces that subdivide higher dimensional space such that all examples are of the same class on one side. of the plane/surface.
For the sake of simplicity, let's take a detailed look at a learning algorithm called Support Vector Machines (SVM).
How does the support vector machine (SVM) work for image classification?
Support Vector Machine (SVM) is one of the most popular supervised binary classification algorithms. Although the ideas used in the SVM have been around since 1963, the current version was proposed in 1995 byCortésmiVapnik.
In the previous step, we learned that the HOG descriptor of an image is a feature vector of length 3780. We can think of this vector as a point in a 3780-dimensional space. Visualizing a higher dimensional space is impossible, so let's simplify things a bit and imagine that the feature vector is only two dimensional.
In our simplified world, we now have 2D points representing the two classes (for example, cats and background). In the figure above, the two classes are represented by two different types of points. All black dots belong to one class and white dots belong to another class. During training, we provide the algorithm with many examples of both classes. In other words, we tell the algorithm the coordinates of the 2D points and also whether the point is white or black.
Different learning algorithms figure out how to separate these two classes in different ways. Linear SVM tries to find the best line that separates the two classes. In the above figure, H1, H2 and H3 are three lines in this 2D space. H1 does not separate the two classes and is therefore not a good classifier. H2 and H3 separate the two classes, but intuitively it seems that H3 is a better classifier than H2, since H3 seems to separate the two classes more clearly. Because ? Because H2 is very close to some black and white points. On the other hand, H3 is chosen to have a maximum distance to the members of the two classes.
Given the 2D features of the above figure, SVM locates the H3 line for you. When you get a new vector of 2D features corresponding to an image the algorithm has never seen before, you can simply test which side of the line the point is on and assign it the appropriate class label. If your feature vectors are in 3D, SVM will find the correct oneslevelwhich separates the two classes as much as possible. As you may have guessed, if your feature vector is in a 3780 dimensional space, SVM will find the match.hyperabean.
So far so good, but I know you have an important unanswered question. What if the features belonging to both classes are not separable by a hyperplane? In these cases, SVM still finds the best hyperplane by solving an optimization problem that tries to increase the hyperplane distance of the two classes while trying to ensure that many training examples are classified correctly. This compensation is controlled by a parameter calledC. If the value ofCis small, a hyperplane with a large margin is chosen at the expense of a greater number of classification errors. vice versa whenCis large, a hyperplane with a smaller margin is chosen, which tries to correctly classify many more examples.
Now you may be confused about which value you should chooseC. Choose the value that works best for avalidation setThe algorithm has not been trained.
Subscription and download code
If you liked this article and would like to download the code (C++ and Python) and sample images used in this post, click here. Alternatively, sign up for a free computer vision resource guide. In our newsletter, we share OpenCV tutorials and examples written in C++/Python, as well as algorithms and news about computer vision and machine learning.
Download the sample code
author of the photo