Introduction to object detection algorithms (2023)


How much time have you spent looking for lost room keys in a messy and messy house? It happens to the best of us and it remains an incredibly frustrating experience to this day. But what if a simple computer algorithm could find your keys in milliseconds?

This is the power of object detection algorithms. While this is a simple example, object detection applications span many, many industries, from 24/7 surveillance to real-time vehicle detection in smart cities. In short, these are powerful deep learning algorithms.

Introduction to object detection algorithms (1)

Specifically in this article, we will go deeper and examine different algorithms that can be used for object detection. We start with the algorithms of the RCNN family, namely RCNN, Fast RCNN, and Faster RCNN. In the next article in this series, we will cover more advanced algorithms like YOLO, SSD, etc.

If you are new to CNN, you can sign up for this free course where we cover CNN comprehensively:Convolutional Neural Networks (CNN) from scratch

I encourage you to follow it.previous article on object detection, where we cover the basics of this wonderful technique and show an implementation in Python using the ImageAI library.

Part 2 and Part 3 of this series have also been published. You can access here:

  • A practical implementation of the fastest R-CNN algorithm for object detection (part 2)
  • A practical guide to object detection using the popular YOLO framework - Part III (with Python codes)

Let's start!


  1. A simple way to solve an object detection task (using deep learning)
  2. Understanding region-based convolutional neural networks
    1. Intuition by RCNN
    2. Problems with RCNN
  3. Understand Fast RCNN
    1. Intuition by Fast RCNN
    2. Problems with Fast RCNN
  4. Understand RCNN faster
    1. Intuition by Faster RCNN
    2. Problems with RCNN faster
  5. Summary of Algorithms Covered

1. A simple way to solve an object detection task (using deep learning)

The following image is a popular example to illustrate how an object detection algorithm works. Every object in the image, from a person to a dragon, has been located and identified with some degree of precision.

Introduction to object detection algorithms (2)

Let's start with the simplest and most widely used deep learning approach for detecting objects in images: Convolutional Neural Networks, or CNNs. If your understanding of CNN is a bit rusty, I highly recommend reading this.This articleFirst.

But I will briefly summarize the inner workings of aCNNfor you. Consider the following image:

Introduction to object detection algorithms (3)

We pass an image to the network and it is sent through several layers of convolution and pooling. Finally, we get the output in the form of the class of the object. Pretty easy, right?

For each input image, we get a corresponding class as output. Can we use this technique to detect different objects in an image? If we can! Let's see how we can solve a general object detection problem using a CNN.

1. First, we take an image as input:

Introduction to object detection algorithms (4)

2. Next, we divide the image into different regions:

Introduction to object detection algorithms (5)

3. Next, we consider each region as a separate image.
4. Send all these regions (images) to CNN and classify them into different classes.
5. After dividing each region into its corresponding class, we can combine all these regions to get the original image with the detected objects:

Introduction to object detection algorithms (6)

The problem with using this approach is that the objects in the image can have different proportions and spatial positions. For example, in some cases the object may cover most of the image, while in others the object may cover only a small percentage of the image. The shapes of the objects can also be different (this often happens in real applications).

Due to these factors, we would need a large number of regions, which would result in enormous computation time. To solve this problem and reduce the number of regions, we can use the region-based CNN, which selects regions through a suggestion method. Let's understand what worries this region.CNNcan do for us.

2. Understand the region-based convolutional neural network

2.1 Intuition by RCNN

Instead of working on a large number of regions, the RCNN algorithm suggests a series of frames in the image and checks if any of those frames contain an object. RCNNuses a selective search to extract these frames from an image (these frames are called regions).

First, let's understand what selective search is and how it identifies different regions. Basically, there are four areas that make up an object: different scales, colors, textures, and borders. Selective search identifies these patterns in the image and suggests different regions based on them. Here's a quick overview of how selective search works:

  • First take an image as input:
    Introduction to object detection algorithms (7)
  • It then generates initial subsegments so that we have multiple regions of this image:
    Introduction to object detection algorithms (8)
  • The technique then combines the similar regions to form one larger region (based on similarity in color, texture, size, and shape compatibility):
    Introduction to object detection algorithms (9)
  • Finally, these regions generate the final locations of the object (region of interest).

Below is a brief summary of the steps performed in RCNN to detect objects:

  1. First, we take a pretrained convolutional neural network.
  2. This model is then retrained. We train the last layer of the network based on the number of classes that need to be recognized.
  3. The third step is to obtain the region of interest for each image. We then reshape all of these regions to match the CNN input size.
  4. After obtaining the regions, we train the SVM to classify objects and backgrounds. For each class, we train a binary SVM.
  5. Finally, we train a linear regression model to generate tighter bounding boxes for each identified object in the image.

A visual example might give you a better idea of ​​the steps above.(Example images below are fromthis paper) .So let's get one!

  • First, an image is taken as input:
    Introduction to object detection algorithms (10)
  • Next, we get the regions of interest (ROI) using a suggestion method (for example, selective search as shown above):
    Introduction to object detection algorithms (11)
  • All of these regions are reformulated according to CNN's input, and each region is passed to ConvNet:
    Introduction to object detection algorithms (12)
  • CNN then pulls resources for each region and SVMs are used to divide these regions into different classes:
    Introduction to object detection algorithms (13)
  • Finally, a bounding box regression (reg-box) is used to predict the bounding boxes of each identified region:
    Introduction to object detection algorithms (14)

And so, ultimately, a RCNN helps us to recognize objects.

2.2 Problems with RCNN

So far, we have seen how RCNN can help with object detection. But this technique has its own limitations. Training an RCNN model is expensive and time consuming thanks to the following steps:

  • Extraction of 2000 regions for each image based on a selective search
  • Extract features with CNN for each region of the image. Assuming we have N images, the number of CNN resources is N*2000
  • The entire process of object detection with RCNN has three models:
    1. CNN for resource extraction
    2. Linear SVM classifier to identify objects
    3. Regression model for tightening bounding boxes.

All these processes together make RCNN very slow. It takes 40-50 seconds to make predictions for each new image, making the model essentially complicated and virtually impossible when faced with a large data set.

Here's the good news: we have a different object detection technique that fixes most of the limitations we saw in RCNN.

3. Understand Fast RCNN

3.1 Intuition by Fast RCNN

What else can we do to reduce the computation time that an RCNN algorithm normally takes? Instead of running a CNN 2000 times per frame, we can run it just once per frame and get all regions of interest (regions that contain an object).

Ross Girshick, the author of RCNN, came up with the idea of ​​running CNN only once per frame, and then finding a way to share that calculation across all 2000 regions. In Fast RCNN, we feed the input image to CNN, which in turn generates the convolutional feature maps. The proposed regions are extracted from these maps. We then use a RoI pooling layer to reshape all proposed regions to a fixed size so they can be fed into a fully connected network.

Let's break this down into steps to simplify the concept:

  1. As with the previous two techniques, we take an image as input.
  2. This image is passed to ConvNet, which in turn generates the regions of interest.
  3. A RoI pooling layer is applied to all these regions to reshape them according to the ConvNet inputs. Each region is then transitioned to a fully connected network.
  4. A softmax layer is used in the fully connected network to broadcast classes. A linear regression layer is also used in parallel with the softmax layer to generate the bounding box coordinates for the predicted classes.

So instead of using three different models (as in RCNN), Fast RCNN uses a single model that extracts features from regions, splits them into different classes, and returns the bounding boxes for the identified classes, all at once.

To break it down further, I'll visualize each step to add a practical angle to the explanation.

  • We follow the now familiar step of taking an image as input:
    Introduction to object detection algorithms (15)
  • This image is passed to a ConvNet which returns the area of ​​interest accordingly:
    Introduction to object detection algorithms (16)
  • We then apply the RoI clustering layer to the extracted regions of interest to ensure that all regions are the same size:
    Introduction to object detection algorithms (17)
  • Finally, these regions are passed to a fully connected network that sorts and returns bounding boxes simultaneously using softmax layers and linear regression:
    Introduction to object detection algorithms (18)

In this way, Fast RCNN solves two of the main problems of RCNN, i.e., passing one region per image instead of 2000 to ConvNet and using one instead of three different models for feature extraction, classification, and feature generation. bounding boxes.

3.2 Problems with Fast RCNN

But Fast RCNN also has certain problem areas. It also uses selective search as a suggestion method to find areas of interest, a slow and slow process. It takes about 2 seconds per frame to detect objects, which is much better compared to RCNN. But if we look at large real data sets, even a fast RCNN doesn't seem so fast anymore.

But there is another object detection algorithm that beats Fast RCNN. And something tells me you won't be surprised by his name.

4. Understand RCNN faster

4.1. Intuition by Faster RCNN

Faster RCNN is the modified version of Fast RCNN. The main difference between them is that Fast RCNN uses selective search to generate regions of interest, while Faster RCNN uses "Region Proposal Network", also known as RPN. RPN takes image feature maps as input and generates a series of object suggestions, each with an object score as output.

The following steps are generally followed in a faster RCNN approach:

  1. We take an image as input and pass it to ConvNet, which returns the feature map of that image.
  2. A region suggestion network is applied to these resource maps. This returns the object suggestions along with their objectivity rating.
  3. A RoI pooling layer is applied to these proposals so that all proposals have the same size.
  4. Finally, the hints are passed to a fully connected layer, which has a softmax layer on top and a linear regression layer, to classify and generate the bounding boxes of the objects.

Introduction to object detection algorithms (19)

Let me briefly explain how this Regional Proposal Network (RPN) actually works.

First, Faster RCNN handles resource mapsCNNand sends them to the Regional Network of Proposals. RPN uses a sliding window over these feature maps and generates in each windowkAnchor boxes in different shapes and sizes:

Introduction to object detection algorithms (20)

Anchor boxes are fixed-size bounding boxes placed throughout the image that vary in shape and size. For each anchor, the RPN predicts two things:

  • The first is the probability that an anchor is an object (it doesn't take into account what class the object belongs to).
  • The second is replicating the bounding box to adjust the anchors to better fit the object.

We now have bounding boxes of various shapes and sizes that are passed to the RoI pooling layer. Now it is possible that after the RPN step there are proposals that do not have associated classes. We can take each signal and adapt it so that each signal contains an object. The RoI pooling layer takes care of this. Extract fixed-size feature maps for each anchor:

Introduction to object detection algorithms (21)

These feature maps are then passed to a fully connected layer that has a softmax and linear regression layer. Finally it classifies the object and predicts the bounding boxes for the identified objects.

4.2 Problems with faster RCNN

All object detection algorithms discussed so far use regions to identify objects. The network does not look at the entire image at once, but sequentially focuses on parts of the image. This creates two complications:

  • The algorithm requires many passes through a single image to extract all the objects.
  • As different systems work in sequence, the performance of the above systems depends on the performance of the above systems.

5. Summary of the algorithms covered

The following table is a good summary of all the algorithms we cover in this article. I suggest keeping it on hand the next time you're working on an object detection challenge!

algorithmresourcesforecast weather / imagelimitations
CNNIt splits the image into multiple regions and then classifies each region into different classes.It requires many regions to accurately predict and therefore a high computation time.
RCNNUse selective search to generate regions. It extracts around 2000 regions from each image.40-50 secondsHigh computation time as each region is passed separately to CNN and uses three different models to create predictions.
fast RCNNEach image is submitted to CNN only once and feature maps are extracted. Selective search is used on these maps to generate predictions. Combine the three models used in RCNN together.2 secondsThe selective search is slow and therefore the computation time is still high.
faster RCNNReplaced the selective search method with a region suggestion network, making the algorithm much faster.0.2 secondsProposing objects takes time, and since different systems work in sequence, the performance of systems depends on how the previous system worked.


Object detection is an intriguing and very popular field in commercial and research applications. Thanks to advances in modern hardware and computing resources, advances in this area have been rapid and innovative.

This article is just the beginning of our journey into object detection. In the next article (Part 2mipart 3) in this series, we find modern object detection algorithms such as YOLO and RetinaNet. So stay tuned!

I always welcome feedback or suggestions on my articles, so feel free to contact me in the comments section below.



What is object detection algorithms? ›

Object detection is a computer vision technique for locating instances of objects in images or videos. Object detection algorithms typically leverage machine learning or deep learning to produce meaningful results.

What is the latest object detection algorithm 2022? ›

Most Popular Object Detection Algorithms

Convolutional neural networks (R-CNN, Region-BasedConvolutional Neural Networks), Fast R-CNN, and YOLO are popular object detection techniques (You Only Look Once). YOLO belongs to the single-shot detector family, whereas the R-CNNs are members of the R-CNN family.

Where do I start with object detection? ›

  1. Getting Started with Object Detection Using Deep Learning.
  2. Create Training Data for Object Detection. ...
  3. Create Object Detection Network.
  4. Train Detector and Evaluate Results.
  5. Detect Objects Using Deep Learning Detectors.
  6. Detect Objects Using Pretrained Object Detection Models.
  7. MathWorks GitHub.
  8. See Also.

Which language is best for object detection? ›

Let's look at the following reviews:
  • (1) Python. Python currently holds the place as the most popular programming language. ...
  • (2) C / C ++ / C# C / C ++ / C # can also be used for image recognition. ...
  • (3) Matlab. Matlab is an independent programming language that has its own framework. ...
  • (4) Java.
Apr 1, 2022

Which algorithm is best for object tracking? ›

DeepSORT is one of the most popular object tracking algorithms. It is an extension to Simple Online Real-time Tracker or SORT, which is an online-based tracking algorithm. SORT is an algorithm that uses the Kalman filter for estimating the location of the object given the previous location of the same.


Top Articles
Latest Posts
Article information

Author: Geoffrey Lueilwitz

Last Updated: 02/11/2023

Views: 6409

Rating: 5 / 5 (80 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Geoffrey Lueilwitz

Birthday: 1997-03-23

Address: 74183 Thomas Course, Port Micheal, OK 55446-1529

Phone: +13408645881558

Job: Global Representative

Hobby: Sailing, Vehicle restoration, Rowing, Ghost hunting, Scrapbooking, Rugby, Board sports

Introduction: My name is Geoffrey Lueilwitz, I am a zealous, encouraging, sparkling, enchanting, graceful, faithful, nice person who loves writing and wants to share my knowledge and understanding with you.