From maintaining crop health to detecting cancerous tumours, object detection models have revolutionised many sectors of society, including agriculture, medicine, transportation, surveillance, and autonomous vehicles.
Object detection is a computer vision technology aimed at predicting the location and type of objects in an image. Therefore, these models must learn from annotated data containing labels and bounding boxes of the objects of interest.
Figure 1: Example of annotated data for object detection. It contains the label and bounding box enclosing each object of interest. Image generated with DALL.E
The ability of a model to properly detect objects depends on the accuracy of the annotations, as well as the number of images used for training. This poses a challenge because annotated data is expensive and difficult to obtain due to the large amount of human labour required (i.e. labellers need to create the bounding boxes and classes manually). Thus, existing open-source annotated data is limited. Some examples of annotated data for object detection are the COCO dataset, LVIS, and Google’s open images.
Table 1: Examples of open-source annotated data for object detection.
Some object detection models trained with the COCO dataset are helpful at detecting generic objects (e.g. YOLO, Fast RCNN); however, in domain-specific cases, these pre-trained models are proven inaccurate. Even object detection models trained with bigger datasets such as LVIS or Open images, might also be inaccurate in specific use cases due to the generic characteristics of the images in those datasets (see figure 2). This is problematic, as image annotation needs to be done every time an object detection model is developed for a specific use case.
Figure 2. Example of LVIS data and annotations. Source: LVIS. Although this dataset is large, it might not contain images from domain-specific cases.
The question is: Is there a way to make image annotation for object detection more efficient? Fortunately, the answer is yes. This is what we will explore in this blog post.
Machine learning-assisted labelling is the process of using models trained on large (and generic) datasets to pre-label new images. In the context of object detection, these pre-trained models generate labels and bounding boxes on the data of interest, making the manual process of labelling much quicker because humans only need to verify and correct labels rather than label images from scratch.
Some of the most-known object detection models used to assist data labelling are Faster-RCNN, SSD (Single Shot Detector), and YOLO (You Only Look Once). What these models have in common is the fact that object localisation and classification are coupled. For instance, the SSD algorithm predicts the bounding box and the object's class in a single shot. This means that in a single forward pass of the network, the presence of an object and the bounding box are predicted simultaneously. Faster-RCNNs rely on Region Proposal Networks (RPN) to search for the areas in an image where objects are likely to be found. On the other hand, YOLO treats object detection as a single regression problem, where the prediction of bounding boxes and class probabilities is made in one step.
The significant disadvantage of coupling object localisation and classification in traditional detection models is that they need to be trained on detection data, which, as explained in the introduction, is scarce. However, if we look at classification datasets, they contain more images and a more considerable repertoire of classes. To mention an example, the ImageNet dataset contains 14M images and 21K classes. Thus, training object detection models with these huge classification datasets would be ideal. However, decoupling the localisation and classification of objects in the models is necessary to achieve this goal. Luckily, researchers from Meta achieved this by developing the algorithm called Detic: Detector with image classes.
In 2022 researchers from Meta released Detic, an object detection model that is capable of detecting 21K different objects with high accuracy. What is the secret ingredient? As mentioned above, they decoupled the object localisation and classification sub-problems allowing them to train on the much larger available classification datasets. For the object localisation, the authors used Region Proposal Networks. These algorithms don’t need to be fine-tuned as they use a sufficient number of proposals in testing (1K proposals for < 20 objects per image [1]). For the object classification, the authors used a modified classifier where they use language embeddings of class names instead of classification weights. In particular, they use the CLIP embeddings. Then the authors use the ImageNet dataset to co-train these embeddings. The advantage of this training process is that Detic is not only able to accurately recognise more objects than traditional detection models (thanks to ImageNet); Detic is also able to identify objects not seen during training, without further fine-tuning (thanks to the CLIP embeddings).
However, there is an important catch here: the ImageNet classification data contains one label per image. Does this have a negative impact on the ability to detect multiple objects? Not all all, because ImageNet is not the only dataset used to train Detic. But when ImageNet is used, Detic identifies the class of the image on the largest bounding box proposed by the RPN, and the loss function is calculated. If the loss is large, the RPN are adjusted and the training continues. The authors also used detection data during the training of Detic. In this case, they use the standard two-stage detection training, where both the bounding box and the class of each object in an image are inferred. Thus the training process of Detic is as follows:
1. Compose a mini-batch using a set of images from both ImageNet and a detection dataset.
2. Images without bounding boxes (i.e. those belonging to ImageNet) are trained by using the modified classifier that use CLIP embeddings.
3. Images with bounding boxes (i.e. those belonging to the detection dataset) follow the the standard two-stage object detection training with the modified classifier.
The training process of Detic is shown below:
Figure 3. Training methodology of Detic. Source: [1]
The authors evaluated Detic on annotated data from the LVIS dataset. They reported a gain of 1.2 mAP for all classes with respect to YOLO90000 and an increase of 8.3 mAP for novel classes on the LVIS benchmark (See table 1 in [1]).
These results show a remarkable progress in object detection, specially in machine-learning assisted labelling where accurate models are needed to make the labelling process more efficient.
How good is Detic in practice? We explored this model with different images to test its capabilities at detecting objects. We also compare Detic’s outcomes with those of a YOLO pre-trained model on Google’s open images annotation dataset (a.k.a. YOLOv8-oiv7).
The first experiment was detecting as many people as possible in an image.
Figure 4: Detection of many people hiking a mountain.
In this case, Detic outperformed YOLO by detecting people almost in the image’s background. In the context of assisted labelling, this would enhance label efficiency because the human verifiers would need much less corrections / additions compared to using yolo.
Detic is also better at detecting objects that are inexistent in the open-source annotations datasets. In particular, we see how Detic is able to detect pencils in the following image, while YOLOv8-oiv7 is unable to detect them:
Figure 5: Detection of pencils. YOLOv8-oiv7 does not detect them because the model was not trained with this object class.
Now, YOLOv8-oiv7 can be good at detecting objects whose classes are part of Google’s open image annotation dataset. In these examples, YOLOv8-oiv7 and Detic could detect helmets and koalas from the images; however, Detic was better at detecting all the koalas.
Figure 6: Detection of Helmets and Koalas.
Overall, Detic is a powerful and accurate model for detecting objects on images, making it a powerful tool for Machine-learning assisted labelling.
We note that the examples shown here are not a representative sample to test Detic’s full potential. Therefore, if you are interested in exploring this model, you can visit this Colab Notebook prepared by us, where you can see Detic in action.
We saw that in general terms, Detic is better than YOLOv8-oiv7 because is able to detect more objects. However, Detic sometimes fails at detecting very domain-specific objects. In this example, Detic could not detect the Espeletia, a shrub that only grows in Colombia even when Detic was asked to detect them as tree, plant or shrub.
Figure 7: Photo of a Espeletia. This is a plant that grows in some Colombian mountains. Image taken by the author.
The inference time of Detic is also slow. In our experiments, Detic was approx. 1000 times slower than YOLO at detecting objects in a single image. However, this limitation is less of a problem in Machine-learning assisted labelling, as no real-time predictions are needed. One desirable pipeline for applications requiring either real-time predictions or deployment on edge devices is to create annotation data in Detic and train a YOLO model on that data to extend YOLO’s default recognition categories.
Object detection models are robust, provided that they use high-quality annotated data. However, domain-specific annotation data is hard to get due to the amount of human labour required. To overcome this challenge, Machine Learning-assisted labelling tools can be used to reduce the hours spent creating these datasets. These assisted tools are being improved thanks to the rise of zero-shot algorithms such as Detic, which allow us to detect more objects than other detection methods. Be aware that, in domain-specific use cases, Detic might not perform as expected, leaving manual labelling the only reliable alternative. In any case, It is expected significant advances in object detection thanks to models like Detic.
[1] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra. Detecting Twenty-thousand Classes using Image-level Supervision. DOI: 10.48550/arXiv.2201.02605