How redundant is your dataset?

Lots of interesting Deep learning applications rely on the use of complex architectures fueled by large datasets. With the growing storage capacities and ease of data collection [1], it is very easy to build a large dataset. However, when doing so, one ends up with lots of redundancies within the dataset. Many of these redundancies are systematically introduced by the data collection process: consecutive frames extracted from a video, very similar images collected from the web.

In this blog post, we present the results of a benchmark study showing the benefits of filtering redundant data.

Redundancies can take multiple forms, the simplest one is having exact image duplicates. Another form is near-duplicates, i.e images shifted with few pixels across some direction or images having slight light changes. These redundancies not only lead to biased results of the model’s performance, be it accuracy or mean average precision mAP score, but also lead to high annotation costs. In addition, redundancies have also been observed in very known academic datasets: CIFAR-10, CIFAR-100, and ImageNet [2,3].

This benchmark study investigates the effect of redundancies in the image-based datasets collected by AI Retailer Systems (AIRS), an innovative start- up developing a checkout-free solution for retailers, which concentrates to answer “who picks up what?”. In this study, we consider an object detection task. An intelligent vision system recognizes products on a shelf or on a customer’s hand. This study was done as part of my role as a machine learning engineer at WhatToLabel (WTL), a tech-based start-up that provides a solution to improve the Machine Learning workflow by finding and removing redundancies in unlabeled data.

This blog post will be structured as follow, we start by describing the dataset, we list the methods used in this study. We present the results obtained and finally, we discuss the importance of filtering redundant data.

AI Retailer System dataset

Image for post
Short video sample extracted from AIRS video.

The dataset used consists of images extracted from short videos capturing a customer grabbing different products. There are two different cameras recording videos of the shelf, each camera has a different angle of view. There are 12 different kinds of products -12 classes.

The dataset has been manually annotated using the open-source annotation tool Vatic, we measured the annotation rate: number of frames per time unit. In our case this was 2.3 ± 0.8 frames per minute, given that there are 51 objects on average in each image, this is equivalent to 0.51 seconds for each bounding box.

Image for post
Sample image from Camera 1 with annotations: the box color does not represent the article class.

Image for post

Sample image from Camera 2 with annotations: the box color does not represent the article class.

The annotated dataset has 7909 images. The training dataset has 2899 images, 80% of these images are from camera 2 and 20% from camera 1. For the test dataset, it has 5010 images and all of them from camera 1.

Image for post
Visualization of the train-test setting for AIRS dataset.

This specific design of the train and test datasets was motivated by the following points: First, we build an imbalanced dataset with a high fraction of images coming from one camera. Second, we make the object detection task hard for the model. With this train-test setting, we can calculate the fraction of images from Camera 1 in the filtered data. Therefore, we can see if there is any re-balancing that is introduced by the different filtering methods used.

Next, we present the methods used in this case study.

Active learning and Sampling methods

To probe the effects of filtering the dataset, we borrowed ideas from the field of Active learning.

Image for post
Active learning loop used in this case study.

Active learning aims at finding a subset of the training data achieving the highest possible performance. In this study, we used the pool-based active learning loop that works as follows: Start with a small fraction of the training dataset called the labeled pool. Train a model on this labeled pool. Then, use the model along with a filtering method to select new data points that should be labeled. Append the labeled pool with this newly selected samples and finally train the model from scratch on the updated labeled pool. After each cycle, we report the model’s performance on the test dataset for each filtering method used. In our case, we used 5% of the training data as the initial labeled pool, we trained the model for 50 epochs, and we added 20% of the training data in each active learning loop.

The object detection model used in this benchmark study is YOLO V3 (You Only Look Once) [4]. We used the implementation provided by the Ultralytics Github repository. The code was slightly modified in order to introduce the active learning loop.

As for the filtering methods, we used four different filtering methods provided by WhatToLabel:

  • RSS”: Refers to random sub-sampling which will be used as a baseline.
  • WTL_unc”: This method refers to WhatToLabel uncertainty based sub-sampling, it selects difficult images that the model is highly uncertain about. The uncertainty is assessed using the model’s predictions.
  • WTL_CS”: This method uses image representations to select images that are both diverse and difficult. It combines uncertainty-based sub-sampling with diversity selection. The image representations are obtained using state-of-the-art self-supervised learning methods using the PIP package Boris-ml. The advantage of self-supervised learning methods is that they don’t require annotations to generate image representations.
  • WTL_pt”: Relies on pre-trained models to learn image representations, the filtering is performed by removing the most similar images. Similarity in this case is given by the L2 distance between image representations.

Both methods “WTL_unc” and “WTL_CS” use active learning since they use the deep learning model to decide which data points to filter. Whereas the “WTL_pt” method does require neither labels nor a deep learning model to filter the dataset. For curious readers, this article presents a comprehensive overview of different sampling strategies used in active learning.

Benchmark study results

We present the results of these experiments below.

Image for post
Averaged mAP score for different fractions of the training dataset using 4 seeds.

We see that the mAP score is low at small fractions of the training dataset. We observe that mAP score saturates when using only 25% of the training data and reaches a value of 0.8. Above the saturation point, the mAP score increases very slowly until it reaches its highest value of 0.84. The saturation at low fractions of the training dataset indicates that there are lots of redundancies in the dataset.

We notice that for small fractions, i.e 5%, the “WTL_CS” filtering method is significantly better than the random baseline. As for high fractions, i.e 85%, the “WTL_pt” is able to achieve the same performance achieved when using the full training dataset. The “WTL_unc” method is on par or worse with the random sub-sampling method “RSS”.

Given that the saturation is reached within a small fraction of the training dataset, we perform a “Zoom-in” experiment where we evaluate the model’s performance using fractions of the training dataset between 5% and 25%. In this experiment, we drop the “WTL_unc” for its poor performance.

Image for post

Zoom-in experiment: Averaged mAP score for different fractions of the training dataset using 4 seeds.

We see in the results above that the sampled subsets using “WTL_CS” and “WTL_pt” methods consistently outperform random sub-sampling. In addition, using only 20% of the training dataset, the “WTL_CS” sampling method is able to achieve a mAP score of 0.80. We achieve 90% of the highest mAP score using only 20% of the training dataset.

Why “WTL_CS” and “WTL_pt” perform better than random sub-sampling “RSS”

To answer this question, we make a simple comparison between the images selected with the “RSS” method with the images selected with “WTL_CS” and “WTL_pt”. For this purpose, we compute the fraction of images from camera 1 in the selected samples for different fractions of the training dataset and for different filtering methods. This comparison is done in both the normal and the zoom-in experiments. Note that in the training dataset, the original fraction of images from Camera 1 is around 20%.

Image for post
The fraction of Camera 1 images in the sampled images as a function of fraction of the training dataset.
Image for post
Zoom-in experiment: Fraction of Camera 1 images in the sampled images as a function of the fraction of the training dataset.

We see that the sampling methods “WTL_CS” and “WTL_pt” select more samples from Camera 1 and therefore, they re-balance the sub-sampled training dataset. This explains the gain in performance obtained using different samplings other than random sub-sampling. Since both “WTL_CS” and “WTL_pt” methods select non-redundant data, they choose more images from camera 1, and therefore the sub-sampled dataset is more diverse.

Summary and outlook

In this case study, we have seen the importance of filtering the redundancies within a dataset. We have found the following results:

  • The AIRS dataset contains lots of redundant images.
  • Achievement of the highest mAP score using only 85% of the training dataset.
  • Achievement of 90% of the highest mAP using only 20% of the training dataset.
  • Filtering re-balanced the AIRS dataset.

This benchmark study showed the importance of filtering redundant data. With WhatToLabel filtering methods, it was possible to achieve annotation costs reductions between 15% and 80%. We found lots of redundancies in the AIRS dataset even though it was collected in a controlled environment: There is at least one customer in each video and the customer grabbed different products. We expect the redundancies to be more pronounced in a general uncontrolled case where different customers grab products at a supermarket.

Anas, Machine Learning Engineer,

[1]: Tom Coughlin, R. Hoyt, and J. Handy (2017): “Digital Storage and Memory Technology (Part 1)” IEEE report:

[2]: Björn Barz, Joachim Denzler (2020): “Do we train on test data? Purging CIFAR of near-duplicates” ArXiv abs/1902.00423

[3]: Vighnesh Birodkar, H. Mobahi et al. (2019): “Semantic Redundancies in Image-Classification Datasets: The 10% You Don’t Need” ArXiv abs/1901.11409

[4]: Joseph Redmon, Ali Farhadi (2018): “YOLOv3: An Incremental Improvement” ArXiv abs/1804.02767v1

Improve your data
Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.
Start Now