Data Preparation
for Machine Learning

Improve your ML models by curating your vision data

Sign-up for access

Find data redundancy and bias

Find and remove redundancy and bias introduced by the data collection process to reduce overfitting and improve ML model generalization.

10x more efficient

Save money on your data related costs  by removing redundancies

Increased accuracy

Reduce overfitting and improve generalization by diversifying your dataset

Manage everything in one place

Understand your data within minutes after collection and before any data labeling.
We use self-supervised learning combined active-learning to accelerate your data preparation pipeline.

Data Selection

Most companies only use between 0.1% and 10% of their data for machine learning. Use our state-of-the-art methods to select the most relevant samples. Let WhatToLabel handle the selection of the data for you while you focus on the training process.

Smart Data Pool

Keep track of the data your team is working on. Our algorithms help you only adding relevant data to the existing pool. We only store non-sensitive meta-information on our servers so you don't have to worry about transfer costs or privacy issues.

Data Analytics

Use our deep data analytics framework to analyze your raw datasets. Get insights about the distribution, diversity, and other key metrics. Find dataset bias before training and evaluating your model.

Use Cases


Make your vehicle autonomous for the street, sea, or air.


Shipping, Logistics, Airline, Defense & Military

Autonomous Vehicles

Visual Inspection

Detect defects in infrastructure, manufactured products, or find infected plants.


Railways & Roads, Infrastructure, Manufacturing, Agriculture, Surveillance & Security


Medical Imaging

Find abnormalities in medical images such as X-rays, MRIs, microscope & medical scans.


Health/Life Science, Biotechnology, and Digital Diagnostics/Pathology


Retail / Advertising

Automatize check-out and shoplifting detection. Improve your advertising and vision-based products.


E-commerce, Retail, Platforms, Advertising & Marketing


Easy data preparation based on your needs

We have the right solution for every amount of data. Use our webapp together with our PIP package to analyze and filter your first dataset within minutes.

You can try out our limited free version with no payment required!

Filter and analyze your first dataset in 5 minutes:
  1. Register on our webapp and create a new dataset
  2. # create an embedding of your dataset using our pre-trained models
    boris-embed from_folder='path_to_your_dataset'

    # upload your dataset and embeddings
    boris-upload path_to_folder='path_to_your_dataset' path_to_embeddings='path_to_your_embeddings' \
    token= 'your_token_from_webapp' dataset_id='your_dataset_id_from_webapp'
  3. Have fun analyzing and filtering your dataset in the webapp!
  • < 1'000 samples
  • Drag n Drop (no coding required)
  • 2048-bit SSL encryption
Python PIP Package (CLI)
  • < 25'000 samples
  • Train custom embedding models using self-supervised learning
  • Option to only upload non-sensitive metadata
  • 2048-bit SSL encryption
On-Premise (Docker)
  • Already used by Fortune500 companies to process > 1'000'000 samples
  • Neither your raw data nor metadata leave your server
  1. curl -F'for=segmentation' -F'strength=low' -F'token=01234567' -F'file=@yourdataset.tar'
  2. wget https://datasets.whattolabel/uniquelink

Customer Case Studies


"After training a model on the filtered data suggested by WhatToLabel, I saw a dramatic increase in performance on our key metrics. Part of this is certainly because this was the first time we trained a model on any data that we've collected, but I'm fairly certain that performance would not have been as good if we had chosen what data to label at random."

Angelo Stekardis, Computer Vision Lead


"WhatToLabel helped us understand more about our own data gathering process. Through their service, we were able to see, that a lot of data being collected was not meaningful enough for training an accurate model. This led us to change the way we gathered data and allowed us to ultimately create a much more information dense and higher quality dataset overall. Needless to say, the performance of our final model was greatly improved."

Nasib Adriano Naimi, Autonomy and Robotics Engineer

Our Blog

Which Optimizer should I use for my Machine Learning Project?

This article provides a summary of popular optimizers used in computer vision, natural language processing, and machine learning in general.

AI Strategy for Business Leaders

In today’s globalized world, competition is becoming more and more intense. Products are getting better and cheaper. Can this race be won? How do you protect yourself from being disrupted by new, innovative products?

Rotoscoping: Hollywood’s video data segmentation?

In Hollywood, video data segmentation has been done for decades. Simple tricks such as color keying with green screens can reduce work significantly. In late 2018 we worked on a video segmentation toolbox.

As seen on

Improve your data
Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.
Start Now