Data Preparation Platform
for Machine Learning

Filter, Clean, and Optimize your Data

Sign-up for access

Boost your machine learning pipeline

Every machine learning project starts with data. After struggling with data preparation ourselves and without convincing solutions available, we decided to build the leading data preparation platform. Our mission is to help engineers build machine learning applications faster and more efficient by understanding raw data.

10x less data

Use our state of the art data selection tools to employ only the most relevant data

10x lower costs

Save money on your data related costs e.g. annotation, storage, and computing

10x faster product cycle

Don't waste time preparing non-relevant data or building your own solution

Increased accuracy

Join our fight against "Garbage in, garbage out"
Reduce overfitting by diversifying your dataset with our filter and augmentation algorithms.

Manage everything in one place

From engineers for engineers. We have used latest research to build a state-of-the art platform for data preparation.
We use self-supervised learning combined with reinforcement learning to accelerate your data preparation pipeline.

Data Selection

Most companies only use between 0.1% and 10% of their data for machine learning. Use state-of-the-art methods to select the most relevant samples out of your data-pool. Let WhatToLabel handle the selection of the data for you while you focus on the training process.

Smart Data Pool

Keep track of the data your team is working on and compare datasets. Our algorithms help you only adding relevant data to the existing pool. We only store non sensitive meta information on on our servers so you don't have to worry about transfer costs or privacy issues.


Use our deep analytics framework to analyze your raw datasets. Get insights about the distribution, diversity and other key metrics within hours after data collection. With deep analytics you optimize the data collection workflow within hours instead of weeks.

Use Cases


Make your vehicle autonomous for the street, sea, or air.


Shipping, Logistics, Airline, Defense & Military

Autonomous Vehicles

Visual Inspection

Detect defects in infrastructure, manufactured products, or find infected plants.


Railways & Roads, Infrastructure, Manufacturing, Agriculture, Surveillance & Security


Medical Imaging

Find abnormalities in medical images such as X-rays, MRIs, microscope & medical scans.


Health/Life Science, Biotechnology, and Digital Diagnostics/Pathology


Retail / Advertising

Automatize check-out and shoplifting detection. Improve your advertising and vision-based products.


E-commerce, Retail, Platforms, Advertising & Marketing


Easy data preparation based on your needs

We offer different user interfaces. The cloud solution can be accessed in two different ways (1) through the webapp if collaboration and ease of use are desired or through the (2) command line if implementation in the existing workflow is more important. In case of highly sensitive data or large amounts of data we recommend the docker container

  1. docker pull whattolabel/data-filtering:latest
  2. docker run --gpus all --rm -it -v INPUT_DIRECTORY:/home/ data_input:ro -v  OUTPUT_DIRECTORY:/home/experiment_output whattolabel/data-filtering:latest
  3. the results will be stored in OUTPUT_DIRECTORY
  1. curl -F'for=segmentation' -F'strength=low' -F'token=01234567' -F'file=@yourdataset.tar'
  2. wget https://datasets.whattolabel/uniquelink

Customer Case Studies


"After training a model on the filtered data suggested by WhatToLabel, I saw a dramatic increase in performance on our key metrics. Part of this is certainly due to the fact that this was the first time we trained a model on any data that we've collected, but I'm fairly certain that performance would not have been as good if we had chosen what data to label at random."

Angelo Stekardis, Computer Vision Lead

Our Blog

AI Strategy for Business Leaders

In today’s globalized world, competition is becoming more and more intense. Products are getting better and cheaper. Can this race be won? How do you protect yourself from being disrupted by new, innovative products?

Rotoscoping: Hollywood’s video data segmentation?

In Hollywood, video data segmentation has been done for decades. Simple tricks such as color keying with green screens can reduce work significantly. In late 2018 we worked on a video segmentation toolbox.

Introducing What To Label - data preparation

Are you curious about research areas such as active, self-supervised, and semi-supervised learning and how we can optimize datasets rather than optimizing deep learning models? You’re in good company, and this blog post will tell you all about it!

As seen on

Improve your data
Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.
Start Now