AI Projects – What to Do If You Don’t Have Enough Data

Six tips to create and improve the dataset of your ML projects

Let’s be realistic. Machine learning projects almost never have enough data.

Sad but true, uh?

This happens because machines “think” differently than humans. One of the main differences is that they need A LOT more data to learn. What does it mean in practice?

It means that computers need to see a lot of samples before they can begin to perform their tasks properly. But then they will do their job faster, better, and more accurately than humans.

Considering this, if you want to increase the probability of success for your ML project you really should know how to expand its dataset.

Probably we can help you with some useful tips. So, go on reading!

Where to find the data you need

You can choose among many strategies to increase your dataset.

Here are some of the main ones.

1. Creating the necessary data on your own

It is the most obvious solution. But, be careful: if you start to create data from scratch without the knowledge of what exactly will be needed for AI training, you may lose time and money.

In order to prevent this, you should understand two things: what data and how much data you need.

If you have to solve the first problem, Data Specification will help you. It is a simple document, where you need to specify at least the following points:

what kind (images, texts, video, etc) of data the project requires
what classes of data the project requires (if the target of the project is to identify cats and dogs, it will have at least 2 classes: (1) cats and (2) dogs)
what quality and data or data format should be used
what restrictions the project data has

It will also be great to put in this document some positive and negative examples.

Instead, the amount of data you need is not so obvious and it’s quite difficult to calculate. But we would like to share with you our years of experience in one table:

Project example	Result quality example	Amount of data samples
Classical ML projects, that do not use ANNs. Deep learning projects that use linear dependencies and/or several data classes with well-differentiated attributes.	Minimal possible quality For R&D products and preliminary research.	less than 1 000
Deep learning projects with ANNs. Detection and identification tasks. For example text language identification etc.		1 000 - 10 000
Deep learning projects with ANNs. All ML/AI tasks can be solved here. For example, non-complicated basic natural language processing, texts/data generation and processing, basic semantic analysis, etc.	Average quality For commercial products.	10 000 - 50 000
Deep learning projects with ANNs. All ML/AI tasks can be solved here. For example, deep semantic analysis and classification, high-quality classification and etc.	Average quality For commercial products.	50 000 - 100 000
Deep learning projects with ANNs, image processing for non-complicated tasks. For example, face detection and recognition, etc.	Good quality For enterprise-level projects and really complicated tasks.	100 000 - 500 000
Deep learning projects with ANNs for deep image processing: images annotation, very high-level semantic analysis, deep natural language processing and etc.		500 000 - 2 000 000
Deep learning projects with ANNs, that solve really complicated tasks with unlimited possible data units. For example intelligent chatbots (human-like), machine translation, large specific text generation, etc.	Excellent quality For quasi-AI projects and really intellectual tasks.	2 000 000 and more

This table offers you very generic recommendations that are based on our experience and knowledge. Please, keep in mind that your project can easily require different numbers. Anyway, our advice can be a good starting point!

2. Data augmentation

This option requires that you already have some data. Basically, this technique can be a solution when your volume of data is not enough.

For example, you have 2 000 images, but you need more than 100 000. In such a case, data augmentation can be a good option for you.

Data augmentation is a strategy that allows creating new data samples from existing samples via a set of linear operations (for example, affine translations for image processing).

Let me show you a practical case. Imagine that you want to classify cats and dogs. You expect to reach a good level of quality for your classifier, but you have just 2 000 images (1 000 cats and 1 000 dogs). How is it possible to increase this number?

Step 1: Mirror horizontally each image. You will have 4 000 examples.
Step 2: Add a small rotation for each image from step 1. For example, ±15 and ±15 degrees. So, after 4 rotations you will have: 4 000 + 8 000 + 8 000 = 20 000 images. They look very similar to you, but not to the ANN.
Step 3: Let’s add some additional Gaussian noise for each image, from step 2. So, 40 000 examples are at your disposal now.
Step 4: Now we can add a random shift for each image without changing the canvas size. For, example for 15% from left to right and 10% from bottom to top or a mix of them. After this operation, you will have: 40 000 + 40 000 + 40 000 + 40 000 = 160 000 images! It’s amazing, isn’t it?

And now you can make a really cool product for cats and dogs classification!

Maybe you noticed that we didn’t use all the possible transformations. We did so because not all the transformations were allowed here.

For example, you can mirror images from bottom to top. But in our world, it’s very difficult to find cats and dogs that walk upside down.

For text augmentation, you can use different online SEO tools. But technically, it will be a data generation case. So, check the next paragraph!

3. Data generation

This method is quite expensive, but sometimes it is the only way to get data. Data generation means that you will use some tools to artificially create data by pre-defined rules (see case 1, Data Specification).

If you decided to use the data generation strategy you will have 2 ML projects:

Data Generator: most likely, it will be a deep learning tool, that will create data for your main project. For example, if you need texts, you can use GPT-2 or GPT-3 ANNs from the OpenAI company.
Your main project

But be careful. You need to properly control the results of data generation. This means a lot of manual work.

Otherwise, there is a concrete risk to get just a mess. You know… it happens when machines teach other machines!

Anyway, you may follow this approach when the necessary data for your project is under strict privacy or security regulations.

4. Open-source

Maybe it’s one of the most obvious ways: the use of open-source data. It’s plenty of them on the internet, almost for any possible situation.

But there can be a lot of issues with intellectual property because the major part of open data has limited license: you will be able to use it for research, but not for commercial purposes.

So, this can be exploited for your R&D phase if you need to test your idea or to make a research.

One more issue is the lack of open-source datasets for very specific things. For example, if you decide to handle the process of mixing coffee and milk to find the recipe for the perfect cappuccino, it will be very unlikely to find such a dataset.

Also, there are sectors that are very poor in open-source resources: law, fin-tech, and healthcare. Nobody would like to share his credit history or data about his diseases with the entire world.

5. Markup

This advice is suitable if you have data, but you need to mark features in it. For example, you need to detect and recognize car license plates. One of the first steps will be to mark license plates on the training dataset with bounding boxes.

It is a big task. And this task is for humans. OUCH!

So, you can hire a special team or just a freelancer who will do the job for you. There are a lot of such services. For example, Amazon Mechanical Turk or SciFabric.

It is a real option if you need to make a simple markup of a huge number of images or videos, but for specific cases that require special knowledge, this will not work.

In fact, the results of this kind of mass markup will not always be good as the results from a professional data-team.

6. Data and AI team

This is our final advice. Consider the possibility to delegate the task to a professional team.

You will need to have a consultation with experienced AI professionals and you will have to share your data with them. Then, they will get the job done for you.

In many cases, this is the simplest and cheapest way, because their business is data and AI and they surely have all the necessary resources at their disposal.

Alternatively, they may know where and how to find the best data for your project.

Believe me, sometimes it’s more expensive to make things on your own!

Conclusions

So, that’s all for today. But let me conclude with one final thought. In real life, you never need just one of these strategies. You have to try some of them, combine them and you will reach success.

We hope that this article was useful for you to figure out your next steps.