fbpx

AI Projects – What to Do If You Don’t Have Enough Data

datasets not enough data ai projects

Six tips to create and improve the dataset of your ML projects

Let’s be realistic. Machine learning projects almost never have enough data.

Sad but true, uh?

This happens because machines “think” differently than humans. One of the main differences is that they need A LOT more data to learn. What does it mean in practice?

It means that computers need to see a lot of samples before they can begin to perform their tasks properly. But then they will do their job faster, better, and more accurately than humans.

Considering this, if you want to increase the probability of success for your ML project you really should know how to expand its dataset.

Probably we can help you with some useful tips. So, go on reading!

Where to find the data you need

You can choose among many strategies to increase your dataset.

Here are some of the main ones.

 

1. Creating the necessary data on your own

It is the most obvious solution. But, be careful: if you start to create data from scratch without the knowledge of what exactly will be needed for AI training, you may lose time and money.

In order to prevent this, you should understand two things: what data and how much data you need.

If you have to solve the first problem, Data Specification will help you. It is a simple document, where you need to specify at least the following points:

  • what kind (images, texts, video, etc) of data the project requires
  • what classes of data the project requires (if the target of the project is to identify cats and dogs, it will have at least 2 classes: (1) cats and (2) dogs)
  • what quality and data or data format should be used
  • what restrictions the project data has

It will also be great to put in this document some positive and negative examples.

Instead, the amount of data you need is not so obvious and it’s quite difficult to calculate. But we would like to share with you our years of experience in one table:

 

Project example

Result quality example

Amount of data samples

Classical ML projects, that do not use ANNs. Deep learning projects that use linear dependencies and/or several data classes with well-differentiated attributes.

Minimal 

possible

quality


For R&D products and preliminary research.

less than 1 000

Deep learning projects with ANNs. Detection and identification tasks. For example text language identification etc.

1 000 - 10 000

Deep learning projects with ANNs. 

All ML/AI tasks can be solved here. For example, non-complicated basic natural language processing, texts/data generation and processing, basic semantic analysis, etc.

Average quality


For commercial products.

10 000 - 50 000

Deep learning projects with ANNs. 

All ML/AI tasks can be solved here. For example, deep semantic analysis and classification, high-quality classification and etc.

50 000 - 100 000

Deep learning projects with ANNs, image processing for non-complicated tasks. For example, face detection and recognition, etc.

Good

quality


For enterprise-level projects and really complicated tasks. 

100 000 - 500 000

Deep learning projects with ANNs for deep image processing: images annotation, very high-level semantic analysis, deep natural language processing and etc.

500 000 - 2 000 000

Deep learning projects with ANNs, that solve really complicated tasks with unlimited possible data units. For example intelligent chatbots (human-like), machine translation, large specific text generation, etc.

Excellent

quality


For quasi-AI projects and really intellectual tasks.

2 000 000 and more

This table offers you very generic recommendations that are based on our experience and knowledge. Please, keep in mind that your project can easily require different numbers. Anyway, our advice can be a good starting point!

 

2. Data augmentation

This option requires that you already have some data. Basically, this technique can be a solution when your volume of data is not enough.

For example, you have 2 000 images, but you need more than 100 000. In such a case, data augmentation can be a good option for you.

Data augmentation is a strategy that allows creating new data samples from existing samples via a set of linear operations (for example, affine translations for image processing).

Let me show you a practical case. Imagine that you want to classify cats and dogs. You expect to reach a good level of quality for your classifier, but you have just 2 000 images (1 000 cats and 1 000 dogs). How is it possible to increase this numb