AI Projects – What to Do If You Don’t Have Enough Data

Six tips to create and improve the dataset of your ML projects

Let’s be realistic. Machine learning projects almost never have enough data.

Sad but true, uh?

This happens because machines “think” differently than humans. One of the main differences is that they need A LOT more data to learn. What does it mean in practice?

It means that computers need to see a lot of samples before they can begin to perform their tasks properly. But then they will do their job faster, better, and more accurately than humans.

Considering this, if you want to increase the probability of success for your ML project you really should know how to expand its dataset.

Probably we can help you with some useful tips. So, go on reading!

Where to find the data you need

You can choose among many strategies to increase your dataset.

Here are some of the main ones.


1. Creating the necessary data on your own

It is the most obvious solution. But, be careful: if you start to create data from scratch without the knowledge of what exactly will be needed for AI training, you may lose time and money.

In order to prevent this, you should understand two things: what data and how much data you need.

If you have to solve the first problem, Data Specification will help you. It is a simple document, where you need to specify at least the following points:

  • what kind (images, texts, video, etc) of data the project requires
  • what classes of data the project requires (if the target of the project is to identify cats and dogs, it will have at least 2 classes: (1) cats and (2) dogs)
  • what quality and data or data format should be used
  • what restrictions the project data has

It will also be great to put in this document some positive and negative examples.

Instead, the amount of data you need is not so obvious and it’s quite difficult to calculate. But we would like to share with you our years of experience in one table:


Project example

Result quality example

Amount of data samples

Classical ML projects, that do not use ANNs. Deep learning projects that use linear dependencies and/or several data classes with well-differentiated attributes.